In [69]:
cd C:\\Users\\bunny\\Desktop  # Setting Directory

C:\Users\bunny\Desktop


As our sample corpus of text, collected data is read using readlines function. These articles have been stored in a single file, so that one article appears on each line.

In [36]:
fin = open("sample.txt","r")
raw_documents = fin.readlines()
fin.close()
#print("Read %d raw text documents" % len(raw_documents))

# Tokenizing text

In the below code, used the built-in scikit-learn tokenizer to split the document into tokens. And then performed case conversion first to convert the entire text to lowercase. Then used stop words list from scikit learn to remove words that re not useful.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
tokenize = CountVectorizer().build_tokenizer()

from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS

all_filtered_tokens = []
for doc in raw_documents:
    # tokenize the next document
    tokens = tokenize(doc.lower())
    # remove the stopwords
    filtered_tokens = []
    for token in tokens:
        if not token in stopwords:
            filtered_tokens.append(token)  
    # add to the overall list
    all_filtered_tokens.append( filtered_tokens )

Document Term matrix:

since each document can be represented as a term vector, stacking these vectors to create a document-term matrix. This is created using list of document strings from package scikit learn.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(raw_documents)

In [42]:
terms = vectorizer.get_feature_names()
vocab = vectorizer.vocabulary_
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 23596 distinct terms


In [76]:
print(terms[500:530])
vocab

['10 there', '10 this', '10 times', '10 to', '10 together', '10 took', '10 trucks', '10 us', '10 victory', '10 was', '10 week', '10 weeks', '10 were', '10 which', '10 while', '10 win', '10 with', '10 world', '10 xbox', '10 yards', '10 year', '10 years', '100', '100 000', '100 53', '100 again', '100 and', '100 but', '100 companies', '100 countries']


{'the': 214091,
 'sporting': 201906,
 'industry': 112776,
 'has': 99929,
 'come': 52109,
 'long': 130699,
 'way': 240722,
 'since': 196753,
 '60s': 6376,
 'it': 118391,
 'carved': 45994,
 'out': 160095,
 'for': 85305,
 'itself': 120205,
 'niche': 147823,
 'with': 246907,
 'its': 119399,
 'roots': 185272,
 'so': 198660,
 'deep': 62871,
 'that': 212630,
 'cannot': 44867,
 'fathom': 80346,
 'sports': 201926,
 'showing': 195622,
 'any': 21326,
 'sign': 196122,
 'of': 150893,
 'decline': 62758,
 'time': 223556,
 'soon': 200090,
 'or': 158204,
 'later': 125671,
 'reason': 178137,
 'can': 44401,
 'be': 30754,
 'found': 87806,
 'in': 109588,
 'this': 221728,
 'seemingly': 191864,
 'subtle': 206419,
 'difference': 65673,
 'other': 159307,
 'industries': 112746,
 'have': 100749,
 'customers': 60350,
 'fans': 79840,
 'vivek': 238086,
 'ranadivé': 176717,
 'leader': 126407,
 'ownership': 161722,
 'group': 97140,
 'nba': 145647,
 'sacramento': 186804,
 'kings': 123788,
 'explained': 78229,
 'beauti

# Text Preprocessing

Here, used the built-in list of stop-words for a given language by specifying the name of the language (lower-case) i.e; english.

In [77]:
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(raw_documents)
# Are standard stopwords gone?
"and" in vectorizer.vocabulary_

False

Removing low frequency terms that appear in fewer than a specified number of documents.

In [78]:
vectorizer = CountVectorizer(min_df = 5)
X = vectorizer.fit_transform(raw_documents)
print("Number of terms in model is %d" % len(vectorizer.vocabulary_) )

Number of terms in model is 7036


Performing lemmatisation using NLTK with scikit learn

In [79]:
import nltk
def lemma_tokenizer(text):
    # use the standard scikit-learn tokenizer first
    standard_tokenizer = CountVectorizer().build_tokenizer()
    tokens = standard_tokenizer(text)
    # then use NLTK to perform lemmatisation on each token
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemma_tokens = []
    for token in tokens:
        lemma_tokens.append( lemmatizer.lemmatize(token) )
    return lemma_tokens

In [55]:
nltk.download('wordnet')  ### Download the appropriate packages.

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bunny\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [169]:
vectorizer = CountVectorizer(stop_words="english",min_df = 3,tokenizer=lemma_tokenizer)
X = vectorizer.fit_transform(raw_documents)
print(X.shape)
print(list(vectorizer.vocabulary_.keys())[0:35])

(1409, 8697)
['sporting', 'industry', 'ha', 'come', 'long', 'way', '60', 'niche', 'root', 'deep', 'sport', 'showing', 'sign', 'decline', 'time', 'soon', 'later', 'reason', 'seemingly', 'difference', 'customer', 'fan', 'leader', 'ownership', 'group', 'nba', 'king', 'explained', 'paint', 'face', 'ceo', 'business', 'dying', 'position', 'passion']


# Term Weighing

Improving the document term matrix by giving more important weight to the more important terms. To do that, most common normalisation is term frequency–inverse document frequency (TF-IDF).

In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer
# we can pass in the same preprocessing parameters
vectorizer = TfidfVectorizer(stop_words="english",min_df = 5)
X = vectorizer.fit_transform(raw_documents)
# display some sample weighted values
X.shape

(1409, 6771)

Importing a CSV which contains three columns i.e; title of the article, category and article text.

In [129]:
import pandas as pd
df = pd.read_csv('category_title_text.csv')
#df
df.columns = ['title','category','text'] ### changing column names

Randomly splitting the complete dataset into a training test and a test set that 20% (0.2) of the data will be used for the test set.

In [132]:
from sklearn.model_selection import train_test_split
y=df.category
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors

Since each document can be represented as a term vector, stack these vectors to create a full document-term matrix and create this matrix from a list of document strings using Scikit-learn

In [134]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.text)
X_train_counts.shape

(1126, 21485)

The below code is to reduce the weightage of more common words like (the, is, an etc.) which occurs in all the documents using TF-IDF i.e Term Frequency times inverse document frequency.

In [135]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(1126, 21485)

# Naive bayes classifier

After getting the features, train a classifier to try to predict the category of a post. To do that, implementing naïve Bayes classifier by using scikit-learn which has the multinomial variant.

In [136]:
from sklearn.naive_bayes import MultinomialNB
#y_train.reshape
clf = MultinomialNB().fit(X_train_tfidf, y_train)

# Building a Pipeline

Building a pipeline in order to make the vectorizer => transformer => classifier easier to work with with using pipeline class from scikit learn package.

In [137]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                        ])

Training the model with a command fit(...)

In [144]:
text_clf.fit(X_train.text, y_train)  
predictions = text_clf.predict(X_test.text)

Evaluation of the performance on the test set

In [161]:
import numpy as np
print(np.mean(predictions == y_test))
len(predictions)

0.9716312056737588


282

we achieved 97.16% accuracy.

# Support Vector Machine (SVM)

Now classify using support vector machine (SVM) and compare the results of both algorithms. Here, changed the learner by plugging a different classifier object into our pipeline:

In [162]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, n_iter=5, random_state=42)),
 ])
_ = text_clf_svm.fit(X_train.text, y_train)
predicted_svm = text_clf_svm.predict(X_test.text)
np.mean(predicted_svm == y_test)



0.9858156028368794

The accuracy we got is 98.58% and it is little better compared to naive bayes classification.

In the below code, calculatind the average accuracy across all folds and comparing the accuracies of both naive bayes and support vector machine methods. 

In [168]:
from sklearn.model_selection import cross_val_score

acc_scores_NB = cross_val_score(text_clf, df.text, y, cv=2, scoring="accuracy")
print(acc_scores_NB)

acc_scores_svm = cross_val_score(text_clf_svm, df.text, y, cv=2, scoring="accuracy")
print(acc_scores_svm)

[0.96879433 0.95305832]




[0.98297872 0.96870555]
