In [1]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
twenty_train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [3]:
twenty_train.target

array([7, 4, 4, ..., 3, 1, 8])

In [4]:
len(twenty_train.target)

11314

In [5]:
len(set(twenty_train.target))

20

In [6]:
print("\n".join(twenty_train.data[0].split("\n")[1:3])) #prints second and third lines of the first data file

Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


In [7]:
type(twenty_train)

sklearn.utils.Bunch

In [8]:
len(twenty_train)

5

In [9]:
len(twenty_train.data)

11314

In [10]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

From https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a:

Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the vocabulary dictionary, and it returns a Document-Term matrix. [n_samples, n_features].

TF: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.

TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.

In [12]:
# TF-IDF: term-frequency times inverse document frequency
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

In [13]:
# Classifier using Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [14]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [15]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

The accuracy we get is ~77.38%, which is not bad for a start and for a naive classifier.

In [16]:
twenty_test.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [17]:
predicted

array([ 7, 11,  0, ...,  9,  3, 15])

In [18]:
# support vector machine classifier
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, max_iter=5, tol=None, random_state=42)),
])

_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)



0.8238183749336165

The accuracy we get is ~82.38%. Yipee, a little better 👌

In [19]:
# Tuning hyperparameters
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1e-2, 1e-3),
}

In [20]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [21]:
%%time
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



Wall time: 3min 17s


In [22]:
print(gs_clf.best_score_)
gs_clf.best_params_

0.9067526957751458


{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

The accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore! 😄) and the corresponding hyperparameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

In [23]:
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf-svm__alpha': (1e-2, 1e-3),
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)

In [24]:
%%time
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)



Wall time: 3min 24s


In [32]:
print(gs_clf_svm.best_score_)
gs_clf_svm.best_params_

0.8979140887396146


{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

Grid search for the SVM classifier improves accuracy from ~82.38% to ~89.79%.

In [25]:
# Removing stop words from original Naive Bayes model
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [26]:
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8169144981412639

Removing stop words from the original Naive Bayes model improves the accuracy from 77.38% to 81.69%.

In [27]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB(fit_prior=False)),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [28]:
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8214285714285714

When set to false for MultinomialNB, a uniform prior will be used. This doesn’t helps that much, but increases the accuracy from 81.69% to 82.14% (not much gain).

In [29]:
import nltk
# nltk.download()

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=True)

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect),
                      ('tfidf', TfidfTransformer()),
                      ('mnb', MultinomialNB(fit_prior=False)),
])

In [30]:
%%time
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)

Wall time: 34.5 s


In [31]:
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)
np.mean(predicted_mnb_stemmed == twenty_test.target)

0.8167817312798725

The accuracy with stemming we get is ~81.67%. Stemming didn't make much difference in this case.

In [33]:
# Put it all together for the best Naive Bayes model
# Original best parameters: {'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB(fit_prior=False)),
])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [34]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1e-2, 1e-3),
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [35]:
%%time
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)



Wall time: 2min 14s


In [36]:
print(gs_clf.best_score_)
gs_clf.best_params_

0.9066643097047905


{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

About the same accuracy as the original grid search; this time, different parameters performed best.

In [45]:
# Removing stop words from original SVM model
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, max_iter=5, tol=None, random_state=42)),
])

_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)



0.8224907063197026

Removing stop words from the SVM model didn't seem to help at all.

In [47]:
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, max_iter=5, tol=None, random_state=42)),
])

parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf-svm__alpha': (1e-2, 1e-3),
}
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)

In [48]:
%%time
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)



Wall time: 3min 16s


In [49]:
print(gs_clf_svm.best_score_)
gs_clf_svm.best_params_

0.8954392787696659


{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

About the same as the last SVM grid search, so removing stop words didn't improve here either.