In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

import nltk
nltk.download() # we need corpora package
from nltk.stem.snowball import SnowballStemmer

import numpy as np

from pprint import pprint

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


## Intro

The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching functions that downloads the data archive from the original 20 newsgroups website, extracts the archive contents in the ~/scikit_learn_data/20news_home folder and calls the sklearn.datasets.load_files on either the training or testing set folder, or both of them.

Also lets check the first article from the list.

In [2]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True) #there is about 14MB
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
pprint(list(twenty_train.target_names))

print("\nFirst article is:\n" + "\n".join(twenty_train.data[0].split("\n")[:3]))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

First article is:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


## Extracting features from text files

In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. I will use BagOfWords model. Briefly, we segment each text file into words (for English splitting by space), and count number of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

Scikit-learn has a high level component which will create feature vectors for us ‘CountVectorizer’.

Here by doing ‘count_vect.fit_transform(twenty_train.data)’, we are learning the vocabulary dictionary and it returns a Document-Term matrix. [n_samples, n_features].

In [3]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

**TF**: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.

**TF-IDF**: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.

In [4]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

## Naive Bias

You can easily build a NBclassifier in scikit using below 2 lines of code: (note - there are many variants of NB, but discussion about them is out of scope)

In [5]:
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

## Pipeline

Building a pipeline. We can write less code and do all of the above, by building a pipeline as follows

In [6]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

## Testing model #1

In [7]:
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.81691449814126393

# Grid Search

Almost all the classifiers will have various parameters which can be tuned to obtain optimal performance. Scikit gives an extremely useful tool ‘GridSearchCV’.

Here, we are creating a list of parameters for which we would like to do performance tuning. All the parameters name start with the classifier name (remember the arbitrary name we gave). E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

Next, we create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.

In [8]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),}

In [9]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [49]:
gs_clf.best_score_
gs_clf.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

# Including NTLK

Things learned from previous notebooks now will get handy.

In [10]:
stemmer = SnowballStemmer("english", ignore_stopwords=True)

In [11]:
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

In [12]:
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

In [15]:
text_mnb_stemmed = Pipeline([('vect', stemmed_count_vect), ('tfidf', TfidfTransformer()), 
                             ('mnb', MultinomialNB(fit_prior=False))])

In [16]:
text_mnb_stemmed = text_mnb_stemmed.fit(twenty_train.data, twenty_train.target)

In [17]:
predicted_mnb_stemmed = text_mnb_stemmed.predict(twenty_test.data)

In [18]:
np.mean(predicted_mnb_stemmed == twenty_test.target)

0.81678173127987252

In [20]:
with open('second_text.txt', 'r') as f:
    new_document=f.read().replace('\n', '')

In [33]:
predicted_own_doc = text_mnb_stemmed.predict([new_document])
print(predicted_own_doc)
twenty_train.target_names[predicted_own_doc[0]+1]

[11]


'sci.electronics'