# Text Classification
[Source](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

## Preliminaries
... as usual.

In [1]:
import numpy as np

## Load the data

In [2]:
from sklearn.datasets import fetch_20newsgroups

Load the list of files from a few categories as follows:

In [3]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
twenty_train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']

**Description:**<br/>**twenty_train.filenames** contains all the file names whose content is located in **twenty_train.data** and the target categories are given in **twenty_train.target**, the values of target correspond to the array indices of **twenty_train.target_names**.

In [4]:
print "The data set contains {} data points".format(len(twenty_train.data))

The data set contains 2257 data points


## Build the dictionary

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(twenty_train.data)

... and calculate **tfidf**. Note that *fit* and transform* could be applied separately as well.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

## Training the classifier

In [7]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

... and running over some quick testdata.

In [8]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']

In [9]:
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


### Pipeline building
Alternatively we can use a pipeline class object to do all of the above in a single go.

In [10]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

prediction2=text_clf.predict(docs_new)

for doc, category in zip(docs_new, prediction2):
    print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Testing the Model

Load the testdata.

In [11]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data

... and now run the trained classifi.er

In [12]:
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.83488681757656458

So we achieved 83.4% accuracy.

## SVM
 Let’s see if we can do better with a linear **support vector machine** (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by just plugging a different classifier object into our pipeline:
 
 Build the pipeline and run... as simple.

In [13]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5, random_state=42)),
])

_ = text_clf.fit(twenty_train.data, twenty_train.target)

predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9127829560585885

... and we go up to 91%, not bad eh !!!

### Report time

In [15]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



In [16]:
metrics.confusion_matrix(twenty_test.target, predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

# Grid Search

... now let us try *Grid Search* feature to try a range of parameters to optimize the search.

In [17]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eight parameter combinations in parallel with the n_jobs parameter. If we give this parameter a value of -1, grid search will detect how many cores are installed and uses them all.

Let us use the classifer used earlier, SVM.

In [18]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

Let us try it on a smaller training set, and quick check on a sample.

In [22]:
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

In [23]:
predicted = gs_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.9127829560585885

Well it is as good as the SVM with default option, but at leastwe tried.