# Working with Text Data
Based on [Scikit-learn tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

## Loading the 20 newsgroups dataset

Load/Dowload 20 Newspapers dataset.

In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ["alt.atheism", "soc.religion.christian", "comp.graphics", "sci.med"]
twenty_train = fetch_20newsgroups(subset="train", shuffle=True, categories=categories, random_state=42)

Check retrieved data.

In [2]:
print("Loaded catagories : {0}".format(twenty_train.target_names))
print("Number of training data : {0}".format(len(twenty_train.data)))

Loaded catagories : ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
Number of training data : 2257


Check first data content.

In [3]:
print("\n".join(twenty_train.data[0].split("\n")[:3]))
print(twenty_train.target_names[twenty_train.target[0]])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
comp.graphics


Check labels of first elements.

In [4]:
print(twenty_train.target[:10])
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

[1 1 3 3 3 3 3 2 2 2]
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


## Extracting features from text files

Tokenizing in order to transform documents into feature vectors.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

Count the number of occurence of the word "algorithm" :

In [6]:
count_vect.vocabulary_.get(u'algorithm')

4690

Transform occurences to frequencies using TF-IDF to prevent disparities between short and long texts. 

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

## Training a classifier

Train a naïve Bayes classifier with current data.

In [8]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Try to predict the text subject on new documents

In [9]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Building a pipeline

Build a pipeline to make the sequence of operations "vectorizer => transformer => classifier" easier to work with.

In [10]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Evaluation of the performance on the test set

Check the predictive accuracy of current model :

In [11]:
import numpy as np
twenty_test = fetch_20newsgroups(subset="test", categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)      

0.8348868175765646

Use a SVM to attempt to improve accuracy :

In [12]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=1000, tol=1e-3))])
text_clf.fit(twenty_train.data, twenty_train.target)  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...om_state=42, shuffle=True, tol=0.001,
       validation_fraction=0.1, verbose=0, warm_start=False))])

Check the predictive accuracy of SVM model :

In [13]:
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)            

0.9114513981358189

Show metrics and confusion matrix: 

In [14]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))
print(metrics.confusion_matrix(twenty_test.target, predicted))

                        precision    recall  f1-score   support

           alt.atheism       0.96      0.81      0.87       319
         comp.graphics       0.87      0.98      0.92       389
               sci.med       0.96      0.89      0.92       396
soc.religion.christian       0.89      0.96      0.92       398

             micro avg       0.91      0.91      0.91      1502
             macro avg       0.92      0.91      0.91      1502
          weighted avg       0.92      0.91      0.91      1502

[[257  11  12  39]
 [  3 380   2   4]
 [  4  36 351   5]
 [  5  10   2 381]]


## Parameter tuning using grid search

Use grid search to search the best hyper-parameters through an exhausitve search.

In [15]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}
gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

Try to predict the subject of one document :

In [16]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

Show best score and parameters

In [17]:
print("Best score : {0}".format(gs_clf.best_score_))
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

Best score : 0.9228225446811431
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)
