# Document classification
Source:
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://medium.com/text-classification-algorithms/text-classification-algorithms-a-survey-a215b7ab7e2d

## Preparing the training and test data

In [1]:
# Loading the training data set
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
# Check the labels (categories) and some data files in the TRAINING data
print('20 NEWS CATEGORIES:\n', *twenty_train.target_names, sep='\n') #prints all the categories
print('\nTRAINING DATA :\n', *twenty_train.data[:1], sep='\n')
print('LABELS: ', twenty_train.target[:1])

20 NEWS CATEGORIES:

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

TRAINING DATA :

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking c

In [3]:
# Check the labels (categories) and some data files in the TEST data
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
print('TEST DATA:\n', *twenty_test.data[:1], sep='\n')
print('LABELS: ', twenty_test.target[:1])

TEST DATA:

From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

			Neil Gandler

LABELS:  [7]


## ML algorithms
Source: https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet-2.png

<img src="https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet-2.png" width=1200 height=800 />

## Extracting features
- CountVectorizer: Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model for our example. We segment each text file into words (splitting by space), and count the number of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

- TfidfTransformer: Counting the number of words in each document will give more weightage to longer documents than shorter documents. To avoid this issue, we can use TF (Term Frequencies) i.e. #count(word) / #Total words, in each document. Moreover, to reduce the weightage of more common words like (the, is, an, etc.) in all document, TF-IDF i.e Term Frequency times Inverse Document Frequency is used.

## 1. The simplest text classifier is Naive Bayes (NB)

In [4]:
# Building a pipeline to do count vectorisation, TF-IDF transformation, and Naive Bayes classification.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([('vect', CountVectorizer()), # learns the vocabulary dictionary & returns a Document-Term matrix. [n_samples, n_features]
                     ('tfidf', TfidfTransformer()), # Term Frequency - Inverse Document Frequency
                     ('clf', MultinomialNB())]) # Naive Bayes classifier

# train the NB classifier on the training data
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [5]:
predicted = text_clf.predict(twenty_test.data)

## Performance Measurements on the test data with Accuracy, Precision, Recall, and F1 metrics

- Accuracy is a ratio of correctly predicted observation to the total observations.
- Precision defines how precise/accurate your model is out of those predicted positive, how many of them are actual positive.
- Recall calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). 
- F1 Score is a better measure to balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives)

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score ,f1_score, classification_report

print('NB Accuracy score: ', accuracy_score(twenty_test.target, predicted)) # same as np.mean(predicted == twenty_test.target)
print('NB Precision score: ', precision_score(twenty_test.target, predicted, average='weighted'))
print('NB Recall score: ', recall_score(twenty_test.target, predicted, average='weighted'))
print('NB F1 score: ', f1_score(twenty_test.target, predicted, average='weighted'))

NB Accuracy score:  0.7738980350504514
NB Precision score:  0.8218781741893993
NB Recall score:  0.7738980350504514
NB F1 score:  0.7684457156894653


## 2. Support Vector Machines (SVM)

In [7]:
from sklearn.linear_model import SGDClassifier

text_clf_svm = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, random_state=42)), ])

_ = text_clf_svm.fit(twenty_train.data, twenty_train.target)

predicted_svm = text_clf_svm.predict(twenty_test.data)

#accuracy_svm = np.mean(predicted_svm == twenty_test.target)
#print('SVM accuracy = ', accuracy_svm)
print('SVM F1 score: ', f1_score(twenty_test.target, predicted_svm, average='weighted'))

SVM F1 score:  0.8179850964920279


## 3. k-nearest neighbors algorithm (kNN) is a non-parametric technique

In [8]:
from sklearn.neighbors import KNeighborsClassifier

text_kneighbors = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', KNeighborsClassifier()),
                     ])
text_kneighbors.fit(twenty_train.data, twenty_train.target)
predicted_kneigbor = text_kneighbors.predict(twenty_test.data)

print('KNeighborsClassifier F1 score: ', f1_score(twenty_test.target, predicted_kneigbor, average='weighted'))

KNeighborsClassifier F1 score:  0.6597157454309466


## 4. Decision tree classifiers include a hierarchical decomposition of the data space

In [9]:
from sklearn import tree

text_tree = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', tree.DecisionTreeClassifier()),
                     ])
text_tree.fit(twenty_train.data, twenty_train.target)
predicted_tree = text_tree.predict(twenty_test.data)

print('Desicion tree F1 score: ', f1_score(twenty_test.target, predicted_tree, average='weighted'))

Desicion tree F1 score:  0.5496185802257517


In [10]:
# If you want the performance by categories, classification_report can be used
print(classification_report(twenty_test.target, predicted_svm)) # SVM performance

             precision    recall  f1-score   support

          0       0.73      0.72      0.72       319
          1       0.80      0.70      0.74       389
          2       0.73      0.76      0.75       394
          3       0.71      0.70      0.70       392
          4       0.83      0.81      0.82       385
          5       0.83      0.77      0.80       395
          6       0.84      0.90      0.87       390
          7       0.92      0.89      0.91       396
          8       0.92      0.96      0.94       398
          9       0.89      0.90      0.89       397
         10       0.88      0.99      0.93       399
         11       0.83      0.96      0.89       396
         12       0.83      0.60      0.70       393
         13       0.87      0.86      0.86       396
         14       0.84      0.96      0.89       394
         15       0.76      0.94      0.84       398
         16       0.70      0.92      0.80       364
         17       0.90      0.93      0.92   