# "Traditional" Text Classification with Scikit-learn

In this HW, we're going to experiment with a few "traditional" approaches to text classification. These approaches pre-date the deep learning revolution in Natural Language Processing, but are often quick and effective ways of training a text classifier. 

## Data

For our data, we're going to work with the [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/), a classic collection of text documents that is often used as a benchmark for text classification models. The set contains texts about various topics, ranging from computer hardward to religion. Some of the topics are closely related to each other (such as "IBM PC hardware" and "Mac hardware"), while others are very different (such as "religion" or "hockey"). The 20 Newsgroups comes shipped with the [Scikit-learn machine learning library](https://scikit-learn.org/stable/), our main tool for this exercise. It has been split into training set of 11,314 texts and a test set of 7,532 texts.

In [1]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups

In [2]:
train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test')

print("Training texts:", len(train_data.data))
print("Test texts:", len(test_data.data))

Training texts: 11314
Test texts: 7532


## Preprocessing

## Problem (20%)

The first step in the development of any NLP model is text preprocessing. This means we're going to transform our texts from word sequences to feature vectors. These feature vectors contain their values for each of a large number of features. Two important techniques are used to get weighted features of words: CountVectorizer and TfidfTransformer. Could you explain them and how to use them in NLP? 



In [3]:
# Getting labels for classification
y_train, y_test = train_data.target, test_data.target


### CountVectorizer-
CountVectorizer is a way to tokenize words and build a dictionary or vocabulary using the words in a document. Words can be encoded using the dictionary formed.  
Steps to use CountVectorizer- 
- Import CountVectorizer from sklearn.feature_extraction.text
- Create an instance of CountVectorizer class
- Call the fit_transform function to learn vocabulary from document and encode words as vectors
- transform() test document to encode each word as a vector

### TfidfTransformer-

TFidfTransformer is used to get tf-idf scores of each word in a document. Pre-requisite to using tfidfTransformer is to use CountVectorizer(). Steps to use TfidfTransformer are as follows-
- fit_transform() the CountVectorizer output vectors
- Transform the test data using same fitted transform

In [4]:
# Converting text data to feature vectors using CountVectorizer

count_vectorizer = CountVectorizer()
X_train_countVectorizer = count_vectorizer.fit_transform(train_data.data)
X_test_countVectorizer = count_vectorizer.transform(test_data.data)

In [5]:
# Using TfidfTransformer on vectors
tfidf_transformer=TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_countVectorizer)
X_test_tfidf = tfidf_transformer.transform(X_test_countVectorizer)

## Training

## Problem (40%)

Next, we train a text classifier on the preprocessed training data. We're going to experiment with three classic text classification models: Naive Bayes, Support Vector Machines and Logistic Regression. Could you explain them and implment these classifiers with your preprocessed words from previous problem to determine classifier accuracy? Of course, we can play with other classifiers also. 

### Naive Bayes-
Naive Bayes classifier is based on Bayes' theorem. It is a probabilistic classifier model that assumes independence between each and all feature vectors i.e., training variables.
Bayes' theorem states that for two events A and B,   
            P(A|B) = (P(B|A) * P(A)) / P(B)

### Support Vector Machines-
Support Vector Machines define a hyperplane- a classifying boundary between different classes and SVM attempts to maximize the distance between boundary and classified data points. The hyperplane can be nonlinear too. SVM uses non-linear kernel for non-linear boundary between classes.

### Logistic Regression- 
Logistic regression is specifically used when the output variable is categorical and when we expect the output to be positive only. It uses sigmoid function as its hypothesis.   
            Z = Wx + b  
            h(Z) = sigmoid(Z)  
                  = 1/(1+exp(-Z))

In [6]:

target_names = train_data.target_names

# train_model function takes classifier as input, trains model on that classifier, and predicts output
def train_model(clf, X_train, y_train, X_test, y_test):
    print('#'*80)
    print("Model to be trained: ", clf)
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    
    score = metrics.accuracy_score(y_test, pred)
    print("classification report:")
    print(metrics.classification_report(y_test, pred, target_names=target_names))

    clf_descr = str(clf).split('(')[0]
    print(clf_descr)
    return clf_descr, pred, score

### Training two transformed feature vectors on Naive Bayes

In [7]:
# Training on Naive Bayes Classifiers

clf = MultinomialNB()
clf_descr_NB_tfidf, pred_NB, score_NB =train_model(clf, X_train_tfidf, y_train, X_test_tfidf, y_test)


################################################################################
Model to be trained:  MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.cry

In [8]:
# Training on SVM Classifiers

clf = LinearSVC(random_state=0, tol=1e-5)
clf_descr_SVM, pred_SVM, score_SVM = train_model(clf, X_train_tfidf, y_train, X_test_tfidf, y_test)


################################################################################
Model to be trained:  LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=0, tol=1e-05,
          verbose=0)
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.82      0.80      0.81       319
           comp.graphics       0.76      0.80      0.78       389
 comp.os.ms-windows.misc       0.77      0.73      0.75       394
comp.sys.ibm.pc.hardware       0.71      0.76      0.74       392
   comp.sys.mac.hardware       0.84      0.86      0.85       385
          comp.windows.x       0.87      0.76      0.81       395
            misc.forsale       0.83      0.91      0.87       390
               rec.autos       0.92      0.91      0.91       396
         rec.motorcycles       0.95      0.95      0.

### Training and getting output of TfIdf vectors, CountVectorizer vectors on Logistic Regression Classifiers

In [9]:
# Training on Logistic Regression Classifiers
clf =  LogisticRegressionCV(cv=5, random_state=0, max_iter=300, multi_class = 'auto')
clf_descr_LR_tfidf, pred_LR, score_LR = train_model(clf, X_train_tfidf, y_train, X_test_tfidf, y_test)


################################################################################
Model to be trained:  LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=300, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=0, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)
classification report:
                          precision    recall  f1-score   support

             alt.atheism       0.83      0.78      0.81       319
           comp.graphics       0.76      0.81      0.78       389
 comp.os.ms-windows.misc       0.76      0.71      0.73       394
comp.sys.ibm.pc.hardware       0.69      0.74      0.72       392
   comp.sys.mac.hardware       0.83      0.85      0.84       385
          comp.windows.x       0.85      0.77      0.81       395
            misc.forsale       0.80      0.90      0.85      



## Extensive evaluation

## Problem (40%)

So far we've only looked at the accuracy of our models: the proportion of test examples for which their prediction is correct. This is fine as a first evaluation, but it doesn't give us much insight in what mistakes the models make and why. We'll therefore perform a much more extensive evaluation, in three steps. What are precision, recall, F-score and confusion matrix? And please using them to measure your trained classifiers.

#### Confusion Matrix
Confusion matrix displays the predictions as a matrix between true predictions and false predictions for each of the class.  
True positives- The number of predictions that are classified as positives and are actually positives  
False positives- The number of predictions that are predicted positive but are actually negative  
True negatives- The number of predictions that are classified as negative and are also negative actually  
False negatives- The number of predictions that are predicted negatives but are actually positive  

#### Precision
The number of true positives divided by the addition of the number of true positives and the number of false positives.   
Precision = TP/(TP + FP)  

#### Recall
The number of true positives divided by the addition of the number of true positives and the number of false negatives.  
Recall = TP/(TP + FN). 
#### F-score
The F1 score is the harmonic mean of precision and recall.  
F1-score = 2* Precision * Recall/(Precision + Recall)  

In [10]:
# extensive_evaluation function takes y_test and predicted y as input and prints classification report,
# confusion matrix. The function returns precision score, recall score, F-1 score and suport
def extensive_evaluation(y_test, pred, classifier_name):
    print('#'*80)
    print("Extensive Evaluation for ", classifier_name)
    score = metrics.accuracy_score(y_test, pred)
    print("confusion matrix:")
    confusion_matrix = metrics.confusion_matrix(y_test, pred)
    print(confusion_matrix)
    precision_score, recall_score, f1_score, support = precision_recall_fscore_support(y_test, pred, average='macro')
    
    return precision_score, recall_score, f1_score, support, score

### Evaluation paramters for Naive Bayes Classifier

In [11]:
precision_score_NB, recall_score_NB, f1_score_NB, support_NB, score_NB = extensive_evaluation(y_test, pred_NB, "Naive Bayes")

################################################################################
Extensive Evaluation for  Naive Bayes
confusion matrix:
[[166   0   0   1   0   1   0   0   1   1   1   3   0   6   3 123   4   8
    0   1]
 [  1 252  15  12   9  18   1   2   1   5   2  41   4   0   6  15   4   1
    0   0]
 [  0  14 258  45   3   9   0   2   1   3   2  25   1   0   6  23   2   0
    0   0]
 [  0   5  11 305  17   1   3   6   1   0   2  19  13   0   5   3   1   0
    0   0]
 [  0   3   8  23 298   0   3   8   1   3   1  16   8   0   2   8   3   0
    0   0]
 [  1  21  17  13   2 298   1   0   1   1   0  23   0   1   4  10   2   0
    0   0]
 [  0   1   3  31  12   1 271  19   4   4   6   5  12   6   3   9   3   0
    0   0]
 [  0   1   0   3   0   0   4 364   3   2   2   4   1   1   3   3   4   0
    1   0]
 [  0   0   0   1   0   0   2  10 371   0   0   4   0   0   0   8   2   0
    0   0]
 [  0   0   0   0   1   0   0   4   0 357  22   0   0   0   2   9   1   1
    0   0]
 [  0   0   0

In [12]:
print("FOR NAIVE BAYES CLASSIFIER: ")
print('-'*80)
print("Precision score: ",precision_score_NB)
print("Recall score: ",recall_score_NB )
print("F-1 score: ", f1_score_NB)
print("Accuracy: ", score_NB)


FOR NAIVE BAYES CLASSIFIER: 
--------------------------------------------------------------------------------
Precision score:  0.8255310124210137
Recall score:  0.756525006352595
F-1 score:  0.7557542971333199
Accuracy:  0.7738980350504514


### Evaluation parameters for Linear SVM Classifiers 

In [13]:
precision_score_SVM, recall_score_SVM, f1_score_SVM, support_SVM, score_SVM = extensive_evaluation(y_test, pred_SVM, "SVM Classifier")


################################################################################
Extensive Evaluation for  SVM Classifier
confusion matrix:
[[254   1   0   1   0   2   1   0   2   0   0   1   1   6   7  22   0   1
    1  19]
 [  1 313  12   8   5  17   3   2   1   3   1   4   9   0   4   2   0   1
    0   3]
 [  0  17 288  40   7  13   4   1   0   4   0   2   4   2   5   2   0   0
    1   4]
 [  0  14  20 297  21   1  13   2   1   1   0   1  19   0   1   0   0   0
    0   1]
 [  0   5   4  19 330   0   9   0   0   2   1   1   9   1   0   0   2   0
    1   1]
 [  1  35  38   3   3 302   3   1   2   0   0   0   1   1   4   0   1   0
    0   0]
 [  0   1   1  10   7   0 353   5   1   2   1   1   5   1   0   0   0   0
    1   1]
 [  0   1   0   5   1   1  10 359   6   2   0   0   6   1   0   0   2   0
    2   0]
 [  0   0   0   1   0   0   4   9 380   1   0   0   1   1   0   1   0   0
    0   0]
 [  0   0   1   0   0   0   4   2   0 378  10   0   1   0   0   0   0   0
    1   0]
 [  0   0 

In [14]:
print("FOR SVM CLASSIFIER USING LINEAR KERNEL: ")
print('-'*80)
print("Precision score: ",precision_score_SVM )
print("Recall score: ", recall_score_SVM)
print("F-1 score: ", f1_score_SVM)
print("Accuracy: ", score_SVM)


FOR SVM CLASSIFIER USING LINEAR KERNEL: 
--------------------------------------------------------------------------------
Precision score:  0.8513737648934537
Recall score:  0.845836571817482
F-1 score:  0.8464878396886357
Accuracy:  0.8531598513011153


### Evaluation Parameters for Logistic Regression classifier models

In [15]:
precision_score_LR, recall_score_LR, f1_score_LR, support_LR, score_LR = extensive_evaluation(y_test, pred_LR, "Logistic Regression")

################################################################################
Extensive Evaluation for  Logistic Regression
confusion matrix:
[[250   1   0   2   0   2   1   0   0   0   0   1   0   5   6  22   0   3
    1  25]
 [  1 315   9   9   6  16   6   3   0   3   0   4   7   0   4   2   0   1
    1   2]
 [  0  20 278  41  12  15   3   1   0   3   0   2   4   2   4   3   0   0
    2   4]
 [  0  12  18 292  19   2  17   4   1   1   0   1  24   0   1   0   0   0
    0   0]
 [  0   5   6  20 326   2   9   0   0   2   0   2   9   1   1   0   2   0
    0   0]
 [  0  31  36   5   3 305   5   1   1   0   0   0   2   1   4   0   1   0
    0   0]
 [  0   1   3  13   6   0 350   5   1   1   1   0   7   1   0   0   0   0
    0   1]
 [  0   1   0   3   1   1  10 357   6   1   0   0   9   1   1   0   3   0
    2   0]
 [  0   1   0   1   0   0   4  10 379   1   0   0   2   0   0   0   0   0
    0   0]
 [  0   0   1   0   0   0   4   2   0 377  10   0   1   0   0   0   0   0
    1   1]
 [  0

In [16]:
print("FOR LOGISTIC REGRESSION CLASSIFIER: ")
print('-'*80)
print("Precision score: ", precision_score_LR )
print("Recall score: ",recall_score_LR )
print("F-1 score: ", f1_score_LR)
print("Accuracy: ", score_LR)



FOR LOGISTIC REGRESSION CLASSIFIER: 
--------------------------------------------------------------------------------
Precision score:  0.8450460355737446
Recall score:  0.83973337274024
F-1 score:  0.8405791662355078
Accuracy:  0.8466542750929368


### References:
- https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py
- https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- https://www.geeksforgeeks.org/naive-bayes-classifiers/
- https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/
- https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc