# "Traditional" Text Classification with Scikit-learn
(follows: https://github.com/nlptown/nlp-notebooks/blob/master/Traditional%20text%20classification%20with%20Scikit-learn.ipynb)

## Data

We investigate techniques that predate deep learning trends in NLP, but are quick & effective ways of training a text classifier.

We use the 20 Newsgroups data set that is shipped with the **Scikit-learn machine learning library**.

It consists of 11_314 training texts and a test set of 7_532 texts.

In [4]:
#| export
from sklearn.datasets import fetch_20newsgroups

In [5]:
#| export
train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test')

print('Training texts: ', len(train_data.data))
print('Test texts: ', len(test_data.data))

Training texts:  11314
Test texts:  7532


## Pre-processing

**Always the first step**: transform the word seqs of the texts into feature vectors. Here we will use BOW approaches. We use `CountVectorizer` to construct vectors that tell us how often a word (or ngram) occurs in a text.

However, texts contain a lot of **uninteresting** words. We use TF-IDF to hunt for words that appear often in a text, but not too often in the corpus as a whole using `TfidfTransformer`.

In order to get these **weighted feature vectors** we combine `CountVectorizer` and `TfidfTransformer` in a `Pipeline`.

In [6]:
#| export
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

preprocessing = Pipeline([
  ('vect', CountVectorizer()),
  ('tfidf', TfidfTransformer())
])

print('Preprocessing training data ...')
train_preprocessed = preprocessing.fit_transform(train_data.data)
print('Preprocessing test data...')
test_preprocessed = preprocessing.transform(test_data.data)

Preprocessing training data ...
Preprocessing test data...


## Training

Now we can train a **text classifier** on the preprocessed training data. For the training we will experiment with 3 text classification models:

1. **Naive Bayes** classifiers. Simple: They presume all features are independent of each other. They lear how frequent all classes are and how frequent each feature occurs in a class. In order to classify a new text, they multiply the probabilities for every feature xi given each class C and pick the class that gives the highest probability:

$$ \hat{y} = argmax_k p(C_k) \prod_{i=1}^{n}p(x_i | C_k) $$

They are quick to train, but usually fall behind in terms of performance.

2. **Support Vector Machines** try to find the **hyperplane** in feature space that best separates the data from the different classes. They perform really well.

3. **Logistic Regression Models** model the log-odds $l$ or $log(p/(1-p))$ of a class as a linear model and estimate the parameters $\beta$ of the model during training:

$$ l = \beta_0 + \sum_{i=1}^{n}\beta_ix_i $$

Very good performance.

In [7]:
#| export
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

# Setup the three classifiers
nb_classifier = MultinomialNB()
lr_classifier = LogisticRegression(multi_class='ovr')
svm_classifier = LinearSVC()

# Do the actual training
print('Training the Naive Bayes Classifier...')
nb_classifier.fit(train_preprocessed, train_data.target)
print('Training the Logistic Regression Classifier...')
lr_classifier.fit(train_preprocessed, train_data.target)
print('Training the SVM classifier...')
svm_classifier.fit(train_preprocessed, train_data.target)

Training the Naive Bayes Classifier...
Training the Logistic Regression Classifier...
Training the SVM classifier...


In order to find out how well each classifier performs, we use their `predict` method the label for all texts in our preprocessed test set.

In [8]:
#| export
nb_predictions = nb_classifier.predict(test_preprocessed)
lr_predictions = lr_classifier.predict(test_preprocessed)
svm_predictions = svm_classifier.predict(test_preprocessed)

In [9]:
#| export
# Let's check the prediction scores for each classifier
import numpy as np

print("NB Accuracy:", np.mean(nb_predictions == test_data.target))
print("LR Accuracy:", np.mean(lr_predictions == test_data.target))
print("SVM Accuracy:", np.mean(svm_predictions == test_data.target))

NB Accuracy: 0.7738980350504514
LR Accuracy: 0.8278013807753585
SVM Accuracy: 0.8531598513011153


## Grid search

Not bad scores at all, but with the `GridSearchCV` module we can try to find the optimum hyperparameters:

In [10]:
#| export
from sklearn.model_selection import GridSearchCV

parameters = {'C': np.logspace(0, 3, 10)}
parameters = {'C': [0.1, 1, 10, 100, 1000]}

print("Grid search for logistic regression")
lr_best = GridSearchCV(lr_classifier, parameters, cv=3, verbose=1)
lr_best.fit(train_preprocessed, train_data.target)

print("Grid search for SVM")
svm_best = GridSearchCV(svm_classifier, parameters, cv=3, verbose=1)
svm_best.fit(train_preprocessed, train_data.target)

Grid search for logistic regression
Fitting 3 folds for each of 5 candidates, totalling 15 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Grid search for SVM
Fitting 3 folds for each of 5 candidates, totalling 15 fits




In [11]:
#| export
# Let's see what hyperparameters lead to the best result

print(f'Best SVM params: {svm_best.best_params_}')
print(f'Best LR params: {lr_best.best_params_}')

Best SVM params: {'C': 1}
Best LR params: {'C': 1000}


Now we can use to these outcomes to, again, calculate predictions on the test set:

In [12]:
best_svm_predictions = svm_best.predict(test_preprocessed)
best_lr_predictions = lr_best.predict(test_preprocessed)

print("Best SVM Accuracy:", np.mean(best_svm_predictions == test_data.target))
print("Best LR Accuracy:", np.mean(best_lr_predictions == test_data.target))

Best SVM Accuracy: 0.8531598513011153
Best LR Accuracy: 0.8515666489644185


## Extensive evaluation

### Detailed scores

So far, we looked at the accuracy of our models: The proportion of test examples for which its prediction is correct. But where do things go wrong?

We start with