This is modeled after the classifier that we used for this story: https://www.washingtonpost.com/graphics/politics/policy-2020/priorities-issues/. It's an example of multi-class, multi-label classification. It

In [6]:
%load_ext autoreload
%autoreload 2

In [32]:
import scipy
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score

import data_utils

In [14]:
texts, labels = data_utils.get_texts_and_labels('facebook')

In [16]:
print(texts[312])
print(labels[312])

Brothers and sisters: We're going to win this election not because we have a super PAC funded by billionaires.
('corporate power', 'campaign finance')


In [17]:
def create_binarizer(labels):
    # since this problem is multi label
    binarizer = MultiLabelBinarizer()
    return binarizer.fit(labels)

def create_featurizer(corpus):
    # I like using unigrams and bigrams
    featurizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    return featurizer.fit(corpus)

def create_classifier(hyperparameters):
    # create a classifier with with specific hyperparameters, otherwise use default hyperparameters
    if hyperparameters is not None and 'estimator__alpha' in hyperparameters:
        base_classifier = SGDClassifier(loss='modified_huber', penalty='elasticnet', tol=1e-3, alpha=hyperparameters['estimator__alpha'])
    else:
        base_classifier = SGDClassifier(loss='modified_huber', penalty='elasticnet', tol=1e-3)
    return OneVsRestClassifier(base_classifier, n_jobs=-1)

For basic text classification TFIDF features are good and I have found that adding bigrams often boosts the performance -- while adding any further n-grams usually doesn't help much, but instead massively increases the feature space.

I have also found that Supper Vector Machine (SVM) classifiers preform well on NLP tasks. Unfortunately `sklearn.svm.SVC` and `sklearn.svm.LinearSVC` often converge quite slowly, especially with large amounts of data since the usual SVM solvers scale at `O(n^2)` (where `n` is the size of the training set). So I generally prefer the `sklearn.linear_model.SGDClassifier`. If the specified loss is set to `hinge`, then it solves the exact same optimization problem as a SVM, except using stochastic gradient descent, which scales linearly with the size of training set and is therefore a lot faster. 

We now lose the ability to use non-linear kernels in the SVM (though we theoretically could use the Nystroem kernel approximation with `sklearn.kernel_approximation.Nystroem`) but for NLP problems this is less of an issue. Since the features are sparse, the kernel trick generally doesn't help.

Also note that I use the `modified_huber` instead of `hinge` loss. These losses are quite similar, but I prefer the former since it allows for predicting probabilities and not just outcomes.


In [41]:
mlb = create_binarizer(labels)
featurizer = create_featurizer(texts)

tfidf = featurizer.transform(texts)
binarized_labels = mlb.transform(labels)

In [56]:
tfidf_train, tfidf_test, binarized_labels_train, binarized_labels_test = train_test_split(tfidf, binarized_labels, test_size=0.33)

In [57]:
print(tfidf[312])
print(binarized_labels[312])

  (0, 29192)	0.2867602968749451
  (0, 29191)	0.20563101243820897
  (0, 25712)	0.2597172020626997
  (0, 25709)	0.24389800569408934
  (0, 24446)	0.2867602968749451
  (0, 24442)	0.2239681752576899
  (0, 19066)	0.2867602968749451
  (0, 19065)	0.24389800569408934
  (0, 11552)	0.24389800569408934
  (0, 11429)	0.12156689082737338
  (0, 11000)	0.2867602968749451
  (0, 10998)	0.2239681752576899
  (0, 8880)	0.2867602968749451
  (0, 8869)	0.19892628518558983
  (0, 3475)	0.23267410725045434
  (0, 3471)	0.22024965442790256
  (0, 3064)	0.2081489788890795
[0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [81]:
def test_classifier(hyperparameters):
    classifier = create_classifier(hyperparameters)
    classifier.fit(tfidf_train, binarized_labels_train)
    y_pred = classifier.predict(tfidf_test)
    f1 = f1_score(binarized_labels_test, y_pred, average='micro')
    precision = precision_score(binarized_labels_test, y_pred, average='micro')
    recall = recall_score(binarized_labels_test, y_pred, average='micro')
    print(f"precision: {precision}")
    print(f"recall: {recall}")
    print(f"f1: {f1}")

def find_best_hyperparmeters(tfidf, labels):
    classifier = create_classifier(None)
    # we have to specify `estimator__` since the classifier is embedded in the `oneVsRest` wrapper.
    param_distribution = {
                        'estimator__alpha': scipy.stats.expon(scale=0.00001),
    }
    # this runs search on maximum available cores
    extra_kwargs = {'n_jobs': -1}
    scv = RandomizedSearchCV(classifier, param_distribution, n_iter=20, cv=5, scoring='f1_micro', iid=True, verbose=1, refit=False, **extra_kwargs)
    scv.fit(tfidf, labels)
    return scv.best_params_

I prefer random search over hyperparameters to grid search. It's generally more useful if you have more than one hyperparameter. Here is a good explainer: http://cs230.stanford.edu/section/8/. While this is written about neural networks specifically, it also holds for less complication machine learning algorithms.

I have found that an exponential distribution with scale set at `1e-5` generally works quite well as distribution to sample the hyperparameter for an `SGDClassifier` from.

In [82]:
test_classifier(None)

precision: 0.9085714285714286
recall: 0.2804232804232804
f1: 0.42857142857142855


As we can see precision is very high, but our recall is suffering. High precision means the labels that our algorithm returned are mostly correct, and the low recall means that we are missing a lot of the true labels. Generally, this means that our classifier is not very confident since it is only tagging the "easy" examples.

One way to try and improve our score is to use hyperparameter optimization. Many machine learning algorithms rely on hyperparameters that are set before training, which can heavily affect the classification. For an `SGDClassifier` the most interesting hyperparameter is `alpha`, which controls the strength of the regularization (how strongly we are forcing the parameters to zero, which tries to combat overfitting) and defaults to `0.0001`.

In [83]:
# first test classifier to see how it would perform without hyperparameter optimization
best_hyperparameters = find_best_hyperparmeters(tfidf_train, binarized_labels_train)
# then test with hyperparmeters
print("best hyperparmeter are: ", best_hyperparameters)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.2s


best hyperparmeter are:  {'estimator__alpha': 9.66253922334479e-07}


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    4.1s finished


As we can see the optimal hyperparemter we found for alpha is substantially smaller than the default setting. This means we were initially underfitting our model, since we were forcing the parameters closer to zero than the data wanted them to be.

In [79]:
test_classifier(best_hyperparameters)

precision: 0.8161290322580645
recall: 0.4462081128747795
f1: 0.5769669327251995


The f1 score improved quite substantially. This due to a large increase in recall (and a correspondingly small decrease in precision. We can see the hyperparameter tuning worked. 

Since we were forcing the parameters to zero, the features had to be very indicative of a topic for the model to assign the topic. This means were missing a lot of potential correct predictions. By losening the regularization, and giving the model more flexibility, we were able to increase the recall without losing too much precision. 

In [28]:
classifier = create_classifier(best_hyperparameters)
classifier.fit(tfidf_train, binarized_labels_train)

OneVsRestClassifier(estimator=SGDClassifier(alpha=1.3834633267768954e-06,
                                            average=False, class_weight=None,
                                            early_stopping=False, epsilon=0.1,
                                            eta0=0.0, fit_intercept=True,
                                            l1_ratio=0.15,
                                            learning_rate='optimal',
                                            loss='modified_huber',
                                            max_iter=1000, n_iter_no_change=5,
                                            n_jobs=None, penalty='elasticnet',
                                            power_t=0.5, random_state=None,
                                            shuffle=True, tol=0.001,
                                            validation_fraction=0.1, verbose=0,
                                            warm_start=False),
                    n_jobs=-1)

In [40]:
test_text = texts[312]
test_text_tfidf = featurizer.transform([test_text])
test_text_pred = classifier.predict(test_text_tfidf)
mlb.inverse_transform(test_text_pred)[0]

('campaign finance', 'corporate power')