This notebooks is an experiment to see if a pure scikit-learn implementation of the fastText model can work better than a linear model on a small text classification problem: 20 newsgroups.

http://arxiv.org/abs/1607.01759

Those models are very similar to Deep Averaging Network (with only 1 hidden layer with a linear activation function):

https://www.cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf


Note that scikit-learn does not provide a hierarchical softmax implementation (but we don't need it on 20 newsgroups anyways).

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.model_selection import train_test_split

In [2]:
twentyng = fetch_20newsgroups(remove=('headers', 'footers'))

docs_train, docs_test, target_train, target_test = train_test_split(
    twentyng.data, twentyng.target, test_size=0.2, random_state=42)

In [75]:
2 ** 18

262144

The following uses the hashing tricks on unigrams and bigrams. `binary=True` makes us ignore repeated words in a document. The `l1` normalization ensures that we "average" the embeddings of the tokens in the document instead of summing them.

In [76]:
%%time
vec = HashingVectorizer(
    encoding='latin-1', binary=True, ngram_range=(1, 2),
    norm='l1', n_features=2 ** 18)

X_train = vec.transform(docs_train)
X_test = vec.transform(docs_test)

CPU times: user 10 s, sys: 42.6 ms, total: 10.1 s
Wall time: 10 s


In [77]:
first_doc_vectors = X_train[:3].toarray()
first_doc_vectors

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [78]:
first_doc_vectors.min(axis=1)

array([ 0.,  0.,  0.])

In [79]:
first_doc_vectors.max(axis=1)

array([ 0.01282051,  0.00083195,  0.00098135])

In [80]:
first_doc_vectors.sum(axis=1)

array([ 1.,  1.,  1.])

Baseline: OvR logistic regression (the multinomial logistic regression loss is currently not implemented in scikit-learn). In practice, the OvR reduction seems to work well enough.

In [81]:
%%time
from sklearn.linear_model import SGDClassifier

lr = SGDClassifier(loss='log', alpha=1e-10, n_iter=30, n_jobs=-1)
lr.fit(X_train, target_train)

CPU times: user 34.1 s, sys: 3.32 s, total: 37.4 s
Wall time: 3.53 s


In [91]:
%%time
print("test score: %0.3f" % lr.score(X_test, target_test))

test score: 0.846
CPU times: user 120 ms, sys: 127 µs, total: 120 ms
Wall time: 119 ms


Let's now use the MLPClassifier of scikit-learn to add a single hidden layer with a small number of hidden units.

Note: instead of tanh or relu we would rather like to use a linear / identity activation function for the hidden layer but this is not (yet) implemented in scikit-learn.

In that respect the following model is closer to a Deep Averaging Network (without dropout) than fastText.

In [88]:
%%time
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(algorithm='adam', hidden_layer_sizes=10, max_iter=100, activation='tanh')
mlp.fit(X_train, target_train)

CPU times: user 9min 37s, sys: 584 ms, total: 9min 37s
Wall time: 9min 38s




In [89]:
%%time
print("test score: %0.3f" % mlp.score(X_test, target_test))

test score: 0.855
CPU times: user 33.2 ms, sys: 677 µs, total: 33.9 ms
Wall time: 32.9 ms


In [None]:
%%time
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(algorithm='adam', hidden_layer_sizes=10, max_iter=100, activation='relu')
mlp.fit(X_train, target_train)

In [None]:
%%time
print("test score: %0.3f" % mlp.score(X_test, target_test))