This notebooks is an experiment to see if a pure scikit-learn implementation of the fastText model can work better than a linear model on a small text classification problem: 20 newsgroups.

http://arxiv.org/abs/1607.01759

Those models are very similar to Deep Averaging Network (with only 1 hidden layer with a linear activation function):

https://www.cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf


Note that scikit-learn does not provide a hierarchical softmax implementation (but we don't need it on 20 newsgroups anyways).

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.model_selection import train_test_split

In [3]:
twentyng_train = fetch_20newsgroups(
    subset='train',
    #remove=('headers', 'footers'),
)
docs_train, target_train = twentyng_train.data, twentyng_train.target


twentyng_test = fetch_20newsgroups(
    subset='test',
    #remove=('headers', 'footers'),
)

docs_test, target_test = twentyng_test.data, twentyng_test.target

In [18]:
2 ** 18

262144

The following uses the hashing tricks on unigrams and bigrams. `binary=True` makes us ignore repeated words in a document. The `l1` normalization ensures that we "average" the embeddings of the tokens in the document instead of summing them.

In [17]:
%%time
vec = HashingVectorizer(
    encoding='latin-1', binary=True, ngram_range=(1, 2),
    norm='l1', n_features=2 ** 18)

X_train = vec.transform(docs_train)
X_test = vec.transform(docs_test)

CPU times: user 16.8 s, sys: 116 ms, total: 16.9 s
Wall time: 16.9 s


In [19]:
first_doc_vectors = X_train[:3].toarray()
first_doc_vectors

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [20]:
first_doc_vectors.min(axis=1)

array([ 0.,  0.,  0.])

In [21]:
first_doc_vectors.max(axis=1)

array([ 0.0049505 ,  0.00469484,  0.00200401])

In [22]:
first_doc_vectors.sum(axis=1)

array([ 1.,  1.,  1.])

Baseline: OvR logistic regression (the multinomial logistic regression loss is currently not implemented in scikit-learn). In practice, the OvR reduction seems to work well enough.

In [23]:
%%time
from sklearn.linear_model import SGDClassifier

lr = SGDClassifier(loss='log', alpha=1e-10, n_iter=30, n_jobs=-1)
lr.fit(X_train, target_train)

CPU times: user 51.3 s, sys: 3.37 s, total: 54.7 s
Wall time: 5.33 s


In [24]:
%%time
print("test score: %0.3f" % lr.score(X_test, target_test))

test score: 0.823
CPU times: user 289 ms, sys: 740 ms, total: 1.03 s
Wall time: 295 ms


Let's now use the MLPClassifier of scikit-learn to add a single hidden layer with a small number of hidden units.

Note: instead of tanh or relu we would rather like to use a linear / identity activation function for the hidden layer but this is not (yet) implemented in scikit-learn.

In that respect the following model is closer to a Deep Averaging Network (without dropout) than fastText.

In [25]:
%%time
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(algorithm='adam', hidden_layer_sizes=10, max_iter=100, activation='tanh', verbose=100)
mlp.fit(X_train, target_train)

Iteration 1, loss = 3.00852041
Iteration 2, loss = 2.98454154
Iteration 3, loss = 2.95940918
Iteration 4, loss = 2.93309926
Iteration 5, loss = 2.90229842
Iteration 6, loss = 2.86498372
Iteration 7, loss = 2.82006721
Iteration 8, loss = 2.76721310
Iteration 9, loss = 2.70666565
Iteration 10, loss = 2.63868516
Iteration 11, loss = 2.56438989
Iteration 12, loss = 2.48427410
Iteration 13, loss = 2.39930013
Iteration 14, loss = 2.31064855
Iteration 15, loss = 2.21980731
Iteration 16, loss = 2.12689619
Iteration 17, loss = 2.03369668
Iteration 18, loss = 1.94064003
Iteration 19, loss = 1.84855330
Iteration 20, loss = 1.75839564
Iteration 21, loss = 1.67029382
Iteration 22, loss = 1.58553153
Iteration 23, loss = 1.50368438
Iteration 24, loss = 1.42559216
Iteration 25, loss = 1.35092920
Iteration 26, loss = 1.28031093
Iteration 27, loss = 1.21336656
Iteration 28, loss = 1.15018003
Iteration 29, loss = 1.09083649
Iteration 30, loss = 1.03492320
Iteration 31, loss = 0.98235703
Iteration 32, los



In [27]:
%%time
print("test score: %0.3f" % mlp.score(X_test, target_test))

test score: 0.813
CPU times: user 116 ms, sys: 3.9 ms, total: 120 ms
Wall time: 119 ms
