# The 20 newsgroups topic analysis

Instead of repeating the IMDB sentiment analysis from the lesson (because frankly, I'm a little bored with sentiment analysis), I will attempt to apply a similar approach to deep-learning NLP classification to a dataset a coworker has recently been messing around with in `scikit-learn`: `sklearn.datasets.fetch_20newsgroups`.

http://people.csail.mit.edu/jrennie/20Newsgroups/

## Setup data

In [1]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

DATASET_DIR = DATA_HOME_DIR + '20_newsgroup/'
MODEL_DIR = DATASET_DIR + 'models/'

In [2]:
if not os.path.exists(MODEL_DIR):
    os.mkdir(DATASET_DIR)
    os.mkdir(MODEL_DIR)

In [3]:
from sklearn.datasets import fetch_20newsgroups

category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]

x_train = fetch_20newsgroups(
    subset='train',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

x_test = fetch_20newsgroups(
    subset='test',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

In [4]:
x_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'soc.religion.christian']

`target_names` are as requested

In [5]:
x_train.filenames.shape, x_train.target.shape, len(x_train.data)

((2254,), (2254,), 2254)

In [6]:
x_test.filenames.shape, x_test.target.shape, len(x_test.data)

((1500,), (1500,), 1500)

Keras implements `get_word_index()` for the IMDB dataset, which returns an dictionary of word->index derived from a json file hosted on Amazon S3.

This seems bizarre to me? Anyway, sklearn doesn't do this. So let's create our own index with `keras.preprocessing.text.Tokenizer` (https://keras.io/preprocessing/text/).

In [7]:
import keras.preprocessing.text
import string

# Workaround to add "Unicode support for keras.preprocessing.text"
# (https://github.com/fchollet/keras/issues/1072#issuecomment-295470970)
def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    if lower: text = text.lower()
    if type(text) == unicode:
        translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
    else:
        translate_table = string.maketrans(filters, split * len(filters))
    text = text.translate(translate_table)
    seq = text.split(split)
    return [i for i in seq if i]
    
keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [8]:
from keras.preprocessing.text import Tokenizer

train_tokenizer = Tokenizer()
train_tokenizer.fit_on_texts(x_train.data) # builds the word index
train_sequences = train_tokenizer.texts_to_sequences(x_train.data)

In [9]:
train_word_index = train_tokenizer.word_index

In [10]:
train_word_index

{u'3ds2scn': 19125,
 u'wax3': 30528,
 u'fl46vzjp2': 19127,
 u'l1tbk': 19128,
 u'mbhi8bea': 19129,
 u'circuitry': 19130,
 u'pantheistic': 19131,
 u'mdbs': 19132,
 u'hanging': 8849,
 u'beqs': 19133,
 u'sation': 19134,
 u'disobeying': 13192,
 u"'113s1t45": 19135,
 u'sisrg': 19136,
 u"ng2z'kk": 19137,
 u"ng2z'ki": 19138,
 u'4bo0': 13211,
 u'trojan': 19140,
 u"qq'jp": 19141,
 u'yourdon': 19142,
 u'ua8cx': 10454,
 u'y74': 41966,
 u"9l2'": 19143,
 u'fractal': 3361,
 u'rlis': 6209,
 u'wednesday': 10455,
 u'woods': 19145,
 u'598n': 19146,
 u'amplifications': 19147,
 u'rlii': 19148,
 u'v6jylh': 19149,
 u'rlim': 10456,
 u'y8z': 19151,
 u's3u4578': 19152,
 u'matthean': 19153,
 u'rll': 12716,
 u'y8v': 19154,
 u'znb8flb8': 19155,
 u'270': 8850,
 u'271': 13196,
 u'272': 8851,
 u'273': 13197,
 u'274': 19156,
 u'275': 19157,
 u'v0z1t': 40355,
 u'sustaining': 9423,
 u'7bizw': 19159,
 u'y8c': 19360,
 u'targa': 2727,
 u'y8g': 19160,
 u'inanimate': 13199,
 u"x'28": 19161,
 u'errors': 1482,
 u'cooking': 191

Reverse the `word_index` with `idx2word`.

In [11]:
train_idx2word = {v: k for k, v in train_word_index.iteritems()}

Let's take a look at the first review, both as a list of indices and as text reconstructed from the indices.

In [12]:
', '.join(map(str, train_sequences[0]))

'6, 47, 1529, 37, 84, 69, 963, 110, 2, 676, 445, 832, 1268, 1135, 198, 72, 445, 832, 8, 736, 450, 7, 6, 95, 189, 3, 28, 3, 1203, 5, 171, 69, 62, 133, 50864, 12, 8, 970, 7537, 4, 117, 1270, 4, 1268, 7, 84, 94, 3755, 18, 109, 236, 26, 542, 29, 206, 244, 117, 69, 4, 134, 176, 213, 199, 16358, 18502, 15497, 14450, 10736, 2404, 144, 35, 15644, 11379, 9545, 2404, 144'

In [13]:
train_idx2word[6]

u'i'

In [14]:
' '.join([train_idx2word[o] for o in train_sequences[0]])

u"i was wondering if any one knew how the various hard drive compression utilities work my hard drive is getting full and i don't want to have to buy a new one what i'm intrested in is speed ease of use amount of compression and any other aspect you think might be important as i've never use one of these things before thanks morgan bullard mb4008 coewl cen uiuc edu or mjbb uxa cso uiuc edu"

In [15]:
x_train.target[0], x_train.target_names[x_train.target[0]]

(2, 'comp.os.ms-windows.misc')

Reduce vocab size by setting rare words to max index.

First, sequence the test data.

In [16]:
test_tokenizer = Tokenizer()
test_tokenizer.fit_on_texts(x_test.data)
test_sequences = test_tokenizer.texts_to_sequences(x_test.data)
test_word_index = test_tokenizer.word_index

In [17]:
import numpy as np

#vocab_size = min(len(train_word_index), len(test_word_index))
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in train_sequences]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in test_sequences]

Distribution of the lengths of sentences:

In [18]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(16306, 0, 289.51863354037266)

In [19]:
lens = np.array(map(len, test))
(lens.max(), lens.min(), lens.mean())

(39458, 0, 253.43933333333334)

Weird that there are sentences with 0 sequences (words) in them...

In [20]:
# get indices of arrays that do NOT satisfy np.nonzero
nonzero_indices = np.unique(np.nonzero(train_sequences)[0])
zero_indices = set(range(len(train_sequences))).difference(nonzero_indices)
len(zero_indices)

59

So there are 59 sentences with no words. E.g.

In [21]:
train_sequences[18], x_train.target_names[x_train.target[18]]

([], 'comp.graphics')

Let's remove them (and their labels) from the dataset.

In [22]:
#trn = np.delete(trn, list(zero_indices), axis=0)

In [23]:
#x_train.target = np.delete(x_train.target, list(zero_indices), axis=0)

In [24]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(16306, 0, 289.51863354037266)

OK, so apparently there are also reviews with 1 word... we'll assume that's valid for now.

Pad (with zero) or truncate each sentence to make consistent length.

In [25]:
from keras.preprocessing import sequence

seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

In [26]:
trn[:10]

array([[   0,    0,    0, ..., 4999, 2404,  144],
       [   0,    0,    0, ..., 4999,   16,  546],
       [  26,  104, 4999, ...,  163,  490,  380],
       ..., 
       [   0,    0,    0, ...,  104, 4999,  103],
       [   0,    0,    0, ...,   26, 1263, 4999],
       [   0,    0,    0, ..., 4999, 4999, 2465]], dtype=int32)

Finally, let's turn the labels into categorical information.

In [27]:
from keras.utils.np_utils import to_categorical

x_train.target = to_categorical(np.asarray(x_train.target))
x_test.target = to_categorical(np.asarray(x_test.target))

In [28]:
trn.shape, x_train.target.shape

((2254, 500), (2254, 4))

In [29]:
test.shape, x_test.target.shape

((1500, 500), (1500, 4))

## Create simple models

In [30]:
'''from keras.datasets import imdb
idx = imdb.get_word_index()

from keras.utils.data_utils import get_file
import pickle
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)'''

"from keras.datasets import imdb\nidx = imdb.get_word_index()\n\nfrom keras.utils.data_utils import get_file\nimport pickle\npath = get_file('imdb_full.pkl',\n                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',\n                md5_hash='d091312047c43cf9e4e38fef92437263')\nf = open(path, 'rb')\n(x_train, labels_train), (x_test, labels_test) = pickle.load(f)\n\nvocab_size = 5000\n\ntrn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]\ntest = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]\n\ntrn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)\ntest = sequence.pad_sequences(test, maxlen=seq_len, value=0)"

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [31]:
vocab_size, seq_len

(5000, 500)

In [32]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.optimizers import Adam

# input_length => 500-word reviews, 32 floats per word
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='sigmoid')])
    #Dense(1, activation='sigmoid')])

In [33]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           1600100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
___________________________________________________________________________________________

In [34]:
model.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=2, batch_size=64)
#model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 2254 samples, validate on 1500 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f1224b39350>

Good? Bad? Here are some accuracies [from an official `sklearn` example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html) that classifies documents by topics using a bag-of-words approach:

```
[('RidgeClassifier', 0.89726533628972649),
 ('Perceptron', 0.88543976348854403),
 ('PassiveAggressiveClassifier', 0.90613451589061345),
 ('KNeighborsClassifier', 0.85809312638580926),
 ('RandomForestClassifier', 0.83813747228381374),
 ('LinearSVC', 0.90022172949002222),
 ('SGDClassifier', 0.90096082779009612),
 ('LinearSVC', 0.87287509238728755),
 ('SGDClassifier', 0.88543976348854403),
 ('SGDClassifier', 0.89874353288987441),
 ('NearestCentroid', 0.85513673318551364),
 ('MultinomialNB', 0.90022172949002222),
 ('BernoulliNB', 0.88396156688839611),
 ('Pipeline', 0.8810051736881005)]
 
 mean: 0.88311688311688319
 ```

So... not a good result in comparison with a much simpler approach. Training accuracy is comparable, but testing accuracy is much poorer.

Since my model is barely training, I'm most likely doing something incorrectly. `pretrained_word_embeddings.ipynb` from Keras's examples repository was able to achieve `0.8734` acc, `0.7257` val_acc after 10 epochs - still not comparable to the 'shallow', bag-of-words models, but viable at least.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [35]:
'''x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)'''

from keras.layers.convolutional import Convolution1D, MaxPooling1D

'''conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(4, activation='softmax')
    ])'''

conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.4),
    Dropout(0.4),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.4),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='sigmoid')])

In [36]:
from keras.optimizers import RMSprop

conv1.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_2 (Embedding)          (None, 500, 100)      500000      embedding_input_2[0][0]          
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 500, 100)      0           embedding_2[0][0]                
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 496, 128)      64128       dropout_2[0][0]                  
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 496, 128)      0           convolution1d_1[0][0]            
___________________________________________________________________________________________

In [37]:
conv1.optimizer.lr.get_value().item()

0.0010000000474974513

In [38]:
#conv1.optimizer.lr=0.0001

In [39]:
conv1.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=4, batch_size=64)

Train on 2254 samples, validate on 1500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f121ef0b910>

In [40]:
#conv1.optimizer.lr=0.001

In [41]:
conv1.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=4, batch_size=64)

Train on 2254 samples, validate on 1500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f121b01a310>

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [42]:
from keras.utils.data_utils import get_file

def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [43]:
from utils import load_array
import pickle

def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [44]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))

Untaring file...


The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [45]:
import re
from numpy.random import normal

def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = train_idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word) and word in wordidx:
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [46]:
emb = create_emb()

In [47]:
emb

array([[ 0.    ,  0.    ,  0.    , ...,  0.    ,  0.    ,  0.    ],
       [-0.0736, -0.243 ,  0.0836, ..., -0.1244,  0.3455,  0.2754],
       [ 0.1393,  0.0832, -0.1375, ..., -0.0614, -0.0384, -0.2619],
       ..., 
       [-0.5187,  0.2875,  0.0487, ..., -0.202 , -0.0205, -0.163 ],
       [ 0.5126, -0.0599,  0.275 , ...,  0.2665, -0.1906,  0.2124],
       [-0.2356, -0.4336,  0.1991, ..., -0.1291, -0.458 , -0.0296]])

In [48]:
emb_model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='sigmoid')])

In [49]:
emb_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
emb_model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_3 (Embedding)          (None, 500, 50)       0           embedding_input_3[0][0]          
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 500, 50)       0           embedding_3[0][0]                
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 500, 64)       16064       dropout_5[0][0]                  
____________________________________________________________________________________________________
dropout_6 (Dropout)              (None, 500, 64)       0           convolution1d_2[0][0]            
___________________________________________________________________________________________

In [50]:
emb_model.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=4, batch_size=64)

Train on 2254 samples, validate on 1500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f11fe821a10>