# The 20 newsgroups topic analysis

Instead of repeating the IMDB sentiment analysis from the lesson (because frankly, I'm a little bored with sentiment analysis), I will attempt to apply a similar approach to deep-learning NLP classification to a dataset a coworker has recently been messing around with in `scikit-learn`: `sklearn.datasets.fetch_20newsgroups`.

http://people.csail.mit.edu/jrennie/20Newsgroups/

## Setup data

In [1]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

DATASET_DIR = DATA_HOME_DIR + '20_newsgroup/'
MODEL_DIR = DATASET_DIR + 'models/'

In [2]:
if not os.path.exists(MODEL_DIR):
    os.mkdir(DATASET_DIR)
    os.mkdir(MODEL_DIR)

In [3]:
from sklearn.datasets import fetch_20newsgroups

category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]

x_train = fetch_20newsgroups(
    subset='train',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

x_test = fetch_20newsgroups(
    subset='test',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

In [4]:
x_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'soc.religion.christian']

`target_names` are as requested

In [5]:
x_train.filenames.shape, x_train.target.shape, len(x_train.data)

((2254,), (2254,), 2254)

In [6]:
x_test.filenames.shape, x_test.target.shape, len(x_test.data)

((1500,), (1500,), 1500)

Keras implements `get_word_index()` for the IMDB dataset, which returns an dictionary of word->index derived from a json file hosted on Amazon S3.

This seems bizarre to me? Anyway, sklearn doesn't do this. So let's create our own index with `keras.preprocessing.text.Tokenizer` (https://keras.io/preprocessing/text/).

In [7]:
from keras.preprocessing.text import Tokenizer
from unidecode import unidecode

train_tokenizer = Tokenizer()
unidecoded_x_train = [unidecode(text) for text in x_train.data]
train_tokenizer.fit_on_texts(unidecoded_x_train) # builds the word index
train_sequences = train_tokenizer.texts_to_sequences(unidecoded_x_train)

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
  _warn_if_not_unicode(string)


In [8]:
train_word_index = train_tokenizer.word_index

Reverse the `word_index` with `idx2word`.

In [9]:
train_idx2word = {v: k for k, v in train_word_index.iteritems()}

Let's take a look at the first review, both as a list of indices and as text reconstructed from the indices.

In [10]:
', '.join(map(str, train_sequences[0]))

'6, 47, 1529, 37, 84, 69, 963, 110, 2, 676, 445, 832, 1268, 1135, 198, 72, 445, 832, 8, 736, 450, 7, 6, 95, 189, 3, 28, 3, 1203, 5, 171, 69, 62, 133, 50862, 12, 8, 970, 7537, 4, 117, 1270, 4, 1268, 7, 84, 94, 3755, 18, 109, 236, 26, 542, 29, 206, 244, 117, 69, 4, 134, 176, 213, 199, 16359, 18501, 15497, 14450, 10736, 2404, 144, 35, 15644, 11379, 9545, 2404, 144'

In [11]:
train_idx2word[6]

'i'

In [12]:
' '.join([train_idx2word[o] for o in train_sequences[0]])

"i was wondering if any one knew how the various hard drive compression utilities work my hard drive is getting full and i don't want to have to buy a new one what i'm intrested in is speed ease of use amount of compression and any other aspect you think might be important as i've never use one of these things before thanks morgan bullard mb4008 coewl cen uiuc edu or mjbb uxa cso uiuc edu"

In [13]:
x_train.target[0], x_train.target_names[x_train.target[0]]

(2, 'comp.os.ms-windows.misc')

Reduce vocab size by setting rare words to max index.

First, sequence the test data.

In [14]:
test_tokenizer = Tokenizer()
unidecoded_x_test = [unidecode(text) for text in x_test.data]
test_tokenizer.fit_on_texts(unidecoded_x_test) # builds the word index
test_sequences = test_tokenizer.texts_to_sequences(unidecoded_x_test)
test_word_index = test_tokenizer.word_index

In [15]:
import numpy as np

vocab_size = min(len(train_word_index), len(test_word_index))

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in train_sequences]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in test_sequences]

Distribution of the lengths of sentences:

In [16]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(16306, 0, 289.51863354037266)

Weird that there are sentences with 0 sequences (words) in them...

In [17]:
# get indices of arrays that do NOT satisfy np.nonzero
nonzero_indices = np.unique(np.nonzero(train_sequences)[0])
zero_indices = set(range(len(train_sequences))).difference(nonzero_indices)
len(zero_indices)

59

So there are 59 sentences with no words. E.g.

In [18]:
train_sequences[18], x_train.target_names[x_train.target[18]]

([], 'comp.graphics')

Let's remove them (and their labels) from the dataset.

In [19]:
trn = np.delete(trn, list(zero_indices), axis=0)

In [20]:
x_train.target = np.delete(x_train.target, list(zero_indices), axis=0)

In [21]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(16306, 1, 297.30068337129842)

OK, so apparently there are also reviews with 1 word... we'll assume that's valid for now.

Pad (with zero) or truncate each sentence to make consistent length.

In [22]:
from keras.preprocessing import sequence

seq_len = 1000

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

In [23]:
trn[:10]

array([[    0,     0,     0, ...,  9545,  2404,   144],
       [    0,     0,     0, ...,  9485,    16,   546],
       [    0,     0,     0, ...,   163,   490,   380],
       ..., 
       [    0,     0,     0, ...,   104, 36013,   103],
       [    0,     0,     0, ...,    26,  1263, 12586],
       [    0,     0,     0, ...,  5867,  5785,  2465]], dtype=int32)

Finally, let's turn the labels into categorical information.

In [24]:
from keras.utils.np_utils import to_categorical

x_train.target = to_categorical(np.asarray(x_train.target))
x_test.target = to_categorical(np.asarray(x_test.target))

In [25]:
trn.shape, x_train.target.shape

((2195, 1000), (2195, 4))

In [26]:
test.shape, x_test.target.shape

((1500, 1000), (1500, 4))

## Create simple models

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [31]:
vocab_size, seq_len

(36014, 1000)

In [32]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.optimizers import Adam

# input_length => 500-word reviews, 32 floats per word
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='sigmoid')])

In [33]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_2 (Embedding)          (None, 1000, 32)      1152448     embedding_input_2[0][0]          
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 32000)         0           embedding_2[0][0]                
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 100)           3200100     flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 100)           0           dense_3[0][0]                    
___________________________________________________________________________________________

In [34]:
model.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=2, batch_size=64)

Train on 2195 samples, validate on 1500 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7eff05b27fd0>

Good? Bad? Here are some accuracies [from an official `sklearn` example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html) that classifies documents by topics using a bag-of-words approach:

```
[('RidgeClassifier', 0.89726533628972649),
 ('Perceptron', 0.88543976348854403),
 ('PassiveAggressiveClassifier', 0.90613451589061345),
 ('KNeighborsClassifier', 0.85809312638580926),
 ('RandomForestClassifier', 0.83813747228381374),
 ('LinearSVC', 0.90022172949002222),
 ('SGDClassifier', 0.90096082779009612),
 ('LinearSVC', 0.87287509238728755),
 ('SGDClassifier', 0.88543976348854403),
 ('SGDClassifier', 0.89874353288987441),
 ('NearestCentroid', 0.85513673318551364),
 ('MultinomialNB', 0.90022172949002222),
 ('BernoulliNB', 0.88396156688839611),
 ('Pipeline', 0.8810051736881005)]
 
 mean: 0.88311688311688319
 ```

So... not a good result in comparison with a much simpler approach. Training accuracy is comparable, but testing accuracy is much poorer.

It's possible that I'm doing something not-ideal.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [42]:
'''x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)'''

from keras.layers.convolutional import Convolution1D, MaxPooling1D

'''conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(4, activation='softmax')
    ])'''

conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.4),
    Dropout(0.4),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.4),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='sigmoid')])

In [43]:
from keras.optimizers import RMSprop

conv1.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_4 (Embedding)          (None, 1000, 100)     3601400     embedding_input_4[0][0]          
____________________________________________________________________________________________________
dropout_6 (Dropout)              (None, 1000, 100)     0           embedding_4[0][0]                
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 996, 128)      64128       dropout_6[0][0]                  
____________________________________________________________________________________________________
dropout_7 (Dropout)              (None, 996, 128)      0           convolution1d_2[0][0]            
___________________________________________________________________________________________

In [44]:
conv1.optimizer.lr.get_value().item()

0.0010000000474974513

In [45]:
#conv1.optimizer.lr=0.0001

In [46]:
conv1.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=4, batch_size=64)

Train on 2195 samples, validate on 1500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7efef5406b50>

In [40]:
#conv1.optimizer.lr=0.001

In [41]:
conv1.fit(trn, x_train.target, validation_data=(test, x_test.target), nb_epoch=4, batch_size=64)

Train on 2195 samples, validate on 1500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7efefb696590>