# The 20 newsgroups topic analysis

Instead of repeating the IMDB sentiment analysis from the lesson (because frankly, I'm a little bored with sentiment analysis), I will attempt to apply a similar approach to deep-learning NLP classification to a dataset a coworker has recently been messing around with in `scikit-learn`: `sklearn.datasets.fetch_20newsgroups`.

http://people.csail.mit.edu/jrennie/20Newsgroups/

## Setup data

In [1]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

DATASET_DIR = DATA_HOME_DIR + '20_newsgroup/'
MODEL_DIR = DATASET_DIR + 'models/'

In [2]:
if not os.path.exists(MODEL_DIR):
    os.mkdir(DATASET_DIR)
    os.mkdir(MODEL_DIR)

In [3]:
from sklearn.datasets import fetch_20newsgroups

category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]

newsgroups = fetch_20newsgroups(
    subset = 'all',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

In [4]:
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'soc.religion.christian']

`target_names` are as requested

In [5]:
newsgroups.filenames.shape, newsgroups.target.shape, len(newsgroups.data)

((3754,), (3754,), 3754)

Keras implements `get_word_index()` for the IMDB dataset, which returns an dictionary of word->index derived from a json file hosted on Amazon S3.

It seems bizarre to me to host this when you can easily create it on-demand... anyway, sklearn doesn't provide this. So let's create our own index with `keras.preprocessing.text.Tokenizer` (https://keras.io/preprocessing/text/).

In [6]:
import keras.preprocessing.text
import string

# Workaround to add "Unicode support for keras.preprocessing.text"
# (https://github.com/fchollet/keras/issues/1072#issuecomment-295470970)
def text_to_word_sequence(text,
                          filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                          lower=True, split=" "):
    if lower: text = text.lower()
    if type(text) == unicode:
        translate_table = {ord(c): ord(t) for c,t in zip(filters, split*len(filters)) }
    else:
        translate_table = string.maketrans(filters, split * len(filters))
    text = text.translate(translate_table)
    seq = text.split(split)
    return [i for i in seq if i]
    
keras.preprocessing.text.text_to_word_sequence = text_to_word_sequence

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [10]:
from keras.preprocessing.text import Tokenizer

vocab_size = 10000

tokenizer = Tokenizer(nb_words=vocab_size)
tokenizer.fit_on_texts(newsgroups.data) # builds the word index
sequences = tokenizer.texts_to_sequences(newsgroups.data)

In [11]:
word_index = tokenizer.word_index

In [12]:
len(word_index)

72905

Reverse the `word_index` with `idx2word`.

In [13]:
idx2word = {v: k for k, v in word_index.iteritems()}

Let's take a look at the first review, both as a list of indices and as text reconstructed from the indices.

In [14]:
', '.join(map(str, sequences[0]))

'24, 2, 60, 566, 52, 20, 4829, 3, 389, 2, 3000, 1339, 2, 155, 386, 84, 901, 76, 4, 115, 24, 2, 88, 566, 76, 92, 402, 3, 525, 101, 2, 155, 385, 6, 1685, 2454, 236, 93, 182, 118, 1350, 335, 7, 108, 5725, 2764, 6, 3000, 8981, 20, 396, 3, 118, 127, 2454, 6, 158, 182, 90, 4499, 14, 129, 11, 2274, 81, 4808, 234, 219, 92, 23, 5605, 6, 720, 10, 3000, 61, 783, 5464, 8, 5725, 7, 1736, 3, 239, 3, 1464, 55, 2, 579, 5464, 214, 701, 45, 91, 23, 2, 1802, 4, 5, 1464, 10, 2, 2694, 74, 318, 129, 386, 436, 5, 24, 11, 30, 3, 102, 1641, 11, 114, 332, 8, 2, 1598, 5, 858, 537, 9, 11, 1237, 96, 386, 95, 17, 23, 3, 952, 26, 8, 9, 121, 23, 219, 9951, 9845, 61, 2104, 95, 96, 5725, 11, 8, 5, 2828, 3, 2501, 197, 4, 127, 863, 720, 200, 102, 337, 127, 1576, 1021, 93, 15, 495, 2, 2157, 4, 94, 396, 3, 5777, 127, 9605, 720, 182, 118, 614, 2221, 60, 261, 3, 583, 8, 9, 135, 2, 154, 1464, 8, 210, 1464'

In [15]:
idx2word[24]

u'on'

In [16]:
' '.join([idx2word[o] for o in sequences[0]])

u"on the one hand there are advantages to having the liturgy stay the same john has described some of these on the other hand some people seem to start out the same old and pay attention better when things get changed around i think innovative priests and liturgy committees are trying to get our attention and make things more meaningful for us it drives me crazy too different people have preferences and needs in liturgy my local parish is innovative i prefer to go to mass at the next parish over sometimes we don't have the option of a mass in the style which best us john put a on it but to just offer it up probably is the solution a related issue that it sounds like john does not have to deal with is that may have different liturgical tastes my husband does like innovative it is a challenge to meet both of our spiritual needs without just going our separate ways when you include the factor of also trying to satisfy our children's needs things get pretty complicated one thing to remembe

In [17]:
newsgroups.target[0], newsgroups.target_names[newsgroups.target[0]]

(3, 'soc.religion.christian')

Distribution of the lengths of sentences:

In [18]:
import numpy as np

lens = np.array(map(len, newsgroups.data))
(lens.max(), lens.min(), lens.mean())

(158791, 0, 1493.157432072456)

Weird that there are sentences with 0 sequences (words) in them...

In [19]:
# get indices of arrays that do NOT satisfy np.nonzero
nonzero_indices = np.unique(np.nonzero(sequences)[0])
zero_indices = set(range(len(sequences))).difference(nonzero_indices)
len(zero_indices)

101

So there are 101 sentences with no words. E.g.

In [20]:
sequences[64], newsgroups.target_names[newsgroups.target[64]]

([], 'alt.atheism')

...sure.

Pad (with zero) or truncate each sentence to make consistent length.

In [21]:
from keras.preprocessing import sequence

seq_len = 1500

data = sequence.pad_sequences(sequences, maxlen=seq_len, value=0)

In [22]:
data[:10]

array([[   0,    0,    0, ...,    8,  210, 1464],
       [   0,    0,    0, ..., 3162,    8,   11],
       [   0,    0,    0, ...,    2,  318, 1142],
       ..., 
       [   0,    0,    0, ...,   47,    7,  740],
       [   0,    0,    0, ..., 2565,  356,  129],
       [   0,    0,    0, ..., 4386,  364, 8254]], dtype=int32)

Finally, let's turn the labels into categorical information.

In [23]:
from keras.utils.np_utils import to_categorical

newsgroups.target = to_categorical(np.asarray(newsgroups.target))

In [24]:
data.shape, newsgroups.target.shape

((3754, 1500), (3754, 4))

Split data into train-test.

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, newsgroups.target, test_size=0.33)

In [26]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2515, 1500), (1239, 1500), (2515, 4), (1239, 4))

## Create simple models

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [27]:
vocab_size, seq_len

(10000, 1500)

In [28]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.optimizers import Adam

# input_length => 1500-word reviews, 32 floats per word
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(len(newsgroups.target_names), activation='softmax')])

In [29]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 1500, 32)      320000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 48000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           4800100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
___________________________________________________________________________________________

In [30]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb7401eb10>

Is a mid-70s validation accuracy.. Good? Bad?

Here are some accuracies [from an official `sklearn` example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html) that classifies documents by topics using a bag-of-words approach:

```
[('RidgeClassifier', 0.89726533628972649),
 ('Perceptron', 0.88543976348854403),
 ('PassiveAggressiveClassifier', 0.90613451589061345),
 ('KNeighborsClassifier', 0.85809312638580926),
 ('RandomForestClassifier', 0.83813747228381374),
 ('LinearSVC', 0.90022172949002222),
 ('SGDClassifier', 0.90096082779009612),
 ('LinearSVC', 0.87287509238728755),
 ('SGDClassifier', 0.88543976348854403),
 ('SGDClassifier', 0.89874353288987441),
 ('NearestCentroid', 0.85513673318551364),
 ('MultinomialNB', 0.90022172949002222),
 ('BernoulliNB', 0.88396156688839611),
 ('Pipeline', 0.8810051736881005)]
 
 mean: 0.88311688311688319
 ```

So, not a good result in comparison with much simpler approaches. Training accuracy is high, but testing accuracy is much poorer.

As a sanity check, I also ran code from [`pretrained_word_embeddings.py`](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py) (from Keras's examples repository) which also runs against `20_newsgroups` (not the `sklearn` version though), and it was able to achieve:

    loss: 0.3784 - acc: 0.8734 - val_loss: 0.9177 - val_acc: 0.7257
after 10 epochs - again, not as accurate as the 'shallow', bag-of-words models - but comparable to the results I'm receiving here.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [31]:
from keras.layers.convolutional import Convolution1D, MaxPooling1D

conv1 = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2),
    Dropout(0.4),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.4),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.7),
    Dense(len(newsgroups.target_names), activation='softmax')])

In [32]:
from keras.optimizers import RMSprop

conv1.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_2 (Embedding)          (None, 1500, 100)     1000000     embedding_input_2[0][0]          
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 1500, 100)     0           embedding_2[0][0]                
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 1496, 128)     64128       dropout_2[0][0]                  
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 1496, 128)     0           convolution1d_1[0][0]            
___________________________________________________________________________________________

In [33]:
conv1.optimizer.lr.get_value().item()

0.0010000000474974513

In [34]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb6a5c9d90>

In [35]:
conv1.optimizer.lr=0.01

In [36]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=4, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fbb6a5c9e10>

In [37]:
conv1.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/1


<keras.callbacks.History at 0x7fbb6e72cdd0>

A good improvement over the previous model.

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [38]:
from keras.utils.data_utils import get_file

def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [39]:
from utils import load_array
import pickle

def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [40]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.100d'))

Untaring file...


In [41]:
len(wordidx)

400000

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [42]:
import re
from numpy.random import normal

def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word) and word in wordidx:
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [43]:
emb = create_emb()

In [44]:
emb.shape

(10000, 100)

In [45]:
emb_model = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(128, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(4, activation='softmax')])

_Note_: I started seeing lines like `4s - loss: nan - acc: 0.6783 - val_loss: nan - val_acc: 0.2131` where in the previous epoch, `val_acc` was twice that amount. A [quick search on the forums](http://forums.fast.ai/t/why-are-my-losses-nan/2931/2) surfaced this explanation:

    "There is one thing that doesn't look quite right: the final activation is not compatible with that loss function. Categorical cross-entropy expects a 'softmax' activation in the final layer, not 'sigmoid'. Consider changing that to see what happens."
    
**Categorical cross-entropy expects a `softmax` activation in the final layer, not `sigmoid`.** So I switched to `softmax`... I don't recall ever learning this information, however. Should ponder why.

In [46]:
emb_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
emb_model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_3 (Embedding)          (None, 1500, 100)     0           embedding_input_3[0][0]          
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 1500, 100)     0           embedding_3[0][0]                
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 1500, 128)     64128       dropout_5[0][0]                  
____________________________________________________________________________________________________
dropout_6 (Dropout)              (None, 1500, 128)     0           convolution1d_2[0][0]            
___________________________________________________________________________________________

In [47]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb4a0e4910>

Let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [48]:
emb_model.layers[0].trainable=True

In [49]:
emb_model.optimizer.lr=1e-4

In [50]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb4a0e4a50>

In [51]:
emb_model.layers[0].trainable=False

In [52]:
emb_model.optimizer.lr=1e-2

In [53]:
emb_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fbb4a0e4c90>

Interestingly, the pretrained embeddings didn't provide any improvement...

## Pre-trained vectors II

Let's try the model from `pretrained_word_embeddings.py (+Dropout)`

In [54]:
deep_model = Sequential([
    Embedding(vocab_size, 100, input_length=seq_len, weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(5),
    Convolution1D(128, 5, activation='relu'),
    Dropout(0.25),
    MaxPooling1D(5),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(len(newsgroups.target_names), activation='softmax')
])

In [55]:
deep_model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
deep_model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_4 (Embedding)          (None, 1500, 100)     0           embedding_input_4[0][0]          
____________________________________________________________________________________________________
dropout_8 (Dropout)              (None, 1500, 100)     0           embedding_4[0][0]                
____________________________________________________________________________________________________
convolution1d_3 (Convolution1D)  (None, 1496, 128)     64128       dropout_8[0][0]                  
____________________________________________________________________________________________________
dropout_9 (Dropout)              (None, 1496, 128)     0           convolution1d_3[0][0]            
___________________________________________________________________________________________

In [56]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb46a71d10>

In [57]:
deep_model.layers[0].trainable=True

In [58]:
deep_model.optimizer.lr=1e-4

In [59]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb46a71e50>

In [60]:
deep_model.optimizer.lr=1e-3

In [61]:
deep_model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb46a72110>

Still no sizeable boost, although slightly better than the first pretrained embedding model.

## Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' [excellent blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data).

In [62]:
from keras.layers import Merge

We use the functional API to create multiple conv layers of different sizes, and then concatenate them.

In [64]:
from keras.layers import Input, Merge
from keras.models import Model

graph_in = Input((vocab_size, 100))
convs = [] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation='relu')(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode='concat')(convs) 
graph = Model(graph_in, out) 

In [65]:
emb = create_emb()

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.

In [69]:
multi = Sequential ([
    Embedding(vocab_size, 100, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation='relu'),
    Dropout (0.7),
    Dense (len(newsgroups.target_names), activation='softmax')
    ])

In [70]:
multi.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
multi.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_6 (Embedding)          (None, 1500, 100)     1000000     embedding_input_6[0][0]          
____________________________________________________________________________________________________
dropout_16 (Dropout)             (None, 1500, 100)     0           embedding_6[0][0]                
____________________________________________________________________________________________________
model_1 (Model)                  multiple              76992       dropout_16[0][0]                 
____________________________________________________________________________________________________
dropout_17 (Dropout)             (None, 144000)        0           model_1[2][0]                    
___________________________________________________________________________________________

In [71]:
multi.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb40102610>

In [72]:
multi.layers[0].trainable=False

In [73]:
multi.optimizer.lr=1e-5

In [74]:
multi.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=64)

Train on 2515 samples, validate on 1239 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbb40102d90>

Highest, most 'stable' (where stable means the `val_acc` generally continued to rise most of the time instead of bouncing around) results so far. And most comparable to the 'shallow' bag of words results in the upper-80s.