In [1]:
# from theano.sandbox import cuda

In [1]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function
from keras import optimizers
from keras.models import Sequential

Using Theano backend.


In [2]:
model_path = '/Users/mdymshits/fastai/data/imdb/models/'

In [3]:
# %mkdir -p /Users/mdymshits/fastai/data/imdb/models/

## Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.

In [4]:
from keras.datasets import imdb
idx = imdb.get_word_index()

This is the word list:

In [5]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word

In [6]:
idx2word = {v: k for k, v in idx.iteritems()}

We download the reviews using code copied from keras.datasets:

In [7]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.

In [8]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

The first word of the first review is 23022. Let's see what that is.

In [9]:
idx2word[23022]

'bromwell'

Here's the whole review, mapped from ids to words.

In [10]:
' '.join([idx2word[o] for o in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

The labels are 1 for positive, 0 for negative.

Reduce vocab size by setting rare words to max index.

In [11]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.

In [12]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [13]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [14]:
trn.shape

(25000, 500)

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [15]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [16]:
%ls -1 /Users/mdymshits/fastai/data/glove/results/

6B.50d_idx.pkl
6B.50d_words.pkl
[34mglove.twitter.27B.100d.dat[m[m/
glove.twitter.27B.100d_idx.pkl
glove.twitter.27B.100d_words.pkl
[34mglove.twitter.27B.25d.dat[m[m/
glove.twitter.27B.25d_idx.pkl
glove.twitter.27B.25d_words.pkl
[34mglove.twitter.27B.50d.dat[m[m/
glove.twitter.27B.50d_idx.pkl
glove.twitter.27B.50d_words.pkl


In [17]:
# vecs, words, wordidx = load_vectors('/Users/mdymshits/fastai/data/glove/results/6B.50d')
vecs, words, wordidx = load_vectors('/Users/mdymshits/fastai/data/glove/results/glove.twitter.27B.100d')

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [18]:
def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))
    counter = 0
    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word) and word in wordidx:
            if word == '10': print (i)
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            counter += 1
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb, counter

In [19]:
emb, c = create_emb()

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.

In [20]:
emb.shape

(5000, 100)

In [21]:
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=emb.shape[1], input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])
#     Dense(2, activation ='softmax')])

model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

**previous results**:
```
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 11s - loss: 0.3997 - acc: 0.8207 - val_loss: 0.3032 - val_acc: 0.8943
Epoch 2/2
25000/25000 [==============================] - 11s - loss: 0.2882 - acc: 0.8832 - val_loss: 0.2646 - val_acc: 0.9029
```

In [22]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=32)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x159255910>

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [30]:
model.layers[0].trainable=True

In [37]:
model.optimizer.lr=1e-1

In [38]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=3, batch_size=32)

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1595ace50>

As expected, that's given us a nice little boost. :)

In [26]:
# model.save_weights(model_path+'glove50.h5')

In [27]:
model.layers[0].trainable=False

In [28]:
model.optimizer.lr=1e-5

In [29]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x107d524d0>

This more complex architecture has given us another boost in accuracy.

## LSTM

We haven't covered this bit yet!

In [None]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
              W_regularizer=l2(1e-6), dropout=0.2),
    LSTM(100, consume_less='gpu'),
    Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)