In [1]:
%matplotlib inline
import utils
import imp
imp.reload(utils)

Using TensorFlow backend.


<module 'utils' from 'G:\\Users\\hjkim\\Documents\\Python Scripts\\fastai\\courses\\deeplearning1\\nbs\\utils.py'>

In [2]:
model_path = 'data\\imdb\\models\\'
#%mkdir $model_path

## Setup data

We're going to look at the IMDB dataset, whicih contains movie reviews from IMDB, along with their sentiment. Kers comes with some helpers for this dataset.

In [3]:
from keras.datasets import imdb
idx = imdb.get_word_index()

This is the word list:

In [4]:
idx_arr = sorted(idx, key = idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

and this is the mapping from id to word

In [5]:
idx2word = {v: k for k, v in idx.items()}

We download the reviews using code copied from keras dataset:

In [6]:
import keras
path = keras.utils.get_file('idm_full.pkl', origin = 'http://s3.amazonaws.com/text-datasets/imdb_full.pkl', md5_hash = 'd091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')

In [7]:
import pickle
(x_train, labels_trn), (x_test, labels_test) = pickle.load(f)

In [8]:
len(x_train)

25000

Here's 1st review. As you see, the words have been replaced by ids. The ids can be lookedup in idx2word.

In [9]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

In [11]:
labels_trn[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index.

In [12]:
vocab_size = 5000
import numpy as np
trn = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.

In [13]:
lens = np.array(list(map(len, trn)))
(lens.max(), lens.min(), float(lens.sum()) / lens.size)

(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [14]:
seq_len = 500
import keras
from keras.preprocessing import sequence

In [15]:
trn = sequence.pad_sequences(trn, maxlen = seq_len, value = 0)
test = sequence.pad_sequences(test, maxlen = seq_len, value = 0)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [16]:
trn.shape

(25000, 500)

## Create simple models

### single hidden layer NN

The simplest model taht tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can;t expect to get any useful results bt feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [17]:
import keras
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten, Dropout, Convolution1D

model = Sequential([
    Embedding(vocab_size, 32, input_length = seq_len),
    Flatten(),
    Dense(100, activation = 'relu'),
    Dropout(0.7),
    Dense(1, activation = 'sigmoid')
])

In [18]:
import keras
from keras.optimizers import Adam

model.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               1600100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 1,760,201
Trainable params: 1,760,201
Non-trainable params: 0
_________________________________________________________________


In [19]:
model.fit(trn, labels_trn, validation_data = (test, labels_test), epochs = 2, batch_size = 64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x294ffef0>

The [stanford paper]() that this dataset if from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1 CNN, since a sequence of words is 1D.

In [26]:
from keras.layers import Convolution1D, MaxPooling1D, SpatialDropout1D

conv1 = Sequential([
    Embedding(vocab_size, 32, input_length = seq_len),
    SpatialDropout1D(0.2),
    Dropout(0.2),
    Convolution1D(64, 5, padding = 'same', activation = 'relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation = 'relu'),
    Dropout(0.7),
    Dense(1, activation = 'sigmoid')
])

In [28]:
from keras.optimizers import Adam
conv1.compile(loss = 'binary_crossentropy', optimizer = Adam(), metrics = ['accuracy'])

In [30]:
conv1.fit(trn, labels_trn, validation_data = (test, labels_test), epochs = 4, batch_size = 64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x2afc5ac8>

That's well past the stanford paper's accuracy - another win for CNNs!

In [31]:
conv1.save_weights(model_path + 'conv1.h5')

In [32]:
conv1.load_weights(model_path + 'conv1.h5')

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [39]:
import os, keras

def get_glove_dataset(dataset):
    """
    Download the requsted glove dataset brom files.fast.ai and return a location that can be passed to load_vectors.
    """
    
    # see workvectors.ipynb for info on how these files were generated from the original glove data.
    md5sums = {
       '6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
       '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
       '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
       '6B.300d': '30290210376887dcc6d0a5a6374d8255'
    }
    #glove_path = os.path.abspath('data\\glove\\results')
    glove_path = 'data\\glove\\results'
    %mkdir $glove_path
    return keras.utils.get_file(dataset, 'http://files.fast.ai/models/glove/' + dataset + '.tgz', cache_subdir = glove_path, md5_hash = md5sums.get(dataset, None), untar = True)

In [40]:
def load_vectors(loc):
    return (load_array(loc + '.dat'), pickle.load(open(loc + '_words.pkl', 'rb')), pickle.load(open(loc + '_idx.pkl', 'rb')))

In [None]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))

A subdirectory or file G:\Users\hjkim\Documents\Python already exists.
Error occurred while processing: G:\Users\hjkim\Documents\Python.
A subdirectory or file Scripts\fastai\courses\deeplearning1\nbs\data\glove\results already exists.
Error occurred while processing: Scripts\fastai\courses\deeplearning1\nbs\data\glove\results.


Downloading data from http://files.fast.ai/models/glove/6B.50d.tgz
 2473984/80107627 [..............................] - ETA: 4360s