<a href="https://colab.research.google.com/github/patbaa/physdl/blob/master/notebooks/11/LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM on IMDB Movie review dataset 

In [1]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import Dense, Embedding, LSTM

Using TensorFlow backend.


#### Loading the dataset. Vocabulary size is restricted to 10.000. So only the 10.000 most frequent word will appear in out dataset

In [2]:
N_words = 10000 # vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(index_from=3, seed=42, num_words=N_words)

In [3]:
y_train[0]

1

#### Actually this is not a vector of one-hot encoded vectors. Contains the same information, but requires way less memory

In [4]:
print(x_train[0])

[1, 11, 4079, 11, 4, 1986, 745, 3304, 299, 1206, 590, 3029, 1042, 37, 47, 27, 1269, 2, 7637, 19, 6, 3586, 15, 1367, 3196, 17, 1002, 723, 1768, 2887, 757, 46, 4, 232, 1131, 39, 107, 3589, 11, 4, 4539, 198, 24, 4, 1834, 133, 4, 107, 7, 98, 413, 8911, 5835, 11, 35, 781, 8, 169, 4, 2179, 5, 259, 334, 3773, 8, 4, 3497, 10, 10, 17, 16, 3381, 46, 34, 101, 612, 7, 84, 18, 49, 282, 167, 2, 7173, 122, 24, 1414, 8, 177, 4, 392, 531, 19, 259, 15, 934, 40, 507, 39, 2, 260, 77, 8, 162, 5097, 121, 4, 65, 304, 273, 13, 70, 1276, 2, 8, 15, 745, 3304, 5, 27, 322, 2197, 2, 2, 70, 30, 2, 88, 17, 6, 3029, 1042, 29, 100, 30, 4943, 50, 21, 18, 148, 15, 26, 5980, 12, 152, 157, 10, 10, 21, 19, 3196, 46, 50, 5, 4, 1636, 112, 828, 6, 1003, 4, 162, 5097, 2, 517, 6, 2, 7, 4, 9527, 5593, 4, 351, 232, 385, 125, 6, 1693, 39, 2383, 5, 29, 69, 5593, 5670, 6, 162, 5097, 1567, 232, 256, 34, 718, 5612, 2980, 8, 6, 226, 762, 7, 2, 7830, 5, 517, 2, 6, 3242, 7, 4, 351, 232, 37, 9, 1861, 8, 123, 3196, 2, 5612, 188, 5165, 857,

#### As we saw above the dataset is just a bunch of indices. To convert back to the original text we need check the code
```python
imdb.load_data??
```
it says
```
by convention, use 2 as OOV word
reserve 'index_from' (=3 by default) characters:
0 (padding), 1 (start), 2 (OOV)
```

So when we need to shift the mapping based on the index_from parameter we set above.   
We can load imdb.get_word_index() which is a mapping from the words to the indices that need to be inverted and shifted.

In [5]:
w2idx = imdb.get_word_index()
idx2w = {(w2idx[w]+3):w for w in w2idx.keys()}
idx2w[0] = '<PAD>'
idx2w[1] = '<START>'
idx2w[2] = '<OOV>' # out of vocab

In [6]:
y_train[:3]

array([1, 0, 1])

In [7]:
print(' '.join([idx2w[i] for i in x_train[0]]), '\n')
print(' '.join([idx2w[i] for i in x_train[1]]), '\n')
print(' '.join([idx2w[i] for i in x_train[2]]))

<START> in panic in the streets richard widmark plays u s navy doctor who has his week <OOV> interrupted with a corpse that contains plague as cop paul douglas properly points out the guy died from two bullets in the chest that's not the issue here the two of them become unwilling partners in an effort to find the killers and anyone else exposed to the disease br br as was pointed out by any number of people for some reason director <OOV> kazan did not bother to cast the small parts with anyone that sounds like they're from <OOV> having been to new orleans where the story takes place i can personally <OOV> to that richard widmark and his wife barbara <OOV> <OOV> can be <OOV> because as a navy doctor he could be assigned there but for those that are natives it doesn't work br br but with plague out there and the news being kept a secret the new orleans <OOV> starts a <OOV> of the city's underworld the dead guy came off a ship from europe and he had underworld connections a new orleans w

### Model definition
We could use some pre-trained embedding matrix in the first layer, but as for now it is just randomly initialized.

In [8]:
model = Sequential()
model.add(Embedding(N_words, 200))
model.add(LSTM(200, dropout=0.5, recurrent_dropout=0.3, return_sequences=True)) 
# note dropout vs recurrent dropout https://arxiv.org/pdf/1512.05287.pdf
model.add(LSTM(100, dropout=0.5, recurrent_dropout=0.3, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [9]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 200)         2000000   
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 200)         320800    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 2,441,301
Trainable params: 2,441,301
Non-trainable params: 0
_________________________________________________________________


#### 83% of the parameters are in the embedding layer!

### What is the usual size of the descriptions?

In [10]:
for i in range(10): print(len(x_train[i]))

467
138
147
168
144
248
125
204
138
182


In [11]:
padded_x_train = sequence.pad_sequences(x_train, maxlen=200) 
# if want to work with mini-batches we need same size input otherwise we need to train one-by one (also possible)
padded_x_test  = sequence.pad_sequences(x_test, maxlen=200)

In [12]:
history = model.fit(padded_x_train, y_train, batch_size=128, epochs=5, validation_data=(padded_x_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## This result is far from state-of-the-art. It is just an example.

You can go up to over 94% accuracy.

### How to train sample-by-sample with varying sequence length

In [13]:
for seq, label in zip(x_train[:5], y_train[:5]):
    print(len(seq))
    model.train_on_batch(np.array(seq)[None], label[None])

467
138
147
168
144
