# 11 RNN Language Model on Keras
In this notebook, we are going to implement three language models following three language models following their own assumptions

## Agenda

1. Recap on tokenizer, convert string to a list of word index

2. One word to One word p(wi|wi-1)

3. Fixed number of word to One word p(wi|wi-1, wi-2,wi-3)

4. Variable number of input words p(wi|wi-1, wi-2, ....w1) by Stateful RNN 

## 1. Recap on Tokenizer

In [1]:
from keras.models import Sequential
import numpy as np
from keras.layers import Dense, Embedding, LSTM
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


#####  Adopted Corpus 
- The following toy corpus is from Chapter 1 of Harry Potter
-  And we are using the trained language model to write our own `Harry Potter`.

In [2]:
corpus = """Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense\n
          Mr. Dursley was the director of a firm called Grunnings, which made
          drills. He was a big, beefy man with hardly any neck, although he did
          have a very large mustache. Mrs. Dursley was thin and blonde and had 
          nearly twice the usual amount of neck, which came in very useful as she 
          spent so much of her time craning over garden fences, spying on the 
          neighbors. The Dursleys had a small son called Dudley and in their 
          opinion there was no finer boy anywhere. 
          The Dursleys had everything they wanted, but they also had a secret, and 
          their greatest fear was that somebody would discover it. They didn't 
          think they could bear it if anyone found out about the Potters. Mrs. 
          Potter was Mrs. Dursley's sister, but they hadn't met for several years; 
          in fact, Mrs. Dursley pretended she didn't have a sister, because her 
          sister and her good-for-nothing husband were as unDursleyish as it was 
          possible to be. The Dursleys shuddered to think what the neighbors would 
          say if the Potters arrived in the street. The Dursleys knew that the 
          Potters had a small son, too, but they had never even seen him. This boy 
          was another good reason for keeping the Potters away; they didn't want 
          Dudley mixing with a child like that"""

- map letters to interger values

In [3]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([corpus])  ## develop mapping from words to unique integers
encoded = tokenizer.texts_to_sequences([corpus])[0]  ## sequence of text can be convereted to sequence of integers

In [4]:
print(encoded[:10])

[25, 5, 7, 9, 10, 46, 47, 48, 49, 11]


- vocab size

In [5]:
vocab_size = len(tokenizer.word_index) + 1 ## plus one is for OOV words
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 149


In [6]:
tokenizer.word_index['mrs']

7

## 2. One-gram Language Model

- One word in, one word out
- The assumption is that the current word only depends on the previous one word.\
- P('I finally finish BT5153') = P(I) * P(finally/I) * P(finish/finally) * P(BT5153/finiSH)

### Training data generation

. From the training sentence `I finally finish BT5153`, we can generate the following training corpus:

<pre>
Input X               Target y
-------------------------------
I                     Finally

Finally               Finish

Finish                BT5153
</pre>


In [7]:
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

sequences = np.array(sequences)
X, y = sequences[:,0],sequences[:,1]

Total Sequences: 262


In [8]:
print('Input word index')
print(X[0])
print(tokenizer.index_word[X[0]])
print(y[0])
print(tokenizer.index_word[y[0]])

Input word index
25
mr
5
and


######  one hot encoding

Neural networks output for multi-class classifcations can only be one hot vectors

In [9]:
y = to_categorical(y, num_classes=vocab_size)

######  model build

In [10]:
model = Sequential()
embedding_size = 15
model.add(Embedding(vocab_size, embedding_size, input_length=1))  # it is one word in therefore, the input lengt is 1
model.add(LSTM(50))                                   # the output size of LSTM layer is 50
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 15)             2235      
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                13200     
_________________________________________________________________
dense_1 (Dense)              (None, 149)               7599      
Total params: 23,034
Trainable params: 23,034
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=100, verbose=2)

Instructions for updating:
Use tf.cast instead.
Epoch 1/100
 - 1s - loss: 5.0038 - acc: 0.0382
Epoch 2/100
 - 0s - loss: 4.9994 - acc: 0.0725
Epoch 3/100
 - 0s - loss: 4.9956 - acc: 0.0649
Epoch 4/100
 - 0s - loss: 4.9915 - acc: 0.0534
Epoch 5/100
 - 0s - loss: 4.9874 - acc: 0.0534
Epoch 6/100
 - 0s - loss: 4.9823 - acc: 0.0534
Epoch 7/100
 - 0s - loss: 4.9767 - acc: 0.0534
Epoch 8/100
 - 0s - loss: 4.9700 - acc: 0.0534
Epoch 9/100
 - 0s - loss: 4.9623 - acc: 0.0649
Epoch 10/100
 - 0s - loss: 4.9530 - acc: 0.0649
Epoch 11/100
 - 0s - loss: 4.9411 - acc: 0.0649
Epoch 12/100
 - 0s - loss: 4.9272 - acc: 0.0649
Epoch 13/100
 - 0s - loss: 4.9109 - acc: 0.0649
Epoch 14/100
 - 0s - loss: 4.8906 - acc: 0.0649
Epoch 15/100
 - 0s - loss: 4.8662 - acc: 0.0649
Epoch 16/100
 - 0s - loss: 4.8380 - acc: 0.0649
Epoch 17/100
 - 0s - loss: 4.8030 - acc: 0.0649
Epoch 18/100
 - 0s - loss: 4.7644 - acc: 0.0649
Epoch 19/100
 - 0s - loss: 4.7156 - acc: 0.0649
Epoch 20/100
 - 0s - loss: 4.6631 - acc: 0.0649
E

<keras.callbacks.History at 0x2eccfca00f0>

###### call model to generate text

- Here, we can use our trained model to write our own `Harry Potter`
- The idea is that we can iteratively call the language model to select the word with the highest prob scores. 
    - Inital the first word as w0
    - Loop index i from 0 to the pre-defined length n_words
        1. feed the word wi into the model
        2. assign the word with the highest probs. score to wi+1
        3. index i = i + 1
    - At last, given word w0, we have the complete sentence w0, w1,..., wn_words
     


In [12]:
def generate_seq_one_word(model, tokenizer, seed_text, n_words):
    """
    Model inputs:
    1 model: language model
    2 tokenizer: it maintain the same mapping (word to index) as the model used
    3 seed_text: the inital input word string
    4 n_words: the length of target sentence
    """
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = np.array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result + ' ' + out_word
    return result

- Write our own `Harry Potter`

In [13]:
print(generate_seq_one_word(model, tokenizer, 'dudley', 6))

dudley and mrs dursley was mrs dursley


## 3. N-gram Language Model

- Fixed Number word in, one word out
- The assumption is that the current word only depends on the previous N word.\
- If N=2, P('I finally finish BT5153') = P(I) * P(finally/I) * P(finish/I, finally) * P(BT5153/finally,finish)

##### Training data generation

. From the training sentence `I finally finish BT5153`, we can generate the following training corpus with N=2:

<pre>
Input X               Target y
-------------------------------
I                     Finally

I, Finally            Finish

Finally, Finish       BT5153
</pre>

The first input x is required for padding.

In [14]:
win_len = 3  # it means three words in and one word out
sequences = []
for i in range(1, len(encoded)):
    # here, we scan the corpus and we will get four grams
    sequence = encoded[max(0,i-win_len):i+1]
    sequences.append(sequence) 
# the sequences will contain a list of four grams, i.e., four words

In [15]:
# pad sequences
max_length = win_len + 1
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 4


In [16]:
# for each four grams in the sequences, the first three words will be the model input and the last word will be the model output.
sequences = np.array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [17]:
X

array([[  0,   0,  25],
       [  0,  25,   5],
       [ 25,   5,   7],
       [  5,   7,   9],
       [  7,   9,  10],
       [  9,  10,  46],
       [ 10,  46,  47],
       [ 46,  47,  48],
       [ 47,  48,  49],
       [ 48,  49,  11],
       [ 49,  11,  50],
       [ 11,  50,  12],
       [ 50,  12,  26],
       [ 12,  26,  13],
       [ 26,  13,   2],
       [ 13,   2,  11],
       [  2,  11,  51],
       [ 11,  51,  52],
       [ 51,  52,  53],
       [ 52,  53,  54],
       [ 53,  54,  17],
       [ 54,  17,  27],
       [ 17,  27,   2],
       [ 27,   2,  11],
       [  2,  11,   1],
       [ 11,   1,  55],
       [  1,  55,  56],
       [ 55,  56,  57],
       [ 56,  57,  58],
       [ 57,  58,  12],
       [ 58,  12,  28],
       [ 12,  28,  59],
       [ 28,  59,   8],
       [ 59,   8,  60],
       [  8,  60,  61],
       [ 60,  61,  62],
       [ 61,  62,  63],
       [ 62,  63,  29],
       [ 63,  29,   2],
       [ 29,   2,  64],
       [  2,  64,  14],
       [ 64,  14

In [18]:
y[0]

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

######  define model

In [19]:
model = Sequential()
embedding_size = 15
model.add(Embedding(vocab_size, embedding_size, input_length=max_length-1)) 
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 3, 15)             2235      
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                13200     
_________________________________________________________________
dense_2 (Dense)              (None, 149)               7599      
Total params: 23,034
Trainable params: 23,034
Non-trainable params: 0
_________________________________________________________________
None


In [20]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=200, verbose=2)

Epoch 1/200
 - 1s - loss: 5.0038 - acc: 0.0115
Epoch 2/200
 - 0s - loss: 4.9984 - acc: 0.0496
Epoch 3/200
 - 0s - loss: 4.9934 - acc: 0.0534
Epoch 4/200
 - 0s - loss: 4.9876 - acc: 0.0534
Epoch 5/200
 - 0s - loss: 4.9803 - acc: 0.0534
Epoch 6/200
 - 0s - loss: 4.9701 - acc: 0.0534
Epoch 7/200
 - 0s - loss: 4.9562 - acc: 0.0534
Epoch 8/200
 - 0s - loss: 4.9352 - acc: 0.0534
Epoch 9/200
 - 0s - loss: 4.9017 - acc: 0.0534
Epoch 10/200
 - 0s - loss: 4.8486 - acc: 0.0534
Epoch 11/200
 - 0s - loss: 4.7691 - acc: 0.0534
Epoch 12/200
 - 0s - loss: 4.6776 - acc: 0.0534
Epoch 13/200
 - 0s - loss: 4.6113 - acc: 0.0534
Epoch 14/200
 - 0s - loss: 4.5716 - acc: 0.0534
Epoch 15/200
 - 0s - loss: 4.5306 - acc: 0.0534
Epoch 16/200
 - 0s - loss: 4.4878 - acc: 0.0534
Epoch 17/200
 - 0s - loss: 4.4468 - acc: 0.0573
Epoch 18/200
 - 0s - loss: 4.4023 - acc: 0.0573
Epoch 19/200
 - 0s - loss: 4.3560 - acc: 0.0573
Epoch 20/200
 - 0s - loss: 4.3163 - acc: 0.0573
Epoch 21/200
 - 0s - loss: 4.2701 - acc: 0.0534
E

Epoch 171/200
 - 0s - loss: 0.3852 - acc: 0.9771
Epoch 172/200
 - 0s - loss: 0.3774 - acc: 0.9771
Epoch 173/200
 - 0s - loss: 0.3739 - acc: 0.9733
Epoch 174/200
 - 0s - loss: 0.3673 - acc: 0.9771
Epoch 175/200
 - 0s - loss: 0.3608 - acc: 0.9809
Epoch 176/200
 - 0s - loss: 0.3558 - acc: 0.9771
Epoch 177/200
 - 0s - loss: 0.3507 - acc: 0.9733
Epoch 178/200
 - 0s - loss: 0.3455 - acc: 0.9771
Epoch 179/200
 - 0s - loss: 0.3402 - acc: 0.9771
Epoch 180/200
 - 0s - loss: 0.3367 - acc: 0.9771
Epoch 181/200
 - 0s - loss: 0.3314 - acc: 0.9771
Epoch 182/200
 - 0s - loss: 0.3256 - acc: 0.9809
Epoch 183/200
 - 0s - loss: 0.3229 - acc: 0.9885
Epoch 184/200
 - 0s - loss: 0.3178 - acc: 0.9885
Epoch 185/200
 - 0s - loss: 0.3128 - acc: 0.9847
Epoch 186/200
 - 0s - loss: 0.3084 - acc: 0.9924
Epoch 187/200
 - 0s - loss: 0.3035 - acc: 0.9885
Epoch 188/200
 - 0s - loss: 0.2992 - acc: 0.9885
Epoch 189/200
 - 0s - loss: 0.2953 - acc: 0.9847
Epoch 190/200
 - 0s - loss: 0.2911 - acc: 0.9885
Epoch 191/200
 - 0s 

<keras.callbacks.History at 0x2eccfca3be0>

###### call model to generate text for n-gram language model

- Here, we can use our trained model to write our own `Harry Potter`
- The idea is that we can iteratively call the language model to select the word with the highest prob scores. 
    - Inital the first word as w0
    - Loop index i from 0 to the pre-defined length n_words
        1. feed the word wi-n+1, wi-n+2,...,wi into the model
        2. assign the word with the highest probs. score to wi+1
        3. index i = i + 1
    - At last, given word w0, we have the complete sentence w0, w1,..., wn_words

In [21]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    """
    Model inputs:
    1 model: language model
    2 tokenizer: it maintain the same mapping (word to index) as the model used
    3 max_length: it is for padding if the length of input word lists is less than the required one
    3 seed_text: the inital input word string
    4 n_words: the length of target sentence
    """
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

- Write our own `Harry Potter`

In [22]:
print(generate_seq(model, tokenizer, max_length-1, 'Potters arrived in', 5))

Potters arrived in the street the dursleys knew


## 4. Variable-gram Language Model

- Variable Number word in, one word out
- The assumption is that the current word depends on all its previous words
- P('I finally finish BT5153') = P(I) * P(finally/I) * P(finish/I, finally) * P(BT5153/I, finally,finish)
- Ideally, we want to expose the network to the entire sequence and let it learn the inter-dependencies, rather than us define those dependencies explicitly in the framing of the problem.

###### Stateful RNN
We can do this in Keras by making the LSTM layers stateful and manually resetting the state of the network at the end of the epoch, which is also the end of the training sequence.

This is truly how the LSTM networks are intended to be used. We find that by allowing the network itself to learn the dependencies between the characters, that we need a smaller network (half the number of units) and fewer training epochs (almost half).

We first need to define our LSTM layer as stateful. In so doing, we must explicitly specify the batch size as a dimension on the input shape. This also means that when we evaluate the network or make predictions, we must also specify and adhere to this same batch size. This is not a problem now as we are using a batch size of 1. This could introduce difficulties when making predictions when the batch size is not one as predictions will need to be made in batch and in sequence.

some useful tutorials for stateful RNN:
1. http://philipperemy.github.io/keras-stateful-lstm/
2. https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/
3. https://keras.io/getting-started/faq/#how-can-i-use-stateful-rnns

##### Training data generation

The training data is the same as the first one-word in and one-word out mode. 

For stateful RNN, the idea is to split the sequence into elements of size 1 and feed them to the LSTM. Once the sequence is over, we manually reset the states of the RNN to have a clean setup for the next one. For each element, we associate the related target Yi. Because the RNN is stateful, the state will be propagated to the next batch. Also because the batch_size=1, we are sure that the state of the last element will be used as input to the current element.

In [23]:
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

sequences = np.array(sequences)
X, y = sequences[:,0],sequences[:,1]
y = to_categorical(y, num_classes=vocab_size)

Total Sequences: 262


In [24]:
batch_size = 1
model = Sequential()
embedding_size = 15
model.add(Embedding(vocab_size, embedding_size, input_length=1, batch_input_shape=(batch_size,1)))
model.add(LSTM(50, batch_input_shape=(batch_size, 1, 10), stateful=True))   ## The stateful is true
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
nb_epoch = 200
for i in range(nb_epoch):
    model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()   ## it is important for stateful RNN, after each scanning one document, you should call reset_states()
                           ## here, we only have one document so that we call reset_states per batch

Epoch 1/1
 - 1s - loss: 5.0134 - acc: 0.0076
Epoch 1/1
 - 0s - loss: 4.8542 - acc: 0.0382
Epoch 1/1
 - 0s - loss: 4.6247 - acc: 0.0534
Epoch 1/1
 - 0s - loss: 4.5889 - acc: 0.0611
Epoch 1/1
 - 0s - loss: 4.5496 - acc: 0.0649
Epoch 1/1
 - 0s - loss: 4.4451 - acc: 0.0534
Epoch 1/1
 - 0s - loss: 4.2653 - acc: 0.0649
Epoch 1/1
 - 0s - loss: 4.0665 - acc: 0.0840
Epoch 1/1
 - 0s - loss: 4.1282 - acc: 0.0534
Epoch 1/1
 - 0s - loss: 3.8721 - acc: 0.0916
Epoch 1/1
 - 0s - loss: 3.7705 - acc: 0.1107
Epoch 1/1
 - 0s - loss: 3.6054 - acc: 0.1450
Epoch 1/1
 - 0s - loss: 3.5230 - acc: 0.1565
Epoch 1/1
 - 0s - loss: 3.2780 - acc: 0.1756
Epoch 1/1
 - 0s - loss: 3.3419 - acc: 0.1565
Epoch 1/1
 - 0s - loss: 3.0631 - acc: 0.2099
Epoch 1/1
 - 0s - loss: 2.9996 - acc: 0.2061
Epoch 1/1
 - 0s - loss: 3.0025 - acc: 0.1985
Epoch 1/1
 - 0s - loss: 2.9994 - acc: 0.2405
Epoch 1/1
 - 0s - loss: 2.7835 - acc: 0.3321
Epoch 1/1
 - 0s - loss: 2.7471 - acc: 0.2672
Epoch 1/1
 - 0s - loss: 2.5538 - acc: 0.3473
Epoch 1/1


 - 0s - loss: 0.0408 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.0311 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.0312 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.0276 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0233 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0195 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0183 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0175 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0195 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0173 - acc: 1.0000
Epoch 1/1
 - 0s - loss: 0.0159 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.0163 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.1383 - acc: 0.9618
Epoch 1/1
 - 0s - loss: 0.1329 - acc: 0.9580
Epoch 1/1
 - 0s - loss: 0.1075 - acc: 0.9695
Epoch 1/1
 - 0s - loss: 0.0616 - acc: 0.9809
Epoch 1/1
 - 0s - loss: 0.0291 - acc: 0.9962
Epoch 1/1
 - 0s - loss: 0.0293 - acc: 0.9924


- Write our own `Harry Potter`

In [26]:
print(generate_seq_one_word(model, tokenizer, 'director', 6))

director they were dursley of were as
