## Lab 10, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 20 epochs are required before the generated text
starts sounding coherent.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

Using TensorFlow backend.


In [95]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [96]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

nb sequences: 200285
Vectorization...


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [88]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [89]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

### Training (reduce the number of epochs, it takes a lot of time!!)
-  Each epoch takes 5-10 minutes or so on a CPU (an epoch took 7.5 minutes for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let that puppy run for a while (2-3 hours)

In [92]:
# print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=25)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x28b97177d68>

In [93]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=1,
          callbacks=[print_callback])

Epoch 1/1

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "as an explanation.
it has eyes and finge"
as an explanation.
it has eyes and finger and interpretation of the subject of the conscious and consideration of the subject, and also the same proud the same thing of the subject, and in the standard to the same must alone is a strange, and also the conscious and such a standard and self-desire to the extent in the species of the sense of the present of the species of the same morality of the strange of the extent of the specie and re
----- diversity: 0.5
----- Generating with seed: "as an explanation.
it has eyes and finge"
as an explanation.
it has eyes and finger the will to the standard of as the same notion of the english and srieble intach and end the same read to the reality of a productively of the self-depression, the sense, the standard and experience of sections are and in the interpreced to his and distrust of the vained and obligates

  after removing the cwd from sys.path.


and chasm of all reced, one even the right reach arreast ald essensaes of referers of the changame and litten: the compro-muchild skeed a human"ioy doubt, the unsentic man find immorally war painor, in dores, a dangeniesces.

242. and a long age excep
----- diversity: 1.2
----- Generating with seed: "as an explanation.
it has eyes and finge"
as an explanation.
it has eyes and fingeralitard thew our "dirged ougation: -and midlmen no
samlysic=); it we had even in i must could "ccainges!--hones), as the racist, for
a she effecyded, and danger-alternanly his logicoms marrce,
e fainss and owerb over
phole-france--this keeme tyraesty in
risk future--there are valosy, to duerrage as pults history of culture is in rdul was
have known, in good
taken hi
ompoines, wut
do the long and 


<keras.callbacks.History at 0x28b97177f60>

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [99]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear100.h5')


def lstm_generate(seed, model):
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
lstm_generate(seed, model)

Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is t7éé77é77 xé77éa7é77é77é7 

  after removing the cwd from sys.path.


77é7 éitqeevf t0a7 to7 . iéiéso7ise7 ' 1 ! xio7éa insvrnérer ë afnéan77-éi éiéésq" équat7é77 éosxe h  e éft iécv7q77 7s )77 x=a ine7-és !7es ;7 ' hje7n get7;éasqië iq7 s 7 bix77 ierhnté ts7neëruh7n7 tf7réa é7s!7éséi sé7i' "a7sft77fo iséa i'r ééi x7an a ëi s
.8i i.  , a7oe 7di(t m "7e7-éiëiesxé7 é77éééi tpvqi! bqw-ere ex 7e777-uoés irofde7é7é7 xé77-t7s x7 t. x é7éas!sbeq é 
----- diversity: 0.5
----- Generating with seed: "ofde7é7é7 xé77-t7s x7 t. x é7éas!sbeq é "
ofde7é7é7 xé77-t7s x7 t. x é7éas!sbeq é 'wim-ane-é; t7out7é shex eéix (a(a7féo7 x  oééa ét at w7  x7és tl so3. the ë77  é7 xéa77éréésuwb7  7sé ! ihdh 7-t7titéa(ashe éiéqééi'éilew7if7éiéa7 'r7n xéi ésq' tne7 'i éoestoë st770e sbeeheoéve d7séoé7éi éé7 xé7asbeve"] wn7ditv5t =77sine7 éé7 7 éon7éi t7e77 an [eq7gr7r(o7(ibitn ieéi) ésx é77 éité7 aos7 ixéa7 éri!77 at3ixbeape(in7 x éis xiqét éi6ixan7é imxiéa e7 =sbre 'oi m7 t7-oë aé7 éanth7 éex"
----- diversity: 1.0
----- Generating with seed: "éa e7 =sbre 'oi m7 t7-oë aé7 éanth7 éex""

It produced readible results the first time I run it, but after I re-download the shakespear pre-trained, it becomes something like this. I somehow lost the previous shakespear pre-trained model, so I don't how to go back to the first result!

### Exercise: try it to generate baby names
-  The baby name data set contains 8000 names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the baby name data set, answer the following tasks:

- Train a LSTM to generate the baby names.
- How long does it take to train? How coherent does it sound? 
- Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?



In [58]:
name_path = get_file('names.txt', origin = 'http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')

with io.open(name_path, encoding = 'utf-8') as f:
    text = f.read()

text = text.split()
text = ', '.join(text)

In [59]:
len(text)

501788

In [60]:
chars = sorted(list(set(text)))
print(len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

58


In [61]:
maxlen = 20
step = 2
names = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    names.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb names:', len(names))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(names), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(names), len(chars)), dtype=np.bool)
for i, name in enumerate(names):
    for t, char in enumerate(name):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

nb names: 250884
Vectorization...


In [62]:
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [63]:
def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(40):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

In [64]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

%timeit -n1 -r1
model.fit(x, y,
          batch_size=128,
          epochs=10,
          callbacks=[print_callback])

Epoch 1/10

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "on, Burdsall, Bureau"
on, Burdsall, Bureau, Bureer, Bureer, Bureer, Bureer, Bureer
----- diversity: 0.5
----- Generating with seed: "on, Burdsall, Bureau"
on, Burdsall, Bureau, Buree, Buree, Bureer, Buree, Bureell, 
----- diversity: 1.0
----- Generating with seed: "on, Burdsall, Bureau"
on, Burdsall, Bureau, Burgele, Burgers, Burget, Burger, Burg
----- diversity: 1.2
----- Generating with seed: "on, Burdsall, Bureau"
on, Burdsall, Bureaus, Buref, Burefer, Bureven, Bureyke, Bur
Epoch 2/10

----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "tscher, Deutschman, "
tscher, Deutschman, Deutser, Deutter, Deutter, Deutter, Deut
----- diversity: 0.5
----- Generating with seed: "tscher, Deutschman, "
tscher, Deutschman, Deutter, Deutter, Deutter, Deuts, Deutte
----- diversity: 1.0
----- Generating with seed: "tscher, Deutschman, "
tscher, Deutschman, eetton, Setu

<keras.callbacks.History at 0x28b9175b518>

In [70]:
"Milasse" in text

False

In [77]:
"Milassa" in text

False

It takes around 20 minutes to run 10 epochs. Many names that the model generates don't sound coherent at all, but some sounds good, like "Milasse" or "Milassa".

In [81]:
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [82]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

%timeit -n1 -r1 model.fit(x, y,batch_size=128,epochs=10,callbacks=[print_callback], shuffle = True)

Epoch 1/10

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "teer, Mateja, Matejk"
teer, Mateja, Matejka, Materie, Matery, Matery, Matery, Mate
----- diversity: 0.5
----- Generating with seed: "teer, Mateja, Matejk"
teer, Mateja, Matejka, Mateh, Mather, Mathermen, Matherger, 
----- diversity: 1.0
----- Generating with seed: "teer, Mateja, Matejk"
teer, Mateja, Matejk, Mather, Matherzok, Matherytan, Mather,
----- diversity: 1.2
----- Generating with seed: "teer, Mateja, Matejk"
teer, Mateja, Matejked, Matinta, Mathku, Matky, Matkies, Mat
Epoch 2/10

----- Generating text after Epoch: 1
----- diversity: 0.2
----- Generating with seed: "akers, Rakes, Rakesh"
akers, Rakes, Rakesh, Rakes, Raki, Rakin, Rakle, Rakle, Rakl
----- diversity: 0.5
----- Generating with seed: "akers, Rakes, Rakesh"
akers, Rakes, Rakesh, Rakes, Rakes, Rakick, Rakio, Rakin, Ra
----- diversity: 1.0
----- Generating with seed: "akers, Rakes, Rakesh"
akers, Rakes, Rakesh, Rakew, Ral

In [84]:
"Heystein" in text

False

In [85]:
"Mcgorney" in text

False

It takes somewhat longer, but not too bad. The coherency doesn't seem to improve...