## Lab 10, Part 2:   Recurrent Neural Networks (RNN)  -- Extra Credit

When it comes to model sequential data such as sentences, documents and videos, etc, the state of the art approach is to use Recurrent neural network (RNN). At each timestep, RNN takes an element (such as a word) as input, combines with past information encoded as a vector (such as all information in the sentence before this timestep), generate a new vector encoding both current input and past information, then delivers it to next timestep.

For more details about LSTM (a very popular variant of RNN), please refer to http://colah.github.io/posts/2015-08-Understanding-LSTMs/ and here is a very good video explaining RNN: https://www.youtube.com/watch?v=WCUNPb-5EYI.

### Generating text with Long Short-Term Memory Networks

RNN can be used to generate text. For more information, please read: https://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The following is an example script to generate text from Nietzsche's writings.

Note: 
- At least 20 epochs are required before the generated text
starts sounding coherent.

- It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

- If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
#Import necessary libraries 
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

Using TensorFlow backend.


In [2]:
#Get the data - available from amazon
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower() # make it all lowercase 
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600893
total chars: 57


In [3]:
# Cut the text in semi-redundant sequences of maxlen characters
## Cut the text into a series of windows. 
## Each window is 40 characters
## The window moves 3 steps forward each step

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

# Turn these sentances into one-hot encoded vectors
## For all words in the sentances, there is a one, else there is a zero in that index of the vector

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

nb sequences: 200285
Vectorization...


Now we have data to feed a model for text generation. Next  we build a LSTM model to fit the data. Using Keras this is only few lines of code!

In [4]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [5]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

### Training (reduce the number of epochs, it takes a lot of time!!)
-  Each epoch takes 5-10 minutes or so on a CPU (an epoch took 7.5 minutes for my PC)
-  Recall that training on at least 20 epochs will give intelligible results 
-  So you're gonna have to let that puppy run for a while (2-3 hours)

In [6]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=25,
          callbacks=[print_callback])

Epoch 1/25

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "ook of grace,
still appeals more to his "
ook of grace,
still appeals more to his of the presponsity and the precess of the presporing and enterations and the power the sense that the sens and ever the sense of the self were the personation of the self--and the self--and the consiment of the may a so the prope and in the may and sension of the personations and and the properations and his and in the propo and in the interped and his new which the may a power and the sens and th
----- diversity: 0.5
----- Generating with seed: "ook of grace,
still appeals more to his "
ook of grace,
still appeals more to his and from the endable and with the supmes there in the still at the may of a philosophers and a become the self were were this knowledge in the man in the other in a the most a for of the and become and influction of the instinction. and being there is the past the well propon in the wh

it! i--have no reason the same and and spirit, and in the most subjection, and it is the fact the procept and the distance of the present of the same dealing and individual of the stand of the conscience of the procept and fact that it is a man is always be and who is an and and the strange of the prise of the experities of the man is not more expresses and and intendivality of all the stand of the man and the same the 
----- diversity: 0.5
----- Generating with seed: "ver else require
it! i--have no reason t"
ver else require
it! i--have no reason than the proper conscience and prehald and power, must like the weaks and sensalities, and it is always an principle the interpreths and having considerable new same the problem of present on the extent of the distance and the the same and and influence of the world and an ancient the mare and any prose of the enough, in an invirtured in the prone is one were only it is not doubt the man is always 
----- diversity: 1.0
----- Generating with 

  after removing the cwd from sys.path.


 self-democratic self-destrained and experiences and self-self-destruction of the spirituality of the same time and sensual conscience of the sense of the same conscience of the sense of the same and and sense of the standard and immoral the standaps and even the standard and stronger in the self-experiences in the spirituality of the same time that the spirituality of the
----- diversity: 0.5
----- Generating with seed: "t, to the pleasant frivolity of
clever f"
t, to the pleasant frivolity of
clever for the such the fact that is the simple to him opposing and not be true and and most own and conscience of the standard and life and most german discoveratically even even the power to the continue, with itself the
religious of the world, to such false conscience to its for the interents and pulated and prevailistic and most and precisely and most stands this found of all state of the democratic p
----- diversity: 1.0
----- Generating with seed: "t, to the pleasant frivolity of
clever f"

they have to perform, the subtlenation of the over of one far of "strength of explanations of the prove of man has a sentiments in the strange to the strength, and the former the spirits and point of his expediantince and merely the suffering in the world, and self-sense, and and for their writtle gratificion of a sentiments that the world of the superior of the soul of the process of the experience of its appearance of the sen
----- diversity: 1.0
----- Generating with seed: "he tasks
they have to perform, the subtl"
he tasks
they have to perform, the subtlel one strigt the seld far the significance. or when beet it at compariling
possi-he has we "presumiation indictually. here they have a morals and even the pain--that sentingless, with expeaus--alle indeed, which is aswarieticien it to do the punest will almost
true even alle a
say bking, that
"done," where betener to ethonen that
which thoseighthss the conothishcours onlyer fanatifelhed,
symos of
----- diversity: 1.2
----- Generati

<keras.callbacks.History at 0x1116749d0>

## Load pre-trained model
Since it is time consuming to train this LSTM model with CPU for more epochs, we provided a pre-trained model which is trained on GPU for 100 epochs. Use the following code to check how coherency the model is.

It requires h5py packages, please install it to test the following code.

In [9]:
# build the model: a single LSTM
print('Load pre-trained model...')
from keras.models import load_model
model = load_model('shakespear100.h5')


def lstm_generate(seed, model):
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        generated += seed
        print('----- Generating with seed: "' + seed + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(seed):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            seed = seed[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


seed = "from an anguish with which no other is t"
lstm_generate(seed, model)


Load pre-trained model...
----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the sense of the sense of the present the such a present the conscious and something the subjections of the subjection of the heart of the subjection of the subjection of the constitute the conceive the subjection of the sacrifices of the sense of the subjection of the present the sense of the propers the serve to the subjection of the subjection of the world and such a possible in the part to the 
----- diversity: 0.5
----- Generating with seed: " and such a possible in the part to the "
 and such a possible in the part to the basies the part the more very psychility, the case of the fact the mask of the consist, and in some habit and inversing the more of man" the pleasure in which the present to the more headfuls of men as
the mask of its such a life although action of the above-scieuch must are not it is not the fact to the con

### Exercise: try it to generate baby names
-  The baby name data set contains 8000 names. You can download and process the name data set as follows:

```python
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 
    
text = text.split()
text = ', '.join(text)
```

Using the baby name data set, answer the following tasks:

- Train a LSTM to generate the baby names.
- How long does it take to train? How coherent does it sound? 
- Can you train the LSTM, but for every epoch, shuffle the order of names before call model.fit()? How long does it take to train? Does it improve the coherency?



In [16]:
name_path = get_file('names.txt', origin='http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/other/names.txt')
with io.open(name_path, encoding='utf-8') as f:
    text = f.read() # make it all lowercase 

text = text.split()
text = ', '.join(text)

In [None]:
print 

In [17]:
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 501788
total chars: 58


In [15]:
lstm_generate(seed, model)

----- diversity: 0.2
----- Generating with seed: "from an anguish with which no other is t"
from an anguish with which no other is the subjection of the states of the subjection of
the consciousness and everything the southernged to the sense of the state of the sense of the serves of the southtrage of the subjection to the state of the subjection of the subjection of the state and the present and the subjection of the states of the sense of the subjection of the sense of the subjection of the constitute the consciousness, the
----- diversity: 0.5
----- Generating with seed: "of the constitute the consciousness, the"
of the constitute the consciousness, the heart--what is a present, to present, the case of the excessive inners and in the proportions of the
passer for the conceive the subjection," and grated even an experience of the serve it be the presented to the statement and such a man superficial that you concerning such a so that the constitute the more to the consciousness, and 