# Recurrent Neural Networks

In this lab we will experiment with recurrent neural networks. These are a useful type of model for predicting sequences or handling sequences of things as inputs. We will implement them in Keras+Tensorflow but many implementations can be found online with many sets of variants. Here are installation instructions for Keras: https://keras.io/#installation, and here are installation instructions for Tensorflow: https://github.com/tensorflow/tensorflow#download-and-setup. You should also be able to run those from a Docker container.

We will take a set of 10 thousand image descriptions from the MS-COCO dataset (400,000 sentences) and make our recurrent network learn how to compose new sentences character by character.
You can download this data here: http://www.cs.virginia.edu/~vicente/recognition/captions_train.txt.zip

First, let's import libraries and make sure you have everything properly installed.

In [3]:
import tensorflow as tf
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, LSTM
from keras.optimizers import RMSprop
from keras.layers.wrappers import TimeDistributed
import random
import pickle

Using TensorFlow backend.


## 1. Preprocessing the Text
We will first read the sentences and map each character to a unique identifier so that we can treat each sentence as an array of character ids. The code below loads the captions from a text file and places them inside a caption tensor that is a matrix of size numCaptions x maxCaptionLength x charVocabularySize. We will also create a caption tensor that contains the sentences but shifted by one character. Each character is mapped to an incremental ID, so we keep two hashmaps to convert from character to id and back.

In [1]:
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [7]:
all_recipes = pickle.load(open('cleaned_recipes.p', 'rb'))
def booze_permuter(recipes):
    for rec in recipes:
        ings = rec['ingredients']
        i = 0
        while i < len(ings) and is_number(ings[i][0]):
            i += 1
            
        ing_list = ings[:i]
        garn_list = ings[i:]
        
        random.shuffle(ing_list)
        random.shuffle(garn_list)
        
        ing_list.extend(garn_list)
        yield '\n'.join(ing_list)
        
def get_new_recipes():
    recipe_list = []
    for recipe in booze_permuter(all_recipes):
        recipe_list.append(recipe)
    return recipe_list

def training_set_generator(num_sets):
    for i in range(num_sets):
        pass
        
test_set = get_new_recipes()
print(len(test_set))
# Compute a char2id and id2char vocabulary.
char2id = {}
id2char = {}
charIndex = 0
for recipe in test_set:
    for char in recipe:
        if char not in char2id:
            char2id[char] = charIndex
            id2char[charIndex] = char
            charIndex += 1

# Add a special starting and ending character to the dictionary.
char2id['S'] = charIndex; id2char[charIndex] = 'S'  # Special sentence start character.
char2id['E'] = charIndex + 1; id2char[charIndex + 1] = 'E'  # Special sentence ending character.
pickle.dump(char2id, open('char2id_2.p', 'wb'), protocol = 2)
pickle.dump(id2char, open('id2char_2.p', 'wb'), protocol = 2)
pickle.dump(all_recipes, open('cleaned_recipes_2.p', 'wb'), protocol = 2)


# max_recipe_length = 500
# # Place test_set inside tensors.
# maxSequenceLength = max_recipe_length + 1
# # inputChars has one-hot encodings for every character, for every caption.
# inputChars = np.zeros((len(test_set), maxSequenceLength, len(char2id)), dtype=np.bool)
# # nextChars has one-hot encodings for every character for every caption (shifted by one).
# nextChars = np.zeros((len(test_set), maxSequenceLength, len(char2id)), dtype=np.bool)

# for i in range(0, len(test_set)):
#     inputChars[i, 0, char2id['S']] = 1
#     nextChars[i, 0, char2id[test_set[i][0]]] = 1
#     for j in range(1, maxSequenceLength):
#         if j < len(test_set[i]) + 1:
#             inputChars[i, j, char2id[test_set[i][j - 1]]] = 1
#             if j < len(test_set[i]):
#                 nextChars[i, j, char2id[test_set[i][j]]] = 1
#             else:
#                 nextChars[i, j, char2id['E']] = 1
#         else:
#             inputChars[i, j, char2id['E']] = 1
#             nextChars[i, j, char2id['E']] = 1

# print("input:")
# print(inputChars.shape)  # Print the size of the inputCharacters tensor.
# print("output:")
# print(nextChars.shape)  # Print the size of the nextCharacters tensor.
# print("char2id:")
# print(char2id)  # Print the character to ids mapping.

30724


NameError: name 'cleaned_recipes' is not defined


<b>Note:</b> In order to clearly show how inputChars, and nextChars store the sequences, let's try printing a sentence back from its stored format in these two arrays.

In [4]:
trainCaption = inputChars[25, :, :]  # Pick some caption
labelCaption = nextChars[25, :, :]  # Pick what we are trying to predict.

def printCaption(sampleCaption):
    charIds = np.zeros(sampleCaption.shape[0])
    for (idx, elem) in enumerate(sampleCaption):
        charIds[idx] = np.nonzero(elem)[0].squeeze()
    print(np.array([id2char[x] for x in charIds]))

printCaption(trainCaption)
printCaption(labelCaption)

[u'S' u'0' u'.' u'2' u'5' u' ' u'p' u'a' u'r' u't' u' ' u'g' u'i' u'n'
 u'\n' u'0' u'.' u'0' u'6' u'2' u'5' u' ' u'p' u'a' u'r' u't' u's' u' '
 u'f' u'r' u'e' u's' u'h' u' ' u'l' u'i' u'm' u'e' u' ' u'j' u'u' u'i' u'c'
 u'e' u'\n' u'1' u' ' u'p' u'a' u'r' u't' u's' u' ' u'm' u'i' u's' u't'
 u' ' u't' u'w' u's' u't' u' ' u'c' u'r' u'a' u'n' u'b' u'e' u'r' u'r' u'y'
 u'\xae' u'\n' u'0' u'.' u'2' u'5' u' ' u'p' u'a' u'r' u't' u' ' u'v' u'o'
 u'd' u'k' u'a' u'\n' u'c' u'r' u'a' u'n' u'b' u'e' u'r' u'r' u'i' u'e'
 u's' u' ' u'a' u'n' u'd' u' ' u'l' u'e' u'm' u'o' u'n' u' ' u'o' u'r' u' '
 u'l' u'i' u'm' u'e' u' ' u's' u'l' u'i' u'c' u'e' u's' u' ' u'f' u'o' u'r'
 u' ' u'g' u'a' u'r' u'n' u'i' u's' u'h' u'E' u'E' u'E' u'E' u'E' u'E' u'E'
 u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E'
 u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E'
 u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E' u'E'
 u'E' u'E' u'E' u'E' u'E' u'E'

In the above output, you will notice that the sentences are indeed shifted. This is because we are going to predict the next character at each timestep. The first character is 'S' which means start of sentences, and the next character in our target should be 'a' which is the first actual character of the sentence. The later characters in the sentence will also use the "history" of all previous characters to find out what goes next. 

## 2. Building our model using an LSTM Recurrent Network.
Next we will create a recurrent neural network using Keras which takes an input set of characters (one-hot encoded) of size (batch_size, maxSequenceLength, charVocabularySize), similarly the output of this network will be a vector of size (batch_size, maxSequenceLength, charVocabularySize). However, the output does not contain one-hot encodings. The output contains a probability distribution (the output of a softmax) for every time step in the sequence. We see in section 4 how to decode the sequence from this distribution, you can just take the character corresponding to the index with the max probability for every time step. 

In [5]:
print('Building training model...')
hiddenStateSize = 128
hiddenLayerSize = 128
model = Sequential()
# The output of the LSTM layer are the hidden states of the LSTM for every time step. 
model.add(LSTM(hiddenStateSize, return_sequences = True, input_shape=(maxSequenceLength, len(char2id))))
# Two things to notice here:
# 1. The Dense Layer is equivalent to nn.Linear(hiddenStateSize, hiddenLayerSize) in Torch.
#    In Keras, we often do not need to specify the input size of the layer because it gets inferred for us.
# 2. TimeDistributed applies the linear transformation from the Dense layer to every time step
#    of the output of the sequence produced by the LSTM.
model.add(TimeDistributed(Dense(hiddenLayerSize)))
model.add(TimeDistributed(Activation('relu'))) 
model.add(TimeDistributed(Dense(len(char2id))))  # Add another dense layer with the desired output size.
model.add(TimeDistributed(Activation('softmax')))
# We also specify here the optimization we will use, in this case we use RMSprop with learning rate 0.001.
# RMSprop is commonly used for RNNs instead of regular SGD.
# See this blog for info on RMSprop (http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop)
# categorical_crossentropy is the same loss used for classification problems using softmax. (nn.ClassNLLCriterion)
model.compile(loss='categorical_crossentropy', optimizer = RMSprop(lr=0.001))

print(model.summary()) # Convenient function to see details about the network model.

# Test a simple prediction on a batch for this model.
print("Sample input Batch size:"),
print(inputChars[0:32, :, :].shape)
print("Sample input Batch labels (nextChars):"),
print(nextChars[0:32, :, :].shape)
outputs = model.predict(inputChars[0:32, :, :])
print("Output Sequence size:"),
print(outputs.shape)

Building training model...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_1 (LSTM)                    (None, 501, 128)      131584      lstm_input_1[0][0]               
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribute(None, 501, 128)      16512       lstm_1[0][0]                     
____________________________________________________________________________________________________
timedistributed_2 (TimeDistribute(None, 501, 128)      0           timedistributed_1[0][0]          
____________________________________________________________________________________________________
timedistributed_3 (TimeDistribute(None, 501, 128)      16512       timedistributed_2[0][0]          
________________________________________________________________

## 3. Training the Model
Keras already implements a generic trainModel functionality through the model.fit function, but it also contains model.train_on_batch if you want to perform the training for loop yourself. For more informations about Keras model functionalities you can see here: https://keras.io/models/model/

If you installed Tensorflow with GPU support, this will automatically run on the GPU.

In [6]:
model.fit(inputChars, nextChars, batch_size = 128, nb_epoch = 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd0c9272a50>

## 4. Verifying the Model is indeed Learning
Here we input an arbitrary caption from the training set (one-hot encoded), compute the output using the trained model, and decode this output back into a char array. Ideally we should see the same input caption shifted by one character. However you would need to run the training code for around 24 hours straight to get the model close to that point (it is ok if you only run the model for 10 iterations for the purposes of this lab).

In [12]:
model.save_weights('cocktail_weights.h5')


In [11]:
# Test a simple prediction on a batch for this model.
captionId = 132

inputCaption = inputChars[captionId:captionId+1, :, :]
outputs = model.predict(inputCaption)

# printCaption(inputCaption[0])
print(''.join([id2char[x.argmax()] for x in outputs[0, :, :]]))

0 55 clu oz. mhter
0 55 clu oz. mocntreau lr coiple sec
0.55 clu sz. moandy
0.5 tlu oz. mreshllimon juice
0 55 clu oz. mpple cider
(occentrate
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE


## 5. Building the Inference Model.
We verified in the previous section that the model was somewhat working on training data. However, we want to be able to create new sentences from this model starting from zero. We want to use the same parameters of the trained model to produce text character by character. We build here such model and just copy the parameters from our trained model above. We show in the following section (section 6) how to produce the sentences using this inference_model. Please pay attention to all the comments in the code below to see what are the differences with the model at training time.

In [14]:
# The only difference with the "training model" is that here the input sequence has 
# a length of one because we will predict character by character.
print('Building Inference model...')
inference_model = Sequential()
# Two differences here.
# 1. The inference model only takes one sample in the batch, and it always has sequence length 1.
# 2. The inference model is stateful, meaning it inputs the output hidden state ("its history state")
#    to the next batch input.
inference_model.add(LSTM(hiddenStateSize, batch_input_shape=(1, 1, len(char2id)), stateful = True))
# Since the above LSTM does not output sequences, we don't need TimeDistributed anymore.
inference_model.add(Dense(hiddenLayerSize))
inference_model.add(Activation('relu'))
inference_model.add(Dense(len(char2id)))
inference_model.add(Activation('softmax'))

# Copy the weights of the trained network. Both should have the same exact number of parameters (why?).
# inference_model.set_weights(model.get_weights())
inference_model.load_weights('cocktail_weights.h5')

# Given the start Character 'S' (one-hot encoded), predict the next most likely character.
startChar = np.zeros((1, 1, len(char2id)))
startChar[0, 0, char2id['S']] = 1
nextCharProbabilities = inference_model.predict(startChar)

# print the most probable character that goes next.
print(id2char[nextCharProbabilities.argmax()])


Building Inference model...
0


In [15]:
charProbs = [(id2char[i], p) for i, p in enumerate(nextCharProbabilities.squeeze())]
charProbs.sort(key=lambda i: i[1], reverse=True)
charProbs[:10]


[(u'0', 0.63272119),
 (u'1', 0.19863351),
 (u'2', 0.073229872),
 (u'3', 0.02815249),
 (u'4', 0.017893698),
 (u'6', 0.010040422),
 (u'5', 0.00622067),
 (u'c', 0.0052369977),
 (u'i', 0.0042802417),
 (u'8', 0.0042548561)]

## 6. Sampling a Complete New Sentence
Now that we have our inference_model working we can start producing new sentences by random sampling from the output of next character probabilities one step at a time. We rely on the np.random.multinomial function from numpy. To see what it does please check the documentation and make sure you understand what it does http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multinomial.html

In [20]:


for i in range(0, 10):
    inference_model.reset_states()  # This makes sure the initial hidden state is cleared every time.
    startChar = np.zeros((1, 1, len(char2id)))
    startChar[0, 0, char2id['S']] = 1
    end = False
    sent = ""
    for i in range(0, maxSequenceLength):
        nextCharProbs = inference_model.predict(startChar)

        # In theory I should be able to input nextCharProbs to np.random.multinomial.
        nextCharProbs = np.asarray(nextCharProbs).astype('float64') # Weird type cast issues if not doing this.
        nextCharProbs = nextCharProbs / nextCharProbs.sum()  # Re-normalize for float64 to make exactly 1.0.

        nextCharId = np.random.multinomial(1, nextCharProbs.squeeze(), 1).argmax()
        if id2char[nextCharId] == 'E':
            if not end:
                print("~~~~~")
            end = True
        else:
            sent = sent + id2char[nextCharId] # The comma at the end avoids printing a return line character.
        startChar.fill(0)
        startChar[0, 0, nextCharId] = 1
    print(sent)

~~~~~
0.188 cup coconut
0.0933 teaspoon suparatens freshly squeezed orange
juice from 1 darkmel torthes worcestershire sadges
0.5 cans – 1/2 tablespoon taw splentted lemons
0.0417.9 liters, tequila (optional)
~~~~~
0.0625 tba cold
0.0909 cup flued sugar
2 pans
~~~~~
0.25 lemons, sliced
0.0833 cup grenadine
1 cucumber cileron
0.45 cups heavy
ice
750ml ripe betred seedseren oranges
~~~~~
3 risemmated lime od gomend, thins
0.0835 tebspoon golden gin
0.5 whote cloves
0.0625 teaspoon sugar
mint sprig medges, grened
~~~~~
1 tablespoons sugar
0.5 tablespoons sugar
0.25 lime wine
2.809 ml triple sec
0.5 tbs fresh wedges
30 miderine syrup of sugar, freshly seatled heaveds in orckssmint sugar
6 cup ili's sweetzes
0.25 tbsp lemon juice
1 singer temperries, juiced
0.5 c%) water
0.25 cardaninet lemons
~~~~~
0.429 cups ice cubes
0.05 cups white sugar
tequila
1 cups tequila
to taste
10 teme driek
0.5 tablespoons triple sec
2 ounces puspey
grewed lemon (sor gorgan asule mereato, chilled
1 shots apple 

Notice how the model learns to always predict 'E' once it has already predicted the first 'E' and does not produce any other character after that. In practice we can stop the for loop once we already found 'E', this has the effect of producing sentences of arbitrary size, meaning our model has learned when to finish a sentence. The sentence might not be perfect at this point in training but probably it has already learned to produce basic words like "a", "the", "and" or "with", however it still produces pseudo-words that look like words but are not actual words. Try running the above code many times, sentences will sound funny if you read them I guess. If you keep training the model for longer it should get better and better.

## Lab Questions (8 pts)
0. In section 4, how long did it take for you to train an epoch on average? and how long did it take to train for 10 epochs? What was your hardware setup? (0.5pts)<br/>
On average, 60s.  10 minutes for 10 epochs.  I'm running this on just a Intel 3570k CPU.<br/>
1. In section 5 we predicted the next character after the starting character 'S' from the output probability distribution. Modify the code to print the top 10 most probable characters at the beginning of a sentence. Show the list of characters and their associated probability to show up as the first character in the sentence: (0.5pts)
<br/>[('a', 0.7477212), ('t', 0.13135067), ('s', 0.020810049), ('p', 0.015685311), ('m', 0.014828731), ('b', 0.013467789), ('c', 0.0088301683), ('w', 0.0085571501), ('l', 0.005540017), ('g', 0.0045442022)]<br/>
2. Print here a five sentences that you obtained from section 6 as a string (not as array and without 'E' characters). (0.5pts)
    - a white soaking atha sitking in the his beachtor har it while counteroop
    - a bathroom with a ehoolel herr sink a roor tolding andowarount and tomencoolatr
    - ity it a kitchen with tur toderat niam holkightin biketaboden boardow
    - a man roonsing a kitchen op sane
    - a botellood red wark in a eroop opather asdelionts tabbeurarains
    - a fould is howing a gooute if a stale
    - a bathroom with o all wooden thiled with two fisture sinkind in a aroom
    - towe topelan thoroor of athouboord bificels and in the silk
    - a witchen with toilet beardom sink woile aroor fooploth
    - a lasge wooden wooden aistwone filige stowite sink oidowhile ancer an aheas and a roop bike<br/>
3. In section 6, what happens if you remove inference_model.reset_states() from the code? Try removing it and running section 6 code multiple times. Why do you get this effect? (0.5pts)<br/>When removing the inference state reset, the model assumes that the sentence is finished as it was before and thus keeps predicting 'E'.<br/>
4. I have trained this model on a GPU for a few thousand epochs (until the loss went down to around 0.17) and obtained the following weight parameters: <a href="weights-vicente.hdf5">weights-vicente.hdf5</a>. Try loading these weights into your model and producing five sentences (see load_weights in Keras). Include five senteces here as strings: (2pts)
    - a bathroom with a glass darry proceinalllist appliances
    - a nice kitchen with plated oblend laying inside a bagoox out on the floor
    - a large commer table it and a pink shirt is like at a cemech
    - a  gardage cookier down a street next to a train
    - two colessued and paper tys most fisce entign cutting a large scall bowl<br/>
5. Keras already includes an example of how to generate text character by character (using Nietzsche's writings as training text) here https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py. Please describe  what are the differences between that model and the model implemented in this lab. How are they different? (2pts)<br/>The model Keras uses predicts characters one at a time based on the past 40 words in the text (including the part of the word being predicted one character at a time).  This means that the input size is of size 40 x numChars.<br/>
6. Include any thoughts that you have about what are other possible uses of this type of model. (For instance, instead of having a one-hot encoding vector for the starting character 'S' as your input you could have the output of a convolutional neural network from an image as the input -- this is the most popular model for generating image captions these days). (2pts) <br/>This type of model could be used to generate names (for people, pets, places...) one character at a time.  One could also have the output of some audio recognition network be used in the places of the 'S' character in order to generate text captions for spoken word.<br/>

### Optional (2pts)

1. Try to improve the model presented here by changing maybe batch_size, hiddenStateSize, hiddenLayerSize, adding a Dropout layer, Batch Normalization layer, etc. You could possibly obtain a very low loss function value much faster with the right combination.<br/><br/>
2. Train the model in this lab using the Nietzche's writings from the Keras example on text generation (you might have to split that text into sentences).

## Answers to lab questions (not inline version)


1. On average, 60s.  10 minutes for 10 epochs.  I'm running this on just a Intel 3570k CPU.
2. [('a', 0.7477212), ('t', 0.13135067), ('s', 0.020810049), ('p', 0.015685311), ('m', 0.014828731), ('b', 0.013467789), ('c', 0.0088301683), ('w', 0.0085571501), ('l', 0.005540017), ('g', 0.0045442022)]
3. sentences below:
    - a white soaking atha sitking in the his beachtor har it while counteroop
    - a bathroom with a ehoolel herr sink a roor tolding andowarount and tomencoolatr
    - ity it a kitchen with tur toderat niam holkightin biketaboden boardow
    - a man roonsing a kitchen op sane
    - a botellood red wark in a eroop opather asdelionts tabbeurarains
    - a fould is howing a gooute if a stale
    - a bathroom with o all wooden thiled with two fisture sinkind in a aroom
    - towe topelan thoroor of athouboord bificels and in the silk
    - a witchen with toilet beardom sink woile aroor fooploth
    - a lasge wooden wooden aistwone filige stowite sink oidowhile ancer an aheas and a roop bike

4. When removing the inference state reset, the model assumes that the sentence is finished as it was before and thus keeps predicting 'E'.
5. sentences below:
    - a bathroom with a glass darry proceinalllist appliances
    - a nice kitchen with plated oblend laying inside a bagoox out on the floor
    - a large commer table it and a pink shirt is like at a cemech
    - a  gardage cookier down a street next to a train
    - two colessued and paper tys most fisce entign cutting a large scall bowl
6. The model Keras uses predicts characters one at a time based on the past 40 words in the text (including the part of the word being predicted one character at a time).  This means that the input size is of size 40 x numChars.
7. This type of model could be used to generate names (for people, pets, places...) one character at a time.  One could also have the output of some audio recognition network be used in the places of the 'S' character in order to generate text captions for spoken word.

<div style="font-size:0.8em;color:#888;text-align:center;padding-top:20px;">If you find any errors or omissions in this material please contact me at vicente@cs.virginia.edu