# Babbling
Lets write a stupid LSTM RNN which learns to generate text based on a corpus fed to it. Keras has a lovely API so we'll use that, backed up by the brunt of Tensorflow.

In [14]:
import math
import pandas as pd
import numpy as np
import nltk

from numpy.random import choice
from keras.layers import *
from keras.models import Sequential

Let's load in a big lump of text for the LSTM to read

In [15]:
book_path = './data/hp_philosophers_stone.txt'

with open(book_path) as f:
    text = f.read().lower()

In [16]:
print('corpus length:', len(text))

corpus length: 439400


Then get a set of the unique characters in the text, and call it our vocabulary. Even in normal text the vocabulary is likely to be quite large - 26 upper case characters, 26 lower case characters, and loads of punctuation

In [17]:
characters = sorted(list(set(text)))
vocab_size = len(characters)

vocab_size

53

To make our data computationally interpretable, we should make some kind of index mapping each character to a unique numeric id. We can then represent our full book text as a list of character indicies. In other words, the output will be a long sequence of numbers which spell out the book.

In [18]:
character_to_index = dict((c, i) for i, c in enumerate(characters))
index_to_character = dict((i, c) for i, c in enumerate(characters))

text_as_indicies = [character_to_index[c] for c in text]

Now we can start splitting that massively long series of numbers into a load of training sequences. We'll use a sequence length of 40, because, having tested this with a bunch of lengths, 40 is a nice round number that seems to work well. It also gives us enough context to start picking up on grammar and sentence cadence without being excessive.

In [19]:
sequence_length = 40
num_sequences = len(text) - sequence_length + 1

sequences = [text_as_indicies[i : i + sequence_length] 
             for i in range(num_sequences)]

next_characters = [text_as_indicies[i + 1 : i + sequence_length + 1] 
                   for i in range(num_sequences)]

len(sequences)

439361

Now we need to come up with the series of next-characters that follow each sequence.

In [20]:
sequences = np.concatenate([[np.array(seq)] for seq in sequences[:-2]])
next_characters = np.concatenate([[np.array(char)] for char in next_characters[:-2]])

Here's an example of the two things we'll be using to train the network

In [21]:
print('sequence:\n' + str(sequences[0]) + '\n')
print('next characters:\n' + str(sequences[1]))

sequence:
[44 32 29  1 26 39 49  1 47 32 39  1 36 33 46 29 28 52 52 37 42 10  1 25 38
 28  1 37 42 43 10  1 28 45 42 43 36 29 49  8]

next characters:
[32 29  1 26 39 49  1 47 32 39  1 36 33 46 29 28 52 52 37 42 10  1 25 38 28
  1 37 42 43 10  1 28 45 42 43 36 29 49  8  1]


# Building the model
We're going to use a pretty generic model structure: 
- embedding
- lstm
- dropout
- lstm 
- dropout
- dense (time distributed)
- softmax

We're also going to use the ADAM optimizer because it's the best and most clever mashup of things (AdaGrad and RMSProp) ever, sparse categorical cross entropy as our loss function, and the mean average error as our metric.

In [22]:
model = Sequential([Embedding(vocab_size, 
                              24, 
                              input_length=sequence_length),
                    LSTM(512, 
                         input_dim=24, 
                         return_sequences=True, 
                         dropout_U=0.2, 
                         dropout_W=0.2, 
                         consume_less='gpu'),
                    Dropout(0.2),
                    LSTM(512, 
                         return_sequences=True, 
                         dropout_U=0.2, 
                         dropout_W=0.2, 
                         consume_less='gpu'),
                    Dropout(0.2),
                    TimeDistributed(Dense(vocab_size)),
                    Activation('softmax')])



In [23]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['mae'])

# Training the model

In [24]:
model.optimizer.lr = 0.001

model.fit(sequences, 
          np.expand_dims(next_characters,-1), 
          batch_size=64, 
          nb_epoch=1)



Epoch 1/1


<keras.callbacks.History at 0x7f8b483728d0>

Now that we've trained the model and optimised all of the weights in the network, we can save them to an `.h5` file.

In [25]:
model.save_weights('models/weights.h5')

OSError: Unable to create file (Unable to open file: name = 'models/weights.h5', errno = 2, error message = 'no such file or directory', flags = 13, o_flags = 242)

# Reloading a pretrained model
If you've build the model and trained it elsewhere, you can reload it by calling `.load_weights()` with the path to the `.h5` file, as follows

In [23]:
model.load_weights('models/weights.h5')

# Babbling

In [29]:
def babble(seed_string=' '*40, output_length=500):
    '''
    Say a lot of stupid stuff based on all of the input text 
    that we trained the model on
    
    Parameters
    ----------
    seed_string : string (optional)
        The story that you want your idiot network to be 
        inspired by
        default = 40 spaces
    
    output_length : int (optional)
        how long do you want the idiot network to talk for
        default = 500
        
    Returns
    -------
    seed_string : string
        the original seed string with 500 characters of new 
        stuff attached to the end of it
    '''
    for i in range(output_length):
        x = np.array([character_to_index[c] for c in seed_string[-40:]])[np.newaxis,:]
        preds = model.predict(x, verbose=0)[0][-1]
        preds = preds / np.sum(preds)
        next_character = choice(characters, p=preds)
        seed_string += next_character
    print(seed_string)

In [30]:
babble()

                                        in it and walked up by trouble glumping on his tricks as harry left harry's broom and back.　　"let's everyone else had to go bit to look at each other.　　"just then," harry. but harry, too, ron, and ron fruffled so back for us," ron sighed, as they telling himself against the stone.　　"then the armchairs wouldn't over his mouth. the flash of the days to give us them id it, just a wafd."　　this is it must be sort. i dungeon had left professor mcgonagall noticing making the first i've got to said. "co


Hooray we did the thing