# Hitchhiker's generation with RNN using Keras

Inspired by http://karpathy.github.io/2015/05/21/rnn-effectiveness/

H2g2 plain text from http://www.clearwhitelight.org/hitch/hhgttg.txt

## Configs

In [1]:
seq_len = 50
batch_size = 100
iterations = 50000

## Setup

In [2]:
import time
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import LSTM
import numpy as np
from sklearn.feature_extraction import DictVectorizer
import logging

Using TensorFlow backend.


In [3]:
f = open('./h2g2.txt')
text = f.read()

In [4]:
print(text[1000:2000])

tches. 

Many were increasingly of the opinion that they'd all made a big mistake in coming down from the trees in the first place. And some said that even the trees had been a bad move, and that no one should ever have left the oceans. 

And then, one Thursday, nearly two thousand years after one man had been nailed to a tree for saying how great it would be to be nice to people for a change, one girl sitting on her own in a small cafe in Rickmansworth suddenly realized what it was that had been going wrong all this time, and she finally knew how the world could be made a good and happy place. This time it was right, it would work, and no one would have to get nailed to anything. 

Sadly, however, before she could get to a phone to tell anyone- about it, a terribly stupid catastrophe occurred, and the idea was lost forever. 

This is not her story. 

But it is the story of that terrible stupid catastrophe and some of its consequences. 

It is also the story of a book, a book called Th

In [5]:
vec = DictVectorizer()

In [6]:
letters = list(text)
text_array = vec.fit_transform([{l:1} for l in letters])

In [7]:
text_array.shape

(272157, 83)

Looks about right, we have 272157 characters in the book and there are 83 unique characters. This one-hot encoding looks accurate.

## Helper functions

In [8]:
def next_batch(data, seq_len, batch_size):
    x_batch = []
    y_batch = []
    for _ in range(batch_size):
        i = np.random.randint(0, data.shape[0]-seq_len)
        x_i = data[i:(i+seq_len)].toarray()
        y_i = data[i+seq_len].toarray()[0]
        x_batch.append(list(x_i))
        y_batch.append(y_i)
    return np.array(x_batch), np.array(y_batch)

In [9]:
# test out next_batch function
x_batch, y_batch = next_batch(text_array, 20, 5)
print x_batch.shape
print y_batch.shape

(5, 20, 83)
(5, 83)


The function looks good. It generates a 5 records, where each record is 20 letters, where each letter is encoded as an array with length of 83.

In [10]:
def text_from_dict(arrays):
    return ''.join([l.keys()[0] for l in vec.inverse_transform(arrays)])

In [11]:
print text_from_dict(x_batch[0])
print text_from_dict(y_batch[0].reshape(1,-1))

 you?" stammered Art
h


## Define network

In [12]:
#model = Sequential()
#model.add(LSTM(32, return_sequences=True, stateful=True, batch_input_shape=(batch_size, seq_len, text_array.shape[1])))
#model.add(LSTM(32, return_sequences=True, stateful=True))
#model.add(LSTM(32, stateful=True))
#model.add(Dense(text_array.shape[1], activation='softmax'))


model = Sequential()
model.add(LSTM(200,
               input_shape=(seq_len, text_array.shape[1])
#               batch_input_shape=(batch_size, seq_len, text_array.shape[1])
              ))
model.add(Dropout(0.2))
model.add(Dense(text_array.shape[1], activation='softmax'))

In [13]:
print model.input_shape

(None, 50, 83)


In [14]:
print model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 200)               227200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 83)                16683     
Total params: 243,883
Trainable params: 243,883
Non-trainable params: 0
_________________________________________________________________
None


## Train

In [None]:
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

In [None]:
logging.basicConfig(filename='h2g2.log', level=logging.DEBUG)

t0 = time.time()
for i in range(iterations):
    x_batch, y_batch = next_batch(text_array, seq_len, batch_size)
    model.train_on_batch(x_batch, y_batch)

    if i%100==0:
        logging.info('iteration complete: {}'.format(i))

        logging.info('x_batch:')
        logging.info(x_batch.shape)
        logging.info(text_from_dict(x_batch[0]))
        logging.info('y_batch:')
        logging.info(y_batch.shape)
        logging.info(text_from_dict(y_batch[0].reshape(1,-1)))

#    if i%1000==0:
        x_valid, y_valid = next_batch(text_array, seq_len, batch_size)
        i_loss = model.test_on_batch(x_valid, y_valid)
        logging.info('iteration loss: {}'.format(i_loss))
print "training time = {}".format(time.time() - t0)

## Evaluate

Evaluating a sequence generator 

In [19]:
i = np.random.randint(0, text_array.shape[0])
seed = text_array[i:(i+seq_len)].toarray()

print text_from_dict(seed)

se. All around the world city streets exploded wit


In [24]:
output = seed.copy()

for _ in range(200):
    next_step = output[-seq_len:]
    i = np.argmax(model.predict(next_step.reshape(-1, seq_len, 83)))
    y_pred = np.zeros(83, dtype='int')
    y_pred[i] = 1
    output = np.concatenate([output, y_pred.reshape(-1,83)], axis=0)
    

In [25]:
output.shape

(250, 83)

In [26]:
text_from_dict(output)

'se. All around the world city streets exploded with a small problem with a start was a small problem with a small problem with a start was a small problem with a small problem with a start was a small problem with a small problem with a start was a s'