# Hitchhiker's generation with RNN using Keras

Inspired by http://karpathy.github.io/2015/05/21/rnn-effectiveness/

H2g2 plain text from http://www.clearwhitelight.org/hitch/hhgttg.txt

## Configs

In [1]:
seq_len = 30
batch_size = 500
iterations = 10000

## Setup

In [2]:
import time
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import LSTM
import numpy as np
from sklearn.feature_extraction import DictVectorizer

Using TensorFlow backend.


In [3]:
f = open('./h2g2.txt')
text = f.read()

In [4]:
print(text[1000:2000])

tches. 

Many were increasingly of the opinion that they'd all made a big mistake in coming down from the trees in the first place. And some said that even the trees had been a bad move, and that no one should ever have left the oceans. 

And then, one Thursday, nearly two thousand years after one man had been nailed to a tree for saying how great it would be to be nice to people for a change, one girl sitting on her own in a small cafe in Rickmansworth suddenly realized what it was that had been going wrong all this time, and she finally knew how the world could be made a good and happy place. This time it was right, it would work, and no one would have to get nailed to anything. 

Sadly, however, before she could get to a phone to tell anyone- about it, a terribly stupid catastrophe occurred, and the idea was lost forever. 

This is not her story. 

But it is the story of that terrible stupid catastrophe and some of its consequences. 

It is also the story of a book, a book called Th

In [5]:
vec = DictVectorizer()

In [6]:
letters = list(text)
text_array = vec.fit_transform([{l:1} for l in letters])

In [7]:
text_array.shape

(272157, 83)

Looks about right, we have 272157 characters in the book and there are 83 unique characters. This one-hot encoding looks accurate.

## Helper functions

In [8]:
def next_batch(data, seq_len, batch_size):
    x_batch = []
    y_batch = []
    for _ in range(batch_size):
        i = np.random.randint(0, data.shape[0]-seq_len)
        x_i = data[i:(i+seq_len)].toarray()
        y_i = data[i+seq_len].toarray()[0]
        x_batch.append(list(x_i))
        y_batch.append(y_i)
    return np.array(x_batch), np.array(y_batch)

In [9]:
# test out next_batch function
x_batch, y_batch = next_batch(text_array, 20, 5)
print x_batch.shape
print y_batch.shape

(5, 20, 83)
(5, 83)


The function looks good. It generates a 5 records, where each record is 20 letters, where each letter is encoded as an array with length of 83.

## Define network

In [10]:
#model = Sequential()
#model.add(LSTM(32, return_sequences=True, stateful=True, batch_input_shape=(batch_size, seq_len, text_array.shape[1])))
#model.add(LSTM(32, return_sequences=True, stateful=True))
#model.add(LSTM(32, stateful=True))
#model.add(Dense(text_array.shape[1], activation='softmax'))


model = Sequential()
model.add(LSTM(200, 
               input_shape=(seq_len, text_array.shape[1])
              ))
model.add(Dropout(0.2))
model.add(Dense(text_array.shape[1], activation='softmax'))

In [11]:
print model.input_shape

(None, 30, 83)


In [12]:
print model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_1 (LSTM)                    (None, 200)           227200      lstm_input_1[0][0]               
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 200)           0           lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 83)            16683       dropout_1[0][0]                  
Total params: 243,883
Trainable params: 243,883
Non-trainable params: 0
____________________________________________________________________________________________________
None


## Train

In [13]:
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

In [None]:
x_batch, y_batch = next_batch(text_array, seq_len, batch_size)
x_valid, y_valid = next_batch(text_array, seq_len, batch_size)

In [None]:
t0 = time.time()
for i in range(iterations):    
    model.train_on_batch(x_batch, y_batch)
    i_loss = model.test_on_batch(x_valid, y_valid)

    if i%100==0:
        print 'iteration complete: {}'.format(i)
        print 'iteration loss: {}'.format(i_loss)
print "training time = {}".format(time.time() - t0)

iteration complete: 0
iteration loss: [4.4061661, 0.028000003]
iteration complete: 100
iteration loss: [3.2244585, 0.17200001]
iteration complete: 200
iteration loss: [3.2539067, 0.17200002]
iteration complete: 300
iteration loss: [3.2354622, 0.18000001]
iteration complete: 400
iteration loss: [3.3990827, 0.168]
iteration complete: 500
iteration loss: [3.7940867, 0.126]
iteration complete: 600
iteration loss: [4.453474, 0.118]
iteration complete: 700
iteration loss: [5.1190538, 0.088000007]
iteration complete: 800
iteration loss: [5.6093268, 0.082000002]
iteration complete: 900
iteration loss: [5.959826, 0.078000002]
iteration complete: 1000
iteration loss: [6.2345324, 0.088]
iteration complete: 1100
iteration loss: [6.4295669, 0.091999993]
iteration complete: 1200
iteration loss: [6.6050825, 0.089999996]


## Evaluate

Evaluating a sequence generator 

In [20]:
def get_text(arrays):
    return ''.join([l.keys()[0] for l in vec.inverse_transform(arrays)])

In [21]:
i = np.random.randint(0, text_array.shape[0])
seed = text_array[i:(i+seq_len)].toarray()

print get_text(seed)

blem, I am here to help you so


In [22]:
output = seed.copy()

for _ in range(100):
    next_step = output[-seq_len:]
    i = np.argmax(model.predict(next_step.reshape(-1, seq_len, 83)))
    y_pred = np.zeros(83, dtype='int')
    y_pred[i] = 1
    output = np.concatenate([output, y_pred.reshape(-1,83)], axis=0)
    

In [23]:
get_text(output)

'blem, I am here to help you soo  lfsmmmmmml eteeeeiieheeur s oo ooooooooooooooornnnnenaeeeeeeennneeeeeeeeesv,,,,,,o,oooo,,,nnn.oen'