# Hitchhiker's generation with RNN using Keras

Inspired by http://karpathy.github.io/2015/05/21/rnn-effectiveness/

H2g2 plain text from http://www.clearwhitelight.org/hitch/hhgttg.txt

## Configs

In [1]:
seq_len = 50
batch_size = 100
iterations = 10000

## Setup

In [2]:
import time
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import LSTM
import numpy as np
from sklearn.feature_extraction import DictVectorizer

Using TensorFlow backend.


In [3]:
f = open('./h2g2.txt')
text = f.read()

In [4]:
print(text[1000:2000])

tches. 

Many were increasingly of the opinion that they'd all made a big mistake in coming down from the trees in the first place. And some said that even the trees had been a bad move, and that no one should ever have left the oceans. 

And then, one Thursday, nearly two thousand years after one man had been nailed to a tree for saying how great it would be to be nice to people for a change, one girl sitting on her own in a small cafe in Rickmansworth suddenly realized what it was that had been going wrong all this time, and she finally knew how the world could be made a good and happy place. This time it was right, it would work, and no one would have to get nailed to anything. 

Sadly, however, before she could get to a phone to tell anyone- about it, a terribly stupid catastrophe occurred, and the idea was lost forever. 

This is not her story. 

But it is the story of that terrible stupid catastrophe and some of its consequences. 

It is also the story of a book, a book called Th

In [5]:
vec = DictVectorizer()

In [6]:
letters = list(text)
text_array = vec.fit_transform([{l:1} for l in letters])

In [7]:
text_array.shape

(272157, 83)

Looks about right, we have 272157 characters in the book and there are 83 unique characters. This one-hot encoding looks accurate.

## Helper functions

In [8]:
def next_batch(data, seq_len, batch_size):
    x_batch = []
    y_batch = []
    for _ in range(batch_size):
        i = np.random.randint(0, data.shape[0]-seq_len)
        x_i = data[i:(i+seq_len)].toarray()
        y_i = data[i+seq_len].toarray()[0]
        x_batch.append(list(x_i))
        y_batch.append(y_i)
    return np.array(x_batch), np.array(y_batch)

In [9]:
# test out next_batch function
x_batch, y_batch = next_batch(text_array, 20, 5)
print x_batch.shape
print y_batch.shape

(5, 20, 83)
(5, 83)


The function looks good. It generates a 5 records, where each record is 20 letters, where each letter is encoded as an array with length of 83.

## Define network

In [10]:
#model = Sequential()
#model.add(LSTM(32, return_sequences=True, stateful=True, batch_input_shape=(batch_size, seq_len, text_array.shape[1])))
#model.add(LSTM(32, return_sequences=True, stateful=True))
#model.add(LSTM(32, stateful=True))
#model.add(Dense(text_array.shape[1], activation='softmax'))


model = Sequential()
model.add(LSTM(64, 
               input_shape=(seq_len, text_array.shape[1])
              ))
model.add(Dense(text_array.shape[1], activation='softmax'))

In [39]:
print model.input_shape

(None, 50, 83)


In [11]:
print model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_1 (LSTM)                    (None, 64)            37888       lstm_input_1[0][0]               
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 83)            5395        lstm_1[0][0]                     
Total params: 43,283
Trainable params: 43,283
Non-trainable params: 0
____________________________________________________________________________________________________
None


## Train

In [12]:
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

In [13]:
x_batch, y_batch = next_batch(text_array, seq_len, batch_size)
x_valid, y_valid = next_batch(text_array, seq_len, batch_size)

In [14]:
t0 = time.time()
for i in range(iterations):    
    model.train_on_batch(x_batch, y_batch)
    i_loss = model.test_on_batch(x_valid, y_valid)

    if i%100==0:
        print 'iteration complete: {}'.format(i)
        print 'iteration loss: {}'.format(i_loss)
print "training time = {}".format(time.time() - t0)

iteration complete: 0
iteration loss: [4.399684, 0.0]
iteration complete: 100
iteration loss: [3.1949382, 0.22]
iteration complete: 200
iteration loss: [3.2324407, 0.22]
iteration complete: 300
iteration loss: [3.3369741, 0.16999999]
iteration complete: 400
iteration loss: [3.7286751, 0.16000001]
iteration complete: 500
iteration loss: [4.0611663, 0.17999999]
iteration complete: 600
iteration loss: [4.322011, 0.16999999]
iteration complete: 700
iteration loss: [4.560277, 0.17999999]
iteration complete: 800
iteration loss: [4.9945164, 0.17999999]
iteration complete: 900
iteration loss: [5.2480221, 0.16]
iteration complete: 1000
iteration loss: [5.5035357, 0.16]
iteration complete: 1100
iteration loss: [5.6996694, 0.14]
iteration complete: 1200
iteration loss: [6.0065527, 0.12]
iteration complete: 1300
iteration loss: [6.1346369, 0.13]
iteration complete: 1400
iteration loss: [6.1613646, 0.12]
iteration complete: 1500
iteration loss: [6.5277152, 0.10999999]
iteration complete: 1600
itera

KeyboardInterrupt: 

## Evaluate

Evaluating a sequence generator 

In [47]:
#i = np.ran
seed = text_array[i:(i+seq_len)].toarray()

In [48]:
def get_text(output):
    print ''.join([l.keys()[0] for l in vec.inverse_transform(output)])

get_text(seed)

s about thirty years old, squattish, squarish, mad


In [49]:
output = seed.copy()

for _ in 10000:
    i = np.argmax(model.predict(output.reshape(-1, 50, 83)))
    y_pred = np.zeros(83, dtype='int').tolist()
    y_pred[i] = 1
    output = np.concatenate([output, y_pred], axis=0)
    