## Train a Character-level RNN

RNNs take a long time to train, so I thought using characters with a window of 20 would get good results, and have a smaller output space (36 characters to output instead of 1000s of words). 

In [1]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM

import tensorflow as tf
import numpy as np
import random
import sys
import re
import string

Using TensorFlow backend.


Added some other poems by Shakespeare, to boost training data. These poems have no specific rhyme or meter though.

In [2]:
maxlen = 25 # Window size

In [4]:
files = ['../data/shakespeare.txt', '../data/shakespeare_xtra.txt']
text = ''

for filename in files:
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if len(line) > 0 and not line.isdigit():
                line = line.translate(None, ':;,.!()?')
                text += '$' + line.lower() + '\n'

In [5]:
chars = set(text)
print('total chars:', len(chars))

('total chars:', 31)


In [6]:
len(text)

240838

In [88]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [18]:
#pickle.dump(char_indices, open("char_indic.json", "w"))
#pickle.dump(indices_char, open("indic_char.json", "w"))

In [90]:
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
    
print('Number sequences:', len(sentences))

('Number sequences:', 80271)


In [92]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [93]:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.4))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

In [6]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array

    with np.errstate(divide='ignore', invalid='ignore'):
        preds = np.asarray(preds).astype('float64')
        preds = np.log(preds) / temperature
    
        # Fix division by 0
        preds[preds == np.inf] = 0
    
        exp_preds = np.exp(preds)
        preds =  exp_preds / np.sum(exp_preds)
    
    return np.argmax(np.random.multinomial(1, preds, 1))

Keras uses Tensorflow Backend, so it will automatically choose GPU if it can. Unfortunately, my computer doesn't have a GPU it can train on.

In [95]:
# train the model, output generated text after each iteration
for iteration in range(1, 10):
    print
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

    generated = '$'*15 + 'o love is '
    sentence = generated
    print('----- Generating with start: %s \n' %generated)
    
    for diversity in [0.2, 0.5, 1.0]:
        print
        print('----- Diversity:', diversity)
        
        sys.stdout.write(generated[15:])
        for i in range(150):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            if (next_char != '$'):
                sys.stdout.write(next_char)
            sys.stdout.flush()
        print


--------------------------------------------------
('Iteration', 1)
Epoch 1/1
----- Generating with start: $$$$$$$$$$$$$$$o love is  


('----- Diversity:', 0.2)
o love is then then then the sore
then then then then then then the the sile the thee the soun the sore
then then then then then thee the the the the sore the

('----- Diversity:', 0.5)
o love is then then then the sore
$then then then then then then the the sile the thee the soun the sore
$then then then then then thee the the the the sore the thee as in ind cond
then then sore the loaks of fier
of than so then thes then sing in thee his af tone thes and tho gour
then thes non the soan i

('----- Diversity:', 1.0)
o love is then then then the sore
$then then then then then then the the sile the thee the soun the sore
$then then then then then thee the the the the sore the thee as in ind cond
$then then sore the loaks of fier
$of than so then thes then sing in thee his af tone thes and tho gour
$then thes non the soan ice gor

Obviously, we need to train more iterations, but it is starting to recognize short words by iteration 10. Online sources say all the words will be correct by iteration 35

I messed up the printing. Should have moved the generated inside the for loop.

## Load a Char-RNN, and Train Further

Load a model trained for 30 iterations, and train it for more. This model is trained to predict backwards, which will be useful when we want to include rhymes

In [2]:
from keras.models import model_from_json

In [12]:
json_file = open('../models/backwards_char_rnn.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# load weights into new model
loaded_model.load_weights("../models/backwards_char_rnn.h5")
loaded_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
print "Loaded model from disk"

Loaded model from disk


In [None]:
files = ['../data/shakespeare.txt', '../data/shakespeare_xtra.txt', 
        '../data/spenser.txt']
text = ''

for filename in files:
    with open(filename) as f:
        for line in f:
            line = line.strip()
            if len(line) > 0 and not line.isdigit():
                line = line.translate(None, ':;,.!()?&')
                text += '$' + line.lower() + '\n'

chars = set(text)
print('Total chars:', len(chars))

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

maxlen = 25 # Window size

# Train in reverse, so we can construct lines from the back for rhyme
step = 1
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i + maxlen: i: -1])
    next_chars.append(text[i])
    
print('Number sequences:', len(sentences))
    
# One hot encode sequences
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

('Total chars:', 31)
('Number sequences:', 294665)


In [None]:
# Train the model, output generated text after each iteration
for iteration in range(1, 10):
    print
    print('-' * 50)
    print('Iteration', iteration)
    loaded_model.fit(X, y, batch_size=128, nb_epoch=1)

    
    for diversity in [0.2, 0.5, 1.0]:
        print
        print('----- Diversity:', diversity)
        
        generated = " and love's embrace\n"
        sentence = '$'*(maxlen - len(generated)) + generated[::-1]
        print('----- Generating with end: %s' %generated)
    
        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = loaded_model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
                
            generated = next_char + generated
            sentence = sentence[1:] + next_char
        
        print generated
        print


--------------------------------------------------
('Iteration', 1)
Epoch 1/1

In [8]:
# serialize model to JSON
model_json = loaded_model.to_json()
with open("../models/backwards_char_rnn.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
loaded_model.save_weights("../models/backwards_char_rnn.h5")
print "Saved model to disk"

Saved model to disk


In [9]:
diversity = 0.2

generated = " and love's embrace\n"
sentence = '$'*(maxlen - len(generated)) + generated[::-1]
print('----- Generating with end: %s' %generated)
    
for i in range(400):
    x = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x[0, t, char_indices[char]] = 1.

    preds = loaded_model.predict(x, verbose=0)[0]
    next_index = sample(preds, diversity)
    next_char = indices_char[next_index]
                
    generated = next_char + generated
    sentence = sentence[1:] + next_char
        
print generated

----- Generating with end:  and love's embrace

all in in mullers'
$with his defore have that from thence doth love and even like
$that one thigg inst me firest
$let the comprace of her power to thee
$but that my if thou blent in thee more might trous lints of feathers then to the wert of ever thee and mind
$and that excuse their rant constire
$to entous to with my how desire
$and preguales her theek her heaven yet doth gaind
$and when a hounds and love's embrace

