# Let's train a LSTM to mimick the writings of Nietzsche

We're going to use [Keras](https://keras.io)  to generate Nietzsche like text. At least 20 epochs are required before the generated text starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.


### Let's check out the corpus

In [26]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop, Adam
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

# Read the entire file containing nietzsche's works
path = './data/nietzsche.txt'
text = open(path).read().lower()

# Output the length of the corpus
print('corpus length:', len(text))

# Create a sorted list of the characters
chars = sorted(list(set(text)))
print('total chars:', len(chars))

corpus length: 600901
total chars: 59


## Creates the overlapping windows with target characters

In [2]:
# Create a dictionary where given a character, you can look up the index and vice versa
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []

# Step through the text via 3 characters at a time, taking a sequence of 40 bytes at a time. 
# There will be lots ofo overlap
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen]) # range from current index i for max length characters 
    next_chars.append(text[i + maxlen]) # the next character after that 
print('Number of sequences:', len(sentences))

Number of sequences: 200287


## Generates the 1 hot vectors for each character

In [25]:
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
print('Finished created vectors')
print('Size of patterns:', len(X[0]))


Vectorization...
Finished created vectors
Size of patterns: 40


## Build the LSTM model

In [27]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

print("Compiling model complete...")


Build model...
Compiling model complete...


### Helper function to sample an index from a probability array
 The purpose of this function is to add some randomness so that the most likely character is not always chosen, and sometiems the 2nd or 3rd most likely cahracter is chosen

In [5]:

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

### And now the actual training...

In [6]:
diversity = 0.5
print('Diversity: ', diversity)

# The training
print('Training...')
history = model.fit(X, y, batch_size=128, nb_epoch=20)

# Save the model
model.save_weights('nietzsche.weights')
json = model.to_json()
f = open('nietzsche.json','w')
f.write(json) 
f.close() 

Diversity:  0.5
Training...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Generating with random seed: "even more terrifying than other peoples "
even more terrifying than other peoples as the act of the aristically courable strengs of
the strength the consequently present accourt and soul end and merial is not be not even the art a states and soul estauthous in the schoped of the most dissintured the found and the could be fact, of the happent how the strength and love of man is not be does for all the respect of the seems
extained are such difficult of the sension us for him la

AttributeError: 'Sequential' object has no attribute 'save'

In [31]:
# Save the model
model.load_weights('nietzsche.weights')

In [32]:
# Check out what our model predicts
sentence = 'those who will not inherit the kingdom o'
x = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
    x[0, t, char_indices[char]] = 1.
    
print(model.predict(x, verbose=0)[0])
print(sum(model.predict(x, verbose=0)[0]))

[  3.77245560e-05   9.27041256e-05   7.33234060e-07   4.18211948e-06
   1.05271240e-04   3.59504284e-08   2.01290334e-07   3.80812162e-06
   1.42704766e-06   1.96695760e-06   1.09209374e-07   3.10382461e-06
   4.49063776e-07   2.25470080e-06   2.67015253e-06   1.75648040e-06
   1.06952356e-07   3.99579960e-08   1.30287196e-06   6.63608333e-08
   2.26742031e-07   2.24635201e-07   1.10115372e-09   2.13744428e-07
   4.29951001e-07   2.12691859e-07   4.26919016e-08   1.63242130e-05
   1.55780872e-03   1.91174258e-04   1.33617796e-04   2.30841706e-05
   9.20703173e-01   4.84245975e-04   4.31293120e-05   1.13168990e-05
   6.95453991e-06   7.76934394e-06   2.17955303e-03   3.47333873e-04
   4.44710515e-02   6.89590699e-04   9.03360546e-04   2.25398549e-06
   1.22935968e-02   5.78581297e-04   3.76695441e-03   4.58054664e-03
   4.46888758e-03   2.20877700e-03   6.59294019e-05   3.15023794e-06
   4.88524137e-08   3.31453476e-11   1.98800629e-11   7.59474688e-08
   3.16799781e-11   2.32910392e-12

In [33]:
generated = ''
original = sentence
# Predict the next 400 characters based on the seed
for i in range(400):
    x = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x[0, t, char_indices[char]] = 1.

    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds, diversity)
    next_char = indices_char[next_index]

    generated += next_char
    sentence = sentence[1:] + next_char

print(original + generated)


those who will not inherit the kingdom of the prousing profound to whose things of the strengthing itself, which has his own and as a distinctions intellect, as the secting the been the stand is even his men of the sense is not only the sense. the religions and god and not of human interest and sufferen and about that is the spirit and its above the man in the spees as the ape forth to the extent is the seld means of religion, the "prom
