Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()    # returns string in lowercase text.
print('corpus length:', len(text))

unique_chars = sorted(list(set(text)))
print('total unique chars:', len(unique_chars))
char_indices = dict((c, i) for i, c in enumerate(unique_chars))
indices_char = dict((i, c) for i, c in enumerate(unique_chars))

Using TensorFlow backend.


corpus length: 600893
total unique chars: 57


In [2]:
def print_sentence_array(sentences):
    print("Contents of sentences:")
    for i, s in enumerate(sentences): print(i, ":    ", s)

In [3]:
# cut the text in semi-redundant sequences of maxlen characters
sequence_length = 60
sequence_stride = 10
sequences = [] 
next_chars = []

for i in range(0, len(text) - sequence_length, sequence_stride):
    sequences.append(text[i: i + sequence_length])
    # next_chars contains the single char that came after the sequence in line above.
    next_chars.append(text[i + sequence_length])
assert(len(sequences) == len(next_chars))
print('nb sequences:', len(sequences))
print_sentence_array(sequences[:10])

nb sequences: 60084
Contents of sentences:
0 :     preface


supposing that truth is a woman--what then? is the
1 :     supposing that truth is a woman--what then? is there not gro
2 :     that truth is a woman--what then? is there not ground
for su
3 :      is a woman--what then? is there not ground
for suspecting t
4 :     n--what then? is there not ground
for suspecting that all ph
5 :     en? is there not ground
for suspecting that all philosophers
6 :     re not ground
for suspecting that all philosophers, in so fa
7 :     und
for suspecting that all philosophers, in so far as they 
8 :     specting that all philosophers, in so far as they have been

9 :     hat all philosophers, in so far as they have been
dogmatists


In [4]:
print('Vectorization...')
# X is a boolean grid over the unique characters. Basically one-hot encoding every since char in sequences (I know). 
X = np.zeros((len(sequences), sequence_length, len(unique_chars)), dtype=np.bool)
y = np.zeros((len(next_chars), len(unique_chars)), dtype=np.bool)
for i_seq, seq in enumerate(sequences):
    for i_char, char in enumerate(seq):
        # One-hot encode. Yup. DON'T MESS WITH THIS ITS FASTER
        X[i_seq, i_char, char_indices[char]] = 1
    y[i_seq, char_indices[next_chars[i_seq]]] = 1

Vectorization...


Note that (because reasons), 
    Input shape == 3D tensor with shape `(nb_samples, timesteps, input_dim)`.

In code below, notice that 'input_shape' is the shape of each of the [number of sequences] boolean grids in X. Methinks that timesteps <--> sequence\_length, and input\_dim <--> len(unique\_chars). This interpretation makes sense considering what we've learned about LSTMs/RNNs. 

Seems like the model approach is:
1. LSTM. Seems like output\_dim is effectively what we'd call 'hidden\_dim'. 
2. Dense. Project from 'hidden\_dim' back to len(unique\_chars) so we can interpret as unnormalized log probabilities.
3. Softmax the unnorm log probs as usual. 

In [5]:
def sample_char(preds, temperature=1.0):
    """ Helper function to sample a character from a probability array. """
    
    # Convert preds from boolean to float. 
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(n=1, pvals=preds, size=1)
    return indices_char[np.argmax(probas)]

def one_hot(sequence):
    res = np.zeros((1, sequence_length, len(unique_chars)))
    for t, char in enumerate(sequence):
        res[0, t, char_indices[char]] = 1.
    return res

In [7]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(output_dim=256, input_shape=(sequence_length, len(unique_chars))))
model.add(Dense(output_dim=len(unique_chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Build model...


In [15]:
def generate_sentence(seed="when i think about life", temperature=0.5):
    start_index = random.randint(0, len(text) - sequence_length - 1)
    
    print('\nTemperature:', temperature)
    sentence  = text[start_index: start_index + sequence_length]
    #sentence = seed
    generated = sentence
    print('Generating with seed: "' + sentence + '"')

    for i in range(400):
        x = one_hot(sentence)
        preds = model.predict(x, verbose=0)[0]
        next_char  = sample_char(preds, temperature)
        generated += next_char
        sentence   = sentence[1:] + next_char
        #print(next_char, end='')
        #sys.stdout.flush()
    return(generated)

In [17]:
#train the model, output generated text after each iteration
for iteration in range(1, 8):
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

Iteration 1
Epoch 1/1
Iteration 2
Epoch 1/1
Iteration 3
Epoch 1/1
Iteration 4
Epoch 1/1
Iteration 5
Epoch 1/1
Iteration 6
Epoch 1/1
Iteration 7
Epoch 1/1


In [18]:
print(generate_sentence(temperature=0.2))
print(generate_sentence(temperature=1.0))


Temperature: 0.2
Generating with seed: "of the emotions.

188. in contrast to laisser-aller, every s"
of the emotions.

188. in contrast to laisser-aller, every self-sorious the most still of the soul-supposent that is a perious the same time a deement has not be at the sense of the spiritual can not to be at the sentiments of the spiritually conside and socituality of the spiritual can a been moral
standandary of the sense of the spiritual
of the spiritual
of the soul-supposent and being a most stranges, and some present and socitual
-and the spiritual
of

Temperature: 1.0
Generating with seed: "nce does not dominate,
but, instead of it, the old, trite "m"
nce does not dominate,
but, instead of it, the old, trite "me"ous tolliemes, relogicism and
conside who astumpt blind.
divicutue, and hastergereni.
aldostumate of the e
fthey world"ne sensed unybut
from the nemiorable good in his commsnomes, for the world--and like its sorvanding emoughtered ty being and more emplation, in suchste 