# LSTM Text Generation
In this notebook I use a recurrent deep learning model to generate text data. The model will learn a language model from the works of the German philosopher Nietzsche, translated to English.

In [1]:
import tensorflow as tf
from tensorflow.keras import models, layers
import numpy as np
import random
import sys

In [2]:
## DATA IMPORT
path = tf.keras.utils.get_file('nietzsche.txt',
                               origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')

with open(path) as file:
    text = file.read().lower()

print('Length of Corpus: ', len(text))

Length of Corpus:  600901


In [3]:
print(text[:1000])

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to truth, have been unskilled and unseemly methods for
winning a woman? certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--if,
indeed, it stands at all! for there are scoffers who maintain that it
has fallen, that all dogma lies on the ground--nay more, that it is at
its last gasp. but to speak seriously, there are good grounds for hoping
that all dogmatizing in philosophy, whatever solemn, whatever conclusive
and decided airs it has assumed, may have been only a noble puerilism
and tyronism; and probably the time is at hand when it will be once
and again understood what has actually sufficed for the basis of such
imposing and abso

In [4]:
text = text.replace('\n', '')  #Remove newline chars

In [5]:
## VECTORIZE CHARACTER SEQUENCES

maxlen = 60  #num chars per sequence
step = 3     #sequence sample frequency

sentences = []   #stores sequences
next_chars = []  #prediction targets

for i in range(0, len(text)-maxlen, step):
    sentence = text[i:i+maxlen]
    sentences.append(sentence)
    next_chars.append(text[i+maxlen])

print('Num of Sequences: ', len(sentences))

chars = sorted(list(set(text)))  #Unique chars
print('Unique character count: ', len(chars))
char_indices = dict((char, chars.index(char)) for char in chars)

# Vectorize input data into shape (num_sequences, maxlen, len(chars))
# One-hot encoded scheme of character sequences
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)))

for i, sentence in enumerate(sentences):
    for j, char in enumerate(sentence):
        char_idx = char_indices[char]
        x[i,j,char_idx] = 1
    y[i,char_indices[next_chars[i]]] = 1

Num of Sequences:  196969
Unique character count:  58


In [6]:
## DEFINE LSTM MODEL
model = models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    loss='categorical_crossentropy'
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               95744     
_________________________________________________________________
dense (Dense)                (None, 58)                7482      
Total params: 103,226
Trainable params: 103,226
Non-trainable params: 0
_________________________________________________________________


Next a function is created to sample a character from the model's predicted probability distribution as a result of the softmax activation. This probability distribution is rescaled by a factor `entropy_factor` which controls the degree of entropy of this distribution. The lower this factor, the more predictable the model becomes when generating the next text character.

In [7]:
## Helper function for character sampling
def sample(preds, entropy_factor=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.exp( np.log(preds) / entropy_factor ) #entropy rescaling
    preds = preds / np.sum(preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
    

In [8]:
char_indices.keys()

dict_keys([' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¤', '¦', '©', '«', 'ã', '†'])

In [9]:
## TEXT-GENERATION
num_epochs = 60
batch_size = 128
entropy_list = [0.2, 0.5, 1.0]  #Different values to generate text with
gen_char_limit = 400  #Number of characters to generate from seed

epochs_to_print_txt = [1, 20, 40, 60]

for epoch in range(1, num_epochs+1):
    print('\n epoch #', epoch)
    model.fit(x, y, batch_size=batch_size, epochs=1, verbose=0)
    start_index = random.randint(0, len(text)-maxlen-1)
    generated_text = text[start_index : start_index+maxlen]
    
    if epoch in epochs_to_print_txt:
        print('--- Generating text with seed: "', generated_text, '"')
    
        for entropy in entropy_list:
            print('\n----- entropy factor:', entropy)
            sys.stdout.write(generated_text)
            sys.stdout.flush()
            for i in range(gen_char_limit):
                #One-hot encoding input sequences
                sampled = np.zeros((1, maxlen, len(chars)))
                for t, char in enumerate(generated_text):
                    sampled[0, t, char_indices[char]] = 1.

                #Sample next char from prediction
                preds = model.predict(sampled, verbose=0)[0]
                #print(preds)
                next_index = sample(preds, entropy)
                next_char = chars[next_index]

                # append new char, keep input size the same
                generated_text += next_char
                generated_text = generated_text[1:]

                sys.stdout.write(next_char)
                sys.stdout.flush()
            


 epoch # 1
--- Generating text with seed: " ghts in science. in that the newpsychologist is about to put "

----- entropy factor: 0.2
ghts in science. in that the newpsychologist is about to puttent and the self the self and the self the contention and the self the self the sulfer and the such the such and the most the the self and the sulf the self the self the self the man a self the self the self the content of the self the contoth of the such the all the self the self the self the self and the content and the pristion and the simply be a the most be the the self the such and the pris
----- entropy factor: 0.5
e simply be a the most be the the self the such and the prise the for the the with the considely the simplefiently that should the intestlives which the praser itself the self the many is for the world the scoulds and meth his realing the one a a word prastion of the moral and inter intercestore and the con or the cortion effic or the a the is a philosopher the contuding who 

  after removing the cwd from sys.path.


he wayd art along by thinks todamplentions which the seem. to by the foribly lopments abonewords. otherse the decession--notonugual justice of doculousand laking of the "humility, of equality are flow
 epoch # 21

 epoch # 22

 epoch # 23

 epoch # 24

 epoch # 25

 epoch # 26

 epoch # 27

 epoch # 28

 epoch # 29

 epoch # 30

 epoch # 31

 epoch # 32

 epoch # 33

 epoch # 34

 epoch # 35

 epoch # 36

 epoch # 37

 epoch # 38

 epoch # 39

 epoch # 40
--- Generating text with seed: " t of certainty that will and action aresomehow one; he ascri "

----- entropy factor: 0.2
t of certainty that will and action aresomehow one; he ascribed the sentically the stall and the more the most and such as the same deceives in the superiority, the soul, not be the sentically the most such as the sentiment in the problem of the same such the sentically the same and self-arright and his same case to the soul, as the sentically the sentically the sentically the same same such as the soul, and the s

We find that after the network has converged, low entropy factors result in many repetitive words being generated. At higher entropy values, the model begins outputting a greater variety of words, with some words even being made up in some cases.