In this notebook I will try to generate fake text using a RNN trained with text extracted from a book.

In [1]:
# Importing necessary packages
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from keras.utils.data_utils import get_file
import io

Using TensorFlow backend.


In [2]:
# For the reproducibility of the results
np.random.seed(42)
tf.random.set_seed(42)

We start by reading the real text, in this case the Frankensein book, which is loaded from the file 'Frankensein.txt':

In [3]:
# Reading the text
path = 'Frankenstein.txt'
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
# Print the text length
print('corpus length:', len(text))

corpus length: 420636


We can print a few lines from the text:

In [4]:
# Print a little extract from the text
print(text[:500])

letter 1

_to mrs. saville, england._


st. petersburgh, dec. 11th, 17—.


you will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings.  i arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.

i am already far north of london, and as i walk in the streets of
petersburgh, i feel a cold northern breeze play upon my cheeks, whi


Now, we create the vocabulary, a list of unique characters present on the text:

In [5]:
# Create the vocabulary from the text
vocabulary = sorted(set(text))

To move from a character to index and the reverse way, these two dictionaries will be useful:

In [6]:
# Character-to-index dictionary for mapping
char_to_idx = { char : idx for idx, char in enumerate(vocabulary)}
# Index-to-character dictionary for reverse mapping
idx_to_char = {idx : char for idx, char in enumerate(vocabulary)}

Now, we create and fill the input and target vectors with the text:

In [7]:
# Input and target data from raw text
input_data = []
target_data = []
maxlen = 40
for i in range(0, len(text) - maxlen):
    input_data.append(text[i : i + maxlen])
    target_data.append(text[i + maxlen])

Encode the vectors:

In [8]:
# Create vectors to encode input and output data
x = np.zeros((len(input_data), maxlen, len(vocabulary)), dtype='float32')
y = np.zeros((len(target_data), len(vocabulary)), dtype='float32')

Fill the encoded the vectors:

In [9]:
# Initialize input and target vector
for s_idx, sequence in enumerate(input_data):
    for idx, char in enumerate(sequence):
        x[s_idx, idx, char_to_idx[char]] = 1
    y[s_idx, char_to_idx[target_data[s_idx]]] = 1

We are ready to create the model:

In [10]:
# Create the model
model = keras.models.Sequential()
model.add(keras.layers.LSTM(128, return_sequences=True, input_shape=(maxlen, len(vocabulary))))
model.add(keras.layers.LSTM(128, return_sequences=True))
model.add(keras.layers.LSTM(128))
model.add(keras.layers.Dense(len(vocabulary), activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')
# View the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 40, 128)           96768     
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 128)           131584    
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 60)                7740      
Total params: 367,676
Trainable params: 367,676
Non-trainable params: 0
_________________________________________________________________


Train the model  with the preprocessed text during 5 epochs (for computational reasons):

In [11]:
# Fit the model to the input and target data
model.fit(x, y, batch_size=64, epochs=5, validation_split=0.2)

Train on 336476 samples, validate on 84120 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x28c69a91e88>

In [13]:
# Save the trained model
model.save('frankenstein_model.h5')

To compare the results obtained with different number of training epochs, I trained the same model with the same real text, but for 50 epochs. This took me several hours of training, even using an old GPU:

In [23]:
# Load the saved model (if needed)
model_50 = keras.models.load_model('frankenstein_model_50_epochs.h5')
model_50.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 40, 128)           96768     
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 128)           131584    
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 60)                7740      
Total params: 367,676
Trainable params: 367,676
Non-trainable params: 0
_________________________________________________________________


And now we a few sentences from the input text as seed for the text generation:

In [12]:
# Random sentence from the input data
start = np.random.randint(0, len(input_data) -1)
seed = input_data[start]

In [13]:
# Print the seed sentence
print(seed)

ious parts of the heavens. the
most viol


This function generates 'n' new characters based on the 'seed', and given the input model:

In [16]:
# Function for generate new text
def generate_text(model, sentence, n):
    generated = ''
    generated += sentence
    model = model
    for i in range(n):
        # Initialize the input vector
        X_test = np.zeros((1, maxlen, len(vocabulary)))
        for t, char in enumerate(sentence):
            X_test[0, t, char_to_idx[char]] = 1
        # Predict the output vector
        preds = model.predict(X_test, verbose=0)[0]
        # Character with max probability
        next_index = np.argmax(preds)
        # Map from index to character
        next_char = idx_to_char[next_index]
        # Add the new character to the sentence
        sentence = sentence[1:] + next_char
        # Add the new character to the generated text
        generated += next_char
    print(generated)    

Using the 5-epochs model:

In [19]:
# Generate new text
generate_text(model, seed, 450)

ious parts of the heavens. the
most violent of the stranger and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and 


In contrast, using the model trained during 50 epochs:

In [25]:
generate_text(model_50, seed, 450)

ious parts of the heavens. the
most violent attention i had suffered to the most continual philosophers of the subject should be conceived. he had alread the stranger of the stranger, and i thought that i might be a consideration of the same subject to my father, and i then the strant misfortunes, i will not be tormented the same country of the sun distinction the subject of the same sufferings of the same time that i might over the same country in the most sense of the same country in


Obviusly, the text generated using the model trained during 50 epochs has a lot more sense, both in the structure and the use of verb forms. But it still being confusing on some points, and it starts to repeat the same characters at the end (the same country in the most).

It can be improved then.