## Generating text with an RNN

This tutorial demonstrates how to generate text using a character-based RNN. (Given a sequence of characters, train a model to predict the next character in the sequence.) Longer text sequences can be generated by calling the model repeatedly.

While some of the sentences are grammatically correct, most are nonsense. The model has not learned the meaning of the words, but consider:

      1. The model is based on characters. When the training began, the model did not know how to spell a word in English, or even that the words were a unit of text.

      2. The structure of the output resembles a play: text blocks typically begin with a speaker's name, in all capital letters, similar to the data set.

     3. As demonstrated below, the model is trained on small batches of text (100 characters each) and can still generate a longer text stream with a consistent structure.
    
For this example the following book will be used:

Alice's Adventures in Wonderland by Project Gutenberg, by Lewis Carroll

This e-book is for the use of anyone anywhere at no cost and with almost no restrictions of any kind. You may copy, gift, or reuse it under the terms of the included Project Gutenberg License
with this ebook or online at www.gutenberg.org

Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

 
**Task to do**: Given a character, or a sequence of characters, what is the most likely next character? This is the task for which you are training the model. The input to the model will be a sequence of characters, and you train the model to predict the output: the next character at each time step.

Since RNNs maintain an internal state that depends on the items seen previously, given all the characters computed so far, what is the next character?



In [7]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from __future__ import print_function
from keras.layers import SimpleRNN
from keras.models import Sequential
from keras.layers import Dense, Activation



In [13]:
INPUT_FILE = "texto/wonderland.txt"
# extract the input as a stream of characters
print("[INFO]Extracting text from input...")
fin = open(INPUT_FILE, 'rb')
lines = []
for line in fin:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)
fin.close()
text = " ".join(lines)
print("[INFO] Done!!")

[INFO]Extracting text from input...
[INFO] Done!!


Here chars is the number of features in our "vocabulary" of characters. Create entries and responses (tags) from text. We do this by stepping through the $STEP$ text character at a time, and extracting a sequence of size $SEQLEN$ and the next output character. For example, assuming an input text "The sky was falling", we would get the following sequence of input_chars and label_chars (only the first 5)

    The sky wa -> s
    he sky was ->  
    e sky was  -> f
    sky was f -> a
    sky was fa -> l

In [9]:
chars = set([c for c in text])
nb_chars = len(chars)
char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))

print("[INFO] Creaetiquetas y frases cortas de entrada...")
SEQLEN = 10
STEP = 1

input_chars = []
label_chars = []
for i in range(0, len(text) - SEQLEN, STEP):
    input_chars.append(text[i:i + SEQLEN])
    label_chars.append(text[i + SEQLEN])
    
print(f"[INFO] Hecho!! \nTotal simbolos: {nb_chars}\nTotal secuencias: {len(input_chars)}")    

[INFO] Creaetiquetas y frases cortas de entrada...
[INFO] Hecho!! 
Total simbolos: 55
Total secuencias: 158773


vectorize the input and output characters (label). Each row of the input is represented by seqlen characters, each character represented as a 1-hot encoding of size len(chars). There are len(input_chars) rows, so the dimensions of (X) is: (len(input_chars), seqlen, nb_chars). Each output row is a single character, also represented as a dense encoding of size len(chars). Therefore, the dimensions of (y) are (len(input_chars),nb_chars).

In [10]:
print("[INFO] Vectorizing input and label text...")
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=bool)
y = np.zeros((len(input_chars), nb_chars), dtype=bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1
print("[INFO] Done!!")      

[INFO] Vectorizing input and label text...
[INFO] Done!!


Build the model. We use a single RNN with a fully connected layer to compute the most likely predicted output character

In [11]:

HIDDEN_SIZE = 128
BATCH_SIZE = 128
NUM_ITERATIONS = 25
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100

model = Sequential()
model.add(SimpleRNN(HIDDEN_SIZE, return_sequences=False,
                    input_shape=(SEQLEN, nb_chars),
                    unroll=True))
model.add(Dense(nb_chars))
model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

## Train and test

In [12]:
# We train the model in batches and test output generated at each step
for iteration in range(NUM_ITERATIONS):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)
    
    # testing model
    # randomly choose a row from input_chars, then use it to 
    # generate text from model for next 100 chars
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    print("Generating from seed: %s" % (test_chars))
    print(test_chars, end="")
    for i in range(NUM_PREDS_PER_EPOCH):
        Xtest = np.zeros((1, SEQLEN, nb_chars))
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2char[np.argmax(pred)]
        print(ypred, end="")
        # move forward with test_chars + ypred
        test_chars = test_chars[1:] + ypred
    print()

Iteration #: 0
Generating from seed:  the water
 the water alice said the the the the the the the the the the the the the the the the the the the the the the 
Iteration #: 1
Generating from seed:  be angry 
 be angry hat ing the the said the the said the the said the the said the the said the the said the the said t
Iteration #: 2
Generating from seed: ine feet h
ine feet her all the said the king the hard the har her alice the har her alice the har her alice the har her 
Iteration #: 3
Generating from seed: hind to ex
hind to exp of the she had she was the said the doon the dorme fing the reall sel was the said the doon the do
Iteration #: 4
Generating from seed: right hold
right holder and the hat it was so the said the cateres and the parted the parted the parted the parted the pa
Iteration #: 5
Generating from seed:  the defec
 the defection and the gryphon and all the dore the gryphon and all the dore the gryphon and all the dore the 
Iteration #: 6
Generating from seed: . and how

In [40]:
pred.size

55