# Create a Text Generator Using an RNN


- The text from a book, chapter, etc can be thought of as a sequence. 
- RNN being based on sequential data can learn the sequences and generate new sequences 
- LTSM is an rnn network that helps avoid common problems

the architecture is the following:
<img src="https://cdn-images-1.medium.com/max/1600/1*_YFtlUJG69dm6QLnFhYBoQ.png"/>

### to learn:
- Where to download a free corpus of text that you can use to train text generative models.
- How to frame the problem of text sequences to a recurrent neural network generative model.
- How to develop an LSTM to generate plausible text sequences for a given problem.

In [1]:
# libraries

import numpy as np
import requests
import sys
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Activation
from keras.layers import Bidirectional
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils



Using TensorFlow backend.


In [13]:
# load text 
articleUrl = requests.get("https://www.gutenberg.org/files/120/120-0.txt")
articleUrl.encoding = "utf-8"
book_text = articleUrl.text
book_text = book_text.lower()
book_text[877:1000]

'rous delightful hours, and with the kindest wishes, dedicated by his\r\naffectionate friend, the author.\r\n\r\n\r\n\r\n             '

In [14]:
book_text = book_text[4300:]

In [15]:
# create a mapping to unique chars to ints

chars = sorted(list(set(book_text)))
chars_to_int = {c:i for i,c in enumerate(chars)}
chars_to_int['a']

29

In [16]:
chars

['\n',
 '\r',
 ' ',
 '!',
 '$',
 '%',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '“',
 '”']

In [17]:
# summarize the data set
n_chars = len(book_text)
n_vocab = len(chars)
print("Book with {} chars and vocab is {} unique chars".format(n_chars,n_vocab))

Book with 387306 chars and vocab is 57 unique chars


#### choosing how to define the training data 
- the book text will be split into subsequences of a 100 timesteps (t100) with a fixed length of 100 chars (100,100) 100 input vectors with a 100 timesteps
- 100 timesteps of one char input(x) and followed by one char output(y)

ex:

<li>runni -->n</li>
<li>unnin --> g</li>

### in the kafka example:

- after the 100 chars, comes one output the next char
- each letter is a timestep
- there are a 100 chars, thus a 100 input vectors

In [6]:
# prepare the dataset of inputs to output pairs encoded as integers
# example
seq_len = 100
for i in range(0,n_chars - seq_len)[:10]:
    seq_in = book_text[i: i+seq_len]
    seq_out = book_text[i+seq_len]
    print((seq_in,seq_out))
    print([chars_to_int[char] for char in seq_in])
    print(chars_to_int[seq_out])

('\ufeffthe project gutenberg ebook of treasure island, by robert louis stevenson\r\n\r\nthis ebook is for the ', 'u')
[60, 51, 39, 36, 2, 47, 49, 46, 41, 36, 34, 51, 2, 38, 52, 51, 36, 45, 33, 36, 49, 38, 2, 36, 33, 46, 46, 42, 2, 46, 37, 2, 51, 49, 36, 32, 50, 52, 49, 36, 2, 40, 50, 43, 32, 45, 35, 11, 2, 33, 56, 2, 49, 46, 33, 36, 49, 51, 2, 43, 46, 52, 40, 50, 2, 50, 51, 36, 53, 36, 45, 50, 46, 45, 1, 0, 1, 0, 51, 39, 40, 50, 2, 36, 33, 46, 46, 42, 2, 40, 50, 2, 37, 46, 49, 2, 51, 39, 36, 2]
52
('the project gutenberg ebook of treasure island, by robert louis stevenson\r\n\r\nthis ebook is for the u', 's')
[51, 39, 36, 2, 47, 49, 46, 41, 36, 34, 51, 2, 38, 52, 51, 36, 45, 33, 36, 49, 38, 2, 36, 33, 46, 46, 42, 2, 46, 37, 2, 51, 49, 36, 32, 50, 52, 49, 36, 2, 40, 50, 43, 32, 45, 35, 11, 2, 33, 56, 2, 49, 46, 33, 36, 49, 51, 2, 43, 46, 52, 40, 50, 2, 50, 51, 36, 53, 36, 45, 50, 46, 45, 1, 0, 1, 0, 51, 39, 40, 50, 2, 36, 33, 46, 46, 42, 2, 40, 50, 2, 37, 46, 49, 2, 51, 39, 36, 2, 52]


In [18]:
## actual input to output pairs encoded as ints
seq_len = 100
data_x = [[chars_to_int[char] for char in book_text[i: i+seq_len]] for i in range(0,n_chars - seq_len)]
data_y = [[chars_to_int[char] for char in book_text[i+seq_len]] for i in range(0,n_chars - seq_len)]
print(len(data_x))
print(len(data_y))
print("number of patterns {}".format(len(data_x)))

387206
387206
number of patterns 387206


##### the data:
- the list of input sequences must be in [samples,time_steps,features]
- the integers needs to be rescaled between 0 and 1, to make it easier for the network to learn 
- convert the output into OneHotEnconding - sparse vector with a len of 59

In [19]:
# reshape x
num_patterns = len(data_x)
x_train = np.reshape(data_x,(num_patterns,seq_len,1))
print(x_train.shape)

# scale x 
x_train = (x_train / float(n_vocab)).astype('float32')
print(x_train.min())
print(x_train.max())
print(x_train[:1].ravel())



(387206, 100, 1)
0.0
0.98245615
[0.57894737 0.57894737 0.80701756 0.01754386 0.         0.01754386
 0.         0.01754386 0.         0.01754386 0.         0.01754386
 0.         0.2631579  0.01754386 0.         0.01754386 0.
 0.84210527 0.6315789  0.57894737 0.03508772 0.75438595 0.7017544
 0.5614035  0.03508772 0.8245614  0.57894737 0.50877196 0.19298245
 0.5614035  0.75438595 0.61403507 0.03508772 0.50877196 0.84210527
 0.03508772 0.84210527 0.6315789  0.57894737 0.03508772 0.50877196
 0.5614035  0.71929824 0.64912283 0.80701756 0.50877196 0.7017544
 0.03508772 0.5263158  0.57894737 0.7368421  0.5263158  0.75438595
 0.8947368  0.01754386 0.         0.01754386 0.         0.01754386
 0.         0.8245614  0.7894737  0.8596491  0.64912283 0.80701756
 0.57894737 0.03508772 0.84210527 0.80701756 0.57894737 0.7017544
 0.50877196 0.8947368  0.7368421  0.57894737 0.9298246  0.1754386
 0.03508772 0.5614035  0.80701756 0.21052632 0.03508772 0.7017544
 0.64912283 0.877193   0.57894737 0.8245614

In [20]:
# one hot enconde the output vector
y_train = np_utils.to_categorical(data_y)
print(y_train[:2])
print(y_train.shape)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0.]]
(387206, 57)


In [21]:
################################
        # THE MODEL #
################################
input_shape = (x_train.shape[1],x_train.shape[2])
num_labels = y_train.shape[1]
batch_size = 256
hlayers = 256
dropout = .25

# model RNN 
model = Sequential()
model.add(LSTM(units=hlayers,input_shape=input_shape,return_sequences=True))
model.add(Dropout(dropout))
model.add(LSTM(units=256))
model.add(Dropout(dropout))
model.add(Dense(num_labels))
model.add(Activation("softmax"))
model.summary()



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100, 256)          264192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 256)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 57)                14649     
_________________________________________________________________
activation_1 (Activation)    (None, 57)                0         
Total params: 804,153
Trainable params: 804,153
Non-trainable params: 0
_________________________________________________________________


In [22]:
# compile
model.compile(loss="categorical_crossentropy",optimizer="rmsprop",metrics=["accuracy"])

# Define a checkpoint to save the weights if improvement in loss is seen at the end
# the model will be loaded with the best set of weights
"""
file = "w-improvement-{epoch:02d}-{acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(file,monitor='acc', verbose=1, save_best_only=True,mode='max')
callback = [checkpoint]


model.fit(x_train,y_train,batch_size=batch_size,epochs=1,callbacks=callback)
print("Finished Training")
"""

'\nfile = "w-improvement-{epoch:02d}-{acc:.4f}.hdf5"\ncheckpoint = ModelCheckpoint(file,monitor=\'acc\', verbose=1, save_best_only=True,mode=\'max\')\ncallback = [checkpoint]\n\n\nmodel.fit(x_train,y_train,batch_size=batch_size,epochs=1,callbacks=callback)\nprint("Finished Training")\n'

In [23]:
# generating text 

file_w = "w-improvement-93-0.5862.hdf5"
model.load_weights(file_w)
model.compile(loss="categorical_crossentropy",optimizer="rmsprop")


In [25]:
# Dictionary to convert back to numbers

int_to_char = {i:c for i,c in enumerate(chars)}
int_to_char[0]

'\n'

##### Text Generation 
- start with a random seed sequence as input and genereate the next character using that seed sequence, and remove first char and append new char and so on 

In [32]:
# random seed
seed = np.random.randint(0,len(data_x)-1)
seed
pat = data_x[seed]

# generate chars
print("".join([int_to_char[val] for val in pat]))
for i in range(500):
    x = np.reshape(pat,(1,len(data_x[0]),1)) 
    x = x / float(n_vocab)   
    # char prediction
    prediction = model.predict(x)
    pred_char = np.argmax(prediction)
    char_to_text = int_to_char[pred_char]
    sys.stdout.write(char_to_text)
    pat.append(pred_char)
    pat = pat[1:len(pat)]
    


 behind me over my shoulder, began to
retrace my steps in the direction of the boats.

instantly 
i was on the streation of the ship shat had she coatt was
still silver and the same time and she was off a mone startler of his
andiorage, and the shought of the ship was a bliar of the ship's mane,
i was stre that the same might of his cape, and the shore were all was
still sirenng and shere and shen and the sea certied on the stockade,
and the shore realan was struck and sook a ship of the stockade, and
the same moment the mort coats were all was on the south- and the should
of the stoc