[View in Colaboratory](https://colab.research.google.com/github/mancap314/text_generation/blob/master/text_generation_trung.ipynb)

# LSTM Text Generation 

Based on this [blog post](https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/) by Trung Tran

## Step 1: Import Text Data

We just import *War and Peace* by Tolstoy (hopefully the machine can read it faster than I...)

In [3]:
!wget https://cs.stanford.edu/people/karpathy/char-rnn/warpeace_input.txt

--2018-06-27 20:50:21--  https://cs.stanford.edu/people/karpathy/char-rnn/warpeace_input.txt
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3258246 (3.1M) [text/plain]
Saving to: ‘warpeace_input.txt’


2018-06-27 20:50:22 (15.0 MB/s) - ‘warpeace_input.txt’ saved [3258246/3258246]



## Step 2: Prepare the data

First read the data and get the set of unique characters:



In [5]:
DATA_PATH = 'warpeace_input.txt'

data = open(DATA_PATH, 'r').read()
chars = list(set(data)) #set: gets unique values
VOCAB_SIZE = len(chars)

print('chars:\n{}\n\nVOCAB_SIZE: {}'.format(chars, VOCAB_SIZE))

chars:
[' ', '!', 'J', 'b', 'ä', 'S', '7', 'r', '?', 'k', 'B', 'Y', '2', 'g', '9', 'D', 'x', 'j', 't', 'C', 'e', 'o', 'u', 'K', '4', 'é', 'm', 'l', '/', '.', '\ufeff', 'a', 'X', 'N', 'Z', '8', '6', 'i', '5', ';', 'H', 'q', 'à', '3', 's', 'p', 'O', 'n', 'G', 'E', 'Q', 'y', 'w', 'I', '(', 'R', '"', 'P', '\n', ':', 'F', 'd', '=', ',', '0', '1', 'L', ')', '-', 'T', '*', 'U', "'", 'h', 'M', 'z', 'f', 'W', 'A', 'v', 'ê', 'c', 'V']

VOCAB_SIZE: 83


Then map the characters to integer numbers (and vice-versa):

In [0]:
idx_to_char = {i: char for i, char in enumerate(chars)}
char_to_idx = {char: i for i, char in enumerate(chars)}

Now let prepare the ground for the feature data **X** and the target variable **y**:

In [0]:
import numpy as np

SEQ_LENGTH = 60 #input sequence length
N_FEATURES = VOCAB_SIZE #one hot encoding here, that's why, but deduplicated for clarity

N_SEQ = int(np.floor((len(data) - 1) / SEQ_LENGTH))

X = np.zeros((N_SEQ, SEQ_LENGTH, N_FEATURES))
y = np.zeros((N_SEQ, SEQ_LENGTH, N_FEATURES))

for i in range(N_SEQ):
  X_sequence = data[i * SEQ_LENGTH: (i + 1) * SEQ_LENGTH]
  X_sequence_ix = [char_to_idx[c] for c in X_sequence]
  input_sequence = np.zeros((SEQ_LENGTH, N_FEATURES))
  for j in range(SEQ_LENGTH):
    input_sequence[j][X_sequence_ix[j]] = 1. #one-hot encoding of the input characters
  X[i] = input_sequence
  
  y_sequence = data[i * SEQ_LENGTH + 1: (i + 1) * SEQ_LENGTH + 1] #shifted by 1 to the right
  y_sequence_ix = [char_to_idx[c] for c in y_sequence]
  target_sequence = np.zeros((SEQ_LENGTH, N_FEATURES))
  for j in range(SEQ_LENGTH):
    target_sequence[j][y_sequence_ix[j]] = 1. #one-hot encoding of the target characters
  y[i] = target_sequence

And now define the model:

In [0]:
from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, Dense, Activation

# constant parameter for the model
HIDDEN_DIM = 700 #size of each hidden layer, "each layer has 700 hidden states"
LAYER_NUM = 2 #number of hidden layers, how much were used?
NB_EPOCHS = 200 #max number of epochs to train, "200 epochs"
BATCH_SIZE = 128 
VALIDATION_SPLIT = 0.1 #proportion of the batch used for validation at each epoch


model = Sequential()
model.add(LSTM(HIDDEN_DIM, 
               input_shape=(None, VOCAB_SIZE), 
               return_sequences=True,
               dropout=0.3, #"Dropout ratio 0.3 at the first LSTM layer"
               recurrent_dropout=0.3))
for _ in range(LAYER_NUM - 1):
  model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

Create function to generate text from the trained model:

In [0]:
def generate_text(model, length):
  ix = [np.random.randint(VOCAB_SIZE)]
  y_char = [idx_to_char[ix[-1]]]
  X = np.zeros((1, length, VOCAB_SIZE))
  for i in range(length):
    X[0, i, :][ix[-1]] = 1.
    ix = np.argmax(model.predict(X[:, :i+1,:])[0], 1)
    y_char.append(idx_to_char[ix[-1]])
  return ''.join(y_char)

In [0]:
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback
# callback to save the model if better
filepath="tgt_model.hdf5"
save_model_cb = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
# callback to stop the training if no improvement
early_stopping_cb = EarlyStopping(monitor='val_loss', patience=0)
# callback to generate text at epoch end
class generateText(Callback):
    def on_epoch_end(self, batch, logs={}):
        print(generate_text(self.model, 200))
        
generate_text_cb = generateText()

callbacks_list = [save_model_cb, early_stopping_cb, generate_text_cb]

And now run the model:

In [0]:

model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, epochs=NB_EPOCHS, callbacks=callbacks_list, validation_split=VALIDATION_SPLIT)

Train on 47943 samples, validate on 5327 samples
Epoch 1/200

Epoch 00001: val_acc improved from -inf to 0.41622, saving model to tgt_model.hdf5
Rostov and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and an
Epoch 2/200
 9728/47943 [=====>........................] - ETA: 33:22 - loss: 2.1227 - acc: 0.3842

10112/47943 [=====>........................] - ETA: 33:00 - loss: 2.1201 - acc: 0.3848