# Character Level RNN using LSTM cells.

- Trains on Star Trek episode titles
- Outputs "fake" titles.

Much comes from a [Keras example](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py).

## Setup Environment

- Import Keras
- Open up the Star Trek corpus
- We need to translate the textual data into a format that the RNN can accept as input.
- Give each letter an index and create dictionaries to translate from index to character.

In [None]:
## Much borrowed from https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM, Dropout
from keras.layers.embeddings import Embedding
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.models import load_model

from keras.callbacks import LambdaCallback
import numpy as np
import random
import sys

text = open("startrekepisodes.txt").read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
vocabulary_size = len(chars)
print('total chars:', vocabulary_size)
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))


# How long is a title?
titles = text.split('\n')
lengths = np.array([len(n) for n in titles])
print("Max:", np.max(lengths))
print("Mean:", np.mean(lengths))
print("Median:", np.median(lengths))
print("Min:", np.min(lengths))

# hence choose 30 as seuence length to train on.
print("Character Dictionary: ", char_indices)
print("Inverse Character Dictionary: ", indices_char)

## Setup Training Data

- Cut up the corpus into semi-redundant sequences of 30 characters.
- Change indices into "one-hot" vector encodings.

<img src="figures/slicing_text.png",width=600>

In [None]:
# cut the text in semi-redundant sequences of maxlen characters
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 30
step = 3

sentences = [] #The training data
next_chars = [] #The training labels

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i+maxlen])
    next_chars.append(text[i+maxlen])
    
#TODO Printouts

### Onehot encoding:
* a -> [1, 0, 0, ..., 0]
* b -> [0, 1, 0, ..., 0]
* ...

Each training sample becomes 2D tensor:
* "This is the text" -> X = [[0, 0, ..., 1, 0, ..., 0], ..., [0, 0, ..., 1, 0, ... 0]]

Each target (next letter) becomes 1D onehot tensor:
* a -> y = [1, 0, 0, ..., 0]

In [None]:
#X shape: 3D tensor. First dimension is the sentences, second is each letter in each sentence, third is the onehot
#vector representing that letter.
X = np.zeros((len(sentences), maxlen, vocabulary_size), dtype=np.bool)
y = np.zeros((len(sentences), vocabulary_size), dtype=np.bool)
    
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
print("Done preparing training corpus, shapes of sets are:")
print("X shape: " + str(X.shape))
print("y shape: " + str(y.shape))
print("Vocabulary of characters:", vocabulary_size)

## Model

- Model has one hidden layer of 128 LSTM cells.
- Output layer uses the "softmax" activation function to output a probability distribution over next letters.

<img src="figures/n-in-1-out.png",width=800>

In [None]:
#Build model

In [None]:
#Compile and summarize model

<img src="figures/reweighting.png",width=800>

In [None]:
#Higher diversity -> more randomness in the generation.
def sample(probability_distribution, diversity=1.0):
    # helper function to sample an index from a probability distribution
    probability_distribution = np.asarray(probability_distribution).astype('float64')
    probability_distribution = np.log(probability_distribution) / diversity
    exp_preds = np.exp(probability_distribution)
    probability_distribution = exp_preds / np.sum(exp_preds)
    #Draws 1 element at random according to the new scaled probability-distribution.
    probabilities = np.random.multinomial(n=1, pvals = probability_distribution) 
    return np.argmax(probabilities)

## Method for printing some example text after every epoch

In [None]:
def generate_text_segment(length, diversity, generating_model = model_train, input_sequence_length = maxlen):
    start_index = random.randint(0, len(text) - input_sequence_length - 1)

    # We need a seed to start the text generation. Since during training the ANN always experiences
    # sentences of size 30, we seed it with a sentence of length 30 to get it into a sensible state.
    generated = ''
    sentence = text[start_index: start_index + input_sequence_length]
    generated += sentence
    
    sys.stdout.write('----- Generating with seed: "' + sentence + '"')

    for i in range(length):
        x_pred = np.zeros((1, input_sequence_length, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        predictions_distribution = generating_model.predict(x_pred, verbose=0)[0]
        next_index = sample(predictions_distribution, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        #Stepping one symbol forward in the sentence
        sentence = sentence[1:] + next_char

    return generated

In [None]:
def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    for diversity in [0.5]:#[0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = generate_text_segment(400, diversity, model_train, input_sequence_length = maxlen)
        sys.stdout.write(generated)
        print()
        

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

## Training

- Train on batches of 128 examples

In [None]:
from keras_tqdm import TQDMNotebookCallback
print("training start")
print("Training done")

## Plotting training and validation error

In [None]:
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = history_dict['loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.title('Training loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

## Make a Decoder model

During training, we presented sequences of 30 characters, along with the correct next character.
When_using the trained model, it may be more useful to feed in 1 character at a time, and seeing the next
predicted one. That will also convince us that the network is actually _using_ its internal state.

- Needs input length of 1.
- Needs batch size of 1
- Needs LSTM to be stateful
- check that params is the same as model_train

<img src="figures/1-in-1-out.png",width=800>

In [None]:
# Build a decoding model (input length 1, batch size 1, stateful)

## Test the Model

- Take a quote then add 400 characters.

In [None]:
# Sample 1000 characters from the decoding model using a random seed from the vocabulary.
generated = generate_text_segment(1000, diversity=0.5, generating_model = model_dec, input_sequence_length = 1)
sys.stdout.write(generated)
print()