# Character-Level Text Generation

In this lab, you will learn to utilize recurrent neural networks for generating text that resembles work of Friedrich Nietzsche. You will go through data cleaning, search for optimal parameters to minimize loss function and get a high quality Nietzsche-like quote.

In [0]:
import keras
import numpy as np
import re, collections            # for text processing
from google.colab import files    # for download files

## Step 1) Download the dataset

In [0]:
path = keras.utils.data_utils.get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with open(path, encoding='utf-8') as f:
    text = f.read()
print('corpus length:', len(text))


## Step 2) Text preprocessing

For simplicity, we convert everything to lower case and remove ends of lines. We might want to remove characters that occur just a few times or look redundant (smaller alphabet = faster calculation).

In [0]:
text = text.lower().replace("\n", " ")
text = re.sub('[ëæéä0-9_\[\]=\(\)]', '', text)

chars = sorted(list(set(text)))
num_chars = len(chars)
print('total characters in vocabulary:', num_chars)

charcounts = collections.Counter(list(text))
sorted(charcounts.items(), key=lambda i: i[1])

## Step 3) Cut the text in semi-redundant sequences

For training, the test is cut into smaller pieces of the same length. Longer pieces enable better context but needs more time and memory for training.

In [0]:
SEQ_LENGTH = 40   # length of sequences
STEP = 10         # shift in cursor between sequences
DEPTH = 1         # number of hidden LSTM/GRU layers
UNIT_SIZE = 128   # number of units per LSTM
DROPOUT = 0.1     # dropout parameter

In [0]:
sentences = list()
targets = list()
for i in range(0, len(text) - SEQ_LENGTH - 1, STEP):
    sentences.append(text[i: i + SEQ_LENGTH])
    targets.append(text[i + 1: i + SEQ_LENGTH + 1])
print('number of sequences:', len(sentences))

## Step 4) Vectorization

One reason to do this is that entering raw numbers into a RNN may not make sense
    because it assumes an ordering for catergorical variables.

In [0]:
# dictionaries to convert characters to numbers and vice-versa
char_to_indices = dict((c, i) for i, c in enumerate(chars))
indices_to_char = dict((i, c) for i, c in enumerate(chars))

X = np.zeros((len(sentences), SEQ_LENGTH, num_chars), dtype=np.bool)
y = np.zeros((len(sentences), SEQ_LENGTH, num_chars), dtype=np.bool)
for i in range(len(sentences)):
    sentence = sentences[i]
    target = targets[i]
    for j in range(SEQ_LENGTH):
        X[i][j][char_to_indices[sentence[j]]] = 1
        y[i][j][char_to_indices[target[j]]] = 1

## Step 5) Model definition

One, two (or three) layers of LSTM and dropout, followed by dense connected layer and softmax. You can experiment and modify this code: use GRU (keras.layers.GRU) instead of LSTM, try SGD or Adam optimizers instead of RMSprop and modify learning rate (lr parameter).

In [0]:
model = keras.models.Sequential()
for _ in range(DEPTH):
    model.add(keras.layers.LSTM(UNIT_SIZE, input_shape=(None, num_chars), return_sequences=True))
    model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.wrappers.TimeDistributed(keras.layers.Dense(num_chars)))
model.add(keras.layers.Activation('softmax'))

In [0]:
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

## Helper functions: Generating text from the model

The function **sample** takes the trained model and get you a sample of a text generated from it. 
You can set the beginning (`set.seed`) and `temperature`. Lower temperature makes text more confident (but also more conservative).

In [0]:
def multinomial_with_temperature(preds, temperature=1.0):
    """
    Helper function to sample from a multinomial distribution (+adj. for temperature)
    """ 
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-8) / temperature  
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def sample(model, char_to_indices, indices_to_char, 
           seed_string=" ", temperature=1.0, test_length=150):
    """
    Generates text of test_length length from model starting with seed_string.
    """
    num_chars = len(char_to_indices.keys())
    for i in range(test_length):
        test_in = np.zeros((1, len(seed_string), num_chars))
        for t, char in enumerate(seed_string):
            test_in[0, t, char_to_indices[char]] = 1
        entire_prediction = model.predict(test_in, verbose=0)[0]
        next_index = multinomial_with_temperature(entire_prediction[-1], temperature)
        next_char = indices_to_char[next_index]
        seed_string = seed_string + next_char
    return seed_string

## Step 6) Model training

Each time you run the code below, the model is trained for 10 epochs  (each sequence is visited 10 times). If the quality of predictions is not sufficient, you can add another 10 epochs, etc.

In [0]:
history = model.fit(X, y,
            batch_size=1024,
            epochs=10)

## Step 7) Generate text

Is it good? Congratulation! You can save and download the model with the code below. Does it need improvement? Either you need more training (Step 6) or you need to change your parameters or model definition (Steps 3 and 5).

In [0]:
sample(model, char_to_indices=char_to_indices, indices_to_char=indices_to_char, seed_string="truth", temperature=0.8)

## Step 8) Saving the model

In [0]:
model_filename = 'nietzsche.loss{0:.2f}.h5'.format(history.history['loss'][-1])
model.save(model_filename)
files.download(model_filename)

## Acknowledgement

This notebook was adapted from Michael Zhang's [Char-RNN](https://github.com/michaelrzhang/Char-RNN) and [lstm_text_generation.py](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py) example in keras github repo. Both were inspired from Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).