[Tensorflow text generation with an RNN tutorial](https://www.tensorflow.org/tutorials/text/text_generation)

In [None]:
import tensorflow as tf

import numpy as np
import os
import time

TensorFlow GPU memory growth must be limited to allow model to train (was having issues without doing this).  Code in below cell borrowed from the [TensorFlow documentation](https://www.tensorflow.org/guide/gpu).

In [None]:
# limiting GPU memory growth

gpus = tf.config.experimental.list_physical_devices('GPU')

if gpus:
    try:
        # currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True) # enabling memory growth
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), 'Physical GPUs,', len(logical_gpus), 'Logical GPU')
    except RuntimeError as e:
        # memory growth must be set before GPUs have been initialized
        print(e)

# Data

Text must all be in a single `.txt` file.

In [None]:
# # download file
# path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# # open the file
# text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# # data type?
# print(type(text))

# # number of characters
# print(f'Length of text: {len(text)} characters')

In [None]:
# open the file
text = str(open('data/drake_lyrics.txt', 'r').read())

# data type?
print(type(text))

# number of characters
print(f'Length of text: {len(text)} characters')

In [None]:
# peek into file
print(text[:250])

In [None]:
# unique characters in file
vocab = sorted(set(text))
print(type(vocab))
print(f'{len(vocab)} unique characters')

# Data Preprocessing

## Text Vectorization

Note that this is **character vectorization**.  Word vectorization would probably make more coherent sentences.

In [None]:
# map unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}

# reverse the map - use this to specify an index to obtain a character
idx2char = np.array(vocab)

# entire text document represented in the above character-to-indices mapping
text_as_int = np.array([char2idx[c] for c in text])

# sample
print(f'"{text[:13]}" ---- characters mapped to int ---- > {text_as_int[:13]}')

## Create Training Examples & Targets

**model input**: sequence of characters

**model output (prediction)**: the following character at each step (based on previous characters in the sequence)

Divide the text into **example sequences**.  Each input sequence will contain `seq_length` characters from the text.

**For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.**

So, break the text into chunks of `seq_length + 1`.  e.g. if `seq_length` is 4 and our text is "Hello", the input sequence would be "Hell" and the target sequence would be "ello".

`tf.data.Dataset.from_tensor_slices` converts the text vector into a stream of character indices.

In [None]:
# max sentence length (in number of characters) desired for single input
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1) # floored division

# create training examples/targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# data type of train examples/targets
print(type(char_dataset))

In [None]:
# preview training examples as characters (using the indices in char_dataset)
for i in char_dataset.take(5):
    print(idx2char[i.numpy()]) # .numpy() converts into numpy data format (in this case, a numpy integer)

Use the `batch` method on `char_dataset` (type `tensorflow.python.data.ops.dataset_ops.TensorSliceDataset`) to convert the individual characters to sequences of the desired size (`seq_length`).

In [None]:
# create sequence batches from the char_dataset
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
print(type(sequences), '\n')

# preview some sequences
for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

For each sequence, duplicate and shift it to form the input and target text using the `map` method on the batch object to apple a simple function to each batch.

In [None]:
# define the shifting (splitting) function
def split_input_target(chunk):
    input_text = chunk[:-1] # up to but not including the last character
    target_text = chunk[1:] # everything except for the firs tcharacter
    return input_text, target_text

In [None]:
# apply the shifting to create input texts and target texts that comprise of our dataset
dataset = sequences.map(split_input_target)
print(type(dataset))

In [None]:
# see the first few examples of input and target values
for input_example, target_example in dataset.take(1):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))
    print()

During training, at time step 0, the model receives the index for F (from "First") and tries to predict the "i" (from "First") as the next character.  At the next time step, it does the same thing, but the RNN considers the previous time step context in addition to the current input character (it would consider both "F" and "i" in trying to predict "r").

**BELOW CELL CAUSES GPU MEMORY SPIKE**

In [None]:
# # first few examples of prediction time steps
# for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
#     print(f"Step {i:4d}")
#     print(f"  input: {input_idx} ({repr(idx2char[input_idx]):s})")
#     print(f"  expected output: {target_idx} ({repr(idx2char[target_idx]):s})")

## Create Training *Batches*

`tf.data` was used to split the text into _sequences_.  But before feeding this data into the model, we must _shuffle_ the data and pack it into _batches_.  The first layer of the model will be a Keras `Embedding` layer

In [None]:
# batch size
BATCH_SIZE = 64

# buffer size to shuffle the dataset
# (TensorFlow data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory.  Instead,
# it maintains a buffer in which it shuffles elements)
BUFFER_SIZE = 10000

dataset_sb = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset_sb

The data is now ready to be passed into an RNN model.

# Modelling

## Building the Model

Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

- `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions
- `tf.keras.layers.GRU`: A type of RNN with size `units = rnn_units` (You can also use an LSTM layer here)
- `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs

In [None]:
# vocabulary length (number of characters)
vocab_size = len(vocab)

# embedding dimension
embedding_dim = 256

# number of RNN units
rnn_units = 1024

In [None]:
# helper function to quickly build the RNN model based on vocab size, embedding dimension, number of RNN units, and batch size
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential()
    
    model.add(tf.keras.layers.Embedding(
        input_dim = vocab_size,
        output_dim = embedding_dim,
        batch_input_shape=[batch_size, None]
    ))
    
    model.add(tf.keras.layers.GRU(
        units = rnn_units,
        return_sequences = True,
        stateful = True,
        recurrent_initializer = 'glorot_uniform'
    ))
    
    model.add(tf.keras.layers.Dense(units=vocab_size))
    
    return model

In [None]:
# build the model
rnn = build_model(
    vocab_size = vocab_size,
    embedding_dim = embedding_dim,
    rnn_units = rnn_units,
    batch_size = BATCH_SIZE
)

## Try the Model (Without Training)

First, check the shape of the output:

In [None]:
for input_example_batch, target_example_batch in dataset_sb.take(1):
    example_batch_predictions = rnn(input_example_batch)
    print(example_batch_predictions.shape, '# (batch_size, sequence_length, vocab_size)')

The sequence length (`seq_length`) was set to `100` but the model can be run on inputs of any length.

In [None]:
rnn.summary()

To get actual predictions from the model, we must sample from the output distribution to get actual character indices.  This distribution is defined by the logits over the character vocabulary.

**Note**: It is important to _sample_ from this distribution, since taking the _argmax_ of the distribution can easily get the model stuck in a loop.

Try it for the first example in the batch:

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy() # tf.squeeze() removes all size-1 dimensions from the tensor

This gives us, at each timestep, a prediction of the next character index:

In [None]:
display(sampled_indices)
print(len(sampled_indices))

Decode these to see the text predicted by the untrained model:

In [None]:
print(f'Input: {repr("".join(idx2char[input_example_batch[0]]))}\n')
print(f'Output: {repr("".join(idx2char[sampled_indices]))}')

## Training the Model

We now have a classification problem: **Given the previous RNN state, and the input at this time step, predict the class of the next character.**

### Attaching an Optimizer and Loss Function

The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because the model returns logits, we need to set the `from_logits` flag to `True`.

In [None]:
# helper function to obtain the loss function
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [None]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Configure the training procedure using the `tf.keras.Model.compile` method.  Use `tf.keras.optimizers.Adam` with default arguments and the loss function.

In [None]:
rnn.compile(
    optimizer = 'adam',
    loss = loss,
    metrics = ['accuracy']
)

### Configure Checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [None]:
# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, 'checkpoint')

# create checkpoints-saving object
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath = checkpoint_prefix,
    monitor = 'loss',
    save_best_only = True,
    mode = 'min',
    save_weights_only = True
)

### Execute the Training

In [None]:
# set number of desired epochs
EPOCHS = 50

In [None]:
%%time

# training!
history = rnn.fit(
    x = dataset_sb,
    epochs = EPOCHS,
    callbacks = [checkpoint_callback]
)

## Generating Text (Making Predictions)

### Restore the Latest Checkpoint

- batch size 1 (for simplicity)
- because of the way the RNN state is passed from time step to time step, the model only accepts a fixed batch size once built
- **to run the model with a different `batch_size`, we need to rebuild the model and restore the weights from the last checkpoint**

In [None]:
# check the file in the working directory that contains the most recent checkpoint
tf.train.latest_checkpoint(checkpoint_dir)

In [None]:
# initiate a new RNN model instance
rnn_cp = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

# load the saved weights from the checkpoint into the new model instance
rnn_cp.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

# build the model with a new input shape
rnn_cp.build(tf.TensorShape([1, None]))

In [None]:
rnn_cp.summary()

### The Prediction Loop

- start by choosing a start string, initializing the RNN state and setting the number of characters to generate
- get the prediction distribution of the next character using the start string and the RNN state
- then, use a categorical distribution to calculate the index of the predicted character
- use this predicted character as our next input to the model
- the RNN state returned by the model is fed back into the model so that it now has more context, instead of only one character
- after predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters

In [None]:
# text prediction function
def generate_text(model, start_string, num_generate=500, temperature=1.0):
    
    # num of chars to generate
    num_generate = num_generate
    
    # vectorizing the start string to numbers
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input=input_eval, axis=0) # returns a tensor with a length-1 axis inserted at index `axis`
    
    # empty string to store results
    text_generated = list()
    
    # "temperature"
    # low temperature results in more predictable text,
    # high temperature results in more surprising text.
    # feel free to experiment with this parameter
    temperature = 1.0
    
    # the batch size was defined when we loaded model weights from training
    
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)
        
        # use a categorical distribution to predict the character returned by the model
        preidctions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
        
        # pass the predicted character as the next input to the model along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)
        
        text_generated.append(idx2char[predicted_id])
    
    return(start_string + ''.join(text_generated))

In [None]:
%%time

# text generation!
print(generate_text(rnn_cp, start_string=u'Anthony simps hard for Janet', num_generate=1000))