# Text Generation using character bases RNN

This example covers list of below things:

1. How to import libraries?
2. Load, read and process dataset (text based).
3. Train a model to predict the next character in the sequence

2. Load and Split the datasets from TensorFlow Datasets
3. Build the model and Apply Pre-trained Embeddings from TensorHub
4. Loss function and optimization
5. Evaluate the model
6. Save the model - (SavedModel & HDF5)
7. Load the saved model - (SavedModel & HDF5)

### Datasets

* Shakespeare's dataset (`Look into ./datasets directory`)

## Import relevant libraries, frameworks etc.

In [42]:
!pip install -q tensorflow
!pip install -q numpy

import tensorflow as tf
from tensorflow.keras import layers

import numpy as np
import os
import time

print("Tensorflow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Tensorflow version:  2.1.0
Eager mode:  True
GPU is NOT AVAILABLE


## Download dataset

Enable next line of code and execute to download the Shakespeare datasets.

In [1]:
# filepath = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

filepath = '../../datasets/shakespeare.txt'

## Read the data

Read the data and then decode and perform few operations to get the sense of data

* Read the length of text is the number of charceters in it
* Read 1st 250 characters in text
* Read the unique characters in the file

In [44]:
text = open(filepath, 'rb').read().decode(encoding='utf-8')

print("Length of text: {} characters".format(len(text)))

print("First 250 characters: ", text[:250])

vocabulary = sorted(set(text))
print("There are {} unique charceters in the file".format(len(vocabulary)))

Length of text: 1115394 characters
First 250 characters:  First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

There are 65 unique charceters in the file


## Process the text

### Vectorize the text

Before training, we need to map strings to a numerical representation.

Create 2 lookup tables for:
1. Mapping characters to numbers.
2. Numbers to characters.

Creating a mapping from unique characters to indices

In [45]:
char2idx = { u:i for i, u in enumerate(vocabulary) }
print(char2idx)

idx2char = np.array(vocabulary)
print(idx2char)

text_as_int = np.array([char2idx[c] for c in text])
print(text_as_int)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']
[18 47 56 ... 45  8  0]


* Now, we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to len(unique).

* See, how the first 32 characters from the text are mapped to integers

In [46]:
print('{} ---- characters mapped to int ----> {}'.format(repr(text[:32]), text_as_int[:32]))

'First Citizen:\nBefore we proceed' ---- characters mapped to int ----> [18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42]


### The prediction task

Given a character, or a sequence of characters, `what is the most probable next character`? This is the task we're training the model to perform.

The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

## Create training examples and targets

Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length + 1`. For example, say `seq_length is 4 and our text is "Hello"`. The `input sequence would be "Hell", and the target sequence "ello"`.

To do this first use the `tf.data.Dataset.from_tensor_slices` function to `convert the text vector into a stream of character indices`.

In [47]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
# print("examples_per_epoch {}".format(examples_per_epoch))

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

F
i
r
s
t


The `batch` method converts these individual characters to sequences of the desired size.

In [48]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
print(sequences)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

<BatchDataset shapes: (101,), types: tf.int64>
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


For each sequence,

* Duplicate and shift it to form the input and target text by using the `map` method to apply a simple function to each batch

In [49]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)
dataset

<MapDataset shapes: ((100,), (100,)), types: (tf.int64, tf.int64)>

Print the first examples input and target values:

In [50]:
for input_example, target_example in dataset.take(1):
  print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


Each index of these vectors are processed as one time step.

1. For the input at time step 0, the model receives the index for "F" and trys to predict the index for "i" as the next character.
2. At the next timestep, it does the same thing but the RNN considers the previous step context in addition to the current input character.

In [51]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')


## Create training batches

* We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [52]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Build the model

Use `tf.keras.Sequential` to define the model. For sample 3 layers are used to define our model:

1. `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with embedding_dim dimensions
2. `tf.keras.layers.LSTM`: A type of RNN with size units=rnn_units (You can also use a GRU layer here.)
3. `tf.keras.layers.Dense`: The output layer, with vocabulary_size outputs.

In [53]:
# Length of the vocabulary in chars
vocabulary_size = len(vocabulary)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [54]:
def build_model(vocabulary_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocabulary_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocabulary_size)
  ])
  return model

In [55]:
model = build_model(
  vocabulary_size = len(vocabulary),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

In [56]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocabulary_size)")

(64, 100, 65) # (batch_size, sequence_length, vocabulary_size)


In [57]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (64, None, 256)           16640     
_________________________________________________________________
lstm_3 (LSTM)                (64, None, 1024)          5246976   
_________________________________________________________________
dense_5 (Dense)              (64, None, 65)            66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

In [59]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
sampled_indices

array([10, 28, 11, 55,  3, 29, 12, 59, 34, 50, 21, 42, 53, 22, 62, 60, 55,
       17, 15, 45, 55, 12, 18, 26, 13, 56, 35, 16, 40, 45, 62, 55, 10,  8,
       16, 21,  0, 49, 60, 26, 18, 56, 39, 16, 41, 60, 28, 19, 19, 57, 38,
       61, 10, 17,  8, 19, 12, 60, 10, 64, 58, 41, 43, 39, 36, 52, 63, 41,
       18, 40,  0, 51, 54, 16, 37, 23,  1, 37, 21, 34,  7, 61,  7,  7, 39,
        7, 57, 48, 30, 13, 55, 34,  6, 12, 58,  7, 42, 15, 15, 12])

In [60]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "do. Come, cousin, I'll\nDispose of you.\nGentlemen, go, muster up your men,\nAnd meet me presently at B"

Next Char Predictions: 
 ':P;q$Q?uVlIdoJxvqECgq?FNArWDbgxq:.DI\nkvNFraDcvPGGsZw:E.G?v:ztceaXnycFb\nmpDYK YIV-w--a-sjRAqV,?t-dCC?'


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

* The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

* Because our model returns logits, we need to set the from_logits flag.

In [61]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocabulary_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.174503


Configure the training procedure using the tf.keras.Model.compile method. We'll use tf.keras.optimizers.Adam with default arguments and the loss function.

In [62]:
model.compile(optimizer='adam', loss=loss)

### Configure checkpoints

* Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [63]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'

# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

To keep training time reasonable, use 10 epochs to train the model.

Notes:

* In Colab, set the runtime to GPU for faster training.

In [64]:
EPOCHS=1

history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Train for 172 steps


## Generate text

### Restore the latest checkpoint

To keep this prediction step simple, use a batch size of 1.

Notes:
    
* Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

* To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

In [66]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_1'

In [68]:
model = build_model(vocabulary_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [69]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
lstm_4 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_6 (Dense)              (1, None, 65)             66625     
Total params: 5,330,241
Trainable params: 5,330,241
Non-trainable params: 0
_________________________________________________________________


### The prediction loop

The following code block generates the text:
    
1. It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.
2. Get the prediction distribution of the next character using the start string and the RNN state.
3. Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.
4. The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [70]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [71]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: ID&FZ?jXSX;&3J3X$g$Jn3xJ!!k$xQ&Kx?HXZdIGHH!BXOVQUj$$3hMkqJeZg.$Kj;fz.,AF 
isgisa, with thy thee beadivest?
Yor
Your-st am to lom were's fordande?
Are ble bency abl the sainde;
To hove way heal!

wEAkn CUINREH:
Ther siliby mo hesp reave; wish if shis?

Yos ma, dore
ther has chay,

MANIL:
Wist, with nope-in, o metollt very bien's brouss!
Nrecon:
I nalk thy mensterth ustmrayssw-
Anlt?
Ane mo she lowe:
-fail you, wa: condude he hevelest net so; her theme,
Buf. Gut wher bithtience han thim wey my her lege notheveorsse!

GULGORIO:
Thy frist br dent thoulll with mast!

Kicon,:
Ge to sto wiints ofuot soQ
TLUCUS:
My thouk:
And gomyeme'e wil kely!
Ascabbuns buf thy cormsen with. 'sw mquan hery ie then careecy.

PBARILIO:
O wired mistly mame

PLANEL:
I, une wo go dain,
That, dut stoul sher's pepim do minssler.

GOWI RASHe with GMood:
Whis ast lithe kibkmy: sher; shat. Mearry wimy.-
Dethour to my ford sithee andibrseng enstens:
Whin the treem, hel craby
KI hes my lote then, o now: -utty ith

## Notes

* The easiest thing you can do to improve the results it to train it for longer (try EPOCHS=30).

* You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.