<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/c3_w4_tf_text_generation_with_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text generation with a RNN

## Text generation with an RNN

In [1]:
import os
import time
import numpy as np
import tensorflow as tf

from tensorflow.keras.utils import get_file 
from tensorflow.keras import Sequential
from tensorflow.keras.layers import GRU, Dense, Embedding


In [2]:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = get_file('shakespeare.txt', url)
path_to_file

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


'/root/.keras/datasets/shakespeare.txt'

In [3]:
with open(path_to_file, 'rb') as file:
    text = file.read().decode(encoding="utf-8")
    print(f'Length of text: {len(text):,}')
    vocab = sorted(set(text))

Length of text: 1,115,394


Take a looks at the dataset text

In [4]:
print(text[:70])
print("...")
print(text[-70:])

First Citizen:
Before we proceed any further, hear me speak.

All:
Spe
...
et'st thy fortune sleep--die, rather; wink'st
Whiles thou art waking.



In [5]:
print(f"{len(vocab)} unique characters")

65 unique characters


### Process the text

In [6]:
char2idx = {char: index for index, char in enumerate(vocab)}
idx2char = np.array(vocab)

In [7]:
text_as_int = np.array([char2idx[char] for char in text])

In [8]:
print("{")
for char,_ in zip(char2idx, range(5)):
    print("  {:4s}: {:3d},".format(repr(char), char2idx[char]))
print("  ...\n}")

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  ...
}


In [9]:
# Show how the first 13 characters from the text are mapped to integers
print(f"{text[:13]} <-- mapped to int --> {text_as_int[:13]}")

First Citizen <-- mapped to int --> [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [10]:
seq_length = 100
examples_per_epoch = len(text) // (seq_length + 1)
examples_per_epoch

11043

In [11]:
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()], end='')

First

The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [12]:
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


For each sequence, duplicate and shift in to form the input and target text by using the `map` method to apply a simple function to each batch:

In [13]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [14]:
for input_example, target_example in dataset.take(1):
    print("Input data:", repr(''.join(idx2char[input_example.numpy()])))
    print("Target data:", repr(''.join(idx2char[target_example.numpy()])))

Input data: 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


Each index of these vectors is processed as a one time step. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the `RNN` considers the previous step context in addition to the current input character.

In [15]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))


Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')


## Create training batches

You used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, you need to shuffle the data and pack it into batches.

In [16]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Build the model

In [17]:
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

In [18]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = Sequential([
        Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        Dense(vocab_size)
    ])
    return model

In [19]:
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


<img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/text_generation_training.png?raw=1" alt="model" width=500>

In [20]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch, sequence, vocab)")

(64, 100, 65) # (batch, sequence, vocab)


In [21]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [22]:
sampled_indices

array([38, 43, 37, 61, 41, 37,  2,  0, 22, 51, 34, 25, 60, 28, 48, 33,  5,
       64, 52, 29, 13,  7, 23, 15, 38, 10, 28, 48, 32, 57, 14, 63, 16,  5,
       34, 21,  5, 14, 23, 60, 19, 24,  4, 17, 47,  1, 61, 26, 28, 54, 32,
        9, 12, 59, 19, 27, 40, 34, 13, 38, 13, 18,  1, 12, 43, 33, 36, 12,
       55,  9,  3, 58, 40, 14, 16, 20, 53,  3, 19, 32, 45, 63, 15, 18, 13,
       10, 43, 32, 55, 37, 34, 39, 38,  9, 20, 24, 44, 31, 35, 40])

In [23]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "urage and in judgment\nThat they'll take no offence at our abuse.\n\nKING EDWARD IV:\nSuppose they take "

Next Char Predictions: 
 "ZeYwcY!\nJmVMvPjU'znQA-KCZ:PjTsByD'VI'BKvGL&Ei wNPpT3?uGObVAZAF ?eUX?q3$tbBDHo$GTgyCFA:eTqYVaZ3HLfSWb"


In [24]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.171814


In [25]:
model.compile(optimizer='adam', loss=loss)

In [26]:
EPOCHS = 10

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 65)             66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


In [32]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperature results in more predictable text.
    # Higher temperature results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [33]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: I may be gone,
And vicely as thou art greate to the malice:--hear me not to be
then, alack! it were a
joidne summer respecteden to the king;
Whisher, rush ambitious mine.

SICINIUS:
Come hot were, I know, I hope
My finger amongs our face and brought,
Or never he scapsalious lies must believe you thanks.
I heartes me o'er.

PETRUCHIO:
Sir, Wracerom'd and Just, Greme me, wife,
And let me see thee foolish body:
Reporting it, and his oatthal soil'd hour her see the
dream, or no must be else.

GLOUCESTER:
Gramering them acceament on thyself:
My wife: but hither with them, born up forbids;
And therefore, ho ut fair and mine,
So playngeath to me the dincy of those knighthomas
Are, farito ye with being a bragevou most,
I hold a banich loves!
They shall not hear a women
Are made ground upon a father, My noes,
Possess I may mount him.

FLIAR LICHARD:
I was thing shall I no ut Pith and your feer'd for retres, but stare up.

RICHMOND:
If the people.

MONTAGUE:
And make you so war?
Ferch, as