In [1]:
import tensorflow as tf

import numpy as np
import os
import time

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [3]:
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(f'Length of text: {len(text)} characters')

Length of text: 1115394 characters


In [4]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [5]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

65 unique characters


# Preprocessing

*From Tensorflow:*

The `tf.keras.layers.StringLookup` layer can convert each character into a numeric ID.

**But** it needs the text to be split into tokens first. For that we can use the split:

In [6]:
tf.strings.unicode_split(['teste'], input_encoding='UTF-8')

<tf.RaggedTensor [[b't', b'e', b's', b't', b'e']]>

In [8]:
# convert to numeric ID
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)
# convert back to text
chars_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1) # just join the chars

In [9]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([19, 48, 57, ..., 46,  9,  1], dtype=int64)>

In [10]:
# turn it into a dataset
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [11]:
seq_length = 100 # input size

In [12]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

Example of 2 inputs:

In [13]:
for seq in sequences.take(2):
    print(text_from_ids(seq).numpy())

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'


## Prepare input and output

If the input is **'Hell'** and the output should be **'ello'**:

In [14]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [16]:
split_input_target(list("Hello"))

(['H', 'e', 'l', 'l'], ['e', 'l', 'l', 'o'])

In [17]:
dataset = sequences.map(split_input_target)

In [18]:
for input_example, target_example in dataset.take(2):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
Input : b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you '
Target: b're all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'


## Batch dataset

In [19]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000 # dont try to shuffle everything at once

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

# Model

In [20]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

The model uses a **GRU** to learn the encoding, the `state` has to be returned in order to feed the next input (iteration)

In [21]:
class MyModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__(self)
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = inputs
        x = self.embedding(x, training=training)
        if states is None:
            states = self.gru.get_initial_state(x)
        x, states = self.gru(x, initial_state=states, training=training)
        x = self.dense(x, training=training)

        if return_state:
            return x, states
        else:
            return x

In [23]:
model = MyModel(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units)

The problem can be treated as a standard **classification** problem.

Given the **previous** RNN state, and the **input** this time step, predict the class of the **next** character.

In [25]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

**Important**:
*From Tensorflow:*

A newly initialized model shouldn't be too sure of itself, the output logits should all have similar magnitudes.

To confirm this you can check that the exponential of the mean loss is **approximately equal** to the vocabulary size. A much higher loss means the model is sure of its wrong answers, and is badly initialized:

In [27]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
    
    print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
    print("Mean loss:        ", example_batch_mean_loss)
    print("Exp. mean loss: ", tf.exp(example_batch_mean_loss).numpy())

Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.1887693, shape=(), dtype=float32)
Exp. mean loss:  65.94159


In [28]:
model.compile(optimizer='adam', loss=loss)

In [29]:
model.summary()

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  16896     
                                                                 
 gru_1 (GRU)                 multiple                  3938304   
                                                                 
 dense_1 (Dense)             multiple                  67650     
                                                                 
Total params: 4,022,850
Trainable params: 4,022,850
Non-trainable params: 0
_________________________________________________________________


In [32]:
EPOCHS = 5

In [33]:
history = model.fit(dataset, epochs=EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# Generate text

*From Tensorflow:*

The simplest way to generate text with this model is to run it in a loop, and keep track of the model's internal state as you execute it.

Each time you call the model you pass in some text and an internal state.

The model returns a prediction for the next character and its new state. Pass the prediction and state back in to continue generating text.

In [35]:
class OneStep(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
        super().__init__()
        self.temperature = temperature
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars

        # Create a mask to prevent "[UNK]" from being generated.
        skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    
        sparse_mask = tf.SparseTensor(
            values=[-float('inf')]*len(skip_ids), # Put a -inf at each bad index.
            indices=skip_ids,
            dense_shape=[len(ids_from_chars.get_vocabulary())] # Match the shape to the vocabulary
        )
    
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_step(self, inputs, states=None):
        # Convert strings to token IDs.
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        # Run the model.
        # predicted_logits.shape is [batch, char, next_char_logits]
        predicted_logits, states = self.model(inputs=input_ids, states=states, return_state=True)
        
        # Only use the last prediction.
        predicted_logits = predicted_logits[:, -1, :]
        predicted_logits = predicted_logits/self.temperature
        
        # Apply the prediction mask: prevent "[UNK]" from being generated.
        predicted_logits = predicted_logits + self.prediction_mask

        # Sample the output logits to generate token IDs.
        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        # Convert from token ids to characters
        predicted_chars = self.chars_from_ids(predicted_ids)

        # Return the characters and model state.
        return predicted_chars, states

In [36]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [37]:
start = time.time()

states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
    next_char, states = one_step_model.generate_one_step(next_char, states=states)
    result.append(next_char)

result = tf.strings.join(result)
end = time.time()

print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)

print('\nRun time:', end - start)

ROMEO:
So, Gentlemen,
Who take't, they must conferr revolt
Shall knock it is with a foot.

Provost:
Northough no doubt, invocudious; which true banish'd,
hereaft reign on ease suspicions.

HARWINGHAM:
Learn boot, an angly--under, you so understoo
pursued in ectim and my trage is not of our house;
Princely guid! my party deep by traveliament with me
And I the haste seems' creature there thou returned?

QUEEN MARGARET:
Let's golding then can speak of sellows, bresh'd up with him.
Ah that we have drum'd the flatterer thrust of them;
And I thou hast disnibe to death, seal me, deliver'd
Where is Richard'd while I were go for, my lord,
Than if you think, since it was tirdd his monatrous,
Fortenst the golden bed of what
they lead the modaltices through. What's thou? This fall twict
What's a killet--lost Henry's nemble;
I do not warrant-full flight: add more
On my body sits of with these death,
So may would have shed for you.

AUTOLYCUS:
Ah, Cares you, by their own, since, 'gainst their
anvici

**Important:**

*From Tensorflow:*

The easiest thing you can do to improve the results is to train it for longer (try `EPOCHS = 30`).

If you want the model to generate text faster the easiest thing you can do is batch the text generation:
```
next_char = tf.constant(['ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:', 'ROMEO:'])
```

# Advanced: Customized Training

*From Tensorflow:*

The above training procedure is simple, but does not give you much control. It uses **teacher-forcing** which prevents bad predictions from being fed back to the model, so the model never learns to recover from mistakes. Using *curriculum learning* can help you to stabilize the model's open-loop output.

------

*From Machinelearningmastery:*

## Teacher-forcing
**Teacher-forcing** is the method of feeding the input of the current step with the output of the previous step (`next_char`).

It is a fast and effective way to train a recurrent neural network that uses output from prior time steps as input to the model.

But, the approach can also result in models that may be fragile or limited when used in practice when the generated sequences vary from what was seen by the model during training.

## Curriculum Learning

Gradually force the model during training to deal with its own mistakes, as it would have to during inference.

Basically, randomly choosing to use the ground truth output or the generated output from the previous time step as input for the current time step.

The curriculum changes over time in what is called scheduled sampling where the procedure starts at forced learning and slowly decreases the probability of a forced input over the training epochs.