##### Copyright 2018 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License").

# Text Generation using a RNN

<table class="tfo-notebook-buttons" align="left"><td>
<a target="_blank"  href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>  
</td><td>
<a target="_blank"  href="https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/generative_examples/text_generation.ipynb"><img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on Github</a></td></table>

This notebook demonstrates how to generate text using an character based RNN using [tf.keras](https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). Here, we show a lower-level implementation that's useful to understand as prework before diving in to deeper examples in a similar, like [Neural Machine Translation with Attention](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb). Also, if you like, you can write a similar [model](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb) using less code.

This is an end-to-end example. We'll download a dataset of Shakespeare's writing, a collection of plays, borrowed from Andrej Karpathy's excellent [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), then given a sequence of characters we'll train a RNN model to predict the next character and use it to generate similar text.

Here is a output sample (with start string='w') after training a single GRU layer for 30 epochs with the default settings below:

```
were to the death of him
And nothing of the field in the view of hell,
When I said, banish him, I will not burn thee that would live.

HENRY BOLINGBROKE:
My gracious uncle--

DUKE OF YORK:
As much disgraced to the court, the gods them speak,
And now in peace himself excuse thee in the world.

HORTENSIO:
Madam, 'tis not the cause of the counterfeit of the earth,
And leave me to the sun that set them on the earth
And leave the world and are revenged for thee.

GLOUCESTER:
I would they were talking with the very name of means
To make a puppet of a guest, and therefore, good Grumio,
Nor arm'd to prison, o' the clouds, of the whole field,
With the admire
With the feeding of thy chair, and we have heard it so,
I thank you, sir, he is a visor friendship with your silly your bed.

SAMPSON:
I do desire to live, I pray: some stand of the minds, make thee remedies
With the enemies of my soul.

MENENIUS:
I'll keep the cause of my mistress.

POLIXENES:
My brother Marcius!

Second Servant:
Will't ple
```

Of course, while some of the sentences are grammatical, most do not make sense. But, consider:

* Our model is character based (when we began training, it did not yet know how to spell a valid English word, or that words were even a unit of text).

* The structure of the output resembles a play (blocks begin with a speaker name, in all caps similar to the original text). Sentences generally end with a period. If you look at the text from a distance (or don't read the individual words too closely, it appears as if it's an excerpt from a play).

* As we'll show our model trains in small pieces of text (100 characters) and still can generate a long text that has coherent structure.

As a next step, you can experiment training the model on a different dataset - any large text file (ASCII) will do, and you can modify a single line of code below to make that change. Have fun!


## Install unidecode library

A helpful library to convert unicode to ASCII.

In [None]:
!pip install unidecode

## Import tensorflow and enable eager execution.

In [2]:
# Import TensorFlow >= 1.10 and enable eager execution
import tensorflow as tf

# Note: Once you enable eager execution, it cannot be disabled. 
tf.enable_eager_execution()

import numpy as np
import os
import unidecode
import time

## Download the dataset

In this example, we will use the [shakespeare dataset](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt). You can use any other dataset that you like.



In [3]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

## Read the dataset

First we'll have a look in the text.

In [4]:
text = unidecode.unidecode(open(path_to_file).read())
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [5]:
# first 1000 characters in text
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [6]:
# unique contains all the unique characters in the file
unique = sorted(set(text))
print ('{} unique characters'.format(len(unique)))

65 unique characters


## Creating the input and output tensors

Our model cannot understand strings only numbers, we need to map the string representation to a numerical representation. There are a couple of choices we need to make, per instance, to choose a character based model or a word based model, in practice for tasks involving text is a good idea to spend sometime thinking about the best representation for your problem.

In this example we'll first map each character to a number, then we'll vectorize the number representation through an [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), which will be described later in this tutorial.

In [7]:
# creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(unique)}
idx2char = {i:u for i, u in enumerate(unique)}

text_as_int = [char2idx[c] for c in text]

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to len(unique).

In [8]:
for char in char2idx:
    print('{:6s} ---> {:4d}'.format(repr(char), char2idx[char]), end='\t')

'\n'   --->    0	' '    --->    1	'!'    --->    2	'$'    --->    3	'&'    --->    4	"'"    --->    5	','    --->    6	'-'    --->    7	'.'    --->    8	'3'    --->    9	':'    --->   10	';'    --->   11	'?'    --->   12	'A'    --->   13	'B'    --->   14	'C'    --->   15	'D'    --->   16	'E'    --->   17	'F'    --->   18	'G'    --->   19	'H'    --->   20	'I'    --->   21	'J'    --->   22	'K'    --->   23	'L'    --->   24	'M'    --->   25	'N'    --->   26	'O'    --->   27	'P'    --->   28	'Q'    --->   29	'R'    --->   30	'S'    --->   31	'T'    --->   32	'U'    --->   33	'V'    --->   34	'W'    --->   35	'X'    --->   36	'Y'    --->   37	'Z'    --->   38	'a'    --->   39	'b'    --->   40	'c'    --->   41	'd'    --->   42	'e'    --->   43	'f'    --->   44	'g'    --->   45	'h'    --->   46	'i'    --->   47	'j'    --->   48	'k'    --->   49	'l'    --->   50	'm'    --->   51	'n'    --->   52	'o'    --->   53	'p'    --->   54	'q'    --->   55	'r'    --->   56	's'    --->   57	't'    --->   

In [9]:
# original first 13 characters and its mapped int version
print ('{} ---- characters mapped to int ---- > {}'.format(text[:13], text_as_int[:13]))

First Citizen ---- characters mapped to int ---- > [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]


Given a character, or a sequence of characters, what is the most probable next character? This is the actual task we'll train the model to be able to perform.

The model input vector will be a sequence of characters, and the expected output will be the respective following characters at each time step.

![](input-output-vectors-rnn.png)

TODO: Add diagram from https://docs.google.com/drawings/d/1MzyjFGelcJDSRftBxfIEUOW6cd2GA7BjkKPfM5OK8r4/edit?usp=sharing

Since RNNs processes sequence input, in practice the input and expected output looks something more like:

![](input-output-vectors-rnn-2.png)


TODO: Add diagram from https://docs.google.com/drawings/d/1MzyjFGelcJDSRftBxfIEUOW6cd2GA7BjkKPfM5OK8r4/edit?usp=sharing

As usual train, we'll give the data for the model in batches, so we'll create **max_length** chunks of input, where each input vector is all the characters in that chunk except the last and the target vector is all the characters in that chunk except the first.

In [10]:
# setting the maximum length sentence we want for a single input in characters
max_length = 100

# creating chunks of length == max_length
input_text = []
target_text = []

for f in range(0, len(text_as_int)-max_length, max_length):
    inputs = text_as_int[f : f + max_length]
    targets = text_as_int[f + 1 : f + 1 + max_length]

    input_text.append(inputs)
    target_text.append(targets)
    
print (np.array(input_text).shape)
print (np.array(target_text).shape)

(11153, 100)
(11153, 100)


We can see we have 11153 examples of long sentences with around 100 characters. Let's print the first 10 inputs of the chunk.

In [11]:
for i, (input_idx, target_idx) in enumerate(zip(input_text[0][:10], target_text[0][:10])):
    print ('Step {:4d}, input: {:4s}, expected output: {:4s}'.format(i+1,
                                                            repr(idx2char[input_idx]),
                                                            repr(idx2char[target_idx])))

Step    1, input: 'F' , expected output: 'i' 
Step    2, input: 'i' , expected output: 'r' 
Step    3, input: 'r' , expected output: 's' 
Step    4, input: 's' , expected output: 't' 
Step    5, input: 't' , expected output: ' ' 
Step    6, input: ' ' , expected output: 'C' 
Step    7, input: 'C' , expected output: 'i' 
Step    8, input: 'i' , expected output: 't' 
Step    9, input: 't' , expected output: 'i' 
Step   10, input: 'i' , expected output: 'z' 


## Creating batches and shuffling them using tf.data

We'll use [tf.data](https://www.tensorflow.org/guide/datasets) to feed the data to the model.

In [None]:
# batch size 
BATCH_SIZE = 64

# buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

## Implementing the model

We use the Model Subclassing API which gives us full flexibility to create the model and change it however we like. We use 3 layers to define our model.

* Embedding layer: a trainable lookup table that will map the numbers of each character to a high dimensional vector with **embedding_dim** dimensions;
* GRU layer: a type of RNN (you can use an LSTM layer here) with layer size = **units**;
* Fully connected layer with **vocab_size** cells.

In [None]:
class Model(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units):
    super(Model, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    if tf.test.is_gpu_available():
      self.gru = tf.keras.layers.CuDNNGRU(self.units, 
                                          return_sequences=True, 
                                          return_state=True, 
                                          recurrent_initializer='glorot_uniform')
    else:
      self.gru = tf.keras.layers.GRU(self.units, 
                                     return_sequences=True, 
                                     return_state=True, 
                                     recurrent_activation='sigmoid', 
                                     recurrent_initializer='glorot_uniform')

    self.fc = tf.keras.layers.Dense(vocab_size)
        
  def call(self, x, hidden):
    x = self.embedding(x)

    # output at every time step
    # output shape == (batch_size, max_length, hidden_size) 
    # states variable to preserve the state of the model
    # states shape == (batch_size, hidden_size)
    output, states = self.gru(x, initial_state=hidden)
    
    # reshaping the output so that we can pass it to the Dense layer
    # after reshaping the shape is (batch_size * max_length, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))
    
    # The dense layer will output predictions for every time_steps(max_length)
    # output shape after the dense layer == (max_length * batch_size, vocab_size)
    x = self.fc(output)
    
    # states will be used to pass at every step to the model while training
    return x, states

## Instantiate the model and set the optimizer and the loss function

In [None]:
# length of the vocabulary in chars
vocab_size = len(unique)

# the embedding dimension 
embedding_dim = 256

# number of RNN (here GRU) units
units = 1024

model = Model(vocab_size, embedding_dim, units)

We'll use [Adam optimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) with default arguments and [softmax cross entropy](https://www.tensorflow.org/api_docs/python/tf/losses/sparse_softmax_cross_entropy) as the loss function. This loss function is applicable in this context since we're trying to predict the next character and the number of characters is a discrete number, similar to a classification problem.

In [None]:
# using adam optimizer with default arguments
optimizer = tf.train.AdamOptimizer()

# using sparse_softmax_cross_entropy so that we don't have to create one-hot vectors
def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)

## Checkpoints (Object-based saving)

We'll use [tf.train.Checkpoint](https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint) to save the weights of the model after a couple of epochs.

In [None]:
# directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
# checkpoint instance
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

## Train the model

Here we will use a custom training loop with the help of GradientTape():

* We initialize the hidden state of the model with zeros and shape == (batch_size, number of rnn units). We do this by calling the function defined while creating the model.

* Next, we iterate over the dataset(batch by batch) and calculate the **predictions and the hidden states** associated with that input.

* There are a lot of interesting things happening here.
  * The model gets hidden state (initialized with 0), lets call that **H0** and the first batch of input, lets call that **I0**.
  * The model then returns the predictions **P1** and **H1**.
  * For the next batch of input, the model receives **I1** and **H1**.
  * The interesting thing here is that we pass **H1** to the model with **I1** which is how the model learns. The context learned from batch to batch is contained in the **hidden state**.
  * We continue doing this until the dataset is exhausted and then we start a new epoch and repeat this.

* After calculating the predictions, we calculate the **loss** using the loss function defined above. Then we calculate the gradients of the loss with respect to the model variables(input)

* Finally, we take a step in that direction with the help of the optimizer using the apply_gradients function.

Below is a diagram with all the steps from text data to training for the word "bye".

TODO: add diagram [here](https://docs.google.com/drawings/d/1Fine4lNwuU-7lDLCvagA-uZhwIlWKh9c-kgsUkCitxM/edit?usp=sharing)

![complete-example.png](complete-example.png)



```
Note: If you are running this notebook in Colab which has a **Tesla K80 GPU** it takes about 23 seconds per epoch.
```

In [None]:
# Training step

EPOCHS = 30

for epoch in range(EPOCHS):
    start = time.time()
    
    # initializing the hidden state at the start of every epoch
    # initally hidden is None
    hidden = model.reset_states()
    
    for (batch, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              # feeding the hidden state back into the model
              # This is the interesting step
              predictions, hidden = model(inp, hidden)
              # reshaping the target because that's how the 
              # loss function expects it
              target = tf.reshape(target, (-1,))
              loss = loss_function(target, predictions)
              
          grads = tape.gradient(loss, model.variables)
          optimizer.apply_gradients(zip(grads, model.variables))

          if batch % 100 == 0:
              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,
                                                            batch,
                                                            loss))
    # saving (checkpoint) the model every 5 epochs
    if (epoch + 1) % 5 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)

    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

## Restore the latest checkpoint

In [None]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

## Predicting using our trained model

The below code block is used to generated the text

* We start by choosing a start string and initializing the hidden state and setting the number of characters we want to generate.

* We get predictions using the start string and the hidden state.

* Then we use a multinomial distribution to calculate the index of the predicted character. **We use this predicted character as our next input to the model**

* **The hidden state returned by the model is fed back into the model so that it now has more context rather than just one word.** After we predict the next word, the modified hidden states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.


TODO: add diagram from [here](https://docs.google.com/drawings/d/1M-7fE94Ql707YtRDV6AeW0oSqUi9PT79g9IlzkS-Jkk/edit?usp=sharing)
![](example-4.png)

If you see the predictions, the model knows when to capitalize, make paragraphs and the text follows a shakespeare style of writing which is pretty awesome!

In [None]:
# Evaluation step (generating text using the model learned)

# number of characters to generate
num_generate = 1000

# You can change the start string to experiment
start_string = 'Q'
# converting our start string to numbers (vectorizing!) 
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)

# empty string to store our results
text_generated = ''

# low temperatures results in more predictable text.
# higher temperatures results in more surprising text
# experiment to find the best setting
temperature = 1.0

# hidden state shape == (batch_size, number of rnn units); here batch size == 1
hidden = [tf.zeros((1, units))]
for i in range(num_generate):
    predictions, hidden = model(input_eval, hidden)

    # using a multinomial distribution to predict the word returned by the model
    predictions = predictions / temperature
    predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()
    
    # We pass the predicted word as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)
    
    text_generated += idx2char[predicted_id]

print (start_string + text_generated)

## Next steps

* Change the start string to a different character, or the start of a sentence.
* Experiment with training on a different, or with different parameters. [Project  Gutenberg](http://www.gutenberg.org/ebooks/100), for example, contains a large collection of books.
* Experiment with the temperature parameter.
* Add another RNN layer.
