# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice 
#### TA. Yechan Hwang
---

### Agenda for this practice
#### 1. Shakespeare dataset
#### 2. GRU Model
#### 3. Generating texts
---
<br/>
<br/>
<br/>

## 6-1. Text generation with an RNN 
In this practice, we will learn how to generate text using a character-based RNN. We will practice with a dataset of **Shakespeare's writing** (from Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)). We will train a model when given a sequence of characters from this data, that predicts the next character in the sequence. For example, when given the characters 'togethe', trained model will predict 'r' for the next character. Longer sequences of text can be generated by calling the model repeatedly.

The following is sample output when the model in this practice trained for 30 epochs, and started with the character 'Q'.


<pre>
QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m
</pre>


While most of the sentences are grammatically correct, they do not make sense. But the model seems to have learned some attributes.

- Before the training, the model can't know the style of the training data. 
- But after training, the structure of the output resembles a play—blocks of text generally begin with a speaker name, in all capital letters similar to the dataset.

#### Import libraries

In [1]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os
import time

#### Download the Shakespeare dataset
Run the following lines to download data for training.

In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
print(path_to_file)

#### Read the data
First, let's take a look at the length of the data.

In [None]:
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

print ('Length of text: {} characters'.format(len(text)))

Here *length of text* is the number of characters in it. We have more than ten million characters.<br/>
Also, we can check the first 250 characters in training text.

In [None]:
print(text[:250])

Our dataset has the format of the screenplay.

And how many unique characters are there? Let's check it.

In [None]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

In [None]:
vocab

There are some special symbols and characters (including lowercase and uppercase letters)

#### Vectorize the text
Before training, we need to **map all the characters in the dataset to a numerical representation**. 

We will create two lookup tables: 
- one for mapping **characters to numbers**
- another for **numbers to characters**

In [None]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

# 1D integer vector for all characters in the text data
text_as_int = np.array([char2idx[c] for c in text])

In [None]:
print('{')
for char,_ in zip(char2idx, range(65)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('\n}')

In [None]:
print(len(text_as_int))
print(text_as_int)

<br/><br/>
Also let's check how the first 13 characters from the dataset text are mapped to integers.

In [None]:
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

#### The prediction task
Our goal is when given a character or a sequence of characters, to predict **the most probable following character.** Therefore, the **input to the model will be a sequence of characters** and the model will learn to **predict the output : the following character at each time step**.



#### Create training examples and targets
For now, divide the text into training sequences. Each input sequence will contain `seq_length` characters from the text. For each input sequence, the corresponding targets contain the same length of text, but shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, let's say that `seq_length` is 4 and our training text is "HELLO".

In this example, **the input sequence would be "HELL", and the target sequence "ELLO"**.

<img src="images/teacher_forcing.png" alt="Drawing" style="width: 700px;"/>

To do this, first use the [`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#from_tensor_slices) function to convert the text vector into a stream of character indices.
- `tf.data.Dataset.from_tensor_slices`: Creates a Dataset whose elements are slices of the given tensors.


In [None]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//seq_length

# Make char dataset (in the form of integer)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(str(i.numpy())+" : "+str(idx2char[i.numpy()]))
    

In [None]:
# Make sequences with sequence length +1
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

print(str(len(sequences))+" sequences of length "+str(seq_length+1))
print()

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))
    print(len(idx2char[item.numpy()]))

<br/><br/>

Now we have to convert above text into input data and target data. Note that target data must be shifted one character to the right.

To do this, we will use `tf.data.Dataset.map`. When we give some function to `tf.data.Dataset.map` as a parameter, it will apply the function to all elements and then return them.

In [None]:
def plus_1(x):
    return x+1
    
temp_dataset = tf.data.Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
print(list(temp_dataset.as_numpy_iterator()))
temp_dataset = temp_dataset.map(plus_1)
print(list(temp_dataset.as_numpy_iterator()))

In [None]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

<br/>

During the training, **each index of these vectors are processed as one time step**. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the **RNN considers the previous step context in addition to the current input character**.

In [None]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

#### Create training batches
We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to **shuffle the data and pack it into batches**.

[`tf.data.Dataset.shuffle`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#shuffle)(buffer_size, seed=None, reshuffle_each_iteration=None) : Randomly shuffles the elements of this dataset.

[`tf.data.Dataset.batch`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#batch)(batch_size, drop_remainder=False) : Combines consecutive elements of this dataset into batches.

Note that `tf.data.Dataset.shuffle` **doesn't shuffle characters in each sentence**, but the sentences in dataset will be shuffled by sentences.

<img src="images/shuffle1.png" alt="Drawing" style="width: 600px;"/>

<br/>
<br/>

<img src="images/shuffle2.png" alt="Drawing" style="width: 600px;"/>

In [None]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

# Shuffle the data and create batches (1 data = (100, 100) ==> 0:99, 1:100)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<br/><br/>

We can see that each batch has 64 input sentences (each has 100 characters) and 64 target sentences (each has 100 characters). 

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch, target_example_batch)

<br/><br/>

Also we can see that all the target sentences are shifted one character to the right .

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch[0], target_example_batch[0])

#### Build The GRU Model
We will use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

- `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with embedding_dim dimensions;
- [`tf.keras.layers.GRU`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/GRU): A type of RNN with size units=rnn_units (You can also use a LSTM layer here.)
<img src=https://miro.medium.com/max/2400/1*dhq14CzJijlqjf7IlDB0uw.png>

 
- `tf.keras.layers.Dense`: The output layer, with vocab_size outputs.

##### About the GRU

GRU is a variation of LSTM. GRU has some different attributes compared to vanilla LSTM.

- The two state vectors $c_t$ and $h_t$ in the LSTM Cell are merged into one vector $h_t$.
- There is only one gate controller $z_t$ that controls all input gates.
- There is no output gate and the state vector $h_t$ is the output of GRU.

You can see details about the GRU at the this [link](https://arxiv.org/abs/1406.1078).
In this practice, we will use GRU since its operation is faster than LSTM and it has fewer parameters.

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab) #65

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
from tensorflow.keras import Sequential 
from tensorflow.keras.layers import Embedding, GRU, Dense 

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    ###################
    #                 #
    # write your code #
    #                 # 
    ###################

In [None]:
model = 
    ###################
    #                 #
    # write your code #
    #                 # 
    ###################

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

<img src=https://www.tensorflow.org/tutorials/text/images/text_generation_training.png>

#### Try the untrained model
Now we will run the untrained model to see how it behaves.

First let's check the shape of the output.

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)") 

<br/><br/>

Also we can check the model's prediction by probability distribution. For example, we can see the model's prediction for the first sentence's fifth character.

(Note that since the current model is not trained yet.)

In [None]:
example_batch_predictions[0][5]

And in our model, the sequence length of the input is 100 but the model can be run on inputs of any length.

This is an advantage of the recurrent neural network which can handle inputs of variable length.

In [None]:
model.summary()

To get actual predictions from the model, we need to sample from the output distribution to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Try it for the first example in the batch:

In [None]:
# num_samples : determines how many characters to sample at each iteration

sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
print(sampled_indices)

This gives us a prediction for the next character index at each timestep.

Now to see the predicted sentence of our untrained model, we will squeeze the `sampled_indices` and convert them into characters.
- [`tf.squeeze`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/squeeze): Removes dimensions of size 1 from the shape of a tensor.

In [None]:
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

print(sampled_indices.shape)
print(sampled_indices)

After sqeezing the `sampled_indices`, we got 1D vector that contains indicies of predicted characters.

Let's decode this vector to see the text predicted by this untrained model.

In [None]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

<br/>

Since our model is not trained yet, it seems to just predict next character randomly.

#### Train the model
At this point the problem can be treated as a standard classification problem. **Given the previous RNN state and the input character at each time step, our model must predict the next character.**

#### Compile the model
We will use `tf.keras.losses.sparse_categorical_crossentropy` loss function since it works well for classification problem.

Since our model returns logits, we need to set the `from_logits` flag.

In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Get loss value from untrained model
example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

In [None]:
model.compile(optimizer='adam', 
              loss=loss)

#### Configure checkpoints
Use a [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to ensure that checkpoints are saved during training:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
history = model.fit(dataset, 
                    epochs=15, 
                    callbacks=[checkpoint_callback])

#### Generate text
We will restore the latest checkpoint. Then, to keep this prediction step simple, we will use 1 for batch size.

(To run the model with a different `batch_size`, we need to rebuild the model with different batch size and restore the weights from the checkpoint.)

By the codes below, we can check the path that contains weights for the lastest model and load it.

In [None]:
# Rebuild the model by changing batch size (=1) to predict new text
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

# Load the weight of the model we trained 
model.load_weights('./saved_ckpt/ckpt_15')

# Change the batch size from 64 to 1
model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

#### The prediction loop

The following code block generates the text:

1. Start by choosing a **start string** and initialize the RNN hidden state for the first iteration.
2. Set the number of characters to generate.
3. Get the **prediction distribution of the next character using the start string and hidden state**.
4. Smaple an index of the predicted character using a multinomial distribution of the first iteration. 
5. Use this predicted character as our next input to the model.
6. Repeat step 3-5 until we get the number of characters we set.

**Note that the RNN hidden state returned by the model is fed back into the model and hidden state will become more complex as the prediction loop repeats.**
In other words, after predicting the a word, the modified RNN states are again fed back into the model, which is how the model learns as it gets more context from the previously predicted words.


![To generate text the model's output is fed back to the input](https://www.tensorflow.org/tutorials/text/images/text_generation_sampling.png)

Looking at the generated text, you'll see the model knows when to capitalize and make paragraphs, and it imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
def generate_text(model, start_string, num_generate,temperature):
    # Evaluation step (generating text using the learned model)

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted word as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.

print(generate_text(model, start_string="ROMEO:", num_generate = 1000, temperature=1.0))

<br/><br/>
The easiest thing you can do to improve the results is to train it for longer (e.g. try EPOCHS=30).

You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.