# CS470 Introduction to Artificial Intelligence
## Deep Learning Practice 
#### TA. Yechan Hwang
---

### Agenda for this practice
#### 1. Shakespeare dataset
#### 2. GRU Model
#### 3. Generating texts
---
<br/>
<br/>
<br/>

## 6-1. Text generation with an RNN 
In this practice, we will learn how to generate text using a character-based RNN. We will train a model when given a sequence of characters from this data, that predicts the next character in the sequence. For example, when given the characters 'togethe', trained model will predict 'r' as a next character. Longer sequences of text can be generated by calling the model repeatedly. 

We will practice with a dataset of **Shakespeare's writing** (from Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)). Our dataset has the format of the screenplay.

https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

And before starting the practice, we will upload checkpoint file first which will be used later.

#### Import libraries

In [3]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os
import time

#### Download the Shakespeare dataset
Run the following lines to download data for training.

In [4]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
print(path_to_file)

/root/.keras/datasets/shakespeare.txt


#### Read the data
First, let's take a look at the length of the data.

In [5]:
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


Here *length of text* is the number of characters in it. We have more than one million characters.<br/>
Also, we can check the first 250 characters in training text.

In [6]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



#### Sample prediction

The following is sample output when the model in this practice trained for 30 epochs, and started with the character 'Q'.


<pre>
QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m
</pre>


While most of the sentences are grammatically correct, they do not make sense. But the model seems to have learned some attributes.

- Before the training, the model can't know the style of the training data. 
- But after training, the structure of the output resembles a play—blocks of text generally begin with a speaker name, in all capital letters similar to the dataset.

And how many unique characters are there? Let's check it.

In [7]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


In [8]:
print(vocab)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


There are some special symbols and characters (including lowercase and uppercase letters)

#### Vectorize the text
Before training, we need to **map all the characters in the dataset to a numerical representation**. 

We will create two lookup tables: 
- one for mapping **characters to numbers** (char2idx)
- another for **numbers to characters** (idx2char)

And we can vectorize the text data using char2idx.

In [9]:
# Creating a mapping from unique characters to indices
char2idx = {}
for i in range(len(vocab)):
    char2idx[vocab[i]]=i

idx2char = np.array(vocab)

# 1D integer vector for all characters in the text data
text_as_int = np.array([char2idx[c] for c in text])

In [10]:
print(char2idx)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}


In [11]:
print(idx2char)

['\n' ' ' '!' '$' '&' "'" ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E'
 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W'
 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o'
 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z']


In [12]:
print(text_as_int)
print(text_as_int.shape)

[18 47 56 ... 45  8  0]
(1115394,)


<br/><br/>
Also let's check how the first 13 characters from the dataset text are mapped to integers.

In [13]:
first_text_as_int = text_as_int[:13]

print ('{} ---- characters mapped to int ---- > {}'.format(text[:13], first_text_as_int))

First Citizen ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]


In [14]:
print ('{} ---- ints mapped to characters ---- > {}'.format(first_text_as_int, idx2char[first_text_as_int]))

[18 47 56 57 58  1 15 47 58 47 64 43 52] ---- ints mapped to characters ---- > ['F' 'i' 'r' 's' 't' ' ' 'C' 'i' 't' 'i' 'z' 'e' 'n']


#### The prediction task
Our goal is to predict **the most probable following character** when given a character or a sequence of characters. Therefore, the **input to the model will be a sequence of characters** and the model will learn to **predict the output : the following character at each time step**.



#### Create training examples and targets
For now, we will divide the text data into training sequences. Each input sequence will contain `seq_length` characters from the text. For each input sequence, the corresponding targets contain the same length of text, but shifted one character to the right.

Therefore, the steps for making training input/target are as follows :
1. Break all the text into chunks of `seq_length+1`.
2. Input data is whole characters except the last character.
3. Target data is whole characters except the first character.

For example, let's say that `seq_length` is 4 and our training text is "HELLO".
In this example, **the input sequence would be "HELL", and the target sequence "ELLO"**.

<img src="images/teacher_forcing.png" alt="Drawing" style="width: 700px;"/>

To do this, first use the [`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#from_tensor_slices) function to convert the text vector into a stream of character indices.
- `tf.data.Dataset.from_tensor_slices`: Creates a Dataset whose elements are slices of the given tensors.


In [15]:
# Make char dataset (in the form of integer)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(20):
    print(str(int(i))+" : "+str(idx2char[int(i)]))


18 : F
47 : i
56 : r
57 : s
58 : t
1 :  
15 : C
47 : i
58 : t
47 : i
64 : z
43 : e
52 : n
10 : :
0 : 

14 : B
43 : e
44 : f
53 : o
56 : r


In [16]:
# Make sequences with sequence length +1
seq_length = 100
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

print("There are "+str(len(sequences))+" sequences of length "+str(seq_length+1))
print()

for item in sequences.take(5):
    print(len(idx2char[item.numpy()]))
    print(item.numpy())
    print(repr(''.join(idx2char[item.numpy()])))
    print()

There are 11043 sequences of length 101

101
[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1]
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

101
[39 56 43  1 39 50 50  1 56 43 57 53 50 60 43 42  1 56 39 58 46 43 56  1
 58 53  1 42 47 43  1 58 46 39 52  1 58 53  1 44 39 51 47 57 46 12  0  0
 13 50 50 10  0 30 43 57 53 50 60 43 42  8  1 56 43 57 53 50 60 43 42  8
  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 18 47 56 57 58  6  1
 63 53 59  1 49]
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'

101
[52 53 61  1 15 39 47 59 57  1 25 39 56 41 47 59 57  1 47 57  1 41 46 47
 43 44  1 43 52 43 51 63  1 58 53

<br/><br/>

Now we have to convert above text into input data and target data. Note that target data must be shifted one character to the right.

(Input data is whole characters except the last character, Target data is whole characters except the first character.)

To do this, we will use `tf.data.Dataset.map`. When we give some function to `tf.data.Dataset.map` as a parameter, it will apply the function to all elements and then return them.

In [17]:
def plus_1(x):
    return x+1
    
temp_dataset = tf.data.Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
print(list(temp_dataset.as_numpy_iterator()))
temp_dataset = temp_dataset.map(plus_1)
print(list(temp_dataset.as_numpy_iterator()))

[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]


<br/><br/>
Here, let's define a function 'construct_input_target' which returns input data and target data as explained above.

In [18]:
def construct_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(construct_input_target)

In [19]:
for input_example, target_example in dataset.take(5):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
    print()

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

Input data:  'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you '
Target data: 're all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'

Input data:  "now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us k"
Target data: "ow Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"

Input data:  "ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be "
Target data: "l him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"

Input data:  'one: awa

<br/>

During the training, **each index of these vectors are processed as one time step**. For the input at time step 0, the model receives the index for "F" and tries to predict the index for "i" as the next character. At the next timestep, it does the same thing but the **RNN considers the previous step context in addition to the current input character**.

In [20]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:15], target_example[:15])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
    print()

Step    0
  input: 53 ('o')
  expected output: 52 ('n')

Step    1
  input: 52 ('n')
  expected output: 43 ('e')

Step    2
  input: 43 ('e')
  expected output: 10 (':')

Step    3
  input: 10 (':')
  expected output: 1 (' ')

Step    4
  input: 1 (' ')
  expected output: 39 ('a')

Step    5
  input: 39 ('a')
  expected output: 61 ('w')

Step    6
  input: 61 ('w')
  expected output: 39 ('a')

Step    7
  input: 39 ('a')
  expected output: 63 ('y')

Step    8
  input: 63 ('y')
  expected output: 6 (',')

Step    9
  input: 6 (',')
  expected output: 1 (' ')

Step   10
  input: 1 (' ')
  expected output: 39 ('a')

Step   11
  input: 39 ('a')
  expected output: 61 ('w')

Step   12
  input: 61 ('w')
  expected output: 39 ('a')

Step   13
  input: 39 ('a')
  expected output: 63 ('y')

Step   14
  input: 63 ('y')
  expected output: 2 ('!')



#### Create training batches
We used `tf.data` to split the text into manageable sequences. But before feeding this data into the model, we need to **shuffle the data and pack it into batches**.

[`tf.data.Dataset.shuffle`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#shuffle)(buffer_size, seed=None, reshuffle_each_iteration=None) : Randomly shuffles the elements of this dataset.

[`tf.data.Dataset.batch`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#batch)(batch_size, drop_remainder=False) : Combines consecutive elements of this dataset into batches.

Note that `tf.data.Dataset.shuffle` **doesn't shuffle characters within each sentence**, but the sentences in dataset will be shuffled by sentences.

<img src="images/shuffle1.png" alt="Drawing" style="width: 600px;"/>

<br/>
<br/>

<img src="images/shuffle2.png" alt="Drawing" style="width: 600px;"/>

In [21]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

# Shuffle the data and create batches (1 data = (100, 100) ==> 0:99, 1:100)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

<br/><br/>

We can see that each batch has 64 input sentences (each has 100 characters) and 64 target sentences (each has 100 characters). 

In [22]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch, target_example_batch)

tf.Tensor(
[[ 1 52 53 ... 43 52 11]
 [ 7 45 53 ...  1 44 53]
 [39 52 52 ... 53 60 43]
 ...
 [43  1 51 ... 19 14 30]
 [10  0 26 ... 43  1 63]
 [ 8  0  0 ...  1 51 39]], shape=(64, 100), dtype=int64) tf.Tensor(
[[52 53  1 ... 52 11  0]
 [45 53 60 ... 44 53 56]
 [52 52 53 ... 60 43  1]
 ...
 [ 1 51 63 ... 14 30 27]
 [ 0 26 39 ...  1 63 53]
 [ 0  0 29 ... 51 39 49]], shape=(64, 100), dtype=int64)


<br/><br/>

Also we can see that all the target sentences are shifted one character to the right .

In [23]:
for input_example_batch, target_example_batch in dataset.take(1):
    print(input_example_batch[0])
    print(target_example_batch[0])

tf.Tensor(
[ 1 44 39 58 46 43 56  5 57  1 46 53 59 57 43  6  0 35 46 53  1 45 39 60
 43  1 46 47 57  1 40 50 53 53 42  1 58 53  1 50 47 51 43  1 58 46 43  1
 57 58 53 52 43 57  1 58 53 45 43 58 46 43 56  6  0 13 52 42  1 57 43 58
  1 59 54  1 24 39 52 41 39 57 58 43 56  8  1 35 46 63  6  1 58 56 53 61
  5 57 58  1], shape=(100,), dtype=int64)
tf.Tensor(
[44 39 58 46 43 56  5 57  1 46 53 59 57 43  6  0 35 46 53  1 45 39 60 43
  1 46 47 57  1 40 50 53 53 42  1 58 53  1 50 47 51 43  1 58 46 43  1 57
 58 53 52 43 57  1 58 53 45 43 58 46 43 56  6  0 13 52 42  1 57 43 58  1
 59 54  1 24 39 52 41 39 57 58 43 56  8  1 35 46 63  6  1 58 56 53 61  5
 57 58  1 58], shape=(100,), dtype=int64)


#### Build The GRU Model
We will use `tf.keras.Sequential` to define the model. And for the model in this practice, three layers will be used:

- `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with embedding_dim dimensions;
- [`tf.keras.layers.GRU`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/GRU): A type of RNN with size units=rnn_units (You can also use a LSTM layer here.)
- `tf.keras.layers.Dense`: The output layer, with vocab_size outputs.

<img src=https://miro.medium.com/max/2400/1*dhq14CzJijlqjf7IlDB0uw.png>


##### About the GRU

GRU is a variation of LSTM. GRU has some different attributes compared to vanilla LSTM.

- The two state vectors $c_t$ and $h_t$ in the LSTM Cell are merged into one vector $h_t$.
- There is only one gate controller $z_t$ that controls all input gates.
- There is no output gate and the state vector $h_t$ is the output of GRU.

You can see details about the GRU at the this [link](https://arxiv.org/abs/1406.1078).
In this practice, we will use GRU since its operation is faster than LSTM and it has fewer parameters.

In [24]:
# Length of the vocabulary in chars
vocab_size = 65 # len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [25]:
from tensorflow.keras import Sequential 
from tensorflow.keras.layers import Embedding, GRU, Dense 

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = Sequential([
        Embedding(vocab_size, embedding_dim,
                  batch_input_shape=[batch_size, None]
        ),
        
        GRU(rnn_units, # Positive integer, dimensionality of the output space.
            return_sequences=True, # Whether to return just last output only or the full sequence.
            stateful=True,  #If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
            recurrent_initializer='glorot_uniform'
        ),
        
        Dense(vocab_size)
    ])
    return model

In [26]:
model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character:

<img src=https://www.tensorflow.org/text/tutorials/images/text_generation_training.png>

#### Try the untrained model
Now we will run the untrained model to see how it behaves.

First let's check the shape of the output.

In [27]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(input_example_batch.shape, "# (batch_size, sequence_length)") 
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)") 

(64, 100) # (batch_size, sequence_length)
(64, 100, 65) # (batch_size, sequence_length, vocab_size)


<br/><br/>

Also we can check the model's prediction by probability distribution. For example, we can see the model's prediction for the first sentence's fifth character.

(Note that since the current model is not trained yet.)

In [28]:
print(example_batch_predictions[0][5])
print(tf.nn.softmax(example_batch_predictions[0][5]))

tf.Tensor(
[-5.7021636e-03 -2.8552175e-03  1.9934154e-03 -1.9802842e-03
 -7.8367200e-03  1.4964208e-03  1.3193447e-02  1.1076080e-02
 -1.6079998e-02  5.9543708e-03  1.2915297e-03  1.5347846e-02
  2.4245273e-02  4.0625734e-04  6.4252410e-04  1.5342148e-02
 -1.0259328e-02  3.8642762e-04 -1.8401940e-03  1.1804763e-02
  1.6406330e-03  2.3928382e-03 -1.4672484e-03 -3.7902817e-03
  5.7984591e-03  8.7341527e-05 -1.9185720e-02 -1.1359070e-02
  4.1295560e-03  1.8590938e-02 -8.7830592e-03 -1.5000392e-02
 -1.9850604e-02  1.0131655e-02  2.2139070e-03  3.5425317e-03
 -9.0296585e-03  1.4619655e-02 -1.3905108e-02  3.5725557e-04
  1.3277170e-02 -5.8074114e-03  4.7223978e-03 -1.5757846e-02
  9.3864594e-03  9.4863204e-03 -5.8165621e-03  2.6426336e-03
 -5.0377073e-03 -2.7695575e-03  3.5837949e-03 -1.3579380e-02
 -1.7965563e-03 -1.2047352e-02 -3.1344555e-03 -3.6765996e-03
  5.9711821e-03  1.3788828e-02  1.5896419e-02 -9.6975744e-04
 -1.1437000e-03  2.3107115e-02 -9.8845093e-03  4.3003848e-03
 -6.6951304e-

#### Free input length

In our practice, the sequence length of the input in oyr dataset is 100 but the model can be run on inputs of any length, which is an advantage of the recurrent neural network which can handle inputs of variable length.

To get actual predictions from the model, we need to sample from the output distribution to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Try it for the first example in the batch:

In [29]:
# num_samples : determines how many characters to sample at each iteration

print(input_example_batch[0])
print()
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
print(sampled_indices)

tf.Tensor(
[53 50 53 52 45  1 39 61 46 47 50 43  1 58 46 43  1 58 56 39 47 58 53 56
  5 57  1 50 47 44 43  8  0 35 56 39 58 46  1 51 39 49 43 57  1 46 47 51
  1 42 43 39 44 10  1 57 54 43 39 49  1 58 46 53 59  6  1 26 53 56 58 46
 59 51 40 43 56 50 39 52 42  8  0  0 26 27 30 32 20 33 25 14 17 30 24 13
 26 16 10  0], shape=(100,), dtype=int64)

tf.Tensor(
[[58]
 [61]
 [30]
 [31]
 [ 0]
 [51]
 [44]
 [52]
 [22]
 [41]
 [43]
 [44]
 [ 4]
 [60]
 [48]
 [32]
 [ 2]
 [37]
 [27]
 [53]
 [51]
 [52]
 [12]
 [61]
 [20]
 [62]
 [20]
 [11]
 [29]
 [15]
 [37]
 [62]
 [35]
 [53]
 [12]
 [60]
 [35]
 [43]
 [24]
 [24]
 [35]
 [22]
 [ 1]
 [33]
 [25]
 [29]
 [38]
 [37]
 [ 8]
 [32]
 [17]
 [57]
 [21]
 [62]
 [59]
 [28]
 [26]
 [63]
 [ 1]
 [23]
 [60]
 [42]
 [21]
 [ 2]
 [23]
 [36]
 [15]
 [11]
 [43]
 [24]
 [36]
 [55]
 [18]
 [56]
 [27]
 [16]
 [55]
 [12]
 [51]
 [59]
 [58]
 [35]
 [34]
 [47]
 [15]
 [16]
 [24]
 [31]
 [42]
 [52]
 [12]
 [56]
 [59]
 [40]
 [21]
 [22]
 [49]
 [50]
 [36]
 [21]], shape=(100, 1), dtype=int64)


This gives us a prediction for the next character index at each timestep.

Now in order to check the predicted sentence of our untrained model, we will squeeze the `sampled_indices` and convert them into characters.
- [`tf.squeeze`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/squeeze): Removes dimensions of size 1 from the shape of a tensor.

In [30]:
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

print(sampled_indices.shape)
print(sampled_indices)

(100,)
[58 61 30 31  0 51 44 52 22 41 43 44  4 60 48 32  2 37 27 53 51 52 12 61
 20 62 20 11 29 15 37 62 35 53 12 60 35 43 24 24 35 22  1 33 25 29 38 37
  8 32 17 57 21 62 59 28 26 63  1 23 60 42 21  2 23 36 15 11 43 24 36 55
 18 56 27 16 55 12 51 59 58 35 34 47 15 16 24 31 42 52 12 56 59 40 21 22
 49 50 36 21]


After sqeezing the `sampled_indices`, we got 1D vector that contains indicies of predicted characters.

Let's decode this vector to see the text predicted by this untrained model.

In [31]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "olong awhile the traitor's life.\nWrath makes him deaf: speak thou, Northumberland.\n\nNORTHUMBERLAND:\n"

Next Char Predictions: 
 'twRS\nmfnJcef&vjT!YOomn?wHxH;QCYxWo?vWeLLWJ UMQZY.TEsIxuPNy KvdI!KXC;eLXqFrODq?mutWViCDLSdn?rubIJklXI'


<br/>

Since our model is not trained yet, it seems to just predict next character randomly.

#### Train the model
At this point the problem can be treated as a standard classification problem. **Given the previous RNN state and the input character at each time step, our model must predict the next character.**

#### Compile the model
We will use `tf.keras.losses.sparse_categorical_crossentropy` loss function since it works well for classification problem.

Since our model returns logits, we need to set the `from_logits` flag.

In [32]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

model.compile(optimizer='adam', 
              loss=loss)

#### Configure checkpoints
Use a [`tf.keras.callbacks.ModelCheckpoint`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/callbacks/ModelCheckpoint) to ensure that checkpoints are saved during training:

In [33]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
history = model.fit(dataset, 
                    epochs=15, 
                    callbacks=[checkpoint_callback])

<img src="images/training_result.png" alt="Drawing" style="width: 600px;"/>

#### Generate text
We will restore the latest checkpoint. Then, to keep this prediction step simple, we will use 1 for batch size.

(Note that in order to run the model with a different `batch_size`, we need to rebuild the model with different batch size and restore the weights from the checkpoint.)

By the codes below, we can check the path that contains weights for the lastest model and load it.

In [34]:
# Rebuild the model by changing batch size (=1) to predict new text
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

# Load the weight of the model we trained 
# model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.load_weights('./saved_ckpt/ckpt_15')

# Change the batch size from 64 to 1
model.build(tf.TensorShape([1, None]))

In [35]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 65)             66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


#### The prediction loop

1. Start by choosing a **start string** and initialize the RNN hidden state for the first iteration.
2. Set the number of characters to generate.
3. Get the **prediction distribution of the next character using the start string and hidden state**.
4. Sample an index of the predicted character using a multinomial distribution of the first iteration. 
5. Use this predicted character as our next input to the model.
6. Repeat step 3-5 until we get the number of characters we set.

**Note that the RNN hidden state returned by the model is fed back into the model and hidden state will become more complex as the prediction loop repeats.**
In other words, after predicting the a word, the modified RNN states are again fed back into the model, which is how the model learns as it gets more context from the previously predicted words.


![To generate text the model's output is fed back to the input](https://www.tensorflow.org/text/tutorials/images/text_generation_sampling.png)


In [36]:
def generate_text(model, start_string, num_generate,temperature):
    # Evaluation step (generating text using the learned model)

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted word as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [44]:
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.

print(generate_text(model, start_string="I will go home and sleep", num_generate = 500, temperature=1.0))

I will go home and sleep,
Turn troops of quality: and, in good correlt,
That is a serina--commander, hen my reediness
He has deap to conceal intelcession of it.

MENENIUS:
Beshrew your grace, unitsul?
O, ill be with yoke more joint
You are beturn and findily and his troth,
Which often so hot revenge's young man do;
For thou hast arteen balladement than thou lack's unck,
The days show wear I seem
Savest the victor's mouth, or hadments must deward:
I would say so hold by curstabeth.

VOLUMNIA:
I kill'd! I have left their


<br/><br/>

Looking at the generated text, you'll see the model knows when to capitalize and make paragraphs, and it imitates a Shakespeare-like writing vocabulary.

<br/>
The easiest thing you can do to improve the results is to train it for longer (e.g. try EPOCHS=30).

You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.