#Week 9 Lab - RNNs for Text Generation

This lab was modified from an example originally written by "The TensorFlow Authors" in 2019 and distributed under Apache License 2.0. This lab creates a model that can generate text using a character-based RNN. A character-based RNN learns sequences of characters from a corpus of text. 

Given a character, or a sequence of characters, what is the most probable next character? This is the task that our RNN model will learn and then perform. The input to the model will be sequences of characters from the book we ingest at the top of this notebook.

Once the model is trained, one can present an input character sequence and the model will generate the character that it predicts would be most likely to appear next. By repeatedly calling the model for new predictions with the previously built sequence, one can create a string of text that "looks like" sentences from the original training text. Note that depending upon the amount of training material and the details of the training procedure, this generated text may look more or less like gibberish.

Also note that, depending on the specific types of layers you are training, you might want to try enabling GPU acceleration to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware accelerator > GPU*. Do this before you start to run any code, because it generally restarts the runtime and your local variables will be lost.

#Setup

### Import TensorFlow and other libraries

In [98]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os # To access local files; for saving checkpoints
import time

from urllib import request # We will need this to read from an URL

In [99]:
# TensorFlow 2.0 offers "Eager Execution," a more practical model for 
# running tf code. Are we using it here?
tf.executing_eagerly()

True

### Download the Text Data

Load a plain text book from Project Gutenberg:

In [100]:
# Some collected plays by Anton Chekhov
url = "https://www.gutenberg.org/files/7986/7986-0.txt"
response = request.urlopen(url)
text = response.read().decode('utf8')
type(text), len(text)

(str, 411576)

In [101]:
#
# Exercise 9.1: Examine some of the contents of text using slicing
# Write a comment saying what you see. Add code to discard the front
# matter so as to start where the book begins: somewhere around text[1090:]. 
#
text[1390:1690]
#From the book, we can see few sentences selected above
#The text contains words along with tags like new lines \n. Punctuations and stop words are all intact.
#There is no preprocessing is done to the text.

'nsystematic mass of\r\ntranslations from the Russian flung at the heads and hearts of English\r\nreaders. The ready acceptance of Chekhov has been one of the few\r\nsuccessful features of this irresponsible output. He has been welcomed\r\nby British critics with something like affection. Bernard Shaw has\r\ns'

In [102]:
# Additional URLs of plain text Chekhov plays for possible later use
uncle_vanya = "https://www.gutenberg.org/cache/epub/1756/pg1756.txt"
the_seagull = "https://www.gutenberg.org/files/1754/1754-0.txt"

### Read the data

First, look in the text to see what we have. Note that this is a character-based model, so we are interested to know what different characters are used throughout the whole text. Note that this should make the process largely language independent: Any language where words are formed through sequences of characters should work in training this kind of model.

In [103]:
# The unique text characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

94 unique characters


In [104]:
#
# Exercise 9.2: Display the list of unique text characters.
# Write a comment saying what you see. What are the advantages of 
# having such a small vocabulary?
#
vocab

#The list includes all the alphabets (lower and upper), tags like new line, symbols, numbers
#Having a small vocabulory helps the RNN model in faster training and less computational power
#They would be better at generalizing and can prevent overfitting

['\n',
 '\r',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 '@',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '£',
 'à',
 'è',
 'é',
 'ê',
 '‘',
 '’',
 '“',
 '”',
 '\ufeff']

## Process the text

### Vectorize the text

Before training, we will convert the strings to a numerical representation. The approach in this notebook uses a so-called "ragged tensor." Here's an excerpt from the TensorFlow documentation:

"Ragged tensors are the TensorFlow equivalent of nested variable-length lists. They make it easy to store and process data with non-uniform shapes, including: Variable-length features, such as the set of actors in a movie; Batches of variable-length sequential inputs, such as sentences or video clips; Hierarchical inputs, such as text documents that are subdivided into sections, paragraphs, sentences, and words."

In the cells below, the notebook uses `preprocessing.StringLookup` from TensorFlow. This function can convert each character into a numeric ID. The input to this is a set of "byte code" ID numbers.

Try the character encoding with a small example first:

In [105]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [106]:
print(chars) # In TF 2.0, the print() function can also show tensors

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>


We have now created a ragged tensor. Now or later, you might want to read some of the TensorFlow documentation describing what a tensor actually is and what it means to be ragged. There's a nice tutorial with code here: https://www.tensorflow.org/guide/tensor

Note that each element in the structure above begins with b. That means that the function has returned a "byte-coded" version of the string. This is an alternative representation to plain text. Also, the idea of "ragged" is demonstrated by this example. The first nested list has seven items, whereas the second list has three. Ragged means that different numbers of elements are permissible within each element of a list. The term ragged comes from publishing, where a common way of justifying text in a book is called "ragged right" (meaning that different lines are different lengths).

Next, create a preprocessing.StringLookup layer. Note this initial diagnostic test to show what preprocessing.StringLookup actually is. It is an object reflecting a relatively new feature of Python called an "abstract base class."

In [32]:
# Confirms that preprocessing.StringLookup is an ABC and not an instance
type(preprocessing.StringLookup), isinstance(preprocessing.StringLookup, preprocessing.StringLookup)

(abc.ABCMeta, False)

In [33]:
# Note that we are passing in the vocab from parsing the whole book in an 
# earlier cell. Examine the first argument closely:
ids_from_chars = preprocessing.StringLookup(vocabulary=list(vocab), mask_token=None)

# Shows the resulting class type and that we now have an instance
type(ids_from_chars), isinstance(ids_from_chars, preprocessing.StringLookup)

(keras.layers.preprocessing.string_lookup.StringLookup, True)

In [34]:
#
# Exercise 9.3: Display the vocabulary associated with ids_from_chars. Hint: Use
# the bound method get_vocabulary(). What is UNK?
#
ids_from_chars.get_vocabulary()

#[UNK] is a special token that represent out of vocabulary.
#If there is a character that StringLookup ABC cannot recognize, it is given [UNK] token
#indicating it is unknown

['[UNK]',
 '\n',
 '\r',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 '@',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '£',
 'à',
 'è',
 'é',
 'ê',
 '‘',
 '’',
 '“',
 '”',
 '\ufeff']

Our instances of the preprocessing.StringLookup ABC can convert from byte coded character tokens to numeric character IDs:

In [107]:
print(chars)
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>


<tf.RaggedTensor [[59, 60, 61, 62, 63, 64, 65], [82, 83, 84]]>

In [108]:
# Just out of curiosity: Are these ASCII codes or something else?

#The stringlookup function maps each unique string value to integer id in ascending order
#They necessarily need not be in ASCII all the time
#This code is not ASCII codes, these are something else.


In [109]:
#
# Exercise 9.4: Compare the IDs from the ragged tensor with the vocabulary
# list from exercise 9.3. Do this by adding code to the print statement that 
# looks up the ids using the get_vocabulary() method of ids_from_chars().
#
# Also write a comment describing what you see.
#
for nest_list in ids.to_list():
  [print(i,ids_from_chars.get_vocabulary()) for i in nest_list]

#The ragged tensor creates vector IDs as integer codes
#whereas the vocabulary list contains the whole list of vocabulory

59 ['[UNK]', '\n', '\r', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', 'à', 'è', 'é', 'ê', '‘', '’', '“', '”', '\ufeff']
60 ['[UNK]', '\n', '\r', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', 'à', 'è', 'é', 'ê', '‘', '’', '“', '”', '\ufeff']
61 ['[UNK]', '\n', '

In the previous exercise you retrieved the character representations through direct indexing into the vocabulary list, but you can also use `preprocessing.StringLookup(..., invert=True)`. Note: Make sure to use the get_vocabulary() method of the preprocessing.StringLookup layer so that the [UNK] tokens (if any) are appropriately labeled.

In [37]:
# This creates an instance of the "inverter"
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

chars = chars_from_ids(ids) # Now use the inverter to process our tiny example
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

Note that we have gotten back the byte codes of the characters from the vectors of IDs, and we are seeing them as a `tf.RaggedTensor` of characters. We can now use another utility, tf.strings.reduce_join, to join the characters back into strings. This is demonstrated here because it will be helpful later when we are ready to use our model to generate new text.

In [38]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [39]:
# Let's turn that utility into a function that we can call.
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

In [54]:
#
# Exercise 9.5: Test the utility function on the small example to make sure
# it does what we expect.
#
text_from_ids(ids)
#Yes this is what we expected

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'abcdefg', b'xyz'], dtype=object)>

In [111]:
#
# Exercise 9.6: Create a new small example of example_texts with at least
# three character strings in the list. Include some upper and lower case and
# some numerals. Run the text through the string preprocessor and then use
# the function from exercise 9.5 to recover the original strings.
#
example_texts_2 = ['hI9Kl', 'mN1p','0Lki']
chars_2 = tf.strings.unicode_split(example_texts_2, input_encoding='UTF-8')
ids_2 = ids_from_chars(chars_2)
chars_from_ids_2 = tf.keras.layers.experimental.preprocessing.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)
chars_2 = chars_from_ids_2(ids_2) # Now use the inverter to process our tiny example
chars_2
text_from_ids(ids_2)

#We have taken a new example, split it and use preprocessing.stringLoookup.
#Then we inverted to get the letters back and used the text_from_ids method
#to recombine them to string

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'hI9Kl', b'mN1p', b'0Lki'], dtype=object)>

### Create training examples and targets

Next, we will divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text. For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello". This ensures that we have tons of training examples and that each training example captures context both around the input string and the output string. This is a somewhat different strategy than the word-based example we examined in class, but it uses the same principle: A input sequence is processed by the RNN to predict a target (in this case a target sequence, rather than just one word).

To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [49]:
# Here's where we process all of the characters from the book, which are stored in text.
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
type(all_ids)

tensorflow.python.framework.ops.EagerTensor

In [55]:
#
# Exercise 9.7: Use the get_shape() bound method to reveal the shape of the resulting tensor
#
all_ids.get_shape()
#Shape of the tensor is 411576

TensorShape([411576])

In [56]:
# This creates a dataset whose elements are slices from the original tensor.
# This would be a great moment to look at the TF documentation for the 
# from_tensor_slices() method.
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
type(ids_dataset)

tensorflow.python.data.ops.from_tensor_slices_op.TensorSliceDataset

In [57]:
#
# Exercise 9.8: Run this code and look at the output that it 
# generates. Then comment each line of the code to say what it is doing.
# Also comment on the output. What does it mean?

#getting the unique values for ids_dataset
temp = ids_dataset.unique()

#For loop within the unique values of ids_dataset
for element in temp:

  #Create the ids as a numpy element
  print(element.numpy())

#The code loops through the unique set of ids and prints the elements in the id list
#

94
45
76
73
68
63
61
78
3
36
79
72
60
65
91
77
70
59
83
32
66
69
80
11
48
62
67
30
2
1
49
31
64
81
71
13
54
74
12
41
25
44
37
47
52
38
33
17
15
20
56
34
5
22
24
23
21
57
50
16
35
10
39
43
40
51
82
18
42
92
93
8
9
75
19
26
84
28
46
27
14
55
4
88
85
53
58
87
90
86
89
7
29
6


In [58]:
# The take() bound method is helpful for examining data in a tensor. It is 
# modeled after the take() methods from numpy. We can use it in a similar fashion
# to head() to extract a small slice of a tensor. 
for ids in ids_dataset.take(18):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

﻿
P
r
o
j
e
c
t
 
G
u
t
e
n
b
e
r
g


In [60]:
#
# Exercise 9.9: Add the skip() bound method to the expression in the for loop
# in the cell just above. To demonstrate how it works, skip 18 elements and then
# take the next 10 elements. The invocation of skip() in the expression should
# precede the invocation of take().
#
for ids in ids_dataset.skip(18).take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

#Skip() method is skipping the first 18 elements and taking the next 10 elements of the ids

’
s
 
P
l
a
y
s
 
b


In [61]:
# While an RNN can theoretically handle a continuous stream of data
# here we are considering the data in small groupings whose length
# is controlled by seq_length. Leave it at its current value for now, but 
# in the future you might consider making it either shorter or longer in the
# training run.
seq_length = 80 # About one line of standard text
examples_per_epoch = len(text)//(seq_length+1)

In [64]:
#
# Exercise 9.10: Display how many examples are used per epoch. What does the
# // operator do in Python? Write a comment explaining it.
#

examples_per_epoch
#len(text) - 411576
#seq length+1 - 81
#5081 examples per epoch
#// floors the division result to lowest significant digit

5081

The `batch` method lets you easily convert these individual characters to sequences of the desired size.

In [65]:
#
# Exercise 9.11: Explain why seq_length+1 is used in the next line of code
#
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

#seq_length os 80 is the input target size
#seq_length+1 is used to create 81 as the desired size of the output target size.
#We are predicting one character from the previous 80 characters.

for seq in sequences.take(2):
  print(chars_from_ids(seq))

tf.Tensor(
[b'\xef\xbb\xbf' b'P' b'r' b'o' b'j' b'e' b'c' b't' b' ' b'G' b'u' b't'
 b'e' b'n' b'b' b'e' b'r' b'g' b'\xe2\x80\x99' b's' b' ' b'P' b'l' b'a'
 b'y' b's' b' ' b'b' b'y' b' ' b'C' b'h' b'e' b'k' b'h' b'o' b'v' b','
 b' ' b'S' b'e' b'c' b'o' b'n' b'd' b' ' b'S' b'e' b'r' b'i' b'e' b's'
 b',' b' ' b'b' b'y' b' ' b'A' b'n' b't' b'o' b'n' b' ' b'C' b'h' b'e'
 b'k' b'h' b'o' b'v' b'\r' b'\n' b'\r' b'\n' b'T' b'h' b'i' b's' b' ' b'e'
 b'B'], shape=(81,), dtype=string)
tf.Tensor(
[b'o' b'o' b'k' b' ' b'i' b's' b' ' b'f' b'o' b'r' b' ' b't' b'h' b'e'
 b' ' b'u' b's' b'e' b' ' b'o' b'f' b' ' b'a' b'n' b'y' b'o' b'n' b'e'
 b' ' b'a' b'n' b'y' b'w' b'h' b'e' b'r' b'e' b' ' b'a' b't' b' ' b'n'
 b'o' b' ' b'c' b'o' b's' b't' b' ' b'a' b'n' b'd' b' ' b'w' b'i' b't'
 b'h' b'\r' b'\n' b'a' b'l' b'm' b'o' b's' b't' b' ' b'n' b'o' b' ' b'r'
 b'e' b's' b't' b'r' b'i' b'c' b't' b'i' b'o' b'n' b's'], shape=(81,), dtype=string)


It's easier to see what this is doing if you join the tokens back into strings:

In [66]:
for seq in sequences.take(2):
  print(text_from_ids(seq).numpy())

b'\xef\xbb\xbfProject Gutenberg\xe2\x80\x99s Plays by Chekhov, Second Series, by Anton Chekhov\r\n\r\nThis eB'
b'ook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions'


For training you'll need a dataset of `(input, label)` pairs, where both `input` and `label` are sequences. At each time step the input is the current character and the label is the next character. Here's a function that takes a sequence as input, duplicates it, and shifts it to align the input and label for each timestep:

In [68]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [69]:
# Let's test the function - We can do the test using
# a regular character string converted to a list.
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

In [72]:
#
# Exercise 9.12: Retest the function with a list of numbers
#
split_input_target([1,2,3,7,8])

#It is splitting into input and label targets

([1, 2, 3, 7], [2, 3, 7, 8])

In [74]:
# This is a curious construction - i.e., passing a function into the
# sequences.map() bound method:
dataset = sequences.map(split_input_target)
#
# Exercise 9.13: Look up the documentation for the map() bound 
# method and then add a comment to explain the line of code in this block.
#

#sequences.map() - applies a function to each element in the sequence
#To each element in the sequence, we are applying the split_input_target function

<MapDataset element_spec=(TensorSpec(shape=(80,), dtype=tf.int64, name=None), TensorSpec(shape=(80,), dtype=tf.int64, name=None))>

In [75]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'\xef\xbb\xbfProject Gutenberg\xe2\x80\x99s Plays by Chekhov, Second Series, by Anton Chekhov\r\n\r\nThis e'
Target: b'Project Gutenberg\xe2\x80\x99s Plays by Chekhov, Second Series, by Anton Chekhov\r\n\r\nThis eB'


### Create training batches

We have used `tf.data` to split the text into manageable sequences. Before using these data to train the model, we need to shuffle the data and pack it into batches. Remember from class that batching, AKA mini-batching, is a method of processing a group of input-output pairs together in the same epoch. Mini-batching facilitates parallelization and can prevent overfitting. Mini-batching also reduces the total number of weight updates that need to occur during a given epoch.

In [76]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset:
# TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements.
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 80), dtype=tf.int64, name=None), TensorSpec(shape=(64, 80), dtype=tf.int64, name=None))>

In [None]:
#
# Exercise 9.14: Examine the documentation for shuffle, batch, and prefetch.
# Try starting your exploration here: https://www.tensorflow.org/guide/data_performance
# Write a one line comment explaining each concept. Make sure you mention
# what AUTOTUNE is.
#

#These are transformations applied to the map transformation done above
#Shuffle - To maintain an internal buffer of elements, it shuffles the elements of dataset 
#samples are randomly shuffled and randomly sampled like using seed
#Batch - it groups the elements into batches of the size we specify - 
#Prefetch - this is used to overlap the preprocessing and model execution of a training step
#It prefetches a number of elements from input and buffers them
#This makes them easily available for next iteration
#AUTOTUNE - We could manually tune the number of elements to prefetch 
#the number of batches consumed by a single training step or we could 
#choose to autotune to get the number of elements

#Build The Model

In this section we will create an architecturally simple vanilla RNN model. Make sure you can explain why we have an embedding layer, what the dimensions of the RNN layer are, and how the vocabular size fits into the picture.

This section defines the model as a `keras.Model` subclass (For details see [Making new Layers and Models via subclassing](https://www.tensorflow.org/guide/keras/custom_layers_and_models)). 

This model has three layers:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map each character-ID to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.SimpleRNN`: A type of RNN with size `units=rnn_units` (You could also use a GRU or LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs. It outputs one logit for each character in the vocabulary. These are the log-likelihood of each character according to the model.

In [77]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256 # 256 is kind of a generic choice that should work well
# for various-sized character vocabularies. There is a danger of overfitting 
# when the size of the embedding layer exceeds the size of the vocabulary.

# Number of RNN units
rnn_units = 1024 # This is tunable. Remember that each RNN unit has a little
# bit of memory for what came before. Here 1024 provides four nodes for every
# node in the embedding layer. Later, you might want to experiment with half
# as many and twice as many.

vocab_size, embedding_dim, rnn_units

(94, 256, 1024)

In [78]:
# This builds a custom class for instantiating the Keras model
#
# Exercise 9.15: Add comments on the appropriate lines of code to
# document each layer of the model.
#
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)

    # What's this layer? # Embedding layer
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    # What's this layer? # RNN layer
    self.rnn = tf.keras.layers.SimpleRNN(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    # Why do we need return_sequences=True and return_state=True?
    # How will these be used after the model is trained?

    # What's this layer? #Dense layer
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.rnn.get_initial_state(x)
    x, states = self.rnn(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

We could have used a `keras.Sequential` model here, as this architecture is quite simple. However, to  generate text later we will need to manage the RNN's internal state. It's simpler to include the state input and output options upfront, than it is to rearrange the model architecture later. For more details see the [Keras RNN guide](https://www.tensorflow.org/guide/keras/rnn#rnn_state_reuse).

In [79]:
# Now instantiate the class defined above.

model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

For each character the model looks up the embedding, runs the RNN one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character. A logit can be turned into an odds ratio with exponentiation. Statisticians like to work with logits because they behave linearly. 

## Try the model

Now run the model to see that it behaves as expected.

First check the shape of the output:

In [80]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 80, 95) # (batch_size, sequence_length, vocab_size)


In the above example the sequence length of the input is `80` but the model can be run on inputs of any length:

In [81]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  24320     
                                                                 
 simple_rnn (SimpleRNN)      multiple                  1311744   
                                                                 
 dense (Dense)               multiple                  97375     
                                                                 
Total params: 1,433,439
Trainable params: 1,433,439
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model we should sample from the output nodes. The distribution of output node values for any given input is defined by the logits over the character vocabulary. The TF documentation says that it is important to _sample_ from this distribution as taking the _argmax_ of the distribution because that can easily get the model stuck in a loop. 

Taking the argmax would cause the text generator to make the exact same prediction every time when provided with a particular input. As a result, we could present "to be" as the input and get "to be or not to be or not to be or not to be. . ." as the output. By sampling from the output instead, we can get a variety of somewhat random (yet high probability) responses each time we present the input, so "to be" could generate the response, "to be or not to be, that is the snorgle."

Try it for the first example in the batch:

In [85]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index. You should run the cell above and the cell below several times:

In [88]:
sampled_indices

array([67, 33, 65,  0, 36, 53, 15, 51, 71, 13,  5, 12,  9, 34, 43, 46, 10,
       92, 84, 20, 15, 63, 68, 76,  2, 12, 47, 86,  1, 26, 32, 10,  4, 93,
       56, 22, 13, 88, 39, 89, 54,  1, 68, 82, 82, 65, 36, 69, 57, 79,  0,
       60, 47, 79, 19, 72, 73, 77, 69, 66, 63, 74, 51, 45, 28, 24, 35, 47,
       29, 44,  5, 61, 37, 42, 75, 70, 12, 57, 41, 15])

Decode these to see the text predicted by this untrained model:

In [89]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'n hanging about\r\nwith you people, going rusty without work. I can\xe2\x80\x99t live without'

Next Char Predictions:
 b'iDg[UNK]GX0Vm.#-)ENQ*\xe2\x80\x9cz50ejr\r-R\xc3\xa0\n;C*!\xe2\x80\x9d[7.\xc3\xa9J\xc3\xaaY\njxxgGk]u[UNK]bRu4noskhepVP?9FR@O#cHMql-]L0'


Naturally, the predicted text is nonsense because the untrained model essentially makes random predictions.

In [None]:
#
# Exercise 9.15: Rerun the previous three code cells. Comment on what you
# observe. In particular, why is the input different each time if you are 
# calling the same code?
#

#Every time, the input is the same
#Output changes everytime because it samples from the highest probability distributions
#Keeping the input base same, but the input batch will change every time


## Train the model

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions. Sparse categorical means that the category options are considered mutually exclusive. After all the next character cannot be both a and b - one of those two options should have a higher probability.

Because your model returns logits, you need to set the `from_logits` flag.


In [90]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [91]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (64, 80, 95)  # (batch_size, sequence_length, vocab_size)
Mean loss:         4.572419


A newly initialized model shouldn't be too sure of itself, the output logits should all have similar magnitudes. To confirm this you can check that the exponential of the mean loss is approximately equal to the vocabulary size:

In [92]:
tf.exp(mean_loss).numpy()

96.77795

##Checkpoint! Report the untrained model (exponentiated) mean_loss from the previous cell

The expression tf.exp(mean_loss).numpy() takes the mean loss value, which is expressed in logits, and exponentiates it (effectively creating an odds ratio). Write the value from the previous cell next to your name on the whiteboard. You can round off to one or two significant digits.

In [None]:
#
# Exercise 9.16: Add some code that compares the exponentiated mean loss with
# the size of the vocabulary (look in earlier cells for this value). If the 
# exp(mean_loss) is more than 10% larger than the vocab size, print a warning.
#

Configure the training procedure using the `tf.keras.Model.compile` method. Use `tf.keras.optimizers.Adam` with default arguments and the loss function.

In [None]:
model.compile(optimizer='adam', loss=loss)

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

To keep training time reasonable, use 20 epochs to train the model. In Colab, set the runtime to GPU for faster training. Note that with all of the initial hyperparameters in this notebook, each epoch should take about 10 seconds, so only about 3 minutes to train this model. 

In [None]:
EPOCHS = 20

In [None]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

In [None]:
#
# Exercise 9.17: Click on the file folder in the left hand control bar.
# Open up the training_checkpoints folder. How many training checkpoints
# do you see. Add a comment saying why there are that many checkpoints.
# Try downloading one of those checkpoint files to your computer. How 
# large is it? Why does it take so much space?
#


## Generate text

The simplest way to generate text with this model is to run a set of predictions in a loop, while keeping track of the model's internal state as it runs. Each time we call the model we pass in a slice of text and an internal state. 

The model returns a prediction for the next character as well as its new state. Pass the prediction and state back in to continue generating text. The class defined below accomplishes one step in this chain of model runs. Besides the class initialization, there is only one bound method, called generate_one_step(). When the generate_one_step() bound method is called, it makes a single step prediction. Note the temperature argument on the class initializer. Temperature controls the degree of randomness in the predictions.


In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    # The model also returns its internal state so that we can use that
    # the next time around the loop.
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
# Now we can instantiate the class. Take note of the arguments we are 
# passing in. What do the last two arguments do?
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

In [None]:
#
# Exercise 9.18: Display the type() of each of the agruments passed into
# the class initializer in the previous cell. Explain why each of the 
# three arguments is needed to initialize the OneStep class.
#

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Chekhov-like  vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['NATALYA STEPANOVNA.'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

#Discuss with Your Partner

Examine the generated output above closely. What linguistic tasks can the model do correctly? What does it have trouble with? Can you explain why there are spelling errors? Why do you think the model has difficulty producing a grammatically correct sentence?

Finally, make sure you can both answer the question of why we used a character generation model for this lab rather than a word-based generation model.

The easiest thing you can do to improve the results is to train it for longer. You can run model.fit() just as you did above and the model will continue training where it left off. Try doing 10 or 20 more epochs.

In [None]:
#
# Exercise 9.19: Run 20 more training epochs and then generate additional 
# characters using the code above. Use the same seed text to start the
# model. 
#
# Add a comment indicating if the model's generated text is better than before.
#

In [None]:
#
# Exercise 9.20: Plot the training history from the most recent training run.
# comment on whether you think that even more training would improve the model.
#



If you want the model to generate text faster the easiest thing you can do is batch the text generation. In the example below the model generates three texts in about the same time it took to generate just one above.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['NATALYA STEPANOVNA.', 'NATALYA STEPANOVNA.', 'NATALYA STEPANOVNA.'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

In [None]:
#
# Exercise 9.21: Add a comment describing why the model produces different text
# even though you have provided the same seed three times.
#

**Improving the Model**

There are several strategies that might improve the performance of the model. Try them in the following order:

* Reduce the temperature in the initialization of the OneStep class to reduce the randomness in generation of new characters 
* Increase the number of nodes in the existing SimpleRNN layer to give the model more "intelligence"
* Add an additional dense layer after the RNN layer to improve the model's capability to make sense of the output of the RNN layer

Try these techniques one at a time in the order stated. For the moment we don't have a way of documenting model quality other than the final model loss value and your own read of the generated text to see whether it is creating real words and sensible sentences. Make sure to add comments documenting what you find out.

In [None]:
#
# Exercise 9.21: Make one change at a time to the model to try to improve it.
# Run a text using the OneStep predictor class as shown in the code above
# to generate new predictions and judge for yourself whether they look more
# sensible. You may want to create some additional functions to encapsulate 
# some of the earlier steps and make it easier to run experiments.
#


## Advanced: Customized Training

The following material is for strudents who want to dive more deeply into TensorFlow. If you have run out of options for modifying and testing the simple model above and there is time left in the lab, you may want to learn more about improving the quality of a character-based text generation model. Examine the customized training class provided below and spend some time thinking about how it works. 

All the training we have done to this point is very straightforward and is no different from outher TF/Keras models we've experimented with. Some would call this an open-loop model, because prediction mistakes are not fed back into the model to improve it. In future weeks, we will use "teacher forcing" to address this problem. You might want to consult this article for a preview: https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c

The custom class below improves over our previous training method by closing the "open loop" that allowed prediction errors to accumulate. The most important part of a custom training loop is the train step function.

The loop uses `tf.GradientTape` to track the gradients. You can learn more about this approach by reading the [eager execution guide](https://www.tensorflow.org/guide/eager).

The basic procedure is:

1. Execute the model and calculate the loss under a `tf.GradientTape`.
2. Calculate the updates and apply them to the model using the optimizer.

In [None]:
class CustomTraining(MyModel):
  @tf.function
  def train_step(self, inputs):
      inputs, labels = inputs
      with tf.GradientTape() as tape:
          predictions = self(inputs, training=True)
          loss = self.loss(labels, predictions)
      grads = tape.gradient(loss, model.trainable_variables)
      self.optimizer.apply_gradients(zip(grads, model.trainable_variables))

      return {'loss': loss}

The above implementation of the `train_step` method follows [Keras' `train_step` conventions](https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit). This is optional, but it allows you to change the behavior of the train step and still use keras' `Model.compile` and `Model.fit` methods.

In [None]:
model = CustomTraining(
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

In [None]:
model.fit(dataset, epochs=1)

Or if you need more control, you can write your own complete custom training loop:

In [None]:
EPOCHS = 10

mean = tf.metrics.Mean()

for epoch in range(EPOCHS):
    start = time.time()

    mean.reset_states()
    for (batch_n, (inp, target)) in enumerate(dataset):
        logs = model.train_step([inp, target])
        mean.update_state(logs['loss'])

        if batch_n % 50 == 0:
            template = f"Epoch {epoch+1} Batch {batch_n} Loss {logs['loss']:.4f}"
            print(template)

    # saving (checkpoint) the model every 5 epochs
    if (epoch + 1) % 5 == 0:
        model.save_weights(checkpoint_prefix.format(epoch=epoch))

    print()
    print(f'Epoch {epoch+1} Loss: {mean.result().numpy():.4f}')
    print(f'Time taken for 1 epoch {time.time() - start:.2f} sec')
    print("_"*80)

model.save_weights(checkpoint_prefix.format(epoch=epoch))

In [None]:
#
# This training code created a trained "model" object, just as the earlier, 
# simpler code did. So you can use all of the predictive machinery that
# appeared earlier in this notebook to test the results.
#