# Comparing RNNs for Text Generation

Modified from an example originally written by "The TensorFlow Authors" in 2019 and distributed under Apache License 2.0. This lab creates a model that can generate text using a character-based RNN. In this notebook, we set up a "head-to-head" comparison of a vanilla RNN, a LSTM, and a GRU. We should be able to observe differences in training time and in the level of the loss function acheived.

Recall from week 9 that a character-based RNN learns sequences of characters from a corpus of text. Once the model is trained, one can present an input character sequence and the model will generate the character that it predicts would be most likely to appear next. By repeatedly calling the model for new predictions with the previously built sequence, one can create a string of text that "looks like" sentences from the original training text. This approach has numerous disadvantages, such as an inability to know when to stop (e.g., at the end of a sentence). But character level predictions use small amounts of training data very effectively, are computationally efficient, and are highly adaptable to multilingual data.

Check to make sure that GPU acceleration is enabled to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware accelerator > GPU*. Do this before you start to run any code, because it generally restarts the runtime and your local variables will be lost.

## Setup

In this initial step you will see an easy method to load an NLP benchmarking data set from the HuggingFace repository (which we have used previously to obtain pretrained sentence summarizer models). Benchmarking data sets are crucial to progress in NLP technology because they provide a standardized basis for exploring the performance characteristics of a model.

Here we are importing the "Squad" question answering data set (version 2) initially developed at Stanford University. Go to https://huggingface.co/datasets to review the dataset discovery interface and a few of the resources stored there.

In [1]:
!pip install datasets # A package for creating a connection to HuggingFace data resources
from datasets import load_dataset

raw_dataset = load_dataset("squad_v2") # The names of the datasets can be obtained from the web-based discovery interface
raw_train_dataset = raw_dataset["train"] # We will use the training data

type(raw_dataset), type(raw_train_dataset) # Display the types

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
[0m  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

(datasets.dataset_dict.DatasetDict, datasets.arrow_dataset.Dataset)

### Import TensorFlow and other libraries

In [2]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing

import numpy as np
import os # To access local files; for saving checkpoints
import time

from urllib import request # We will need this to read from an URL

### Setup the Text Data

Squad is set up with triplets of questions, answers, and contextual paragraphs that contain or imply the answer. With a more sophisticated generative model we would treat the problem of predicting the answer from the question in a more formalized manner, but here we simply compile everything into a large text file for use in training our character model.

In [3]:
# Build a raw text file of Q/A/C sequences: Takes a minute
long_text = ""

for entry in raw_train_dataset:
  # Some entries may not have an answer: Skip them
  if len(entry['answers']['text']) > 0:
    long_text += entry['question']
    long_text += " "
    long_text += entry['answers']['text'][0]
    long_text += ":  "
    long_text += entry['context']
    long_text += "/n/n"

len(long_text), long_text[:200]

(73125480,
 'When did Beyonce start becoming popular? in the late 1990s:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and a')

In [6]:
#
# Exercise 10.0: Review some additional text in long_text. Examine enough so
# that you can see at least one more Q/A/C triplet.
#

long_text[:1800]
#Here we can see the second question - What areas did Beyonce compete in when she was growing up?

'When did Beyonce start becoming popular? in the late 1990s:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy"./n/nWhat areas did Beyonce compete in when she was growing up? singing and dancing:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and 

In [7]:
# We won't be able to train the models during class with this much data
# so let's subset it for purposes of gettng the lab done.
PROP_TO_TAKE = 0.10 # Take a small proportion of the data to make training time realistic
lt_len = len(long_text)
st_len = int(round((lt_len * PROP_TO_TAKE),0))

text = long_text[ :st_len]
len(text)

7312548

### Examine the vocab

First, look in the text to see what we have. Note that this is a character-based model, so we are interested to know what different characters are used throughout the whole text. Note that this should make the process largely language independent: Any language where words are formed through sequences of characters should work in training this kind of model.

In [8]:
# The unique text characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

381 unique characters


In [9]:
#
# Exercise 10.1: Review the character set. Why so many more characters than last week?
#
vocab
#The corpus contains 73125480 tokens that contain various symbols, different languages
#With many unique symbols, we are getting more character set.

['\n',
 ' ',
 '!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 '[',
 ']',
 '`',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '~',
 '¢',
 '£',
 '¥',
 '§',
 '°',
 '±',
 '²',
 '½',
 'Å',
 'Ç',
 'É',
 'Ö',
 'Ü',
 'à',
 'á',
 'â',
 'ã',
 'ä',
 'æ',
 'ç',
 'è',
 'é',
 'ê',
 'ë',
 'ì',
 'í',
 'ï',
 'ñ',
 'ò',
 'ó',
 'ô',
 'ö',
 'ø',
 'ú',
 'ü',
 'ý',
 'Ā',
 'ā',
 'ă',
 'ć',
 'Đ',
 'ī',
 'ĭ',
 'İ',
 'Ł',
 'ł',
 'ń',
 'Ō',
 'ō',
 'ŏ',
 'ř',
 'Ś',
 'ś',
 'Š',
 'š',
 'ũ',
 'ū',
 'ů',
 'ŵ',
 'ź',
 'Ż',
 'ż',
 'ɐ',
 'ɑ',
 'ɒ',
 'ɔ',
 'ə',
 'ɛ',
 'ɜ',
 'ɡ',
 'ɪ',
 'ɹ',
 'ɾ',
 'ʁ',
 'ʃ',
 'ʊ',
 'ʌ',
 'ʏ

## Process the text

In [10]:
# Note that we are passing in the vocab from parsing the whole dataset in an 
# earlier cell. Examine the first argument closely:
ids_from_chars = preprocessing.StringLookup(vocabulary=list(vocab), mask_token=None)

# Shows the resulting class type and that we now have an instance
type(ids_from_chars), isinstance(ids_from_chars, preprocessing.StringLookup)

(keras.layers.preprocessing.string_lookup.StringLookup, True)

Create a reverse lookup using `preprocessing.StringLookup(..., invert=True)`. Note: Make sure to use the get_vocabulary() method of the preprocessing.StringLookup layer so that the [UNK] tokens (if any) are appropriately labeled.

In [11]:
# This creates an instance of the "inverter"
chars_from_ids = tf.keras.layers.experimental.preprocessing.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

# Shows the resulting class type and that we now have an instance
type(chars_from_ids), isinstance(chars_from_ids, preprocessing.StringLookup)

(keras.layers.preprocessing.string_lookup.StringLookup, True)

In [12]:
# Create a function that will piece together text
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the task that our RNN models will perform. The input to the models will be sequences of characters from the text we ingested at the top of the notebook.


### Create training examples and targets

Next, we will divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text. For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`. For example, say `seq_length` is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello". This ensures that we have tons of training examples and that each training example captures context both around the input string and the output string. This is a somewhat different strategy than the word-based example we examined in class, but it uses the same principle: A input sequence is processed by the RNN to predict a target (in this case a target sequence, rather than just one word).

To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [13]:
# Here's where we process all of the characters from the data, which are stored in text.
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
type(all_ids)

tensorflow.python.framework.ops.EagerTensor

In [14]:
all_ids.get_shape

<bound method _EagerTensorBase.get_shape of <tf.Tensor: shape=(7312548,), dtype=int64, numpy=array([56, 70, 67, ..., 76, 65, 67])>>

In [15]:
# This creates a dataset whose elements are slices from the original tensor.
# This would be a great moment to look at the TF documentation for the 
# from_tensor_slices() method.
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
type(ids_dataset)

tensorflow.python.data.ops.from_tensor_slices_op.TensorSliceDataset

In [16]:
# While an RNN can theoretically handle a continuous stream of data
# here we are considering the data in small groupings whose length
# is controlled by seq_length. Leave it at its current value for now, but 
# in the future you might consider making it either shorter or longer in the
# training run.
seq_length = 160 # LSTMs and GRUs should be able to handle longer sequences
examples_per_epoch = len(text)//(seq_length+1)

The `batch` method lets you easily convert these individual characters to sequences of the desired size.

In [17]:
#
# seq_length+1 accounts for the fact that we will be predicting a set of 
# characters from an original string that was right-shifted by one.
#
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(2):
  print(chars_from_ids(seq))

tf.Tensor(
[b'W' b'h' b'e' b'n' b' ' b'd' b'i' b'd' b' ' b'B' b'e' b'y' b'o' b'n'
 b'c' b'e' b' ' b's' b't' b'a' b'r' b't' b' ' b'b' b'e' b'c' b'o' b'm'
 b'i' b'n' b'g' b' ' b'p' b'o' b'p' b'u' b'l' b'a' b'r' b'?' b' ' b'i'
 b'n' b' ' b't' b'h' b'e' b' ' b'l' b'a' b't' b'e' b' ' b'1' b'9' b'9'
 b'0' b's' b':' b' ' b' ' b'B' b'e' b'y' b'o' b'n' b'c' b'\xc3\xa9' b' '
 b'G' b'i' b's' b'e' b'l' b'l' b'e' b' ' b'K' b'n' b'o' b'w' b'l' b'e'
 b's' b'-' b'C' b'a' b'r' b't' b'e' b'r' b' ' b'(' b'/' b'b' b'i'
 b'\xcb\x90' b'\xcb\x88' b'j' b'\xc9\x92' b'n' b's' b'e' b'\xc9\xaa' b'/'
 b' ' b'b' b'e' b'e' b'-' b'Y' b'O' b'N' b'-' b's' b'a' b'y' b')' b' '
 b'(' b'b' b'o' b'r' b'n' b' ' b'S' b'e' b'p' b't' b'e' b'm' b'b' b'e'
 b'r' b' ' b'4' b',' b' ' b'1' b'9' b'8' b'1' b')' b' ' b'i' b's' b' '
 b'a' b'n' b' ' b'A' b'm' b'e' b'r' b'i' b'c' b'a' b'n' b' ' b's' b'i'], shape=(161,), dtype=string)
tf.Tensor(
[b'n' b'g' b'e' b'r' b',' b' ' b's' b'o' b'n' b'g' b'w' b'r' b'i' b't'
 b'e' b'r' b',' b' ' b'r'

It's easier to see what this is doing if you join the tokens back into strings:

In [18]:
for seq in sequences.take(2):
  print(text_from_ids(seq).numpy())

b'When did Beyonce start becoming popular? in the late 1990s:  Beyonc\xc3\xa9 Giselle Knowles-Carter (/bi\xcb\x90\xcb\x88j\xc9\x92nse\xc9\xaa/ bee-YON-say) (born September 4, 1981) is an American si'
b'nger, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose '


For training you'll need a dataset of `(input, label)` pairs, where both `input` and `label` are sequences. At each time step the input is the current character and the label is the next character. Here's a function that takes a sequence as input, duplicates it, and shifts it to align the input and label for each timestep:

In [19]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [20]:
# This is a curious construction - i.e., passing a function into the
# sequences.map() bound method:
dataset = sequences.map(split_input_target)

dataset.take(1)

<TakeDataset element_spec=(TensorSpec(shape=(160,), dtype=tf.int64, name=None), TensorSpec(shape=(160,), dtype=tf.int64, name=None))>

In [21]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'When did Beyonce start becoming popular? in the late 1990s:  Beyonc\xc3\xa9 Giselle Knowles-Carter (/bi\xcb\x90\xcb\x88j\xc9\x92nse\xc9\xaa/ bee-YON-say) (born September 4, 1981) is an American s'
Target: b'hen did Beyonce start becoming popular? in the late 1990s:  Beyonc\xc3\xa9 Giselle Knowles-Carter (/bi\xcb\x90\xcb\x88j\xc9\x92nse\xc9\xaa/ bee-YON-say) (born September 4, 1981) is an American si'


### Create training batches

We have used `tf.data` to split the text into manageable sequences. Before using these data to train the model, we need to shuffle the data and pack it into batches. Remember from class that batching, AKA mini-batching, is a method of processing a group of input-output pairs together in the same epoch. Mini-batching facilitates parallelization and can prevent overfitting. Mini-batching also reduces the total number of weight updates that need to occur during a given epoch.

In [22]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset:
# TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements.
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 160), dtype=tf.int64, name=None), TensorSpec(shape=(64, 160), dtype=tf.int64, name=None))>

## Build The Models

This section defines the model as a `keras.Model` subclass (For details see [Making new Layers and Models via subclassing](https://www.tensorflow.org/guide/keras/custom_layers_and_models)). 

This model has three layers:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map each character-ID to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.SimpleRNN`: A type of RNN with size `units=rnn_units` (You could also use a GRU or LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs. It outputs one logit for each character in the vocabulary. These are the log-likelihood of each character according to the model.

In [23]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 128 # This can probably go lower

# Number of RNN units
rnn_units = 256 # This is tunable. Remember that each RNN unit has a little
# bit of memory for what came before. Here 1024 provides four nodes for every
# node in the embedding layer. Later, you might want to experiment with half
# as many and twice as many.

vocab_size, embedding_dim, rnn_units

(381, 128, 256)

In [24]:
# This builds a custom class for instantiating the Keras model
# For Vanilla and GRU
#
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units, rnn_type="vanilla"):
    super().__init__(self)

    # What's this layer?
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    # What's this layer?
    if rnn_type=="vanilla":
      self.rnn = tf.keras.layers.SimpleRNN(rnn_units, return_sequences=True, return_state=True)
    elif rnn_type=="gru":
      self.rnn = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    else:
      print("Error: Please specify RNN model type.")
      return(None)

    # What's this layer?
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)

    if states is None:
      states = self.rnn.get_initial_state(x)

    x, states = self.rnn(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [25]:
# For LSTM: Because it returns two internal states, we need a different call
class MyLSTM(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)

    # What's this layer?
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    # What's this layer?
    self.rnn = tf.keras.layers.LSTM(rnn_units, return_sequences=True, return_state=True)

    # What's this layer?
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)

    if states is None:
      states = self.rnn.get_initial_state(x)

    x, states1, states2 = self.rnn(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states1, states2
    else:
      return x

We could have used a `keras.Sequential` model here, as these architectures are quite simple. However, to  generate text later we will need to manage the RNN's internal state. The vanilla and GRU have just a single state, whereas the LSTM has two states (h and c) that need to be maintained. It's simpler to include the state input and output options upfront, than it is to rearrange the model architecture later. For more details see the [Keras RNN guide](https://www.tensorflow.org/guide/keras/rnn#rnn_state_reuse).

In [26]:
# Now instantiate the classes defined above.

vanilla_model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    rnn_type="vanilla")

gru_model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    rnn_type="gru")

lstm_model = MyLSTM(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)



In [27]:
# Send one batch through each model

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions1 = vanilla_model(input_example_batch)
    print(example_batch_predictions1.shape, "# (batch_size, sequence_length, vocab_size)")

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions2 = gru_model(input_example_batch)
    print(example_batch_predictions2.shape, "# (batch_size, sequence_length, vocab_size)")

for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions3 = lstm_model(input_example_batch)
    print(example_batch_predictions3.shape, "# (batch_size, sequence_length, vocab_size)")



(64, 160, 382) # (batch_size, sequence_length, vocab_size)
(64, 160, 382) # (batch_size, sequence_length, vocab_size)
(64, 160, 382) # (batch_size, sequence_length, vocab_size)


In [28]:
#
# Exercise 10.2 - Show model summaries for each of the models. Make note of the 
# number of trainable parameters. Comment on the number of extra parameters
# needed for GRU and LSTM respectively.
#
vanilla_model.summary()
#Vanilla RNN-Total params: 245,630 - Vanilla model has 1 gate, less time to train
#GRU-Total params: 443,518 - GRU has 3 gates, more time to train
#LSTM-Total params: 541,310 - LSTM has 4 gates, has about 100,000 more trainable parameters, highest time to train


Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  48896     
                                                                 
 simple_rnn (SimpleRNN)      multiple                  98560     
                                                                 
 dense (Dense)               multiple                  98174     
                                                                 
Total params: 245,630
Trainable params: 245,630
Non-trainable params: 0
_________________________________________________________________


In [29]:
gru_model.summary()

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  48896     
                                                                 
 gru (GRU)                   multiple                  296448    
                                                                 
 dense_1 (Dense)             multiple                  98174     
                                                                 
Total params: 443,518
Trainable params: 443,518
Non-trainable params: 0
_________________________________________________________________


In [30]:
lstm_model.summary()

Model: "my_lstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     multiple                  48896     
                                                                 
 lstm (LSTM)                 multiple                  394240    
                                                                 
 dense_2 (Dense)             multiple                  98174     
                                                                 
Total params: 541,310
Trainable params: 541,310
Non-trainable params: 0
_________________________________________________________________


Vanilla RNN-Total params: 245,630 - Vanilla model has 1 gate, less time to train<br>
GRU-Total params: 443,518 - GRU has 3 gates, more time to train<br>
LSTM-Total params: 541,310 - LSTM has 4 gates, has about 100,000 more trainable parameters, highest time to train<br>
LSTM has more complexity and can be more expensive to produce its outputs<br>

#Checkpoint! - Write your name and the model size on the board

Take note of which of the three models has the greatest number of trainable parameters. Write your name on the board and next to it write the number of trainable parameters for the largest model.

##Getting predictions from the models

To get actual predictions from the model we should sample from the output nodes. The distribution of output node values for any given input is defined by the logits over the character vocabulary. The TF documentation says that it is important to _sample_ from this distribution as taking the _argmax_ of the distribution because that can easily get the model stuck in a loop. 

Taking the argmax means that the text generator will make the exact same prediction every time when provided with a particular input. As a result, we could present "to be" as the input and get "to be or not to be or not to be or not to be. . ." as the output. By sampling from the output instead, we can get a variety of somewhat random (yet high probability) responses each time we present the input, so "to be" could generate the response, "to be or not to be, that is the snorgle."

Try it for the first example in the batch:

In [31]:
sampled_indices = tf.random.categorical(example_batch_predictions1[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'lled her "third child". In letters to third parties, she vented her impatience, referring to him as a "child," a "little angel", a "sufferer" and a "beloved lit'

Next Char Predictions:
 b'\xe3\x82\xbc\xe7\xb4\xb9t\xe2\x81\x84\xc3\x96\xe6\x97\x8fo\xcf\x85\xe0\xa4\xbe\xc2\xb1\xe7\xb7\xa3\xd8\xaf:\xe0\xa4\xb0\n\xcf\x87\xcc\xafA\xca\x8a8\xe8\xb5\xb7\xe0\xa4\x95\xce\xbd!E\xe5\xb8\xab\xe5\x9c\x8b\xc3\xba\xe1\xba\xbf\xe6\xb4\x9e\xe0\xa4\xac\xce\xad\xe5\x8f\xa3V\xe5\xba\xab\xc3\xa7\xe7\xa6\x85\xc3\xba\xc3\xb3\xc3\xab\xce\xac\xe5\xaf\xb9\xe3\x83\x88\xce\xba\xe6\x9c\xb1WRN\xe5\xaf\xb9\xc5\x9b:x3\xe7\xbb\x8d\xcc\xa9#\xc3\x85\xe9\x8b\xaa\xe5\xae\x89\xc3\x85\xce\xbc\xce\xb2\xca\x81\xc4\xad\xe9\xb2\x81\xe2\x82\xac\xe7\x90\x86\xe5\x80\x99\xe8\xaa\xac-\xe2\x80\x8d0\xe5\x89\x91Ne\xc9\xaac\xe3\x80\x80\xe5\xba\xabXYK$\xce\xbd\xce\x95\xc4\xab\xe0\xa4\xa7D\xe7\xbb\xb4\xe3\x83\xb3\xe6\x9c\x9d7 \xe0\xa4\xbeO\xc3\xa1E\xc9\x90\xe8\xbb\x8d\xe6\x80\x9d\xc9\x94\xcf\x85\xe6\x81\xaf\xc5\x99\xe6\xb3\x95\x

Naturally, the predicted text is nonsense because the untrained model essentially makes random predictions.

In [35]:
#
# Exercise 10.3 - Produce next character predictions for the GRU and LSTM models. 
# All you need to change is the reference to example_batch_predictions1.
#
sampled_indices = tf.random.categorical(example_batch_predictions2[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'lled her "third child". In letters to third parties, she vented her impatience, referring to him as a "child," a "little angel", a "sufferer" and a "beloved lit'

Next Char Predictions:
 b'\xe4\xb8\x8dH\xc2\xb1\xcf\x89\xce\xa4\xc3\xab\xce\xb3\xe7\xb4\xabN+b\xe6\xa0\xa1s4\xe5\xb7\x9dnX\xe3\x83\xab\xc9\xa1\xe3\x83\x88\xe4\xbc\x9d\xe5\xb7\x9d]\xe6\xa5\xad\xcf\x89\xc5\x9a\xc9\x99\xce\xa66\xcf\x84\xc3\xa9\xe3\x82\xb9\xe3\x82\xbb\xcf\x84\xc3\xb3\xe2\x81\x84\xc2\xa2\xe6\x97\x8f\xe5\xb8\xab\xe5\xb7\x9d\xe8\xaa\xac\xc3\xad\xe5\xa4\xa7\xe3\x83\x88\xe6\x81\xaf\xd8\xa7\xd8\xa8\xe6\x80\x9d\xc5\xb5\xc4\xab\xe8\xaa\xac\xc9\xaa\xe0\xa5\x8a\xe7\xb6\xad\xe8\xb1\x86\xe6\x8b\x89\xcf\x80N\xe7\x8c\xae\xe3\x83\xa9\xc3\x96\xce\xafy\xe7\x90\x86\xe2\x80\x9cQw\xe6\xb7\xa8\xe2\x80\x93\xe7\x90\x86\xc3\xb6\xe2\x80\x8b\xc3\xba\xe0\xa4\xaf\xc4\x90\xe0\xa4\xbf\xe0\xa4\xac\xe6\x96\xb9\xcf\x81g\xe1\xbd\x80\xe8\xbb\x8dW\xc4\xab\xe0\xa4\xb0J\xe6\xa1\x88\xc9\x94\xce\xb3\xe6\x9c\x9d\xc3\xac)\xe0\xa5\x8a:\xe4\xb8\x

In [36]:
sampled_indices = tf.random.categorical(example_batch_predictions3[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b'lled her "third child". In letters to third parties, she vented her impatience, referring to him as a "child," a "little angel", a "sufferer" and a "beloved lit'

Next Char Predictions:
 b"\xe9\x95\xb7\xe3\x82\xbb\xc2\xa7\xe1\xbc\x80\xe6\xb1\xb6\xe6\x81\xaf\xe5\xa4\x8d\xe3\x82\xbc<\xc4\xb0\xe3\x80\x8b\xc2\xb2\xe9\x93\xba7[UNK]\xc2\xa5\xe6\xa1\x88\xc3\xab\xd8\xa7\xcb\x88a\xce\xbe\xe1\xbc\xb5\xca\x8a\xc9\x9c\xe2\x80\xa6\xe9\xa8\xb7\xc3\xa7\xcf\x8c\xe1\xb8\xa5\xe0\xa4\xae\xe2\x80\xa6\xc5\xab\xe6\x9e\x97\xe2\x80\x9c\xc3\xbc\xe4\xbc\x9d\xe1\xba\xbf\xce\xac\xe8\x80\xbf\xc3\xa7\xe6\x9e\x97\xe7\xb4\xb9\xe6\xb2\x90\xe5\x9f\xa0U2\xe2\x80\xa6=\xc3\xa7\xe5\xba\xab\xe2\x81\x84\xe5\x9c\x9f\xe9\xb2\x81\xe5\x9f\xa0\xe5\x9d\xaa\xe5\x8f\xa3\xcf\x83\xe7\xa6\x85\xe0\xa4\x83\xc9\xaa\xe5\xbb\xba\xcf\x85\xe5\x9b\xbd\xc5\xa0\xcf\x8e\xe0\xa5\x81\xe0\xa5\x81\xc2\xa7\xe8\x8f\xaf\xe5\x9c\x8b!\xce\xbc\xe9\xa8\xb7\xcf\x8c\xc5\x81\xc5\x81\xe0\xa5\x81O\xe2\x80\x94\xe2\x80\x8c\xce\xad\xe7\xbb\x8d\xe5\xb8\xa5

## Train the models

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because your model returns logits, you need to set the `from_logits` flag.


In [32]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [33]:
example_batch_loss = loss(target_example_batch, example_batch_predictions1)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions1.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (64, 160, 382)  # (batch_size, sequence_length, vocab_size)
Mean loss:         5.9434166


A newly initialized model shouldn't be too sure of itself, the output logits should all have similar magnitudes. To confirm this you can check that the exponential of the mean loss is approximately equal to the vocabulary size. A much higher loss means the model is sure of its wrong answers, and is badly initialized:

In [34]:
tf.exp(mean_loss).numpy()

381.23523

In [37]:
#
# Exercise 10.4 - Show the exponentiated mean loss for the other two untrained models.
# Comment on whether these seem appropriate.
#
example_batch_loss = loss(target_example_batch, example_batch_predictions2)
mean_loss2 = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions2.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss2)
tf.exp(mean_loss2).numpy()
#Mean loss is 382.1, which is lesser than vanilla RNN

Prediction shape:  (64, 160, 382)  # (batch_size, sequence_length, vocab_size)
Mean loss:         5.9457016


382.10733

In [38]:
example_batch_loss = loss(target_example_batch, example_batch_predictions3)
mean_loss3 = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions3.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss3)
tf.exp(mean_loss3).numpy()
#Mean loss is 381.6
#Loss is lesser than vanilla RNN and GRU for LSTM

Prediction shape:  (64, 160, 382)  # (batch_size, sequence_length, vocab_size)
Mean loss:         5.9444413


381.6261

Configure the training procedure using the `tf.keras.Model.compile` method. Use `tf.keras.optimizers.Adam` with default arguments and the loss function.

In [None]:
vanilla_model.compile(optimizer='adam', loss=loss)
gru_model.compile(optimizer='adam', loss=loss)
lstm_model.compile(optimizer='adam', loss=loss)

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure that checkpoints are saved during training:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

Because we are focusing on comparing training time, start with just two epochs to train the model. In Colab, set the runtime to GPU for faster training. Make a prediction in advance about which model will train most quickly and why. Take note of any differences you observe in model training time and think of an explanation for them. 

In [None]:
import time
EPOCHS = 2

In [None]:
# Time each of the models on a a small number of epochs

t0 = time.perf_counter() # Time epochs: Capture the start time
history1 = vanilla_model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
print("Vanilla, elapsed time:", time.perf_counter() - t0, "seconds.")
print()

t0 = time.perf_counter() # Time epochs: Capture the start time
history2 = gru_model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
print("GRU, elapsed time:", time.perf_counter() - t0, "seconds.")
print()

t0 = time.perf_counter() # Time epochs: Capture the start time
history3 = lstm_model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
print("LSTM, elapsed time:", time.perf_counter() - t0, "seconds.")



#Discuss with Your Partner

Examine the total training time for each of the three models above. Was there anything surprising? Can you infer why certain models ran faster than others? Here's a hint: When there are a lot of weights to process in parallel, the GPU is the way to get things done fast. Which tensorflow layers have been optimized for use with a GPU?

In [None]:
#
# Exercise 10.5 - Comment on the training time for each of the models.
# Also comment on the ending value of the loss function. From efforts in
# Week 9, a loss value below 1.4 starts to look promising in terms of 
# generating morphologically correct character sequences.
#

In [None]:
history1.history, history2.history, history3.history,

In [None]:
#
# Exercise 10.6 - Compare the final loss status of each model.
# Compute the difference between the initial loss and the final loss.
#


In [None]:
#
# Exercise 10.7 - Plot the loss history for each model.
#

## Generate text

The simplest way to generate text with this model is to run a set of predictions in a loop, while keeping track of the model's internal state as it runs. Each time we call the model we pass in a slice of text and an internal state. 

The model returns a prediction for the next character as well as its new state. Pass the prediction and state back in to continue generating text. The class defined below accomplishes one step in this chain of model runs. When the generate_one_step() bound method is called, it makes a single step prediction. Note the temperature argument on the class initializer.


In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=0.3):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    # The model also returns its internal state so that we can use that
    # the next time around the loop.
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
class OneStepLSTM(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=0.3):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states1=None, states2=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    # The model also returns its internal state so that we can use that
    # the next time around the loop.
    if states1 == None:
      predicted_logits, states1, states2 = self.model(inputs=input_ids, states=None, return_state=True)
    else:
      predicted_logits, states1, states2 = self.model(inputs=input_ids, states=[states1, states2], return_state=True)
    
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states1, states2

In [None]:
# Now we can instantiate the vanilla class. Take note of the arguments we are 
# passing in. What do the last two arguments do?
one_step_model1 = OneStep(vanilla_model, chars_from_ids, ids_from_chars)

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make sentences and imitates the vocabulary seen in the training data. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['When did Beyoncé start to get famous?'])
result = [next_char]

for n in range(160):
  next_char, states = one_step_model1.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

In [None]:
# Now we can instantiate the GRU class. Take note of the arguments we are 
# passing in. What do the last two arguments do?
one_step_model2 = OneStep(gru_model, chars_from_ids, ids_from_chars)

In [None]:
start = time.time()
states = None
next_char = tf.constant(['When did Beyoncé start to get famous?'])
result = [next_char]

for n in range(160):
  next_char, states = one_step_model2.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

In [None]:
# Now we can instantiate the LSTM class. Take note of the arguments we are 
# passing in. What do the last two arguments do?
one_step_model3 = OneStepLSTM(lstm_model, chars_from_ids, ids_from_chars)

In [None]:
start = time.time()
states1 = None
states2 = None
next_char = tf.constant(['When did Beyoncé start to get famous?'])
result = [next_char]

for n in range(160):
  next_char, states1, states2 = one_step_model3.generate_one_step(next_char, states1=states1, states2=states2)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Examine the processing time for each of the generation steps above - i.e.by comparing the three models. Which model generates inferences most quickly? Which is the slowest? Can you explain why?

If you want the model to generate text faster the easiest thing you can do is batch the text generation. In the example below the model generates three texts in about the same time it took to generate just one above.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['When did Beyoncé start to get famous?', 'Where is the Amazon?', 'Who won the world cup in 2019?'])
result = [next_char]

for n in range(160):
  next_char, states = one_step_model1.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

In [None]:
#
# Exercise 10.8 - Add timed loops for doing multiple predictions (as shown
# above) for GRU and LSTM models.
#


**Improving the LSTM and GRU Models**

Focus on improving the GRU and the LSTM. In particular, train more epochs for each and compare the progress of the loss function.

There are several additional strategies that might improve the performance of the models. Try them in the following order:

* Change the temperature in the initialization of the OneStep class to adjust the randomness in generation of new character sequences (note that the default value is already set quite low)
* Increase the number of nodes in the LSTM and GRU layers to give the model more "intelligence"
* Add a new dense layer after the LSTM or GRU layer to improve the model's ability to process the output of the RNN

Try at least one of these techniques during the remaining time in the lab. For the moment we don't have a way of documenting model quality other than the final loss value and your own read of the generated text to see whether it is creating real words and sensible sentences. But those are reasonable criteria for now. Make sure to add comments documenting what you find out.

In [None]:
#
# Exercise 10.9 - Plot the loss history for a few additional GRU training epochs.
#



In [None]:
#
# Exercise 10.10 - Conduct additional predictions using a prompt to the GRU model.
#


In [None]:
#
# Exercise 10.11 - Plot the loss history for a few additional LSTM training epochs.
#


In [None]:
#
# Exercise 10.12 - Conduct additional predictions using a prompt to the LSTM model.
#


In [None]:
#
# Exercise 10.13 - Try one additional technique for improving the GRU.
#


In [None]:
#
# Exercise 10.14 - Use the same improvement technique on the LSTM.
#
