# MAP 631 Lab 5: Text Generation using Recurrent Neural Networks
## Solution

J.B. Scoggins

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jbscoggi/teaching/blob/master/Polytechnique/MAP631/rnn.ipynb) 

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jbscoggi/teaching/master?filepath=Polytechnique%2FMAP631%2Frnn.ipynb)

## Introduction

Recurrent neural networks (RNNs) have emerged as powerful predictive and generative models for a range of applications.  For example, take a look at the excellent blog post by Andrew Karpathy on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).  In this lab assignment, your task is to build a generative, character-by-character RNN that can predict the next character from a given sequence. The classic work *The Odyssey* by Homer will serve as a corpus for your model.  An example of the type of text that you can generate from this lab is below.  The `seed` text represents an initial sequence of characters that is randomly sampled from the corpus.  Following the seed, you can see that the model has predicted a fairly realistic text sequence, in the style of the *The Odyssey*, including realistic line breaks and punctuation.

<b>Seed text</b>
```
h ulysses for having
blinded an eye of p
```

<b>Prediction of next 500 characters</b>
```
olypels end she darte yod mentered saw he would polden ewall by sur; for her got him eather, and he would send
them flying out of the hould not save his
men, for they perished through their own sheer folly in eating the
cattle of the sun-god hyperion; so the god prevented them from ever
reaching home. tell me, too, about all these things, oh daughter of
jove, from whatsoever source you may know them.

so now all who escaped death in battle or by shipwreck had got safely
home except ulysses, and 
```

### RNN Models in Keras

As in the CNN lab, we will use the Tensorflow Keras module to implement a simple RNN model.  Specifically, we will make use of the [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) and [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) layers.
Before proceeding, it may be useful to familiarize yourself with the [Keras API for RNNs](https://www.tensorflow.org/guide/keras/rnn).  In particular, understanding what input shape each RNN layer expects will be crucial.

### Embeddings

Embeddings is a topic you will see later in the course, but we will utilize them in this lab to make the model development easier.  In essence, an embedding encodes an integer index as a vector of some size.  You may think of this as a generalization of a one-hot encodding.  For example, consider the list of characters in the word "hello".  If this word contained all the characters in our vocabulary (namely "h", "e", "l", and "o"), we can generate a one-hot encoding where each character is represented by a vector of size 4, with a 1 in the element corresponding to the letter, and zeros everwhere else.  For our 4-character vocabulary, this could look like the following:

| char / index | encodding  |
|--------------|------------|
|h / 0         | 1 0 0 0    |
|e / 1         | 0 1 0 0    |
|l / 2         | 0 0 1 0    |
|o / 3         | 0 0 0 1    |

While one-hot encoddings are usefull for many applications, they have a number of limitations.  For example,

- The encodding matrix is extremely sparse
- The size of the encodding depends on the size of the vocabulary or number of categories being encodded
- There is no notion of similarity between the entities being encodded

An embedding solves these problems by using a learnable (size of vocabulary)X(size of encodding) matrix, in place of the fixed and sparse one-hot encodding matrix.  Continuing with our previous example, an embedding for the characters in "hello" might take the form

| char / index | encodding   |
|--------------|-------------|
|h / 0         | 1.2 0.3 4.3 |
|e / 1         | 0.1 1.5 7.8 |
|l / 2         | 0.5 3.2 1.9 |
|o / 3         | 3.6 7.2 5.8 |

Note that here, we have chosen an encoding represented by a vector of size 3, which is less than the size of the vocabulary.  This is called "embedding" our vocabulary in a 3 dimensional space.  This is extremely useful when building encoddings for very large vocabularies.  In addition, the coefficients in the encodding matrix are learned during the training process, allowing "similar" characters (in this case) to be grouped locally in the encodding space.  Further, we can visualize the embedding space through a number of techniques to help us understand how our vocabulary is encodded.  We will not do that here, but you can find a number of examples online, if you are interested.

### Grading

This lab will be **optionally** graded for those who wish.  If you want this lab graded, please submit the following files to Moodle **before midnight on Oct. 13**:

- completed jupyter notebook
- saved model parameters file, needed to run your model (I will not retrain your model)

*Submissions after the due date will not be graded.*  Note that your code must successfully load the model parameters and run one epoch of training, showing the accuracy during that epoch.  Your grade will be based on the completeness of the tasks (your code) and the reported accuracy of your model.

## Getting started

Import packages and intialize various global variables. You may want to change these later.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

# Global constants
window_size = 40 # Length of character sequences
batch_size = 32  # Batch size for learning
rnn_units = 128 # Number of hidden units in the LSTM or GRU cells cell
epochs     = 100 # Number of training epochs

text_file = 'data/odyssey.txt'

## Task 1: Preprocess text data

Fill in the function below to read in the text file given by `text_file` and perform any preprocessing that you feel is necessary.  For example, convert the text to lower case in order to reduce the size of the vocabulary.  Other examples of processing include replacing accented characters with non accented characters, removing "unnecessary" punctuation, etc.

In [None]:
def preprocess_file(file_name):
    """Read a text file, perform preprocessing, and return text as a string.
    
    Parameters
    ----------
    file_name : str
        Name of text file to load.
    
    Returns
    -------
    text : str
        Preprocessed text from the file.
    """

    # TODO: Read in file and convert to lower case
    text = 

    # Optional: perform additional processing

    return text

# Load and prepare data
text = preprocess_file(text_file)
text = text[:10000] # Shorten text for testing
print(text[:500])

## Task 2: Generate a dataset for training

Fill in the function below which takes in the document text and a "window" size and returns a list of unique characters representing the vocabulary for the document as well as the training data.  The training data consists of two lists.  The first is a list of lists of integers (indexing the vocabulary list) corresponding to a sequences of characters found in the document of length `window_size`.  The other is a list of integers (indexing the vocabulary list) corresponding to the next character in the sequence, for each sequence in the first list.

In [None]:
def make_dataset(text, window_size=40):
    """Create the dataset used to train the RNN.
    
    Parameters
    ----------
    text : str
        String representing text to learn on.
    window_size : int
        Length of character sequence used to predict next character.
    
    Returns
    -------
    vocab : list(char)
        List of characters making up the vocabulary of the text.
    x_data : list(list(int))
        List of sequences of size window_size, containing indices into vocab.
        Each sequence represents a sequence of window_size characters found in
        the text.  The number of sequences generated will be len(text) - window_size.
    y_data : list(int)
        List of indices corresponding to the characters that follow the
        sequences in x_data.
    """
    # TODO: Determine list of unique characters
    vocab = 
    
    x_data = []
    y_data = []

    # TODO: Generate training data
    
    
    return x_data, y_data, vocab

x_data, y_data, vocab = make_dataset(text, window_size=window_size)

# Check if everything is working
print(vocab)
print(len(vocab))
print(x_data[0])
print(y_data[0])

## Task 3: Create the RNN model

Fill in the function below which builds and returns the RNN model, for the given size parameters and RNN layer.  The model should take as input a tensor representing batches of character index sequences and output a tensor representing the probabilities of each character in the vocabulary coming next in the sequence, for each sequence in the batch.  Use the following architecture.  

- Sequential Keras model
  - [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) with input and output dimensions equal to the vocab size (you can try using smaller encoddings later)
  - [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) with num_units
  - [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) output, with softmax
- `sparse_categorical_crossentropy` loss
- Adam optimizer
- Metrics: accuracy

In [None]:
def rnn_model(num_units, window_size, vocab_size, rnn_layer=layers.LSTM):
    """Creates the RNN model.
    
    Parameters
    ----------
    num_units : int
        Number of hidden units in the LSTM layer.
    window_size : int
        Number of characters in an input sequence.
    vocab_size : int
        Number of unique characters in the vocabulary.
    rnn_layer : Keras RNN layer (RNN, LSTM, GRU)
    
    Returns
    -------
    model : Keras model
        RNN model.
    """
    
    # TODO: Build the model
    model = 
    
    # TODO: Compile the model

    return model

model = rnn_model(rnn_units, window_size, len(vocab))
model.summary()

## Task 4: Train and evaluate

Fill in the code below to train the RNN model.  After every 3 epochs, generate 500 characters of text from a random seed sequence to gauge how well the model is doing.  This can be done by using the seed to predict the next character in the sequence (take the maximum likelihood character).  Append this new character onto the sequence (dropping the first character to maintain the window size) and repeat.  Print each new character as you go to generate the text.

In [None]:
# Train the model
for i in range(1,epochs):
    # TODO: Fit model for 3 epochs

    # TODO: Generate text
    

## Task 5: Repeat your experiment using a GRU layer

Repeat tasks 3 and 4 with the GRU layer in place of the LSTM.  Do you notice any differences in the performance, training, or text generation?

## If you have time...

This lab was a small taste of the power of RNN models.  Here are some other things you can try if you want to go further with the time you have left.

- Sample the output distribution from the model to generate the next character in the sequence (instead of taking the most probable).  This will add some more randomness to your text generation.
- Build a vocabulary of words, rather than characters.  This will highlight the importance of the embedding layer (you will need to use a smaller output dimension for the embedding than the vocabulary size).
- Try other media types (eg: sound, video, guitar tabs...)