<table class="tfo-notebook-buttons" align="left">
  <td>
    <a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/nlp/c3_w2_assignment_deep_n_grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>    
  </td>
  <td>
    <a href="https://github.com/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/nlp/c3_w2_assignment_deep_n_grams.ipynb" target="_parent"><img src="https://raw.githubusercontent.com/martin-fabbri/colab-notebooks/master/assets/github.svg" alt="View On Github"/></a>  </td>
</table>

# Assignment 2:  Deep N-grams

Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks `RNN`.
- You will be using the fundamentals of google's [trax](https://github.com/google/trax) package to implement any kind of deeplearning model. 

By completing this assignment, you will learn how to implement models from scratch:
- How to convert a line of text into a tensor
- Create an iterator to feed data to the model
- Define a GRU model using `trax`
- Train the model using `trax`
- Compute the accuracy of your model using the perplexity
- Predict using your own model


## Outline

- [Overview](#0)
- [Part 1: Importing the Data](#1)
    - [1.1 Loading in the data](#1.1)
    - [1.2 Convert a line to tensor](#1.2)
        - [Exercise 01](#ex01)
    - [1.3 Batch generator](#1.3)
        - [Exercise 02](#ex02)
    - [1.4 Repeating Batch generator](#1.4)        
- [Part 2: Defining the GRU model](#2)
    - [Exercise 03](#ex03)
- [Part 3: Training](#3)
    - [3.1 Training the Model](#3.1)
        - [Exercise 04](#ex04)
- [Part 4:  Evaluation](#4)
    - [4.1 Evaluating using the deep nets](#4.1)
        - [Exercise 05](#ex05)
- [Part 5: Generating the language with your own model](#5)    
- [Summary](#6)

<a name='0'></a>
### Overview

Your task will be to predict the next set of characters using the previous characters. 
- Although this task sounds simple, it is pretty useful.
- You will start by converting a line of text into a tensor
- Then you will create a generator to feed data into the model
- You will train a neural network in order to predict the new set of characters of defined length. 
- You will use embeddings for each character and feed them as inputs to your model. 
    - Many natural language tasks rely on using embeddings for predictions. 
- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit `GRU`, and run it through a linear layer to predict the next set of characters.

<img src="https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/images/model_deep_n_grams.png" width="600px"/>

The figure above gives you a summary of what you are about to implement. 
- You will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, you will compute the softmax. 

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [1]:
%%capture
!pip install trax==1.3.1

In [2]:
%%capture
!wget https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/datasets/shakespeare-data.tar.gz
!tar -xf shakespeare-data.tar.gz

In [4]:
%%capture
import os
import pickle
import random as rnd

import numpy
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import trax
import trax.fastmath.numpy as np
from trax import fastmath
from trax import layers as tl


# set random seed
trax.supervised.trainer_lib.init_random_number_generators(32)
rnd.seed(32)

In [5]:
!pip list | grep 'trax\|jax\|numpy\|torch'

jax                           0.2.7                
jaxlib                        0.1.57+cuda101       
numpy                         1.19.5               
torch                         1.7.0+cu101          
torchsummary                  1.5.1                
torchtext                     0.3.1                
torchvision                   0.8.1+cu101          
trax                          1.3.1                


In [6]:
def get_batch(source, i):
    '''
        returns a batch
    '''
    bptt = 35
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    
    return data, target


def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data


# to detach the hidden state from the graph.
def detach(hidden):
    """
    This function detaches every single tensor. 
    """
    if isinstance(hidden, torch.Tensor):
        return hidden.detach()
    else:
        return tuple(detach(v) for v in hidden)

<a name='1'></a>
# Part 1: Importing the Data

<a name='1.1'></a>
### 1.1 Loading in the data

<img src = "https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/images/shakespeare.png" width="250px"/>

Now import the dataset and do some processing. 
- The dataset has one sentence per line.
- You will be doing character generation, so you have to process each sentence by converting each **character** (and not word) to a number. 
- You will use the `ord` function to convert a unique character to a unique integer ID. 
- Store each line in a list.
- Create a data generator that takes in the `batch_size` and the `max_length`. 
    - The `max_length` corresponds to the maximum length of the sentence.

In [7]:
dirname = 'data/'
lines = [] # storing all the lines in a variable. 
for filename in os.listdir(dirname):
    with open(os.path.join(dirname, filename)) as files:
        for line in files:
            # remove leading and trailing whitespace
            pure_line = line.strip()
            
            # if pure_line is not the empty string,
            if pure_line:
                # append it to the list
                lines.append(pure_line)

In [8]:
n_lines = len(lines)
print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 2 KING HENRY IV
Sample line at position 999 Page	A' calls me e'en now, my lord, through a red


Notice that the letters are both uppercase and lowercase.  In order to reduce the complexity of the task, we will convert all characters to lowercase.  This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'.

In [9]:
# go through each line
for i, line in enumerate(lines):
    # convert to all lowercase
    lines[i] = line.lower()

print(f"Number of lines: {n_lines}")
print(f"Sample line at position 0 {lines[0]}")
print(f"Sample line at position 999 {lines[999]}")

Number of lines: 125097
Sample line at position 0 2 king henry iv
Sample line at position 999 page	a' calls me e'en now, my lord, through a red


In [10]:
eval_lines = lines[-1000:] # Create a holdout validation set
lines = lines[:-1000] # Leave the rest for training

print(f"Number of lines for training:   {len(lines):,}")
print(f"Number of lines for validation:   {len(eval_lines):,}")

Number of lines for training:   124,097
Number of lines for validation:   1,000


<a name='1.2'></a>
### 1.2 Convert a line to tensor

Now that you have your list of lines, you will convert each character in that list to a number. You can use Python's `ord` function to do it. 

Given a string representing of one Unicode character, the `ord` function return an integer representing the Unicode code point of that character.

In [11]:
# View the unique unicode integer associated with each character
print(f"ord('a'): {ord('a')}")
print(f"ord('b'): {ord('b')}")
print(f"ord('c'): {ord('c')}")
print(f"ord(' '): {ord(' ')}")
print(f"ord('x'): {ord('x')}")
print(f"ord('y'): {ord('y')}")
print(f"ord('z'): {ord('z')}")
print(f"ord('1'): {ord('1')}")
print(f"ord('2'): {ord('2')}")
print(f"ord('3'): {ord('3')}")

ord('a'): 97
ord('b'): 98
ord('c'): 99
ord(' '): 32
ord('x'): 120
ord('y'): 121
ord('z'): 122
ord('1'): 49
ord('2'): 50
ord('3'): 51


<a name='ex01'></a>
### Exercise 01

**Instructions:** Write a function that takes in a single line and transforms each character into its unicode integer.  This returns a list of integers, which we'll refer to as a tensor.
- Use a special integer to represent the end of the sentence (the end of the line).
- This will be the EOS_int (end of sentence integer) parameter of the function.
- Include the EOS_int as the last integer of the 
- For this exercise, you will use the number `1` to represent the end of a sentence.

In [12]:
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: line_to_tensor
def line_to_tensor(line, EOS_int=1):
    """Turns a line of text into a tensor

    Args:
        line (str): A single line of text.
        EOS_int (int, optional): End-of-sentence integer. Defaults to 1.

    Returns:
        list: a list of integers (unicode values) for the characters in the `line`.
    """
    
    # Initialize the tensor as an empty list
    tensor = []
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    # for each character:
    for c in line:
        
        # convert to unicode int
        c_int = ord(c)
        
        # append the unicode integer to the tensor list
        tensor.append(c_int)
    
    # include the end-of-sentence integer
    tensor.append(EOS_int)
    ### END CODE HERE ###

    return tensor

In [13]:
# Testing your output
line_to_tensor('abc xyz')

[97, 98, 99, 32, 120, 121, 122, 1]

##### Expected Output
```CPP
[97, 98, 99, 32, 120, 121, 122, 1]
```

<a name='1.3'></a>
### 1.3 Batch generator 

Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).
- The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.

Once you create the generator, you can iterate on it like this:

```
next(data_generator)
```

This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.

<a name='ex02'></a>
### Exercise 02
**Instructions:** Implement the data generator below. Here are some things you will need. 

- While True loop: this will yield one batch at a time.
- if index >= num_lines, set index to 0. 
- The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of `data_lines` is created. This list can be shuffled and used to get random batches everytime the index is reset.
- if len(line) < max_length append line to cur_batch.
    - Note that a line that has length equal to max_length should not be appended to the batch. 
    - This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.  
    - So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.
- if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.

**Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.**

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li>Use the line_to_tensor function above inside a list comprehension in order to pad lines with zeros.</li>
    <li>Keep in mind that the length of the tensor is always 1 + the length of the original line of characters.  Keep this in mind when setting the padding of zeros.</li>
</ul>
</p>

In [15]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: data_generator
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
    """Generator function that yields batches of data

    Args:
        batch_size (int): number of examples (in this case, sentences) per batch.
        max_length (int): maximum length of the output tensor.
        NOTE: max_length includes the end-of-sentence character that will be added
                to the tensor.  
                Keep in mind that the length of the tensor is always 1 + the length
                of the original line of characters.
        data_lines (list): list of the sentences to group into batches.
        line_to_tensor (function, optional): function that converts line to tensor. Defaults to line_to_tensor.
        shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
        tuple: two copies of the batch (jax.interpreters.xla.DeviceArray) and mask (jax.interpreters.xla.DeviceArray).
        NOTE: jax.interpreters.xla.DeviceArray is trax's version of numpy.ndarray
    """
    # initialize the index that points to the current position in the lines index array
    index = 0
    
    # initialize the list that will contain the current batch
    cur_batch = []
    
    # count the number of lines in data_lines
    num_lines = len(data_lines)
    
    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]
    
    # shuffle line indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)
    
    ### START CODE HERE (Replace instances of 'None' with your code) ###
    while True:
        
        # if the index is greater or equal than to the number of lines in data_lines
        if index >= num_lines:
            # then reset the index to 0
            index = 0
            # shuffle line indexes if shuffle is set to True
            if shuffle:
                rnd.shuffle(lines_index)
            
        # get a line at the `lines_index[index]` position in data_lines
        line = data_lines[lines_index[index]]
        
        # if the length of the line is less than max_length
        if len(line) < max_length:
            # append the line to the current batch
            cur_batch.append(line)
            
        # increment the index by one
        index += 1
        
        # if the current batch is now equal to the desired batch size
        if len(cur_batch) == batch_size:
            
            batch = []
            mask = []
            
            # go through each line (li) in cur_batch
            for li in cur_batch:
                # convert the line (li) to a tensor of integers
                tensor = line_to_tensor(li)
                
                # Create a list of zeros to represent the padding
                # so that the tensor plus padding will have length `max_length`
                pad = [0] * (max_length - len(tensor))
                
                # combine the tensor plus pad
                tensor_pad = tensor + pad
                
                # append the padded tensor to the batch
                batch.append(tensor_pad)

                # A mask for  tensor_pad is 1 wherever tensor_pad is not
                # 0 and 0 wherever tensor_pad is 0, i.e. if tensor_pad is
                # [1, 2, 3, 0, 0, 0] then example_mask should be
                # [1, 1, 1, 0, 0, 0]
                # Hint: Use a list comprehension for this
                example_mask = [1 if c !=0 else 0 for c in tensor_pad]
                mask.append(example_mask)
               
            # convert the batch (data type list) to a trax's numpy array
            batch_np_arr = np.array(batch)
            mask_np_arr = np.array(mask)
            
            ### END CODE HERE ##
            
            # Yield two copies of the batch and mask.
            yield batch_np_arr, batch_np_arr, mask_np_arr
            
            # reset the current batch to an empty list
            cur_batch = []
            

In [16]:
# Try out your data generator
tmp_lines = ['12345678901', #length 11
             '123456789', # length 9
             '234567890', # length 9
             '345678901'] # length 9

# Get a batch size of 2, max length 10
tmp_data_gen = data_generator(batch_size=2, 
                              max_length=10, 
                              data_lines=tmp_lines,
                              shuffle=False)

# get one batch
tmp_batch = next(tmp_data_gen)

# view the batch
tmp_batch

(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))

##### Expected output

```CPP
(DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 DeviceArray([[49, 50, 51, 52, 53, 54, 55, 56, 57,  1],
              [50, 51, 52, 53, 54, 55, 56, 57, 48,  1]], dtype=int32),
 DeviceArray([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32))
```

Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network. 