# Character level language model - Dinosaurus land

Welcome to Dinosaurus Island! 65 million years ago, dinosaurs existed, and in this assignment they are back. You are in charge of a special task. Leading biology researchers are creating new breeds of dinosaurs and bringing them to life on earth, and your job is to give names to these dinosaurs. If a dinosaur does not like its name, it might go beserk, so choose wisely! 

<table>
<td>
<img src="images/dino.jpg" style="width:250;height:300px;">

</td>

</table>

Luckily you have learned some deep learning and you will use it to save the day. Your assistant has collected a list of all the dinosaur names they could find, and compiled them into this [dataset](dinos.txt). (Feel free to take a look by clicking the previous link.) To create new dinosaur names, you will build a character level language model to generate new names. Your algorithm will learn the different name patterns, and randomly generate new names. Hopefully this algorithm will keep you and your team safe from the dinosaurs' wrath! 

By completing this assignment you will learn:

- How to store text data for processing using an RNN 
- How to synthesize data, by sampling predictions at each time step and passing it to the next RNN-cell unit
- How to build a character-level text generation recurrent neural network


In [1]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

## 1 - Problem Statement

### 1.1 - Dataset and Preprocessing

Run the following cell to read the dataset of dinosaur names `dinos.txt`, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size.

The characters are a-z (26 characters) plus the "\n" (or newline character), which in this assignment plays a role similar to the `<EOS>` (or "End of sentence") token. It indicates the end of the dinosaur name.

**TO DO**: In the cell below, you need to create two python dictionaries (i.e., hash tables) to map each character to an index from 0-26 and to map each index back to the corresponding character. 

This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `self.ch_to_idx` and `self.idx_to_ch` are the python dictionaries. 

`__getitem__` is essential when using `DataLoader` later in the algorithm. The first entry of `x_str` being ` ` will be interpreted by setting $x^{\langle 0 \rangle} = \vec{0}$. Further, this ensures that `y` is equal to `x` but shifted one step to the left, and with an additional "\n" appended to signify the end of the dinosaur name. 

In [2]:
hidden_size = 100


class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content))
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()

        self.ch_to_idx = {}
        self.idx_to_ch = {}

        for i in range(26):
            curch = chr(i + 65)
            self.ch_to_idx[curch] = i
            self.idx_to_ch[i] = curch

    def __getitem__(self, index):
        line = self.lines[index]
        x_str = ' ' + line  # add a space at the beginning, which indicates a vector of zeros.
        y_str = line + '\n'
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)

        y[0] = self.ch_to_idx[y_str[0]]
        #we start from the second character because the first character of x was nothing(vector of zeros).
        for i, (x_ch, y_ch) in enumerate(zip(x_str[1:], y_str[1:]), 1):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]

        return x, y

    def __len__(self):
        return len(self.lines)


In [None]:
### 1.2 - Overview of the Recurrent Neural Network

**TO DO**: Implement your RNN model class. Your RNN model will have the following structure: 
    
<img src="images/RNN.png" style="width:450;height:300px;">
<caption><center> **Figure 1**: Recurrent Neural Network.  </center></caption>

At each time-step, the RNN tries to predict what is the next character given the previous characters. The dataset $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters in the training set, while $Y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ , such that at every time-step $t$, we have $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$. 

In [6]:
class RNN(nn.Module):
    def __init__(self, ):
        super().__init__(inputsize)

        self.rnn = nn.RNN(
            input_size=inputsize,
            hidden_size=32,
            num_layers=1,
            batch_first=True
        )

        self.out = nn.Linear(32, 1)


    def forward(self, x, h_state):
        r_out, h_state = self.rnn(x, h_state)

        outs = []
        for time in range(r_out.size(1)):
            outs.append(self.out(r_out[:, time, :]))
        return torch.stack(outs, dim=1), h_state


## 2 - Sampling

In this part, you will build the important block of the overall language model:
- Sampling: a technique used to generate characters


Now assume that your model is trained. You would like to generate new text (characters). The process of generation is explained in the picture below:

<img src="images/dinos3.png" style="width:500;height:300px;">
<caption><center> **Figure 2**: In this picture, we assume the model is already trained. We pass in $x^{\langle 1\rangle} = \vec{0}$ at the first time step, and have the network then sample one character at a time. </center></caption>

**TO DO**: Implement the `sample` function below to sample characters. You need to carry out 4 steps:

- **Step 1**: Pass the network the first "dummy" input $x^{\langle 1 \rangle} = \vec{0}$ (the vector of zeros). This is the default input before we've generated any characters. We also set $a^{\langle 0 \rangle} = \vec{0}$

- **Step 2**: Run one step of forward propagation to get $a^{\langle 1 \rangle}$ and $\hat{y}^{\langle 1 \rangle}$. Here are the equations:

$$ a^{\langle t+1 \rangle} = \tanh(W_{ax}  x^{\langle t \rangle } + W_{aa} a^{\langle t \rangle } + b)\tag{1}$$

$$ z^{\langle t + 1 \rangle } = W_{ya}  a^{\langle t + 1 \rangle } + b_y \tag{2}$$

$$ \hat{y}^{\langle t+1 \rangle } = softmax(z^{\langle t + 1 \rangle })\tag{3}$$

Note that $\hat{y}^{\langle t+1 \rangle }$ is a (softmax) probability vector (its entries are between 0 and 1 and sum to 1). $\hat{y}^{\langle t+1 \rangle}_i$ represents the probability that the character indexed by "i".

- **Step 3**: Carry out sampling: Pick the next character's index according to the probability distribution specified by $\hat{y}^{\langle t+1 \rangle }$. This means that if $\hat{y}^{\langle t+1 \rangle }_i = 0.16$, you will pick the index "i" with 16% probability. To implement it, you can use [`np.random.choice`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html).

Here is an example of how to use `np.random.choice()`:
```python
np.random.seed(0)
p = np.array([0.1, 0.0, 0.7, 0.2])
index = np.random.choice([0, 1, 2, 3], p = p.ravel())
```
This means that you will pick the `index` according to the distribution: 
$P(index = 0) = 0.1, P(index = 1) = 0.0, P(index = 2) = 0.7, P(index = 3) = 0.2$.

- **Step 4**: The last step to implement in `sample()` is to overwrite the variable `x`, which currently stores $x^{\langle t \rangle }$, with the value of $x^{\langle t + 1 \rangle }$. You will represent $x^{\langle t + 1 \rangle }$ by creating a one-hot vector corresponding to the character you've chosen as your prediction. You will then forward propagate $x^{\langle t + 1 \rangle }$ in Step 1 and keep repeating the process until you get a "\n" character, indicating you've reached the end of the dinosaur name. 

In [5]:
def sample(model):
    model.eval()
    word_size = 0
    newline_idx = trn_ds.ch_to_idx['\n']
    indices = []
    pred_char_idx = -1

    # Step 1: initialize first input and hidden state
    # YOUR CODE HERE
    h_prev = None
    x = None

    with torch.no_grad():
        while pred_char_idx != newline_idx and word_size != 50:
            # Step 2: Forward propagate x using the equations (1), (2) and (3)
            # YOUR CODE HERE

            np.random.seed(np.random.randint(1, 5000))
            # Step 3: Sample the index of a character within the vocabulary from the probability distribution
            # YOUR CODE HERE
            idx = None
            indices.append(idx)

            # Step 4: Overwrite the input character as the one corresponding to the sampled index.
            # YOUR CODE HERE
            x = None

            pred_char_idx = idx
            word_size += 1
        if word_size == 50:
            indices.append(newline_idx)
    return indices

def print_sample(sample_idxs):
    print(trn_ds.idx_to_ch[sample_idxs[0]].upper(), end='')
    [print(trn_ds.idx_to_ch[x], end='') for x in sample_idxs[1:]]


## 3 - Training the language model 

It is time to train the character-level language model for text generation. 

### 3.1 - Gradient descent 

**TO DO**: In this section you will implement a function performing one epoch of training steps(with clipped gradients). You will go through the training examples one at a time, so the optimization algorithm will be stochastic gradient descent. 
Before running the optimization loop, you need to first initialize parameters.

As a reminder, here are the steps of a common optimization loop for an RNN:

- Forward propagate through the RNN to compute the loss
- Backward propagate through time to compute the gradients of the loss with respect to the parameters
- Clip the gradients if necessary 
- Update your parameters using gradient descent 

Every 100 steps of stochastic gradient descent, you will sample 1 name to see how the algorithm is doing. 

In [5]:
def train_one_epoch(model, loss_fn, optimizer):
    # Go through the training examples one at a time
    for line_num, (x, y) in enumerate(trn_dl):
        model.train()
        loss = 0
        optimizer.zero_grad()
        
        # Initialize parameters
        # YOUR CODE HERE
        h_prev = None
        
        for i in range(x.shape[1]):
            # Forward propagate through the RNN to compute the loss
            # YOUR CODE HERE
            
        #Every 100 steps of stochastic gradient descent, 
        # print one sampled name to see how the algorithm is doing
        if (line_num+1) % 100 == 0:
            # YOUR CODE HERE
            # HINT: print_sample()
        
        # Backpropagate through time
        # YOUR CODE HERE
        
        # Clip your gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 5)
        
        # Update parameters
        # YOUR CODE HERE
        
        

### 3.2 - Begin Training
Remember to shuffle the dataset, so that stochastic gradient descent visits the examples in random order. 

**TO DO**: Follow the instructions and implement `train()`. 

In [6]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, batch_size=1, shuffle=True)

def train(trn_ds, trn_dl, epochs=1):
    # Create a new model, loss_fn and optimizer.
    # YOUR CODE HERE
    model = None
    # Use cross entropy loss
    loss_fn = None
    # Use Adam
    optimizer = None
    
    for e in range(1, epochs+1):
        print(f'{"-"*20} Epoch {e} {"-"*20}')
        train_one_epoch(model, loss_fn, optimizer)

In [3]:
#Start training
train(trn_ds, trn_dl, epochs=5)

NameError: name 'train' is not defined

## Conclusion

You can see that your algorithm has started to generate plausible dinosaur names towards the end of the training. At first, it was generating random characters, but towards the end you could see dinosaur names with cool endings. Feel free to run the algorithm even longer and play with hyperparameters to see if you can get even better results. Our implemetation generated some really cool names like `maconucon`, `marloralus` and `macingsersaurus`. Your model hopefully also learned that dinosaur names tend to end in `saurus`, `don`, `aura`, `tor`, etc.

If your model generates some non-cool names, don't blame the model entirely--not all actual dinosaur names sound cool. (For example, `dromaeosauroides` is an actual dinosaur name and is in the training set.) But this model should give you a set of candidates from which you can pick the coolest! 

This assignment had used a relatively small dataset, so that you could train an RNN quickly on a CPU. Training a model of the english language requires a much bigger dataset, and usually needs much more computation, and could run for many hours on GPUs. We ran our dinosaur name for quite some time, and so far our favoriate name is the great, undefeatable, and fierce: Mangosaurus!

<img src="images/mangosaurus.jpeg" style="width:250;height:300px;">

Reference: this assignment is adapted from one of Andrew Ng's Deep Learning Specialization--Sequence Models labs.