# Overview
The point of this notebook is to quickly go through the moving parts of training the model and generating samples.

## Vocabulary class
The point of this class is to set up a bijection between tokens (which in this case are characters) and natural numbers. The class has methods to look up tokens and ids. It also introduces some funny tokens to mark begin, end, unkown token, mask (used for padding when working with sequences of fixed length), and space.

In [3]:
from charvocabulary import charVocabulary
vocab = charVocabulary()
import os
txt_path = os.path.join('source', 'firstnames.txt')
vocab.add_txt(txt_path)
print(f"Vocabulary length is {len(vocab)}.")
for i in range(10):
    print(f"To id {i} corresponds token {vocab.lookup_idx(i)}.")
print(f"Empty space token corresponds to index {vocab.lookup_token(' ')}.")
print(f"Token 'a' corresponds to index {vocab.lookup_token('a')}.")

Vocabulary length is 31.
To id 0 corresponds token <mask>.
To id 1 corresponds token <begin>.
To id 2 corresponds token <end>.
To id 3 corresponds token <unk>.
To id 4 corresponds token  .
To id 5 corresponds token a.
To id 6 corresponds token l.
To id 7 corresponds token i.
To id 8 corresponds token y.
To id 9 corresponds token h.
Empty space token corresponds to index 4.
Token 'a' corresponds to index 5.


## Vectorizer class
The goal of this class is to take in a sequence of tokens, and spit out a sequence of integers.
This class requires we already have a _vocabulary_ class.

The point of the vectorizer class is the vectorize method, the workflow of which goes something like this.
- Given a sequence, such as 'monkey', it is encoded as a list of integers (using the lookup_idx method from the vocabulary class), so we'll get something like this [1,5,13,6,9,10,5,2,0,0,0,0]. The zeros come from the padding.

For training, we actually want to feed the network two versions of the sequence (one to be used as training input, the other for training labels).
* x = "begintoken" monkey "masktoken""masktoken""masktoken"
* y = monkey "endtoken""masktoken""masktoken""masktoken"

In [4]:
from charvectorizer import charVectorizer
vectorizer = charVectorizer(vocab=vocab)

In [5]:
x,y = vectorizer.vectorize('monkey', max_len=20)
print(x)
print(y)

[ 1 24 16 14 25 13  8  0  0  0  0  0  0  0  0  0  0  0  0]
[24 16 14 25 13  8  2  0  0  0  0  0  0  0  0  0  0  0  0]


## The model
The logic behind the Embedding and RNN layers are explained in separate notebooks. You could have a look at them first.

In [6]:
from charmodel import charModel

In [7]:
# it's useful for the model to know which id is used for padding
maskid = vocab.mask_idx

# verbose is just a silly option we added so it prints some information as it runs
model = charModel(vocab_size=len(vocab), padding_idx=maskid)

In [8]:
# let's create some bogus inputs
# recall x comes from the vectorizer, although any string of natural numbers would do
import torch
x_in = torch.tensor(x)
x_in = x_in.unsqueeze(dim=0)

In [9]:
print(x_in.shape)
print(x_in)

torch.Size([1, 19])
tensor([[ 1, 24, 16, 14, 25, 13,  8,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0]])


In [11]:
# the verbose option just prints out the shapes of the tensors as they are processed by the model
y = model(x_in, verbose=True)

Input has shape torch.Size([1, 19]).
Output of embedding layer has shape torch.Size([1, 19, 10]).
Output of RNN has shape torch.Size([1, 19, 9]).
Reshaped output of RNN has shape torch.Size([19, 9]).
Output of fc has shape torch.Size([19, 31]).
Final output has shape torch.Size([1, 19, 31]).


## Decoding
We feed pytorch a word, and it spits out a sequence of vectors. How do we translate this back into a word?

The idea is to create an empty list called `sequence`. We first append the `<begin>` token. Then we iteratively add tokens to it, by feeding it to the model.

In [12]:
"""
If you run this code and the output word looks empty, it probably generated a bunch of spaces. Simply run this cell again.
"""
# let's decide on the length of the word we'll generate
sample_size = 20

# we start our sequence with the begin token
beginid = vocab.begin_idx
# we convert it to a torch tensor, and unsqueeze a batch index
begintensor = torch.tensor([beginid]).unsqueeze(dim=0)

# we initialize our sequence
sequence = [begintensor]

# base step of our iteration
t = 1
x_t = sequence[1-1]
h_t = None

# once is a dummy variable to print something only once
once = True

for t in range(1,sample_size+1):
    # input at time t is x_t
    x_t = sequence[t-1]
    emb_t = model.emb(x_t)
    
    # on the LHS h_t is the hidden output at time t, while on the LHS h_t is the old hidden vector (so h_t-1)
    rnn_t, h_t = model.rnn(emb_t, h_t)
    
    # we first squeez rnn_t so that it has shape (batch, features), so it can be fed to the linear layer
    predictions = model.fc(rnn_t.squeeze(dim=1))
    
    # apply a softmax to translate the outputs into probabilities
    # softmax is applied batch-by-batch, that's why there's a dim=1
    probabilities = torch.nn.functional.softmax(predictions, dim=1)
    if once:
        print(f"The shape of probabilities is {probabilities.shape}.")
        once = False

    # multinomial introduces an element of randomness
    # it treats each row of probabilities as weights for a categorical distribution and then samples from it
    # it returns the corresponding index
    winner = torch.multinomial(probabilities, num_samples=1)
    sequence.append(winner)

temp = ""
for i in range(len(sequence)):
    idx = sequence[i].item()
    temp += vocab.lookup_idx(idx)

# what follows is a bit ad hoc, and might not play well with the unknown
i = 0
while i < len(temp) and temp[i] != '>':
    i += 1

decoded = ""
i += 1
while i < len(temp) and temp[i] != '<':
    decoded += temp[i]
    i += 1

print(f"The output is: {decoded}.")

The shape of probabilities is torch.Size([1, 31]).
The output is: bcnswuaotftmjdd.


## Training
Now we know how to fully use a model. How to train it?

#### Loading data
First we need to a way to access training data. We use pytorch's dataset class and the output of the vectorizer.

In [13]:
import pandas as pd
txt_path = os.path.join('source', 'firstnames.txt')
corpus = pd.read_csv(txt_path, header=None).dropna().reset_index()[0]
# charDataset is a very standard pytorch dataset class
# the getitem method returns the output of the vectorizer
from chardataset import charDataset
ds = charDataset(vectorizer=vectorizer, corpus=corpus)

# let's pick a random sample
from random import randint
N = len(ds)
x,y = ds.__getitem__(randint(0,N))

print(x)
print(y)

print(f"Recall that the mask, begin, and end tokens correspond to indices {vocab.mask_idx}, {vocab.begin_idx}, and {vocab.end_idx}.")

[ 1 10 17  8 14  6 13 13  0  0  0]
[10 17  8 14  6 13 13  2  0  0  0]
Recall that the mask, begin, and end tokens correspond to indices 0, 1, and 2.


### Loss function
We use a standard cross-entropy loss. The annoying thing is we need to do some reshaping first.

In [17]:
# convert data to torch and add a batch index
x_tensor = torch.tensor(x).unsqueeze(dim=0)
y_tensor = torch.tensor(y).unsqueeze(dim=0)

# apply model
y_pred = model(x_tensor)

# reshape to apply loss
batch_size, sequence_length, feature_dimension = y_pred.shape
y_pred_reshaped = y_pred.view(batch_size*sequence_length, feature_dimension)
print(y_pred_reshaped.shape)

# y only contains the labels
y_reshaped = y_tensor.view(batch_size*sequence_length)
# the code above is equivalent to
# y_reshaped = y_tensor.view(-1)
print(y_reshaped.shape)

# cross entropy applies a softmax first, so we don't need to
loss = torch.nn.functional.cross_entropy(y_pred_reshaped, y_reshaped, ignore_index=maskid)
print(f"The loss is {loss.item()}.")

torch.Size([11, 31])
torch.Size([11])
The loss is 3.5431160926818848.


## Workflow
We conclude this by having a look at what the training workflow looks like.

In [26]:
# load the dataloader
from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=4, shuffle=True)

# let's use a simple stochastic gradient descent as optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

num_epochs = 5
for epoch in range(num_epochs):
    print(f"\nThis is epoch number {epoch+1}.")
    model.train()
    
    for _,data in enumerate(dl):
        
        # zero out gradients before we forget
        optimizer.zero_grad()

        # unpack data into input and output
        x,y = data
        
        # no need to convert data to tensor or unsqueeze a batch index
        # 'tis the magic of the dataloader
        y_pred = model(x)

        batch_size, seq_len, feats = y_pred.shape
        y_pred_loss = y_pred.view(batch_size*seq_len,feats)
        y_loss = y.view(-1)
        
        # compute loss and gradients
        loss = torch.nn.functional.cross_entropy(y_pred_loss, y_loss, ignore_index=maskid)
        loss.backward()
        # update
        optimizer.step()

    # instead of evaluating the model computing some accuracy score, we have it generate words
    # we use the code from the Decoding cell above,
    # but it would be more practical to wrap this in a function
    
    num_samples = 5
    sample_size = 20
    print(f"Model has completed epoch number {epoch+1}. We will now print {num_samples} names of length {sample_size}.")
    model.eval()
    for i in range(num_samples):
        beginid = vocab.begin_idx
        begintensor = torch.tensor([beginid]).unsqueeze(dim=0)
        sequence = [begintensor]
        t = 1
        x_t = sequence[1-1]
        h_t = None

        for t in range(1,sample_size+1):
            x_t = sequence[t-1]
            emb_t = model.emb(x_t)
            rnn_t, h_t = model.rnn(emb_t, h_t)
            predictions = model.fc(rnn_t.squeeze(dim=1))
            probabilities = torch.nn.functional.softmax(predictions, dim=1)
            winner = torch.multinomial(probabilities, num_samples=1)
            sequence.append(winner)

        temp = ""
        for i in range(len(sequence)):
            idx = sequence[i].item()
            temp += vocab.lookup_idx(idx)

        i = 0
        while i < len(temp) and temp[i] != '>':
            i += 1

        decoded = ""
        i += 1
        while i < len(temp) and temp[i] != '<':
            decoded += temp[i]
            i += 1
        print(decoded)


This is epoch number 1.
Model has completed epoch number 1. We will now print 5 names of length 20.
slyfan
enaba
sabia
jaidlyne
kid

This is epoch number 2.
Model has completed epoch number 2. We will now print 5 names of length 20.
ahllia
roda
chain
aysollyi
kayya

This is epoch number 3.
Model has completed epoch number 3. We will now print 5 names of length 20.
brya
ragaralie
avie
kalinia
onla

This is epoch number 4.
Model has completed epoch number 4. We will now print 5 names of length 20.
callee
hena
alientalinel
harlenn
rene

This is epoch number 5.
Model has completed epoch number 5. We will now print 5 names of length 20.
filanty
sel
doun
lyna
mar
