| <p style="text-align: left;">Name</p>               | Matr.Nr. | <p style="text-align: right;">Date</p> |
| --------------------------------------------------- | -------- | ------------------------------------- |
| <p style="text-align: left">Lion DUNGL</p> | 01553060 | 29.05.2020                            |

<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 7 -- Introduction to Natural Language Processing II </h2>

<b>Authors</b>: Rekabsaz, Brandstetter <br>
<b>Date</b>: 11-05-2020

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies to all code within this file.

<b>Copyright statement:</b><br>
This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

# Exercise 0

- Import the same modules as discussed in the lecture notebook.
- Check if your model versions are correct.
- Use your GPU if available.

In [1]:
import u7_utils as u7

import numpy as np
import torch
import torch.utils.data
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
import dill as pickle
import sys
import os
import io
import time
import math

In [2]:
u7.check_module_versions()

Installed Python version: 3.7 (✓)
Installed numpy version: 1.18.1 (✓)
Installed matplotlib version: 3.1.3 (✓)
Installed PyTorch version: 1.5.0 (✓)


In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
print("Device:", device)

Device: cuda


<h1 style="color:rgb(208,90,80)">ABOUT THIS NOTEBOOK</h1>
<span style="color:rgb(208,90,80)">In this notebook you should solve a small task on your one. <br><br> The goal is to train an LSTM network with a different number of hidden cells on the Penn Treebank dataset. You should decide on the validation dataset which model works best and then try it on the test dataset. This is a first example of a hyperparameter search. <br> We only evaluate how you build this hyperparameter search.</span>

<h3 style="color:rgb(0,120,170)">Defining hyper-parameters</h3>
In contrast to the lecture notebook we do not set the parameter <i> nhid </i>. This is the hyperparameter which we will later use for the search.

In [4]:
data_path = 'resources/penn/'
emsize = 200 # size of word embeddings
lr = 20 # initial learning rate
clipping = 0.25 # gradient clipping
epochs = 3 # upper epoch limit
train_batch_size = 10 # batch size for training
eval_batch_size = 5 # batch size for elidation/test
max_seq_len = 35 # sequence length
seed = 1111 # random seed to facilitate reproducability
print_interval = 1000 # report interval

In [5]:
torch.manual_seed(seed)

<torch._C.Generator at 0x7f9f11a1fbd0>

<h3 style="color:rgb(0,120,170)">Data & dictionary</h3>

In [6]:
train_corpus = u7.Corpus(os.path.join(data_path, 'train.txt'))
valid_corpus = u7.Corpus(os.path.join(data_path, 'valid.txt'))
test_corpus = u7.Corpus(os.path.join(data_path, 'test.txt'))

dictionary = u7.Dictionary()
train_corpus.fill_dictionary(dictionary)
ntokens = len(dictionary)
print (f'Number of tokens in dictionary {ntokens}')

train_data = train_corpus.words_to_ids(dictionary)
print (f'Train data: number of tokens {len(train_data)}')

valid_data = valid_corpus.words_to_ids(dictionary)
print (f'Validation data: number of tokens {len(valid_data)}')

test_data = test_corpus.words_to_ids(dictionary)
print (f'Test data: number of tokens {len(test_data)}')

with open('dictionary.pkl', 'wb') as f:
    pickle.dump(dictionary, f)

Number of tokens in dictionary 10001
Train data: number of tokens 929589
Validation data: number of tokens 73760
Test data: number of tokens 82430


In [7]:
train_data_batches = u7.batchify(train_data, train_batch_size, device)
print (f'Train batchified data shape: {train_data_batches.shape}')

val_data_batches = u7.batchify(valid_data, eval_batch_size, device)
print (f'Validation batchified data shape: {val_data_batches.shape}')

test_data_batches = u7.batchify(test_data, eval_batch_size, device)
print (f'Test batchified data shape: {test_data_batches.shape}')

Train batchified data shape: torch.Size([92958, 10])
Validation batchified data shape: torch.Size([14752, 5])
Test batchified data shape: torch.Size([16486, 5])


<h3 style="color:rgb(0,120,170)">Training</h3>
Nothing to do here

In [8]:
def train(model: torch.nn.Module, dictionary: u7.Dictionary,
          max_seq_len: int, train_batch_size: int, 
          train_data_batches, optimizer: torch.optim.Optimizer,
          criterion: torch.nn, clipping: int, learning_rate: int,
          print_interval: int, epoch: int):
    """
    Function to train the model. 
    :return:
    """
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = len(dictionary)
    start_hidden = model.init_hidden(train_batch_size)
    for batch, i in enumerate(range(0, train_data_batches.size(0) - 1, max_seq_len)):
        data, targets = u7.get_batch(train_data_batches, i, max_seq_len)

        # forward pass
        model.zero_grad()
        start_hidden = u7.repackage_hidden(start_hidden)
        output, last_hidden = model(data, start_hidden)

        # loss computation & backward pass
        output = output.view(-1, ntokens)
        loss = criterion(output, targets.view(-1))
        loss.backward()

        start_hidden = last_hidden
        # clipping gradient
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping)
        optimizer.step()

        total_loss += loss.item()
        if batch % print_interval == 0 and batch > 0:
            cur_loss = total_loss / print_interval
            elapsed = time.time() - start_time
            print(f'| epoch {epoch :3d} | {batch :5d} /{int(len(train_data_batches)/max_seq_len) :5d} batches ' 
                  f'| lr {learning_rate :02.2f} | ms/batch {elapsed * 1000 / print_interval :5.2f} |'
                  f' loss {cur_loss :5.2f} | perplexity {math.exp(cur_loss) :8.2f}')
            total_loss = 0
            start_time = time.time()

In [9]:
class LM_LSTMModel(nn.Module):

    def __init__(self, ntoken, ninp, nhid):
        super(LM_LSTMModel, self).__init__()
        self.ntoken = ntoken
        self.encoder = nn.Embedding(ntoken, ninp)
        self.rnn = nn.LSTM(ninp, nhid)
        self.decoder = nn.Linear(nhid, ntoken)
        self.nhid = nhid
        
    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(1, bsz, self.nhid),
                weight.new_zeros(1, bsz, self.nhid))

    def forward(self, input, hidden):
        emb = self.encoder(input)
        hiddens, last_hidden = self.rnn(emb, hidden)
        
        decoded = self.decoder(hiddens)
        return F.log_softmax(decoded, dim=-1), last_hidden

# Exercise 1

- Train the model for three epochs and validate after each epoch. Repeat this procedure with different number of LSTM cells (the <i> nhid </i> parameter in the lecture notebook). Save the best models for the different runs.
- What is the best model? You can use the suggested parameter values but you can try different values too if wanted. Please note that for larger number of LSTM cells the training might be pretty time-consuming.
- Load the best model and evaluate it on the test dataset.
- NOTA BENE: use the Adam optimizer to get better performance <code> optimizer = optim.Adam(model.parameters(), lr=1e-2, weight_decay=1e-5)</code>, instead of SGD as done in the lecture (you can check for it in earlier notebooks).

In [10]:
nhid = [8, 16, 32, 64, 128, 256, 512]

In [11]:
overall_start_time = time.time()
for n, cells in enumerate(nhid):
    print('-' * 89)
    print(f'{n+1} / {len(nhid)}: Training with {cells} LSTM cells')
    print('-' * 89)
    print('-' * 89)
        
    save_path = os.path.join('models', 'nihd'+str(cells))
    
    model = LM_LSTMModel(ntokens, emsize, cells).to(device)

    best_val_loss = None
    criterion = nn.NLLLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-2, weight_decay=1e-5)
    
    cell_start_time = time.time()
    for epoch in range(epochs):
        epoch_start_time = time.time()
        train(model, dictionary, max_seq_len, train_batch_size, train_data_batches, optimizer, criterion, clipping, lr, print_interval, epoch)
        val_loss = u7.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_batches, criterion)
        
        print('-' * 89)
        print(f'| end of epoch {epoch :3d} | time: {time.time() - epoch_start_time :5.2f}s' 
              f'| valid loss {val_loss :5.2f} | valid perplexity {math.exp(val_loss):8.2f}')
        print('-' * 89)
        
        # saving best model
        if not best_val_loss or val_loss < best_val_loss:
            with open(save_path, 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            lr /= 4.0
    else:
        print('-' * 89)
        print(f'Done training after {time.time() - cell_start_time :5.2f}s !')
        print('-' * 89)
else:
    print('-' * 89)
    print('-' * 89)
    print('-' * 89)
    print(f'Done after {time.time() - overall_start_time :5.2f}s !')

-----------------------------------------------------------------------------------------
1 / 7: Training with 8 LSTM cells
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| epoch   0 |  1000 / 2655 batches | lr 20.00 | ms/batch  9.65 | loss  6.34 | perplexity   565.14
| epoch   0 |  2000 / 2655 batches | lr 20.00 | ms/batch  9.57 | loss  5.92 | perplexity   371.62
-----------------------------------------------------------------------------------------
| end of epoch   0 | time: 26.35s| valid loss  5.83 | valid perplexity   341.97
-----------------------------------------------------------------------------------------


  "type " + obj.__name__ + ". It won't be checked "


| epoch   1 |  1000 / 2655 batches | lr 20.00 | ms/batch  9.64 | loss  5.78 | perplexity   322.68
| epoch   1 |  2000 / 2655 batches | lr 20.00 | ms/batch 10.27 | loss  5.73 | perplexity   306.69
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 27.58s| valid loss  5.73 | valid perplexity   308.54
-----------------------------------------------------------------------------------------
| epoch   2 |  1000 / 2655 batches | lr 20.00 | ms/batch  9.66 | loss  5.69 | perplexity   296.75
| epoch   2 |  2000 / 2655 batches | lr 20.00 | ms/batch  9.53 | loss  5.67 | perplexity   289.29
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 26.18s| valid loss  5.69 | valid perplexity   296.58
-----------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------

KeyboardInterrupt: 

# Conclusion
It seems like the more LSTM cells we're using, the better the coresponding model performs. But one can also see that the decrease of 'valid loss' and 'valid perplexity' gets smaller and smaller the more cells the model has. <b>The best model seems to be the one with 512 LSTM cells.</b>

In [16]:
best_num_cells = list()

for n, cells in enumerate(nhid):
    print('-' * 89)
    print(f'{n+1} / {len(nhid)}: Testing model with {cells} LSTM cells')
    print('-' * 89)
    
    model_path = os.path.join('models', 'nihd'+str(cells))
    with open(model_path, 'rb') as f:
        model = torch.load(f)
    
    val_loss = u7.evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_batches, criterion)
        
    print(f'| valid loss {val_loss :5.2f} | valid perplexity {math.exp(val_loss):8.2f}')
    print('-' * 89)
    print(' ' * 89)
    
    if n == 0:
        best_num_cells = [cells, val_loss]
    else:
        if val_loss < best_num_cells[1]:
            best_num_cells = [cells, val_loss]
else:
    print(f'!! The best performing model is the one with {best_num_cells[0]} LSTM cells and a loss on the validation set of {best_num_cells[1]} !!')

-----------------------------------------------------------------------------------------
1 / 7: Testing model with 8 LSTM cells
-----------------------------------------------------------------------------------------
| valid loss  5.65 | valid perplexity   284.21
-----------------------------------------------------------------------------------------
                                                                                         
-----------------------------------------------------------------------------------------
2 / 7: Testing model with 16 LSTM cells
-----------------------------------------------------------------------------------------
| valid loss  5.45 | valid perplexity   232.77
-----------------------------------------------------------------------------------------
                                                                                         
-----------------------------------------------------------------------------------------
3 / 7: Testing mo

In [17]:
with open(os.path.join('models', 'nihd512'), 'rb') as f:
    best_model = torch.load(f)
    
test_loss = u7.evaluate(model, dictionary, max_seq_len, eval_batch_size, test_data_batches, criterion)

print('=' * 89)
print(f'| End of training | test loss {test_loss :5.2f} | test perplexity {math.exp(test_loss) :5.2f}')
print('=' * 89)

| End of training | test loss  4.96 | test perplexity 142.69


# Exercise 2

- Count the parameters of the best model. How many parameters does it have?

In [59]:
n_params = sum(p.numel() for p in best_model.parameters() if p.requires_grad)

print(f'The LSTM model with 512 cells has {n_params} parameters')

The LSTM model with 512 cells has 8592985 parameters
