# Assignment 3 : Sequence labelling with RNNs
In this assignement we will ask you to perform POS tagging.

You are asked to follow these steps:
*   Download the corpora and split it in training and test sets, structuring a dataframe.
*   Embed the words using GloVe embeddings
*   Create a baseline model, using a simple neural architecture
*   Experiment doing small modifications to the model
*   Evaluate your best model
*   Analyze the errors of your model

**Corpora**:
Ignore the numeric value in the third column, use only the words/symbols and its label.
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip 

**Splits**: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.

**Baseline**: two layers architecture: a Bidirectional LSTM and a Dense/Fully-Connected layer on top.
**Modifications**: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and using a CRF in addition to the LSTM. Each of this change must be done by itself (don't mix these modifications).<br>
1) BiLSTM +  FC <br>
2) BiGRU + FC <br>
3) BiLSTMx2 + FC <br>
4) BiLSTM +  FC + CRF <br>
**Training and Experiments**: all the experiments must involve only the training and validation sets.

**Evaluation**: in the end, only the best model of your choice must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech (without considering punctuation classes).

**Error Analysis** (optional) : analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

**Report**: You are asked to deliver a small report of about 4-5 lines in the .txt file that sums up your findings.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext

from torchtext import data
from torchtext import datasets

import spacy
import numpy as np

import time
import random
import os

In [251]:
def read_data(base_dir, datafields):
    train = []
    val = []
    test = []
    for filename in sorted(os.listdir(base_dir)):
        if str(filename) < 'wsj_0100.dp': # get train data
            with open(base_dir + filename, encoding='utf-8') as f:
                words = []
                labels = []
                for line in f:
                    line = line.strip()
                    if line: # if is not empty string
                        columns = line.split()
                        words.append(columns[0]) # take the word
                        labels.append(columns[-2]) # take the POS tag
                train.append(torchtext.data.Example.fromlist([words, labels], datafields))
        elif str(filename) < 'wsj_0150.dp':
            with open(base_dir + filename, encoding='utf-8') as f:
                words = []
                labels = []
                for line in f:
                    line = line.strip()
                    if line: # if is not empty string
                        columns = line.split()
                        words.append(columns[0]) # take the word
                        labels.append(columns[-2]) # take the POS tag
                val.append(torchtext.data.Example.fromlist([words, labels], datafields))
        else:
            with open(base_dir + filename, encoding='utf-8') as f:
                words = []
                labels = []
                for line in f:
                    line = line.strip()
                    if line: # if is not empty string
                        columns = line.split()
                        words.append(columns[0]) # take the word
                        labels.append(columns[-2]) # take the POS tag
                test.append(torchtext.data.Example.fromlist([words, labels], datafields))
    return torchtext.data.Dataset(train, datafields), torchtext.data.Dataset(val, datafields), torchtext.data.Dataset(test, datafields)

In [252]:
text = data.Field(lower = True)
label = data.Field(unk_token = None)
fields = [('text', text), ('label', label)]
base_dir = '../dependency_treebank/'
train_data, val_data, test_data = read_data(base_dir, fields)

In [253]:
MIN_FREQ = 2

text.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_)


label.build_vocab(train_data)

In [254]:
def tag_percentage(tag_counts):
    
    total_count = sum([count for tag, count in tag_counts])
    
    tag_counts_percentages = [(tag, count, count/total_count) for tag, count in tag_counts]
        
    return tag_counts_percentages

In [255]:
print("Tag\t\tCount\t\tPercentage\n")

for tag, count, percent in tag_percentage(text.vocab.freqs.most_common()):
    print(f"{tag}\t\t{count}\t\t{percent*100:4.1f}%")

Tag		Count		Percentage

,		2525		 5.4%
the		2286		 4.9%
.		1896		 4.1%
of		1155		 2.5%
to		1018		 2.2%
a		1000		 2.2%
in		932		 2.0%
and		804		 1.7%
for		448		 1.0%
's		424		 0.9%
that		413		 0.9%
``		397		 0.9%
''		388		 0.8%
$		332		 0.7%
is		315		 0.7%
said		311		 0.7%
it		301		 0.6%
on		254		 0.5%
mr.		223		 0.5%
at		217		 0.5%
by		203		 0.4%
with		202		 0.4%
as		197		 0.4%
was		193		 0.4%
are		190		 0.4%
have		185		 0.4%
n't		184		 0.4%
has		183		 0.4%
its		174		 0.4%
new		168		 0.4%
an		164		 0.4%
but		162		 0.3%
u.s.		158		 0.3%
from		157		 0.3%
be		155		 0.3%
he		155		 0.3%
%		154		 0.3%
they		150		 0.3%
will		137		 0.3%
says		132		 0.3%
--		130		 0.3%
about		126		 0.3%
million		111		 0.2%
or		109		 0.2%
this		104		 0.2%
their		101		 0.2%
company		99		 0.2%
more		95		 0.2%
year		93		 0.2%
who		93		 0.2%
were		92		 0.2%
which		91		 0.2%
japanese		89		 0.2%
;		86		 0.2%
than		85		 0.2%
had		85		 0.2%
she		84		 0.2%
one		82		 0.2%
would		80		 0.2%
also		78		 0.2%
been		75		 0.2%
n

stand		4		 0.0%
advance		4		 0.0%
club		4		 0.0%
paul		4		 0.0%
atlanta		4		 0.0%
allow		4		 0.0%
shipments		4		 0.0%
annually		4		 0.0%
23		4		 0.0%
convertible		4		 0.0%
preferred		4		 0.0%
oil		4		 0.0%
launch		4		 0.0%
whiting		4		 0.0%
brought		4		 0.0%
started		4		 0.0%
reserves		4		 0.0%
start		4		 0.0%
shearson		4		 0.0%
businesses		4		 0.0%
philippines		4		 0.0%
aide		4		 0.0%
mcalpine		4		 0.0%
expansion		4		 0.0%
quarterly		4		 0.0%
canadian		4		 0.0%
founder		4		 0.0%
mortgage		4		 0.0%
purchases		4		 0.0%
institutional		4		 0.0%
rest		4		 0.0%
hong		4		 0.0%
kong		4		 0.0%
boosted		4		 0.0%
l.		4		 0.0%
expanding		4		 0.0%
giving		4		 0.0%
tender		4		 0.0%
expire		4		 0.0%
sixth		4		 0.0%
portfolios		4		 0.0%
foot		4		 0.0%
associates		4		 0.0%
porter		4		 0.0%
analyst		4		 0.0%
smith		4		 0.0%
swing		4		 0.0%
broader		4		 0.0%
wild		4		 0.0%
premium		4		 0.0%
reason		4		 0.0%
long-term		4		 0.0%
1992		4		 0.0%
integration		4		 0.0%
jumped		4		 0.0%
increasingly		4		 0.0%


boomers		1		 0.0%
texture		1		 0.0%
salty		1		 0.0%
dogs		1		 0.0%
whistle		1		 0.0%
johnny		1		 0.0%
goode		1		 0.0%
bugs		1		 0.0%
bunny		1		 0.0%
mickey		1		 0.0%
spillane		1		 0.0%
groucho		1		 0.0%
harpo		1		 0.0%
desultory		1		 0.0%
reader		1		 0.0%
charm		1		 0.0%
engaging		1		 0.0%
recognizing		1		 0.0%
buttoned-down		1		 0.0%
lore		1		 0.0%
refreshing		1		 0.0%
author		1		 0.0%
self-aggrandizing		1		 0.0%
we-japanese		1		 0.0%
perpetuate		1		 0.0%
unique		1		 0.0%
unfathomable		1		 0.0%
outsiders		1		 0.0%
implicit		1		 0.0%
nutty		1		 0.0%
plot		1		 0.0%
rooted		1		 0.0%
reality		1		 0.0%
imaginative		1		 0.0%
disaffected		1		 0.0%
hard-drinking		1		 0.0%
nearly-30		1		 0.0%
snow		1		 0.0%
search		1		 0.0%
elusive		1		 0.0%
behest		1		 0.0%
sinister		1		 0.0%
erudite		1		 0.0%
mobster		1		 0.0%
degree		1		 0.0%
tow		1		 0.0%
prescient		1		 0.0%
girlfriend		1		 0.0%
sassy		1		 0.0%
retorts		1		 0.0%
docile		1		 0.0%
butterfly		1		 0.0%
meets		1		 0.0%
solicitous		1		 0.0%
chri

rhythm		1		 0.0%
variation		1		 0.0%
memorize		1		 0.0%
methods		1		 0.0%
odd-sounding		1		 0.0%
treble		1		 0.0%
grandsire		1		 0.0%
caters		1		 0.0%
kensington		1		 0.0%
ten		1		 0.0%
shirt-sleeved		1		 0.0%
prize-fighter		1		 0.0%
pulling		1		 0.0%
rope		1		 0.0%
disappears		1		 0.0%
hole		1		 0.0%
snaking		1		 0.0%
muffled		1		 0.0%
totally		1		 0.0%
absorbed		1		 0.0%
stare		1		 0.0%
vision		1		 0.0%
rope-sight		1		 0.0%
thus		1		 0.0%
pulls		1		 0.0%
bronze		1		 0.0%
wheels		1		 0.0%
madly		1		 0.0%
360		1		 0.0%
inverted		1		 0.0%
mouth-up		1		 0.0%
skilled		1		 0.0%
wrists		1		 0.0%
retard		1		 0.0%
well-known		1		 0.0%
detective-story		1		 0.0%
novelist		1		 0.0%
finds		1		 0.0%
satisfaction		1		 0.0%
mathematical		1		 0.0%
completeness		1		 0.0%
perfection		1		 0.0%
filled		1		 0.0%
solemn		1		 0.0%
intoxication		1		 0.0%
intricate		1		 0.0%
ritual		1		 0.0%
faultlessly		1		 0.0%
obsession		1		 0.0%
pattenden		1		 0.0%
stays		1		 0.0%
stuck		1		 0.0%
sweat		1		 0.0%
skip		1		

In [271]:
BATCH_SIZE = 4

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.text),
    sort_within_batch=False,
    repeat=False)

### Baseline: a Bidirectional LSTM and a Dense/Fully-Connected layer on top.

In [272]:
class BiLSTM(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.embedding.weight.requires_grad = False
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers = 1, 
                            bidirectional = bidirectional)
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        # pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions

INPUT_DIM = len(text.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(label.vocab)
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = text.vocab.stoi[text.pad_token]
TAG_PAD_IDX = label.vocab.stoi[label.pad_token]

model = BiLSTM(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)

In [238]:
class BiGRU(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 bidirectional,
                 dropout,
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.embedding.weight.requires_grad = False
        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, 
                          bidirectional=bidirectional, num_layers=1)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        # pass embeddings into LSTM
        outputs, (hidden, cell) = self.gru(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions

INPUT_DIM = len(text.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(label.vocab)
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = text.vocab.stoi[text.pad_token]
TAG_PAD_IDX = label.vocab.stoi[label.pad_token]
'''
model = BiGRU(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)'''

'\nmodel = BiGRU(INPUT_DIM, \n                        EMBEDDING_DIM, \n                        HIDDEN_DIM, \n                        OUTPUT_DIM, \n                        BIDIRECTIONAL, \n                        DROPOUT, \n                        PAD_IDX)'

In [239]:
class BiLSTMx2(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.embedding.weight.requires_grad = False
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers = n_layers, 
                            bidirectional = bidirectional,
                            dropout = dropout if n_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        # pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions
    
INPUT_DIM = len(text.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(label.vocab)
N_LAYERS = 2 # here we will jave two LSTM layers
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = text.vocab.stoi[text.pad_token]
TAG_PAD_IDX = label.vocab.stoi[label.pad_token]
'''
model = BiLSTMx2(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        N_LAYERS, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)'''

'\nmodel = BiLSTMx2(INPUT_DIM, \n                        EMBEDDING_DIM, \n                        HIDDEN_DIM, \n                        OUTPUT_DIM, \n                        N_LAYERS, \n                        BIDIRECTIONAL, \n                        DROPOUT, \n                        PAD_IDX)'

In [240]:
from torchcrf import CRF
class BiLSTM_CRF(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        self.embedding.weight.requires_grad = False
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers = 1, 
                            bidirectional = bidirectional)
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        self.crf = CRF(output_dim)
        
    def forward(self, text):
        # pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        # pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions
    
INPUT_DIM = len(text.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(label.vocab)
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = text.vocab.stoi[text.pad_token]
TAG_PAD_IDX = label.vocab.stoi[label.pad_token]
'''
model = BiLSTM_CRF(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)'''

'\nmodel = BiLSTM_CRF(INPUT_DIM, \n                        EMBEDDING_DIM, \n                        HIDDEN_DIM, \n                        OUTPUT_DIM, \n                        BIDIRECTIONAL, \n                        DROPOUT, \n                        PAD_IDX)'

We initialize the weights from a simple Normal distribution. Again, there may be a better initialization scheme for this model and dataset.

In [258]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
        
model.apply(init_weights)

BiLSTM(
  (embedding): Embedding(3389, 100, padding_idx=1)
  (lstm): LSTM(100, 128, bidirectional=True)
  (fc): Linear(in_features=256, out_features=46, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

Next, a small function to tell us how many parameters are in our model. Useful for comparing different models.

In [259]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 247,342 trainable parameters


We'll now initialize our model's embedding layer with the pre-trained embedding values we loaded earlier.

This is done by getting them from the vocab's `.vectors` attribute and then performing a `.copy` to overwrite the embedding layer's current weights.

In [260]:
pretrained_embeddings = text.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([3389, 100])


In [261]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.4109,  1.8717, -0.3816,  ...,  0.2594,  1.7012,  2.2465],
        [-0.1555, -0.0150,  0.8936,  ..., -1.1725, -0.3419, -0.0849],
        [-0.1077,  0.1105,  0.5981,  ..., -0.8316,  0.4529,  0.0826],
        ...,
        [ 0.0638,  0.0505, -0.0947,  ...,  0.0947, -0.3449,  0.0679],
        [ 0.3079, -0.6757,  0.2158,  ...,  0.3156,  0.5600,  0.4885],
        [-0.3110, -0.3398,  1.0308,  ...,  0.5317,  0.2836, -0.0640]])

It's common to initialize the embedding of the pad token to all zeros. This, along with setting the `padding_idx` in the model's embedding layer, means that the embedding should always output a tensor full of zeros when a pad token is input.

In [262]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[-0.4109,  1.8717, -0.3816,  ...,  0.2594,  1.7012,  2.2465],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1077,  0.1105,  0.5981,  ..., -0.8316,  0.4529,  0.0826],
        ...,
        [ 0.0638,  0.0505, -0.0947,  ...,  0.0947, -0.3449,  0.0679],
        [ 0.3079, -0.6757,  0.2158,  ...,  0.3156,  0.5600,  0.4885],
        [-0.3110, -0.3398,  1.0308,  ...,  0.5317,  0.2836, -0.0640]])


We then define our optimizer, used to update our parameters w.r.t. their gradients. We use Adam with the default learning rate.

In [263]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function, cross-entropy loss.

Even though we have no `<unk>` tokens within our tag vocab, we still have `<pad>` tokens. This is because all sentences within a batch need to be the same size. However, we don't want to calculate the loss when the target is a `<pad>` token as we aren't training our model to recognize padding tokens.

We handle this by setting the `ignore_index` in our loss function to the index of the padding token in our tag vocabulary.

In [264]:
TAG_PAD_IDX = label.vocab.stoi[label.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

We then place our model and loss function on our GPU, if we have one.

In [265]:
model = model.to(device)
criterion = criterion.to(device)

We will be using the loss value between our predicted and actual tags to train the network, but ideally we'd like a more interpretable way to see how well our model is doing - accuracy.

The issue is that we don't want to calculate accuracy over the `<pad>` tokens as we aren't interested in predicting them.

The function below only calculates accuracy over non-padded tokens. `non_pad_elements` is a tensor containing the indices of the non-pad tokens within an input batch. We then compare the predictions of those elements with the labels to get a count of how many predictions were correct. We then divide this by the number of non-pad elements to get our accuracy value over the batch.

In [266]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]])

Next is the function that handles training our model.

We first set the model to `train` mode to turn on dropout/batch-norm/etc. (if used). Then we iterate over our iterator, which returns a batch of examples. 

For each batch: 
- we zero the gradients over the parameters from the last gradient calculation
- insert the batch of text into the model to get predictions
- as PyTorch loss functions cannot handle 3-dimensional predictions we reshape our predictions
- calculate the loss and accuracy between the predicted tags and actual tags
- call `backward` to calculate the gradients of the parameters w.r.t. the loss
- take an optimizer `step` to update the parameters
- add to the running total of loss and accuracy

In [267]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        text = batch.text
        tags = batch.label
        
        optimizer.zero_grad()
        predictions = model(text)
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The `evaluate` function is similar to the `train` function, except with changes made so we don't update the model's parameters.

`model.eval()` is used to put the model in evaluation mode, so dropout/batch-norm/etc. are turned off. 

The iteration loop is also wrapped in `torch.no_grad` to ensure we don't calculate any gradients. We also don't need to call `optimizer.zero_grad()` and `optimizer.step()`.

In [268]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.label
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Next, we have a small function that tells us how long an epoch takes.

In [269]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model!

After each epoch we check if our model has achieved the best validation loss so far. If it has then we save the parameters of this model and we will use these "best" parameters to calculate performance over our test set.

In [None]:
N_EPOCHS = 50

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 41s
	Train Loss: 3.832 | Train Acc: 2.17%
	 Val. Loss: 3.830 |  Val. Acc: 2.22%
Epoch: 02 | Epoch Time: 0m 42s
	Train Loss: 3.831 | Train Acc: 2.42%
	 Val. Loss: 3.830 |  Val. Acc: 2.22%
Epoch: 03 | Epoch Time: 0m 41s
	Train Loss: 3.832 | Train Acc: 2.30%
	 Val. Loss: 3.830 |  Val. Acc: 2.22%


We then load our "best" parameters and evaluate performance on the test set.

In [101]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.911 |  Test Acc: 74.61%


## Inference

88% accuracy looks pretty good, but let's see our model tag some actual sentences.

We define a `tag_sentence` function that will:
- put the model into evaluation mode
- tokenize the sentence with spaCy if it is not a list
- lowercase the tokens if the `Field` did
- numericalize the tokens using the vocabulary
- find out which tokens are not in the vocabulary, i.e. are `<unk>` tokens
- convert the numericalized tokens into a tensor and add a batch dimension
- feed the tensor into the model
- get the predictions over the sentence
- convert the predictions into readable tags

As well as returning the tokens and tags, it also returns which tokens were `<unk>` tokens.

In [102]:
def tag_sentence(model, device, sentence, text_field, tag_field):
    
    model.eval()
    
    if isinstance(sentence, str):
        nlp = spacy.load('en')
        tokens = [token.text for token in nlp(sentence)]
    else:
        tokens = [token for token in sentence]

    if text_field.lower:
        tokens = [t.lower() for t in tokens]
        
    numericalized_tokens = [text_field.vocab.stoi[t] for t in tokens]

    unk_idx = text_field.vocab.stoi[text_field.unk_token]
    
    unks = [t for t, n in zip(tokens, numericalized_tokens) if n == unk_idx]
    
    token_tensor = torch.LongTensor(numericalized_tokens)
    
    token_tensor = token_tensor.unsqueeze(-1).to(device)
         
    predictions = model(token_tensor)
    
    top_predictions = predictions.argmax(-1)
    
    predicted_tags = [tag_field.vocab.itos[t.item()] for t in top_predictions]
    
    return tokens, predicted_tags, unks

We'll get an already tokenized example from the training set and test our model's performance.

In [103]:
example_index = 1

sentence = vars(train_data.examples[example_index])['text']
actual_tags = vars(train_data.examples[example_index])['label']

print(sentence)

['rudolph', 'agnew', ',', '55', 'years', 'old', 'and', 'former', 'chairman', 'of', 'consolidated', 'gold', 'fields', 'plc', ',', 'was', 'named', 'a', 'nonexecutive', 'director', 'of', 'this', 'british', 'industrial', 'conglomerate', '.']


We can then use our `tag_sentence` function to get the tags. Notice how the tokens referring to subject of the sentence, the "respected cleric", are both `<unk>` tokens!

In [104]:
tokens, pred_tags, unks = tag_sentence(model, 
                                       device, 
                                       sentence, 
                                       text, 
                                       label)

print(unks)

['agnew', 'conglomerate']


We can then check how well it did. Surprisingly, it got every token correct, including the two that were unknown tokens!

In [106]:
print("Pred. Tag\tActual Tag\tCorrect?\tToken\n")

for token, pred_tag, actual_tag in zip(tokens, pred_tags, actual_tags):
    correct = '✔' if pred_tag == actual_tag else '✘'
    print(f"{pred_tag}\t\t{actual_tag}\t\t{correct}\t\t{token}")

Pred. Tag	Actual Tag	Correct?	Token

NNP		NNP		✔		rudolph
NNP		NNP		✔		agnew
,		,		✔		,
CD		CD		✔		55
CD		NNS		✘		years
NN		JJ		✘		old
CC		CC		✔		and
NNP		JJ		✘		former
NN		NN		✔		chairman
IN		IN		✔		of
NNP		NNP		✔		consolidated
NNP		NNP		✔		gold
NNP		NNP		✔		fields
NNP		NNP		✔		plc
,		,		✔		,
VBD		VBD		✔		was
NNP		VBN		✘		named
DT		DT		✔		a
NN		JJ		✘		nonexecutive
NN		NN		✔		director
IN		IN		✔		of
DT		DT		✔		this
NNP		JJ		✘		british
NN		JJ		✘		industrial
NN		NN		✔		conglomerate
.		.		✔		.


### F1 macro score

In [152]:
F1 = []
for i in range(len(test_data.examples)):
    actual_tags = vars(test_data.examples[i])['label']
    sentence = vars(test_data.examples[i])['text']
    _, pred_tags, _ = tag_sentence(model, device, sentence, text, label)
    F1.append(f1_score(actual_tags, pred_tags, average='macro'))

print("The F1 macro score is: ", np.mean(F1))

The F1 macro score is:  0.6291287181722861


In [177]:
ll = [['ciao', 'fra',','], ['ciao', ':', ',','a']]
punc = '''!()-[]{};:'"\ , <>./?@#$%^&*_~'''

In [178]:
for elem in ll:
    for e in elem:
        if e in punc:
            elem.remove(e)

In [179]:
ll

[['ciao', 'fra'], ['ciao', ',', 'a']]

In [183]:
import string
'cia,fra'.translate(str.maketrans('', '', string.punctuation))

'ciafra'