# Introduction
In `3. Part of Speech Tagging - Second attempt with RNN` I did Part of Speech (POS) tagging with several RNN's. My most successful was inspired by [Wang et al., 2015](https://arxiv.org/pdf/1510.06168.pdf), who used a model originally from [Graves, 2002](https://www.cs.toronto.edu/~graves/preprint.pdf):
Two LSTM layers with 93 hidden units, one doing a forward pass of the input and the other a backward pass. 

I encoded all words using word embeddings trained by [Glove](https://nlp.stanford.edu/projects/glove/). Glove is Stanfords embedding model.

In this notebook I will use a similar approach, but this time I will implement my model in PyTorch. Hopefully increased control will help me to learn an experiment.

# The data
I will be using the same training data for my tagger as in all my previous notebooks:
[Universal Dependencies - English Web Treebank](http://universaldependencies.org/treebanks/en_ewt/index.html), a CoNLL-U formart corpus with 254 830 words and 16 622 sentences in english *taken from various web media including weblogs, newsgroups, emails, reviews, and Yahoo! answers*.

## Load the Data
First lets load the training data and convert it to a python dictionary.
I use the [conllu](https://github.com/EmilStenstrom/conllu) python package to parse the CoNLL-U files to dictionaries.

In [1]:
import numpy as np
import conllu

Read the data.

In [2]:
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-train.conllu'.format(directory), 'r', encoding='utf-8') as f:
    train_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-dev.conllu'.format(directory), 'r', encoding='utf-8') as f:
    dev_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-test.conllu'.format(directory), 'r', encoding='utf-8') as f:
    test_text = f.read()

Convert it to a dictionary.

In [3]:
train_dict = conllu.parse(train_text)
dev_dict = conllu.parse(dev_text)
test_dict = conllu.parse(test_text)

Count sentences and tokens.

In [4]:
from functools import reduce

n_train_sentences = len(train_dict)
n_train_tokens = reduce(lambda x, y: x + len(y), train_dict, 0)

print("The training set contains {} sentences and {} tokens".format(n_train_sentences, n_train_tokens))

The training set contains 12543 sentences and 204607 tokens


In [14]:
train_sentences = [[token['form'] for token in sentence] for sentence in train_dict]
train_labels = [[token['upostag'] for token in sentence] for sentence in train_dict]

dev_sentences = [[token['form'] for token in sentence] for sentence in dev_dict]
dev_labels = [[token['upostag'] for token in sentence] for sentence in dev_dict]

test_sentences = [[token['form'] for token in sentence] for sentence in test_dict]
test_labels = [[token['upostag'] for token in sentence] for sentence in test_dict]

In [6]:
pos_tags = list(set(reduce(lambda x, y: x + y, train_labels)))

In [7]:
n_labels = len(pos_tags)
n_labels

17

# Feature Engineering
[Wang et al.](https://arxiv.org/pdf/1510.06168.pdf) showed that a bidirectional LSTM network could achieve state of the art performance without using any morphological features, they only used these features:
* Word embedding of the word (cast to lower case). Embeddings trained by the same architecture, but on another task.
* Suffix of length two, one-hot encoded
* Wether the word is all caps, lower case, or has an initial capital letter. One-hot encoded.

Other papers, like [Xiao et al.](https://arxiv.org/abs/1809.01997), complement word embeddings with character embeddings. 
It would be interesting to experiment with this.

## Glove
I will opt to use the 100 dimensional Glove 6B data. My vocabulary will be exactly the words inside the pretrained model.

In [9]:
glove_path = 'glove/glove.6B/glove.6B.100d.txt'
embeddings = {}
token_index = {}
index_token = {}
with open(glove_path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        tok, *vec = line.split()
        embeddings[tok] = np.array(vec, dtype='float32')
        # Reserve index 0 for padding
        token_index[tok] = i + 1
        index_token[i+1] = tok

I know from my last notebook that my data has many words that are OOV for Glove. I adressed this by pre-processing my data. In this notebook I will condense my preprocessing to a single method, check the other one for more rationale on why I apply each step.

I will apply pre-processing to all three data sets: train, dev and test. As the preprocessing is not trained on the data this could be done on new data in production as well.

In [10]:
import re
def allign_to_vocab(vocab, sentences):
    
    # Find all OOV tokens
    oov = find_oov(vocab, sentences)
    print("%d OOV tokens before processing" % len(oov))
    
    # Convert to lower case
    sentences = [[token.lower() for token in sentence] for sentence in sentences]
    
    oov = find_oov(vocab, sentences)
    print("%d OOV tokens after converting to lower case" % len(oov))
    
    # Replace URL's with 'url'
    sentences = [[convert_url(token) for token in sentence] for sentence in sentences]
    
    oov = find_oov(vocab, sentences)
    print("%d OOV tokens after converting urls" % len(oov))
    
    # Build spelling correction dictionary
    # Search for word in vocabulary words within 1 Levensthein Damerau distance
    new_spelling = dict([(word, find_one_neighbour(word, embeddings)) for word in oov])
    sentences = [[new_spelling[token] if token in new_spelling else token for token in sentence] for sentence in sentences]
    
    oov = find_oov(vocab, sentences)
    print("%d OOV tokens after spelling correction" % len(oov))
    
    # Replace OOV words with 'unk'
    # See https://stackoverflow.com/questions/49239941/what-is-unk-in-glove-6b-50d-txt
    sentences = [['unk' if token in oov else token for token in sentence] for sentence in sentences]
    
    return sentences

def find_oov(vocab, sentences):
    oov = []
    for sentence in sentences:
        for tok in sentence:
            if tok not in vocab:
                oov.append(tok)
    return oov
     
def convert_url(token):
    # Match words starting with www., http:// or https://
    if re.match(r'^(?:https{0,1}\:\/\/.*|www\.*)', token):
        return "url"
    else:
        return token

# Checks for vocabulary words within 1 Damerau Levenstein distance and returns the first match
# Logic inspired by http://norvig.com/spell-correct.html
def find_one_neighbour(word, vocab):
    
    ascii_vocab = [str(chr(i)) for i in range(32, 127)]
    
    # Tuples with all possible splits of word
    splits = [(word[:i], word[i:]) for i in range(len(word))]
    
    # All words generated by deleting one character
    for L, R in splits:
        candidate = L + R[1:] if R else None
        if candidate in vocab:
            return candidate
    # All words generated by swapping two characters in word
    for L, R in splits:
        candidate = L + R[1:] if R else None
        if candidate in vocab:
            return candidate
    # All words generated by swapping two characters in word
    for L, R in splits:    
        candidate = L + R[1] + R[0] + R[2:] if len(R) > 1 else None
        if candidate in vocab:
            return candidate
    # All words generated by inserting a character in word
    for L, R in splits:
        for c in ascii_vocab:    
            candidate = L + c + R 
            if candidate in vocab:
                return candidate
    # All words generated by replacing a character in word
    for L, R, in splits:
        for c in ascii_vocab:    
            candidate = L + c +R[1:] if R else None
            if candidate in vocab:
                return candidate
        
    return word
    

In [15]:
%%time
print("Training set:")
train_sentences = allign_to_vocab(embeddings, train_sentences)

print("\nDev set:")
dev_sentences = allign_to_vocab(embeddings, dev_sentences)

print("\nTest set:")
test_sentences = allign_to_vocab(embeddings, test_sentences)

Training set:
31192 OOV tokens before processing
2442 OOV tokens after converting to lower case
2310 OOV tokens after converting urls
1066 OOV tokens after spelling correction

Dev set:
4362 OOV tokens before processing
463 OOV tokens after converting to lower case
424 OOV tokens after converting urls
215 OOV tokens after spelling correction

Test set:
4589 OOV tokens before processing
522 OOV tokens after converting to lower case
483 OOV tokens after converting urls
267 OOV tokens after spelling correction
Wall time: 8.88 s


## Encoding the data
I will encode the targets and the tokens as integers.

### Labels
First build a map from pos tag to an index.

In [175]:
pos_tag_index = {}
for i, pos in enumerate(pos_tags):
    pos_tag_index[pos] = i

In [176]:
Y_train = [torch.from_numpy(np.asarray([pos_tag_index[pos] for pos in sentence])).long()  for sentence in train_labels]
Y_dev = [torch.from_numpy(np.asarray([pos_tag_index[pos] for pos in sentence])).long()  for sentence in dev_labels]
Y_test = [torch.from_numpy(np.asarray([pos_tag_index[pos] for pos in sentence])).long()  for sentence in test_labels]

### Tokens
I already built the mapping from token to index. I also replaced all OOV tokens with `unk`, so all words will be in the vocabulary. 

In [159]:
X_train = [torch.from_numpy(np.asarray([token_index[tok] for tok in sentence])).long() for sentence in train_sentences]
X_dev = [torch.from_numpy(np.asarray([token_index[tok]  for tok in sentence])).long()  for sentence in dev_sentences]
X_test = [torch.from_numpy(np.asarray([token_index[tok] for tok in sentence])).long()  for sentence in test_sentences]

# RNN

## Embedding Layer
Create a frozen embedding layer from the Glove data.

In [23]:
import torch

Load the data into a tensor.

In [34]:
embedding_dims = 100
embedding_matrix = torch.from_numpy(np.array(list(embeddings.values())))
n_words = len(embedding_matrix)

Create a frozen embedding layer.

In [35]:
import torch.nn as nn

In [36]:
embedding_matrix.shape

torch.Size([400000, 100])

In [37]:
def create_embeddings(weights, frozen=True):
    
    embedding_layer = nn.Embedding(*weights.shape, _weight=embedding_matrix)
    embedding_layer.weight.requires_grad = frozen
    
    return embedding_layer

In [38]:
embedding_layer = create_embeddings(embedding_matrix)

## BLSTM 1
Let's implement the BLSTM introduced by Graves, with our Glove embeddings as the first layer.

In [39]:
import torch.nn.functional as F

In [122]:
class BLSTM1(nn.Module):
    def __init__(self, lstm_dim, n_classes, embedding_weights):
        super(BLSTM1, self).__init__()
        
        # Variables
        self.lstm_dim = lstm_dim
        self.vocab_size, self.embedding_dim = embedding_weights.shape
        self.n_classes = n_classes
        
        # Layers
        self.embedding = create_embeddings(embedding_weights)
        self.lstm = nn.LSTM(self.embedding_dim, lstm_dim, batch_first=True, bidirectional=True)
        self.output = nn.Linear(self.lstm_dim * 2, self.n_classes)
        
    def forward(self, sentence):
        
        # Embeddings
        embedded = self.embedding(sentence).unsqueeze(0)
        
        # LSTM
        out, _ = self.lstm(embedded)
        
        # Fully Connected
        return F.log_softmax(self.output(out), 2)
        

In [123]:
blstm1 = BLSTM1(96, n_labels, embedding_matrix)

In [126]:
import torch.optim as optim

In [128]:
criterion = nn.NLLLoss()
optimizer = optim.Adam(blstm1.parameters())

In [187]:
from tqdm import tqdm

# Function inspired by https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
def train_model(model, X_train, y_train, optimizer, criterion, X_dev = None, y_dev = None, epochs = 2):
    
    # Only enter the validation state if there is a validation_loader
    phases = ['train']
    data_dict = {'train' : list(zip(X_train, y_train))} 
    if X_dev and y_dev:
        phases.append('val')
        data_dict['val'] = list(zip(X_dev, y_dev))
        
    for epoch in range(epochs):
        
        print('Epoch {}/{}'.format(epoch + 1, epochs))
        print('-' * 10)

        for phase in phases:
            
            data = data_dict[phase]
            
            # Only update model weights based on the training data
            if phase == 'train':
                model.train()
            else:
                model.eval()
                
            running_loss = 0.0
            running_corrects = 0
            
            for seq, labels in tqdm(data, total = len(data)):
                
                #labels = torch.autograd.Variable(labels).type(torch.LongTensor)

                optimizer.zero_grad()
                
                # Only track history during training
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(seq).squeeze(0)
                    loss = criterion(outputs, labels)
                    predictions = torch.argmax(outputs, dim=1)
                    
                    # Only perform backpropagation during training
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                    
                # Save statistics
                running_loss += loss.item() * seq.size(0)
                running_corrects += torch.sum(predictions == labels.data)
                
            epoch_loss = running_loss / len(data)
            epoch_acc = running_corrects.double() / len(data)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))
                
            

In [188]:
train_model(blstm1, X_train[:10], Y_train[:10], optimizer, criterion, X_dev[:10], Y_dev[:10])

Epoch 1/2
----------


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.80it/s]


train Loss: 27.4362 Acc: 12.9000


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 39.95it/s]


val Loss: 27.4377 Acc: 9.7000
Epoch 2/2
----------


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.82it/s]


train Loss: 24.7206 Acc: 15.1000


100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 37.84it/s]


val Loss: 26.2986 Acc: 10.0000


Okay, so I constructed my model and training function. Unfortunately training is very slow now that I am processing just one sentence at a time. Maybe I should consider introducing padding to allow batch processing?

I'll continue tomorrow.