# POS Tagging

Develop POS tagging using Pytorch

### Reading in data using torchtext


In [61]:
from torchtext.data import Field, NestedField
from torchtext.datasets import SequenceTaggingDataset 

PAD="<pad>"
UNK="<unk>"
START="<start>"
END="<end>"

We will use [Universal Dependencies](https://universaldependencies.org/) or UD dataset for English. Sentences in UD datasets are encoded for dependency syntax which means that they are marked for POS, morphological features, syntactic dependencies (i.e. head words) and syntactic role (like predicate and subject). Below you can see an example of a sentence in UD annotation also called CoNLL-U format:
```
1   This    this    DET     DT    Number=Sing|PronType=Dem                                2   det     _       _
2   item    item    NOUN    NN    Number=Sing                                             6   nsubj   _       _
3   is      be      AUX     VBZ   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6   cop     _       _
4   a       a       DET     DT    Definite=Ind|PronType=Art                               6   det     _       _
5   small   small   ADJ     JJ    Degree=Pos                                              6   amod    _       _
6   one     one     NOUN    NN    Number=Sing                                             0   root    _       _
7   and     and     CCONJ   CC    _                                                       9   cc      _       _
8   easily  easily  ADV     RB    _                                                       9   advmod  _       _
9   missed  miss    VERB    VBN   Tense=Past|VerbForm=Part                                6   conj    _       _
10  .       .       PUNCT   .     _                                                       6   punct   _       _
```

We will now implement a `torchtext.Dataset` which reads data in UD format. The dataset class `UDData` is a subclass of [`torchtext.datasets.SequenceTaggingDataset`](https://torchtext.readthedocs.io/en/latest/datasets.html#sequence-tagging) because our inputs are sentences consisting of several word forms. Your task is to make `UDData.splits` return a training, development and test set containing examples which have the members `word`, `char` and `pos`. For the following sentence:

```
1   This       this      PRON   PRN   Number=Sing|PronType=Dem                                2   nsubj   _   _
2   is         be        AUX    AUX   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4   cop     _   _
3   a          a         DET    DET   _                                                       4   det     _   _
4   sentence   sentence  NOUN   NOUN  Number=Sing                                             0   root    _   _
```


#### Defining `torchtext.data.Field` objects and downloading data

In [62]:

WORD = Field(sequential=True,
            pad_token = PAD,
            unk_token = UNK,
        init_token = START,
            eos_token = END)
NESTING_FIELD = Field(tokenize=list,
                     pad_token = PAD,
            unk_token = UNK,
        init_token = START,
            eos_token = END)
CHAR = NestedField(nesting_field=NESTING_FIELD, include_lengths = True)
POS = Field(sequential=True)


assert(WORD.preprocess("This is a sentence") == ["This", "is", "a", "sentence"]) 
assert(CHAR.preprocess("This is a sentence") == [["T", "h", "i", "s"], 
                                                 ["i", "s"], 
                                                 ["a"], 
                                                 ["s", "e", "n", "t", "e", "n", "c", "e"]])
assert(POS.preprocess("PRON AUX DET NOUN") == ["PRON", "AUX", "DET", "NOUN"])

Download and read Universal Dependencies data.

In [63]:
class UDData(SequenceTaggingDataset):
    """
    Universal Dependencies .
    Download original at http://universaldependencies.org/
    License: http://creativecommons.org/licenses/by-sa/4.0/
    
    UDData defines a data loader and reader for UD treebanks. Since we inherit 
    SequenceTaggingDataset, the only method we need to define is splits which 
    returns the training, development and test set.
    """
    
    urls = ['https://mpsilfve.github.io/assets/uddata.zip']
    dirname = 'uddata'
    name = 'uddata'

    
    @classmethod
    def splits(cls, language, root="data", **kwargs):
        """
        Downloads and reads Universal Dependencies Version 2 data. The function 
        returns three torchtext.data.Data objects: train, dev and test which 
        contain torchtext.data.Example objects.
        
        The language parameter should be set to "en" for English, "es" for 
        Spanish and "fi" for Finnish.
        
        The variable FIELDS determines how UDData treats the fields in the 
        CoNLL-U format. It consists of 10 fields corresponding to each of the 10 
        fields in the CoNLL-U format. We are interested in field 1 (word form) and 3 
        (POS tag). We don't want to extract any information for the 
        remaining fields. 
    
        Each entry in FIELDS is a pair (name, field) where field refers to the 
        torchtext.data.Field object that handles the information stored in this 
        field. The information is stored as the variable 
        torchtext.data.Example.name. 
        
        Field 1 is special because we extract two kinds of information: the word 
        form as a monolithic token and the same word form as a character sequence. 
        That is why FIELDS[1] extracts two name values: 'word' and 'char'.  
        """
        FIELDS = ((None,None), 
                  (('word','char'), (WORD, CHAR)), 
                  (None,None), 
                  ('pos', POS), 
                  (None,None), 
                  (None,None), 
                  (None,None),
                  (None,None),
                  (None,None))
        
        return super(UDData, cls).splits(
                fields=FIELDS, 
                root=root, 
                train="%s-ud-train.conllu.head" % language, 
                validation="%s-ud-dev.conllu" % language,
                test="%s-ud-test.conllu" % language, **kwargs)
    
# Read Universal Depandencies v2 training, development and test sets for English.
train, dev, test = UDData.splits(language="en")

# Print the first example in the English training set.
ex = next(iter(iter(iter(train))))
print(ex.word)
print(ex.char)
print(ex.pos)

['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
[['A', 'l'], ['-'], ['Z', 'a', 'm', 'a', 'n'], [':'], ['A', 'm', 'e', 'r', 'i', 'c', 'a', 'n'], ['f', 'o', 'r', 'c', 'e', 's'], ['k', 'i', 'l', 'l', 'e', 'd'], ['S', 'h', 'a', 'i', 'k', 'h'], ['A', 'b', 'd', 'u', 'l', 'l', 'a', 'h'], ['a', 'l'], ['-'], ['A', 'n', 'i'], [','], ['t', 'h', 'e'], ['p', 'r', 'e', 'a', 'c', 'h', 'e', 'r'], ['a', 't'], ['t', 'h', 'e'], ['m', 'o', 's', 'q', 'u', 'e'], ['i', 'n'], ['t', 'h', 'e'], ['t', 'o', 'w', 'n'], ['o', 'f'], ['Q', 'a', 'i', 'm'], [','], ['n', 'e', 'a', 'r'], ['t', 'h', 'e'], ['S', 'y', 'r', 'i', 'a', 'n'], ['b', 'o', 'r', 'd', 'e', 'r'], ['.']]
['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'AD

#### Build Vocabularies


In [64]:
WORD.build_vocab(train)
CHAR.build_vocab(train)
POS.build_vocab(train)


In [65]:

print("WORD:",len(WORD.vocab.stoi))
print("CHAR:",len(CHAR.vocab.stoi))
print("POS:",len(POS.vocab.stoi))


WORD: 4740
CHAR: 96
POS: 19


### Simple baseline tagger

To be able to gauge the performance of our deep learning tagger, we'll now implement a baseline majority label classifier.

#### Counting tags

In [66]:
from collections import defaultdict, Counter

# A counter for POS tags. tag_counts[wf][pos] should denote the number of times we saw the word wf with 
# POS tag pos in the training data.
tag_counts = defaultdict(Counter)

# Populate tag_counts with the counts of different POS tags for each word type in the training data. 

for d in train:
    words = d.word
    tags = d.pos
    for word, tag in zip(words, tags):
        tag_counts[word][tag] += 1
        
# assert the answer
assert(tag_counts["this"]["DET"] == 46)
assert(tag_counts["this"]["PRON"] == 20)
assert(tag_counts["this"]["ADV"] == 1)

#### Tagging the development data

In [67]:
output_tags = []

for ex in dev:
    output_tags.append([])
    for wf in ex.word:
        if wf in tag_counts:
            output_tags[-1].append(max(tag_counts[wf], key = tag_counts[wf].get))
        else:
            output_tags[-1].append("NOUN")
        

Calculate the baseline accuracy:

In [68]:
def accuracy(sys,gold):
    """
    Function for evaluating tagging accuracy w.r.t. a gold standard test set (gold).
    """
    assert(len(sys) == len(gold))
    corr = 0
    tot = 0
    for s, g in zip(sys,gold):
        assert(len(s) == len(g.pos))
        corr += sum([1 if x==y else 0 for x,y in zip(s,g.pos)])
        tot += len(s)
    return corr * 100.0 / tot

print("Accuracy for baseline majority class tagger:",accuracy(output_tags,dev))

Accuracy for baseline majority class tagger: 77.36599331954828


### Numericalizing data

Define iterators for the training data, development data and test data. 

In [69]:
from torchtext.data import Iterator, BucketIterator

train_iter = BucketIterator(train,
                          batch_size=1,
                          sort_key=len,
                          shuffle=True,
                         device = "cpu")

dev_iter, test_iter = Iterator.splits((dev, test),
                                     batch_sizes=(1, 1),
                                     sort=False,
                                     shuffle=False,
                                    device = "cpu")

Look at the data:

In [70]:
# ex represents the first sentence in the training set. 
ex = next(iter(train_iter))

print("Here is the first example from the training set:")
print(ex)
print("\nEach example contains a vector of POS tags ex.pos having dimension (sentence_length,1):\n")
print(ex.pos)
print(ex.pos.size())
print("\nEach example contains a vector of word tokens ex.word having dimension (sentence_length+2,1)")
print("The +2 stems from START symbol at the beginning of the sentence (WORD.vocab.stoi[START] == %u)" 
      % WORD.vocab.stoi[START])
print("and END symbol at the end of the sentence (WORD.vocab.stoi[END] == %u):\n" % WORD.vocab.stoi[END])
print(ex.word)
print(ex.word.size())
print("\nEach example contains a tensor of character strings ex.char[0] having dimension (1,sentence_length,max_word_length+2)")
print("The tensor is big enough to fit all tokens in the sentence. Again, +2 stems from the sequence initial")
print("symbol START (CHAR.vocab.stoi[START] == %u) and sequence final symbol END (CHAR.vocab.stoi[END] == %u)"
      % (CHAR.vocab.stoi[START], CHAR.vocab.stoi[END]))
print("which are appended to each word. All words are padded to the same length using the symbol PAD")
print("(CHAR.vocab.stoi[PAD] == %s):\n" % CHAR.vocab.stoi[PAD])
chars, word_count, char_lengths = ex.char      
print(chars)
print(chars.size())
print("\nAdditionally we get the length of each input word form ex.char[2] in a (1,sentence_length) tensor:\n")
print(char_lengths)

Here is the first example from the training set:

[torchtext.data.batch.Batch of size 1 from UDDATA]
	[.pos]:[torch.LongTensor of size 27x1]
	[.word]:[torch.LongTensor of size 29x1]
	[.char]:('[torch.LongTensor of size 1x27x10]', '[torch.LongTensor of size 1]', '[torch.LongTensor of size 1x27]')

Each example contains a vector of POS tags ex.pos having dimension (sentence_length,1):

tensor([[11],
        [11],
        [ 9],
        [ 9],
        [10],
        [ 6],
        [11],
        [10],
        [10],
        [ 8],
        [ 5],
        [ 9],
        [12],
        [ 9],
        [ 6],
        [11],
        [ 9],
        [10],
        [ 6],
        [13],
        [ 6],
        [ 9],
        [ 3],
        [ 3],
        [ 6],
        [ 4],
        [ 3]])
torch.Size([27, 1])

Each example contains a vector of word tokens ex.word having dimension (sentence_length+2,1)
The +2 stems from START symbol at the beginning of the sentence (WORD.vocab.stoi[START] == 2)
and END symbol at the end 

### The POS tagger

Build a basic BiLSTM POS tagger.

In [71]:
import numpy as np

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.nn.functional import log_softmax, relu
from torch.optim import Adam, SGD

from random import random, seed, shuffle

# Ensure reproducible results by setting random seeds to 0.
seed(0)
torch.manual_seed(0)
np.random.seed(0)

import re

# Hyperparameters
VOCAB_SIZE = len(WORD.vocab.stoi)
EMBEDDING_DIM=300
RNN_HIDDEN_DIM=50
RNN_LAYERS=1
BATCH_SIZE=10
EPOCHS=5

# Maximum length of generated output word forms.
MAXWFLEN=40

#### LSTM layer

Build bi-directional LSTM.

In [72]:
class BidirectionalLSTM(nn.Module):
    def __init__(self):
        super(BidirectionalLSTM,self).__init__()

        self.forward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)
        self.backward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)

        
    def forward(self, sequence):
        output_f, (hidden_f, cell_f) = self.forward_rnn(sequence)
        output_b, (hidden_b, cell_b) = self.backward_rnn(sequence.flip(0))
        
        bi_output = torch.cat((output_f, output_b.flip(0)), dim = 2) 
        
        return bi_output

In order to improve tagging accuracy for OOV words, we need to use word dropout.

In [73]:
def drop_words(sequence,word_dropout):
    seq_len, _ = sequence.size()
    dropout_sequence = sequence.clone()
    for i in range(1,seq_len-1):
        if random() < word_dropout:
            dropout_sequence[i,0] = WORD.vocab.stoi[UNK]
    return dropout_sequence

#### Sentence Encoder 

In [74]:
class SentenceEncoder(nn.Module):
    def __init__(self):
        super(SentenceEncoder,self).__init__()
        
        self.vocabulary = WORD.vocab.stoi
        self.embedding = nn.Embedding(len(self.vocabulary), EMBEDDING_DIM)
        self.rnn = BidirectionalLSTM()
        
    def forward(self,ex,word_dropout):
        
        ex = drop_words(ex.word,word_dropout)
        embedded = self.embedding(ex)
        output = self.rnn(embedded)
        output = output[1:-1, :, :]
        return output


#### Prediction Layer

In [75]:
class FeedForward(nn.Module):
    def __init__(self,input_dim,output_dim):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(input_dim, input_dim)
        self.linear2 = nn.Linear(input_dim, output_dim)
        self.act1 = nn.ReLU()
        self.act2 = nn.LogSoftmax(dim = 2)
        
    def forward(self,tensor):
        output = self.linear1(tensor)
        output = self.act1(output)
        output = self.linear2(output)
        output = self.act2(output)
        
        return output

## Tagging sentences and training the model


In [76]:
class SimplePOSTagger(nn.Module):
    def __init__(self):
        super(SimplePOSTagger,self).__init__()
        self.tagset_size = len(POS.vocab.itos)
        
        self.sentence_encoder = SentenceEncoder()
        self.hidden2tag = FeedForward(2*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        states = self.sentence_encoder(ex,word_dropout)
        return self.hidden2tag(states)

    def tag(self,data):
        with torch.no_grad():
            results = []
            for ex in data:
                tags = self(ex).argmax(dim=2).squeeze(1)
                results.append([POS.vocab.itos[i] for i in tags])
            return results
        
pos_size = len(POS.vocab.itos)
assert(SimplePOSTagger()(ex).size() == (ex.word.size()[0]-2,1,pos_size))
assert(len(SimplePOSTagger().tag([ex])[0]) == ex.word.size()[0] -2) 

Armed with the `SimplePOSTagger` class, you can now train your tagger using the following code. You should get to around 75% tagging accuracy on the development set.

In [77]:
tagger = SimplePOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev))

Epoch 1: Example 1000 of 1000
Average loss per example: 1.2292
Development accuracy: 71.78
Epoch 2: Example 1000 of 1000
Average loss per example: 0.5045
Development accuracy: 75.11
Epoch 3: Example 1000 of 1000
Average loss per example: 0.2637
Development accuracy: 76.44
Epoch 4: Example 1000 of 1000
Average loss per example: 0.1891
Development accuracy: 75.29
Epoch 5: Example 1000 of 1000
Average loss per example: 0.1542
Development accuracy: 77.68
