# Example: POS Tagging

According to [Wikipedia](https://en.wikipedia.org/wiki/Part-of-speech_tagging):

> Part-of-speech tagging (POS tagging or PoS tagging or POST) is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph.

Formally, given a sequence of words $\mathbf{x} = \left< x_1, x_2, \ldots, x_t \right>$ the goal is to learn a model $P(y_i \,|\, \mathbf{x})$ where $y_i$ is the POS tag associated with the $x_i$.
Note that the model is conditioned on all of $\mathbf{x}$ not just the words that occur earlier in the sentence - this is because we can assume that the entire sentence is known at the time of tagging.

### Dataset

We will train our model on the [Engligh Dependencies Treebank](https://github.com/UniversalDependencies/UD_English).
You can download this dataset by running the following lines:

In [1]:
!pip install gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
Successfully installed gdown-5.2.0


In [2]:
import gdown
url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu"
output = "en_ewt-ud-dev.conllu"
gdown.download(url, output, quiet=False)

Downloading...
From: https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu
To: /kaggle/working/en_ewt-ud-dev.conllu
1.76MB [00:00, 114MB/s]                   


'en_ewt-ud-dev.conllu'

In [4]:
url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu"
output = "en_ewt-ud-test.conllu"
gdown.download(url, output, quiet=False)

Downloading...
From: https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu
To: /kaggle/working/en_ewt-ud-test.conllu
1.77MB [00:00, 117MB/s]                   


'en_ewt-ud-test.conllu'

In [5]:
url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu"
output = "en_ewt-ud-train.conllu"
gdown.download(url, output, quiet=False)

Downloading...
From: https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu
To: /kaggle/working/en_ewt-ud-train.conllu
13.9MB [00:00, 163MB/s]                    


'en_ewt-ud-train.conllu'

The individual data instances come in chunks seperated by blank lines. Each chunk consists of a few starting comments, and then lines of tab-seperated fields. The fields we are interested in are the 1st and 3rd, which contain the tokenized word and POS tag respectively. An example chunk is shown below:

```
# sent_id = answers-20111107193044AAvUYBv_ans-0023
# text = Hope you have a crapload of fun!
1	Hope	hope	VERB	VBP	Mood=Ind|Tense=Pres|VerbForm=Fin	0	root	0:root	_
2	you	you	PRON	PRP	Case=Nom|Person=2|PronType=Prs	3	nsubj	3:nsubj	_
3	have	have	VERB	VBP	Mood=Ind|Tense=Pres|VerbForm=Fin	1	ccomp	1:ccomp	_
4	a	a	DET	DT	Definite=Ind|PronType=Art	5	det	5:det	_
5	crapload	crapload	NOUN	NN	Number=Sing	3	obj	3:obj	_
6	of	of	ADP	IN	_	7	case	7:case	_
7	fun	fun	NOUN	NN	Number=Sing	5	nmod	5:nmod	SpaceAfter=No
8	!	!	PUNCT	.	_	1	punct	1:punct	_

```

As with most real world data, we are going to need to do some preprocessing before we can use it. The first thing we are going to need is a `Vocabulary` to map words/POS tags to integer ids. Here is a more full-featured implementation than what we used in the first tutorial:

In [6]:
from collections import Counter


class Vocab(object):
    def __init__(self, iter, max_size=None, sos_token=None, eos_token=None, unk_token=None):
        """Initialize the vocabulary.
        Args:
            iter: An iterable which produces sequences of tokens used to update
                the vocabulary.
            max_size: (Optional) Maximum number of tokens in the vocabulary.
            sos_token: (Optional) Token denoting the start of a sequence.
            eos_token: (Optional) Token denoting the end of a sequence.
            unk_token: (Optional) Token denoting an unknown element in a
                sequence.
        """
        self.max_size = max_size
        self.pad_token = '<pad>'
        self.sos_token = sos_token
        self.eos_token = eos_token
        self.unk_token = unk_token

        # Add special tokens.
        id2word = [self.pad_token]
        if sos_token is not None:
            id2word.append(self.sos_token)
        if eos_token is not None:
            id2word.append(self.eos_token)
        if unk_token is not None:
            id2word.append(self.unk_token)

        # Update counter with token counts.
        counter = Counter()
        for x in iter:
            counter.update(x)

        # Extract lookup tables.
        if max_size is not None:
            counts = counter.most_common(max_size)
        else:
            counts = counter.items()
            counts = sorted(counts, key=lambda x: x[1], reverse=True)
        words = [x[0] for x in counts]
        id2word.extend(words)
        word2id = {x: i for i, x in enumerate(id2word)}

        self._id2word = id2word
        self._word2id = word2id

    def __len__(self):
        return len(self._id2word)

    def word2id(self, word):
        """Map a word in the vocabulary to its unique integer id.
        Args:
            word: Word to lookup.
        Returns:
            id: The integer id of the word being looked up.
        """
        if word in self._word2id:
            return self._word2id[word]
        elif self.unk_token is not None:
            return self._word2id[self.unk_token]
        else:
            raise KeyError('Word "%s" not in vocabulary.' % word)

    def id2word(self, id):
        """Map an integer id to its corresponding word in the vocabulary.
        Args:
            id: Integer id of the word being looked up.
        Returns:
            word: The corresponding word.
        """
        return self._id2word[id]

Now we need to parse the .conllu files and extract the data needed for our model. The good news is that the file is only a few megabytes so we can store everything in memory. Rather than creating a generator from scratch like we did in the previous tutorial, we will instead showcase the `torch.utils.data.Dataset` class. There are two main things that a `Dataset` must have:

1. A `__len__` method which let's you know how many data points are in the dataset.
2. A `__getitem__` method which is used to support integer indexing.

Here's an example of how to define these methods for the English Dependencies Treebank data.

In [7]:
import re
from torch.utils.data import Dataset


class Annotation(object):
    def __init__(self):
        """A helper object for storing annotation data."""
        self.tokens = []
        self.pos_tags = []


class CoNLLDataset(Dataset):
    def __init__(self, fname):
        """Initializes the CoNLLDataset.
        Args:
            fname: The .conllu file to load data from.
        """
        self.fname = fname
        self.annotations = self.process_conll_file(fname)
        self.token_vocab = Vocab([x.tokens for x in self.annotations],
                                 unk_token='<unk>')
        self.pos_vocab = Vocab([x.pos_tags for x in self.annotations])

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, idx):
        annotation = self.annotations[idx]
        input = [self.token_vocab.word2id(x) for x in annotation.tokens]
        target = [self.pos_vocab.word2id(x) for x in annotation.pos_tags]
        return input, target

    def process_conll_file(self, fname):
        # Read the entire file.
        with open(fname, 'r') as f:
            raw_text = f.read()
        # Split into chunks on blank lines.
        chunks = re.split(r'^\n', raw_text, flags=re.MULTILINE)
        # Process each chunk into an annotation.
        annotations = []
        for chunk in chunks:
            annotation = Annotation()
            lines = chunk.split('\n')
            # Iterate over all lines in the chunk.
            for line in lines:
                # If line is empty ignore it.
                if len(line)==0:
                    continue
                # If line is a commend ignore it.
                if line[0] == '#':
                    continue
                # Otherwise split on tabs and retrieve the token and the
                # POS tag fields.
                fields = line.split('\t')
                annotation.tokens.append(fields[1])
                annotation.pos_tags.append(fields[3])
            if (len(annotation.tokens) > 0) and (len(annotation.pos_tags) > 0):
                annotations.append(annotation)
        return annotations

And let's see how this is used in practice.

In [8]:
dataset = CoNLLDataset('en_ewt-ud-train.conllu')

In [9]:
input, target = dataset[0]
print('Example input: %s\n' % input)
print('Example target: %s\n' % target)
print('Translated input: %s\n' % ' '.join(dataset.token_vocab.id2word(x) for x in input))
print('Translated target: %s\n' % ' '.join(dataset.pos_vocab.id2word(x) for x in target))

Example input: [266, 16, 5249, 45, 295, 703, 1154, 4233, 10099, 595, 16, 10100, 4, 3, 6865, 35, 3, 6866, 10, 3, 498, 8, 6867, 4, 758, 3, 2224, 1605, 2]

Example target: [9, 2, 9, 2, 7, 1, 3, 9, 9, 9, 2, 9, 2, 6, 1, 5, 6, 1, 5, 6, 1, 5, 9, 2, 5, 6, 7, 1, 2]

Translated input: Al - Zaman : American forces killed Shaikh Abdullah al - Ani , the preacher at the mosque in the town of Qaim , near the Syrian border .

Translated target: PROPN PUNCT PROPN PUNCT ADJ NOUN VERB PROPN PROPN PROPN PUNCT PROPN PUNCT DET NOUN ADP DET NOUN ADP DET NOUN ADP PROPN PUNCT ADP DET ADJ NOUN PUNCT



The main upshot of using the `Dataset` class is that it makes accessing training/test observations very simple. Accordingly, this makes batch generation easy since all we need to do is randomly choose numbers and then grab those observations from the dataset - PyTorch includes a `torch.utils.data.DataLoader` object which handles this for you. In fact, if we were not working with sequential data we would be able to proceed straight to the modeling step from here. However, since we are working with sequential data there is one last pesky issue we need to handle - padding.

The issue is that when we are given a batch of outputs from `CoNLLDataset`, the sequences in the batch are likely to all be of different length. To deal with this, we define a custom `collate_annotations` function which adds padding to the end of the sequences in the batch so that they are all the same length. In addition, we'll have this function take care of loading the data into tensors and ensuring that the tensor dimensions are in the order expected by PyTorch.

Oh and one last annoying thing - to deal with some of the issues caused by using padded data we will be using a function called `torch.nn.utils.rnn.pack_padded_sequences` in our model later on. All you need to know now is that this function expects our sequences in the batch to be sorted in terms of descending length, and that we know the lengths of each sequence. So we will make sure that the `collate_annotations` function performs this sorting for us and returns the sequence lengths in addition to the input and target tensors.

In [10]:
import torch
from torch.autograd import Variable


def pad(sequences, max_length, pad_value=0):
    """Pads a list of sequences.
    Args:
        sequences: A list of sequences to be padded.
        max_length: The length to pad to.
        pad_value: The value used for padding.
    Returns:
        A list of padded sequences.
    """
    out = []
    for sequence in sequences:
        padded = sequence + [0]*(max_length - len(sequence))
        out.append(padded)
    return out


def collate_annotations(batch):
    """Function used to collate data returned by CoNLLDataset."""
    # Get inputs, targets, and lengths.
    inputs, targets = zip(*batch)
    lengths = [len(x) for x in inputs]
    # Sort by length.
    sort = sorted(zip(inputs, targets, lengths),
                  key=lambda x: x[2],
                  reverse=True)
    inputs, targets, lengths = zip(*sort)
    # Pad.
    max_length = max(lengths)
    inputs = pad(inputs, max_length)
    targets = pad(targets, max_length)
    # Transpose.
    inputs = list(map(list, zip(*inputs)))
    targets = list(map(list, zip(*targets)))
    # Convert to PyTorch variables.
    inputs = Variable(torch.LongTensor(inputs))
    targets = Variable(torch.LongTensor(targets))
    lengths = Variable(torch.LongTensor(lengths))
    if torch.cuda.is_available():
        inputs = inputs.cuda()
        targets = targets.cuda()
        lengths = lengths.cuda()
    return inputs, targets, lengths

Again let's see how this is used in practice:

In [11]:
from torch.utils.data import DataLoader


for inputs, targets, lengths in DataLoader(dataset, batch_size=16, collate_fn=collate_annotations):
    print('Inputs: %s\n' % inputs.data)
    print('Targets: %s\n' % targets.data)
    print('Lengths: %s\n' % lengths.data)

    # Usually we'd keep sampling batches, but here we'll just break
    break

Inputs: tensor([[   28,  1083,   266,    28,    30,   106,    68,   266,   499,   625,
         10103,   121,  1212,    28,    28,   108],
        [10106,     3,    16,  1713,  6874,  6878, 10115,    16,  1030,   106,
            45, 10123,     8,  3581,  1081,  1606],
        [   10,  5252,  5249,  4237,    11,    11,    46,  5249,  4239,  1712,
           555,     4,    69,    60,    19,    54],
        [  180,    19,    45,     8,    10,     3,   185,    45,    51,     8,
          1849,  6874,    60,  1370,   159,    41],
        [   11,   343,   295, 10118, 10125,   759,   138,  5253, 10121,     7,
          2018,  3111,   159,    10,   450,    19],
        [ 4234,   163,   703,  3111,   180,  1031,     8,  1154,     7, 10101,
            12,     4,   450,     3,    44, 10111],
        [    5,     5,  1154,  2018,     6,    10,     3,     7, 10122, 10102,
            31,   151,    44, 10112,     3,     3],
        [    3,   408,  4233,    12,    50,     3,  2755,   807,  3112,    

### Model

We will use the following architecture:

1. Embed the input words into a 200 dimensional vector space.
2. Feed the word embeddings into a (bidirectional) GRU.
3. Feed the GRU outputs into a fully connected layer.
4. Use a softmax activation to get the probabilities of the different labels.

There is one complication which arises during the forward computation. As was noted in the dataset section, the input sequences are padded. This causes an issue since we do not want to waste computational resources feeding these pad tokens into the RNN. In PyTorch, we can deal with this issue by converting the sequence data into a  `torch.nn.utils.rnn.PackedSequence` object before feeding it into the RNN. In essence, a `PackedSequence` flattens the sequence and batch dimensions of a tensor, and also contains metadata so that PyTorch knows when to re-initialize the hidden state when fed into a recurrent layer. If this seems confusing, do not worry. To use the `PackedSequence` in practice you will almost always perform the following steps:

1. Before feeding data into a recurrent layer, transform it into a `PackedSequence` by using the function `torch.nn.utils.rnn.pack_padded_sequence()`.
2. Feed the `PackedSequence` into the recurrent layer.
3. Transform the output back into a regular tensor by using the function `torch.nn.utils.rnn.pad_packed_sequence()`.

See the model implementation below for a working example:

In [12]:
from torch import nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
#RNN
class Tagger(nn.Module):
    def __init__(self,
                 input_vocab_size,
                 output_vocab_size,
                 embedding_dim=64,
                 hidden_size=64,
                 bidirectional=True):
        """Initializes the tagger.

        Args:
            input_vocab_size: Size of the input vocabulary.
            output_vocab_size: Size of the output vocabulary.
            embedding_dim: Dimension of the word embeddings.
            hidden_size: Number of units in each LSTM hidden layer.
            bidirectional: Whether or not to use a bidirectional rnn.
        """
        super(Tagger, self).__init__()

        # Store parameters
        self.input_vocab_size = input_vocab_size
        self.output_vocab_size = output_vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.bidirectional = bidirectional

        # Define layers
        self.word_embeddings = nn.Embedding(input_vocab_size, embedding_dim,
                                            padding_idx=0)
        self.rnn = nn.GRU(embedding_dim, hidden_size,
                          bidirectional=bidirectional,
                          dropout=0.9)
        if bidirectional:
            self.fc = nn.Linear(2*hidden_size, output_vocab_size)
        else:
            self.fc = nn.Linear(hidden_size, output_vocab_size)
        self.activation = nn.LogSoftmax(dim=2)

    def forward(self, x, lengths=None, hidden=None):
        """Computes a forward pass of the language model.

        Args:
            x: A LongTensor w/ dimension [seq_len, batch_size].
            lengths: The lengths of the sequences in x.
            hidden: Hidden state to be fed into the lstm.

        Returns:
            net: the output representation for each word in the sequence.
            hidden: the hidden state at the last timestamp.
        """
        seq_len, batch_size = x.size()

        # If no hidden state is provided, then default to zeros.
        if hidden is None:
            if self.bidirectional:
                num_directions = 2
            else:
                num_directions = 1
            hidden = Variable(torch.zeros(num_directions, batch_size, self.hidden_size))
            if torch.cuda.is_available():
                hidden = hidden.cuda()

        net = self.word_embeddings(x)
        # Pack before feeding into the RNN.
        if lengths is not None:
            lengths = lengths.data.view(-1).tolist()
            net = pack_padded_sequence(net, lengths)
        net, hidden = self.rnn(net, hidden)
        # Unpack after
        if lengths is not None:
            net, _ = pad_packed_sequence(net)
        net = self.fc(net)
        net = self.activation(net)

        return net, hidden

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
#CNN
class Tagger(nn.Module):
    def __init__(self,
                 input_vocab_size,
                 output_vocab_size,
                 embedding_dim=128,
                 num_filters=128,
                 filter_sizes=(3, 4, 5),
                 dropout=0.5):
        """Initializes the tagger.

        Args:
            input_vocab_size: Size of the input vocabulary.
            output_vocab_size: Size of the output vocabulary.
            embedding_dim: Dimension of the word embeddings.
            num_filters: Number of filters for each filter size.
            filter_sizes: Tuple of filter sizes.
            dropout: Dropout probability.
        """
        super(Tagger, self).__init__()

        # Store parameters
        self.input_vocab_size = input_vocab_size
        self.output_vocab_size = output_vocab_size
        self.embedding_dim = embedding_dim
        self.num_filters = num_filters
        self.filter_sizes = filter_sizes
        self.dropout = dropout

        # Define layers
        self.word_embeddings = nn.Embedding(input_vocab_size, embedding_dim, padding_idx=0)
        self.convs = nn.ModuleList([
            nn.Conv1d(embedding_dim, num_filters, filter_size)
            for filter_size in filter_sizes
        ])
        self.fc1 = nn.Linear(len(filter_sizes) * num_filters, 256)
        self.fc2 = nn.Linear(256, output_vocab_size)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.LogSoftmax(dim=2)

    def forward(self, x, lengths=None, hidden=None):
        """Computes a forward pass of the CNN model.

        Args:
            x: A LongTensor w/ dimension [seq_len, batch_size].
            lengths: The lengths of the sequences in x (unused in CNN, kept for compatibility).
            hidden: Hidden state (unused in CNN, kept for compatibility).

        Returns:
            net: the output representation for each word in the sequence.
            hidden: None (to match the RNN output format).
        """
        seq_len, batch_size = x.size()

        net = self.word_embeddings(x)
        net = net.permute(1, 2, 0)  # Permute dimensions to [batch_size, embedding_dim, seq_len]

        conv_outputs = []
        for conv in self.convs:
            conv_out = F.relu(conv(net))
            conv_out = F.max_pool1d(conv_out, conv_out.size(2))
            conv_out = conv_out.squeeze(2)
            conv_outputs.append(conv_out)

        net = torch.cat(conv_outputs, 1)
        net = self.dropout(net)
        net = F.relu(self.fc1(net))
        net = self.dropout(net)
        net = self.fc2(net)
        net = net.unsqueeze(0)  # Add sequence length dimension
        net = net.expand(seq_len, -1, -1)  # Expand to match target shape

        net = self.activation(net)

        return net, None  # Return None for hidden state to match RNN output format

In [14]:
!pip install pytorch-crf

Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl.metadata (2.4 kB)
Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2


In [28]:
import torch
import torch.nn as nn
import torchcrf
from torchcrf import CRF
# BiLSTM + CRF
class Tagger(nn.Module):
    def __init__(self,
                 input_vocab_size,
                 output_vocab_size,
                 embedding_dim=128,
                 hidden_size=256,
                 num_layers=2,
                 dropout=0.5,
                 bidirectional=True):
        """Initializes the tagger.

        Args:
            input_vocab_size: Size of the input vocabulary.
            output_vocab_size: Size of the output vocabulary.
            embedding_dim: Dimension of the word embeddings.
            hidden_size: Number of units in each LSTM hidden layer.
            num_layers: Number of LSTM layers.
            dropout: Dropout probability.
            bidirectional: Whether to use a bidirectional LSTM.
        """
        super(Tagger, self).__init__()

        # Store parameters
        self.input_vocab_size = input_vocab_size
        self.output_vocab_size = output_vocab_size
        self.embedding_dim = embedding_dim
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout = dropout
        self.bidirectional = bidirectional

        # Define layers
        self.word_embeddings = nn.Embedding(input_vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers,
                            dropout=dropout, bidirectional=bidirectional)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(2 * hidden_size if bidirectional else hidden_size, output_vocab_size)
        self.crf = CRF(output_vocab_size, batch_first=True)

    def forward(self, x, lengths=None, hidden=None):
        """Computes a forward pass of the BiLSTM-CRF model.

        Args:
            x: A LongTensor w/ dimension [seq_len, batch_size].
            lengths: The lengths of the sequences in x.
            hidden: Hidden state to be fed into the LSTM.

        Returns:
            emissions: The emission scores for each tag.
            hidden: The hidden state at the last timestamp.
        """
        seq_len, batch_size = x.size()

        # If no hidden state is provided, then default to zeros.
        if hidden is None:
            num_directions = 2 if self.bidirectional else 1
            hidden = torch.zeros(self.num_layers * num_directions, batch_size, self.hidden_size)
            if torch.cuda.is_available():
                hidden = hidden.cuda()

        # Embed the input
        net = self.word_embeddings(x)

        # Pack padded sequences and feed into LSTM
        net = nn.utils.rnn.pack_padded_sequence(net, lengths.cpu())
        net, hidden = self.lstm(net, (hidden, hidden))
        net, _ = nn.utils.rnn.pad_packed_sequence(net)

        # Apply dropout and feed into fully connected layer
        net = self.dropout(net)
        emissions = self.fc(net)

        return emissions, hidden
    
    def decode(self, emissions, lengths):
        """Decodes the emission scores and returns the most likely tag sequence.

        Args:
            emissions: The emission scores for each tag.
            lengths: The lengths of the sequences.

        Returns:
            The most likely tag sequence for each input sequence.
        """
        # Transpose the emissions to match the expected shape
        emissions = emissions.transpose(0, 1)

        # Create a mask tensor based on the lengths
        mask = torch.zeros(emissions.size()[:2], dtype=torch.bool)
        if torch.cuda.is_available():
            mask = mask.cuda()
        for i, length in enumerate(lengths):
            mask[i, :length] = True

        # Ensure the mask of the first timestep is all ones
        mask[:, 0] = True

        return self.crf.decode(emissions, mask)

    def loss(self, emissions, tags, lengths):
        """Computes the negative log-likelihood loss.

        Args:
            emissions: The emission scores for each tag.
            tags: The true tags.
            lengths: The lengths of the sequences.

        Returns:
            The negative log-likelihood loss.
        """
        if torch.cuda.is_available():
            emissions = emissions.cuda()
            tags = tags.cuda()
            lengths = lengths.cuda()

        # Transpose the emissions to match the expected shape
        emissions = emissions.transpose(0, 1)

        # Create a mask tensor based on the lengths
        mask = torch.zeros(emissions.size()[:2], dtype=torch.bool)
        if torch.cuda.is_available():
            mask = mask.cuda()
        for i, length in enumerate(lengths):
            mask[i, :length] = True

        # Adjust the tags tensor to ignore padded elements
        tags = tags.transpose(0, 1)
        tags = tags[:, :emissions.size(1)].contiguous()

        return -self.crf(emissions, tags, mask=mask)

### Training

Training is pretty much exactly the same as in the previous tutorial. There is one catch - we don't want to evaluate our loss function on pad tokens. This is easily fixed by setting the weight of the pad class to zero.

In [53]:
import numpy as np

# Load datasets.
train_dataset = CoNLLDataset('en_ewt-ud-train.conllu')
dev_dataset = CoNLLDataset('en_ewt-ud-dev.conllu')

dev_dataset.token_vocab = train_dataset.token_vocab
dev_dataset.pos_vocab = train_dataset.pos_vocab

# Hyperparameters / constants.
input_vocab_size = len(train_dataset.token_vocab)
output_vocab_size = len(train_dataset.pos_vocab)
batch_size = 16
epochs = 6

# Initialize the model.
model = Tagger(input_vocab_size, output_vocab_size)
if torch.cuda.is_available():
    model = model.cuda()

# Loss function weights.
weight = torch.ones(output_vocab_size)
weight[0] = 0
if torch.cuda.is_available():
    weight = weight.cuda()

# Initialize loss function and optimizer.
loss_function = torch.nn.NLLLoss(weight)
optimizer = torch.optim.Adam(model.parameters())

# Main training loop.
data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                         collate_fn=collate_annotations)
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False,
                        collate_fn=collate_annotations)
losses = []
i = 0
for epoch in range(epochs):
    for inputs, targets, lengths in data_loader:
        optimizer.zero_grad()
        outputs, _ = model(inputs, lengths=lengths)

        outputs = outputs.view(-1, output_vocab_size)
        targets = targets.view(-1)

        loss = loss_function(outputs, targets)
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
        if (i % 1000) == 0:
            # Compute dev loss over entire dev set.
            # NOTE: This is expensive. In your work you may want to only use a
            # subset of the dev set.
            dev_losses = []
            for inputs, targets, lengths in dev_loader:
                outputs, _ = model(inputs, lengths=lengths)
                outputs = outputs.view(-1, output_vocab_size)
                targets = targets.view(-1)
                loss = loss_function(outputs, targets)
                dev_losses.append(loss.item())
            avg_train_loss = np.mean(losses)
            avg_dev_loss = np.mean(dev_losses)
            losses = []
            print('Iteration %i - Train Loss: %0.6f - Dev Loss: %0.6f' % (i, avg_train_loss, avg_dev_loss))
            torch.save(model, 'pos_tagger.pt')
        i += 1

torch.save(model, 'pos_tagger.final.pt')

Iteration 0 - Train Loss: -0.008877 - Dev Loss: -0.046099
Iteration 1000 - Train Loss: -246.874077 - Dev Loss: -483.062932
Iteration 2000 - Train Loss: -725.059759 - Dev Loss: -955.216551
Iteration 3000 - Train Loss: -1200.155136 - Dev Loss: -1425.913204
Iteration 4000 - Train Loss: -1675.247595 - Dev Loss: -1896.754818


In [24]:
#BI-LSTM ONLY

import numpy as np
import torch

# Load datasets
train_dataset = CoNLLDataset('en_ewt-ud-train.conllu')
dev_dataset = CoNLLDataset('en_ewt-ud-dev.conllu')

dev_dataset.token_vocab = train_dataset.token_vocab
dev_dataset.pos_vocab = train_dataset.pos_vocab

# Hyperparameters and constants
input_vocab_size = len(train_dataset.token_vocab)
output_vocab_size = len(train_dataset.pos_vocab)
embedding_dim = 128
hidden_size = 256
num_layers = 2
dropout = 0.5
bidirectional = True
batch_size = 16
epochs = 6

# Initialize the model
model = Tagger(input_vocab_size, output_vocab_size, embedding_dim, hidden_size,
               num_layers, dropout, bidirectional)
if torch.cuda.is_available():
    model = model.cuda()

# Initialize optimizer
optimizer = torch.optim.Adam(model.parameters())

# Main training loop
data_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                         collate_fn=collate_annotations)
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False,
                        collate_fn=collate_annotations)
losses = []
best_dev_loss = float('inf')
for epoch in range(epochs):
    model.train()
    for inputs, targets, lengths in data_loader:
        optimizer.zero_grad()
        emissions, _ = model(inputs, lengths=lengths)
        loss = model.loss(emissions, targets, lengths)
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

    # Evaluate on dev set
    model.eval()
    dev_losses = []
    with torch.no_grad():
        for inputs, targets, lengths in dev_loader:
            emissions, _ = model(inputs, lengths=lengths)
            loss = model.loss(emissions, targets, lengths)
            dev_losses.append(loss.item())
    avg_train_loss = np.mean(losses)
    avg_dev_loss = np.mean(dev_losses)
    losses = []
    print(f'Epoch {epoch + 1} - Train Loss: {avg_train_loss:.4f} - Dev Loss: {avg_dev_loss:.4f}')

    # Save the best model based on dev loss
    if avg_dev_loss < best_dev_loss:
        best_dev_loss = avg_dev_loss
        torch.save(model, 'pos_tagger_bilstm_crf.pt')

Epoch 1 - Train Loss: 192.3368 - Dev Loss: 82.2925
Epoch 2 - Train Loss: 82.0035 - Dev Loss: 64.2159
Epoch 3 - Train Loss: 50.8260 - Dev Loss: 57.7737
Epoch 4 - Train Loss: 32.9238 - Dev Loss: 57.5582
Epoch 5 - Train Loss: 21.9069 - Dev Loss: 61.4896
Epoch 6 - Train Loss: 15.2133 - Dev Loss: 66.3488


ValueError: the first two dimensions of emissions and mask must match, got (36, 16) and (16,)

### Evaluation

For tagging tasks the typical evaluation metric are accuracy and f1-score (e.g. the harmonic mean of precision and recall):

$$ \text{f1-score} = 2 \frac{\text{precision} * \text{recall}}{\text{precision} + \text{recall}} $$

Here are the results for our final model:

In [31]:
# Collect the predictions and targets
y_true = []
y_pred = []

for inputs, targets, lengths in dev_loader:
    outputs, _ = model(inputs, lengths=lengths)
    _, preds = torch.max(outputs, dim=2)
    targets = targets.view(-1)
    preds = preds.view(-1)
    if torch.cuda.is_available():
        targets = targets.cpu()
        preds = preds.cpu()
    y_true.append(targets.data.numpy())
    y_pred.append(preds.data.numpy())

# Stack into numpy arrays
y_true = np.concatenate(y_true)
y_pred = np.concatenate(y_pred)

# Compute accuracy
acc = np.mean(y_true[y_true != 0] == y_pred[y_true != 0])
print('Accuracy - %0.6f\n' % acc)

# Evaluate f1-score
from sklearn.metrics import f1_score
score = f1_score(y_true, y_pred, average=None)
print('F1-scores:\n')
for label, score in zip(dev_dataset.pos_vocab._id2word[1:], score[1:]):
    print('%s - %0.6f' % (label, score))

Accuracy - 0.908866

F1-scores:

NOUN - 0.867515
PUNCT - 0.991355
VERB - 0.893418
PRON - 0.985135
ADP - 0.947318
DET - 0.985302
ADJ - 0.811631
AUX - 0.978887
PROPN - 0.061173
ADV - 0.855584
CCONJ - 0.992935
PART - 0.956656
NUM - 0.833100
SCONJ - 0.843085
_ - 0.991620
SYM - 0.777070
INTJ - 0.743455
X - 0.193548


In [30]:
sum(p.numel() for p in model.parameters() if p.requires_grad)

4963618

### Inference

Now let's look at some of the model's predictions.

In [None]:
model = torch.load('pos_tagger.final.pt')

def inference(sentence):
    # Convert words to id tensor.
    ids = [[dataset.token_vocab.word2id(x)] for x in sentence]
    ids = Variable(torch.LongTensor(ids))
    if torch.cuda.is_available():
        ids = ids.cuda()
    # Get model output.
    output, _ = model(ids)
    _, preds = torch.max(output, dim=2)
    if torch.cuda.is_available():
        preds = preds.cpu()
    preds = preds.data.view(-1).numpy()
    pos_tags = [dataset.pos_vocab.id2word(x) for x in preds]
    for word, tag in zip(sentence, pos_tags):
        print('%s - %s' % (word, tag))

In [None]:
sentence = "sdfgkj asd;glkjsdg ;lkj  .".split()
inference(sentence)