# Example: An LSTM for Part-of-Speech Tagging


This is a commented version of the PyTorch LSTM tutorial. We will also be using the helpfull information provided by Robert Guthrie, particularly his Natural Language Processing tutorials on PyTorch which can be found at: https://github.com/rguthrie3/DeepLearningForNLPInPytorch. Additional explanations were facilitated by Christopher Olah's blog http://colah.github.io/.

The idea is to take a sentence and assign to each word a tag that has a grammatical meaning. This is called part-of-speech tagging. The categorization tries to take into account the relationship with adjacent words. In this example we are using three categories Noun (NN), Determinant (DET) and Verb (V) and be using and LSTM to get the speech tags.

The model is as follows: let our input sentence be $w_1, \dots, w_M$, where $w_i \in V$, our vocab.
Also, let $T$ be our tag set, and $y_i$ the tag of word $w_i$.  Denote our prediction of the tag of word $w_i$ by $\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence $\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence.  Denote the hidden state at timestep $i$ as $h_i$ and assign each tag a unique index.
Then our prediction rule for $\hat{y}_i$ is
$$ \hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j $$
That is, take the log softmax of the affine map of the hidden state, and the predicted tag is the tag that has the maximum value in this vector.  Note this implies immediately that the dimensionality of the target space of $A$ is $|T|$.

LSTM's in Pytorch
~~~~~~~~~~~~~~~~~

Before getting to the example, note a few things. Pytorch's LSTM expects
all of its inputs to be 3D tensors. The semantics of the axes of these
tensors is important. The first axis is the sequence itself, the second
indexes instances in the mini-batch, and the third indexes elements of
the input. We haven't discussed mini-batching, so lets just ignore that
and assume we will always have just 1 dimension on the second axis. If
we want to run the sequence model over the sentence "The cow jumped",
our input should look like

\begin{align}\begin{bmatrix}
   \overbrace{q_\text{The}}^\text{row vector} \\
   q_\text{cow} \\
   q_\text{jumped}
   \end{bmatrix}\end{align}

Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which
case the 1st axis will have size 1 also.


In [1]:
# Author: Robert Guthrie

import torch
import torch.autograd as autograd #for automatic differentiation
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1) # set the seed of the random number generator

<torch._C.Generator at 0x7f5e1c17b150>

#### Prepare the data:

In [2]:
def prepare_sequence(seq, to_ix): #to_ix is a dictionary with the indx of every element in list
    idxs = [to_ix[w] for w in seq] #creates list of indexes that refer to elements in seq
    tensor = torch.LongTensor(idxs) #converts list to tensor
    return autograd.Variable(tensor)

training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix) # Assign a unique index to each word.
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2} # Define the tags numerically

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


Before we can start to define the model we have to understand what word embedding is.
A word can be represented as a 1 in a binary vector whose length is the complete vocabulary. This is called a one-hot representation. The problem with this is that it becomes impractical once we have a large dictionary. But also, we are limited by this representation in the sense that we take each word as independent of its context. For this reason, there is an alternate encoding called word embedding. Words are now mapped into a high dimensional space $W: Words \rightarrow {\rm I\!R}^n$ for example:

$$ W("cat")=(0.2, -0.4, 0.7, ...)$$
$$ W("beagle")=(0.0, 0.6, -0.1, ...)$$

This sounds great, the only problem is how do we choose the dimensions for the encoding? Normally this would be done by manually deciding important parameters one by one, but this is a strange process what is relevant? the meaning? the context? the grammatical position? this is where Neural Networks come in. The benefit of neural nets is that they are capable of discovering the representation on their own! 

#### So let's use them to create the model:

In [3]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [4]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        
        #Word Embedding initialization: definition of the spaces's size
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim) 

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence) # Usage --> input:indexes of each word, output: word embeddings
        lstm_out, self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1), self.hidden) #this inputs the whole sentece at once. view() resizes the tensor to the correct size.
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space) # this is bc of NLLLoss
        return tag_scores

In [5]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss() # negative log likelihood loss
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [6]:
# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

Variable containing:
-1.1157 -1.1563 -1.0282
-1.1721 -1.1107 -1.0190
-1.0874 -1.2354 -0.9883
-0.9600 -1.2786 -1.0827
-0.9597 -1.2682 -1.0917
[torch.FloatTensor of size 5x3]



In [7]:
for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Variables of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward(retain_variables = True)
        optimizer.step()


In [8]:
# See what the scores are after training
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
# The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
#  for word i. The predicted tag is the maximum scoring tag.
# Here, we can see the predicted sequence below is 0 1 2 0 1
# since 0 is index of the maximum value of row 1,
# 1 is the index of maximum value of row 2, etc.
# Which is DET NOUN VERB DET NOUN, the correct sequence!
print(tag_scores)

Variable containing:
-0.1980 -1.7581 -4.9260
-5.8300 -0.0094 -5.0523
-3.5314 -3.9384 -0.0500
-0.0316 -4.1675 -4.1612
-4.4157 -0.0316 -3.9622
[torch.FloatTensor of size 5x3]

