# LSTM for Part-of-Speech Tagging
Part of speech tagging is the process of determining the category of word in accordance with its syntactic functions. So basically deciding wheter a word is a *noun*, *verb* etc.<br>In this notebook I will create simple LSTM model which will be able to determine wheter a word is a *noun*, *verb* or *adjective* in a given sentence.<br>

#### Why do we even need that?
It can be used in various ways but the most popular and useful are:
- Determinig on what subject is someone talking about
- Creating artificial sentences
- Understanding the context of a sentence (example: We have **major** advantage VS **major** Ted, report for duty

# Preparing the Data
"The data" in that case will be 4 sentences I wrote, so very small dataset but for the sake of example it is perfect. Train set is a list of 4 tuples, where each tuple has a following structure: `(["word1", "word2", "word3", ...],["tag1", "tag2", "tag3", ...])` and tags are `DET, NN and V`

In [48]:
# import needed libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [16]:
# create training data
training_data = [
    ("The princess drunk that juice".lower().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Taylor admire the Kanye".lower().split(), ["NN", "V", "DET", "NN"]),
    ("The dog likes that rope".lower().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Those cats eat garbage".lower().split(), ["DET", "NN", "V", "NN"])
]

# create dictionary of unique words
word2idx = {}
for words, tags in training_data:
    for word in words:
        if word not in word2idx:
            word2idx[word] = len(word2idx)
            
# create dictionary for tags also
tag2idx = {"DET" : 0, "NN" : 1, "V" : 2}

print(word2idx)
print(tag2idx)

{'the': 0, 'princess': 1, 'drunk': 2, 'that': 3, 'juice': 4, 'taylor': 5, 'admire': 6, 'kanye': 7, 'dog': 8, 'likes': 9, 'rope': 10, 'those': 11, 'cats': 12, 'eat': 13, 'garbage': 14}
{'DET': 0, 'NN': 1, 'V': 2}


Now let's define a helper function that converts list of words into torch tensor using previously defined `word2idx`

In [54]:
def prepare_sequence(sequence, dictionary):
    """
    
    Parameters:
    sequence - list of words that will be mapped to torch tensor
    dictionary - dict that maps words to indices
    
    """
    mappedwords = [dictionary[word] for word in sequence]
    return torch.LongTensor(mappedwords)

example = prepare_sequence(training_data[1][0], word2idx)
print(example)

tensor([5, 6, 0, 7])


# The model

Assumptions:
- Input is a sequence of words so ["word1", "word2", "word3", ...]
- All words are in the previously defined vocabulary: `word2idx`
- We have 3 Tags: Noun(NN), Verb(V) and Determiner(DET)
- The goal is to predict tag for each word

But there is a problem with input size. Number of words in the sentence can vary so to address that problem we have to use *word embeddings*. Each word in our vocabulary will be presented as an vector of size `n`. Moreover each entry in a vector can be treated as a feature of the word, so due to that words(embedded vectors) can be compared using an angle between them as a measure of similarity (more about that [here](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#word-embeddings-in-pytorch)).<br>

Structure of LSTM<br>
<img src="images/LSTM3.png"><br>
Credits: Udacity Computer vision Nanodegree



In [58]:
# Model for tagging parts of speech
class LSTMTagger(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        """
        Initialize layers of this model.
        
        Parameters:
        
        """
        super().__init__()
        # set dimension of hidden layer
        self.hidden_dim = hidden_dim
        
        # first layer - embedding: turns words into vector of size embedding_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # second layer - LSTM takes embedded word vectors as input and outputs hidden states of size hidded_dim
        # (in, out) = (embedding_dim, hidden_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # third layer - linear layer that maps output of the LSTM to the number of tags we want
        # (in, out) = (hidden_dim, tagse_size)
        self.hidden2tags = nn.Linear(hidden_dim, tagset_size)
        
        # initialize the hidden state
        self.hidden = self.init_hidden()
    
    def init_hidden(self):
        """
        Initialize hidden state of the model. At the begining of training we set this to 0's because we did not see anything before that.
        """
        # dimensions here are (n_layers, batch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim), torch.zeros(1, 1, self.hidden_dim))
    
    def forward(self, sentence):
        """
        Define the feedforward pass of the model
        """
        # create embedded vectors for each word in a sentence
        embedding = self.word_embeddings(sentence)
        
        # get the output an hidden state of the LSTMby applying it to embedded vectors
        lstm_out, self.hidden = self.lstm(embedding.view(len(sentence), 1, -1), self.hidden)
        
        # get the scores for the most likely tag for a word
        tag_outputs = self.hidden2tags(lstm_out.view(len(sentence),-1))
        tag_scores = F.log_softmax(tag_outputs, dim=1)
        
        return tag_scores        

# Training

In [62]:
# define embedding dimension, here we have a simple example so we will keep it small
# in more complex tasks those vectors grows to sizes like 64, 128 or even 256
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

# instantiate the model
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), len(tag2idx))

# define loss function and optimizer
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# And now little test before training
test_sentence = "The dog drunk that juice".lower().split()

# first we need to prepare input sequence
inputs = prepare_sequence(test_sentence,word2idx)

tag_scores = model(inputs)
# here we have torch tensor of size (5, 3) because fro each word we have 3 predictions regarding part of speech tag (DET, NN, V)
print(tag_scores, tag_scores.shape)

tensor([[-1.1463, -0.9253, -1.2525],
        [-0.9971, -0.9871, -1.3532],
        [-1.0648, -0.9699, -1.2870],
        [-1.0216, -1.0291, -1.2636],
        [-1.0290, -1.0579, -1.2193]], grad_fn=<LogSoftmaxBackward>) torch.Size([5, 3])


## Let's train the model

In epoch of the training loop each sentence will go through the LSTM model. For each sentence following actions will be taken:
- zero the gradients
- zero the hidden state of LSTM. WHY? because hidden state is for "remembering" words within the sentence in order to establish connections between them. Not zero-ing hidden state after each senetence would cause the end state from one sentence be an input to the first LSTM cell of the following