# LSTM for Part-of-Speech Tagging
Part of speech tagging is the process of determining the category of word in accordance with its syntactic functions. So basically deciding wheter a word is a *noun*, *verb* etc.<br>In this notebook I will create simple LSTM model which will be able to determine wheter a word is a *noun*, *verb* or *adjective* in a given sentence.<br>

#### Why do we even need that?
It can be used in various ways but the most popular and useful are:
- Determinig on what subject is someone talking about
- Creating artificial sentences
- Understanding the context of a sentence (example: We have **major** advantage VS **major** Ted, report for duty

# Preparing the Data
"The data" in that case will be 4 sentences I wrote, so very small dataset but for the sake of example it is perfect. Train set is a list of 4 tuples, where each tuple has a following structure: `(["word1", "word2", "word3", ...],["tag1", "tag2", "tag3", ...])` and tags are `DET, NN and V`

In [17]:
# import needed libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [16]:
# create training data
training_data = [
    ("The princess drunk that juice".lower().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Taylor admire the Kanye".lower().split(), ["NN", "V", "DET", "NN"]),
    ("The dog likes that rope".lower().split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Those cats eat garbage".lower().split(), ["DET", "NN", "V", "NN"])
]

# create dictionary of unique words
word2idx = {}
for words, tags in training_data:
    for word in words:
        if word not in word2idx:
            word2idx[word] = len(word2idx)
            
# create dictionary for tags also
tag2idx = {"DET" : 0, "NN" : 1, "V" : 2}

print(word2idx)
print(tag2idx)

{'the': 0, 'princess': 1, 'drunk': 2, 'that': 3, 'juice': 4, 'taylor': 5, 'admire': 6, 'kanye': 7, 'dog': 8, 'likes': 9, 'rope': 10, 'those': 11, 'cats': 12, 'eat': 13, 'garbage': 14}
{'DET': 0, 'NN': 1, 'V': 2}


Now let's define a helper function that converts list of words into torch tensor using previously defined `word2idx`

In [28]:
def prepare_sequence(sequence, dictionary):
    """
    
    Parameters:
    sequence - list of words that will be mapped to torch tensor
    dictionary - dict that maps words to indices
    
    """
    mappedwords = [dictionary[word] for word in sequence]
    return torch.FloatTensor(mappedwords)

example = prepare_sequence(training_data[1][0], word2idx)
print(example)

tensor([5., 6., 0., 7.])


# The model

Assumptions:
- Input is a sequence of words so ["word1", "word2", "word3", ...]
- All words are in the previously defined vocabulary: `word2idx`
- We have 3 Tags: Noun(NN), Verb(V) and Determiner(DET)
- The goal is to predict tag for each word

But there is a problem with input size. Number of words in the sentence can vary so to address that problem we have to use *word embeddings*. Each word in our vocabulary will be presented as an vector of size `n`. Moreover each entry in a vector can be treated as a feature of the word, so due to that words(embedded vectors) can be compared using an angle between them as a measure of similarity (more about that [here](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#word-embeddings-in-pytorch)).<br>

Structure of LSTM<br>
<img src="images/LSTM3.png"><br>
Credits: Udacity Computer vision Nanodegree

