# Grammatical Tagging with LSTM

Create a Recursive Neural Network (RNN) to determine the major 9 categories of words in a sentence: 
- Noun
- Verb
- Article
- Adjective 
- Preposition 
- Pronoun 
- Adverb 
- Conjunction
- Interjection

As this is a simplified example to experiment with Long Short-Term Memory (LSTM) neural network, it will only uses a subset of the 9 categories.  Secifically juss the following 5 catecories:
- Noun (N)
- Verb (V)
- Article (ART)
- Adjective (ADJ)
- Pronoun (PRO)

With this we can just can just analyze simple sentences, such as "I like McDonalds"

## Prerequisites

In [1]:
import platform
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

%matplotlib inline

print("Python version: ", platform.python_version())
print("Torch version: ", torch.__version__)

Python version:  3.6.10
Torch version:  1.5.0


## Make up some simple sentences as Training Data

In [2]:
# Create a list of some simple sentences as training data and the category tags
training_sentences = [
    ("The cat caught the mouse".lower().split(), ["ART", "N", "V", "ART", "N"]),
    ("The mouse loves cheese".lower().split(), ["ART", "N", "V", "N"]),
    ("The dog hates the cat".lower().split(), ["ART", "N", "V", "ART", "N"]),
    ("The dog sleeps".lower().split(), ["ART", "N", "V"]),
    ("The cat is black".lower().split(), ["ART", "N", "V", "ADJ"]),
    ("The dog is white".lower().split(), ["ART", "N", "V", "ADJ"]),
    ("The cat runs".lower().split(), ["ART", "N", "V"]),
    ("I like cheese".lower().split(), ["PRO", "N", "V"]),
    ("The cheese is yellow".lower().split(), ["ART", "N", "V", "ADJ"]),
    ("You like the cat".lower().split(), ["PRO", "V", "ART", "N"]),
    ("She watches TV".lower().split(), ["PRO", "V", "N"])
]

# print(training_sentences)

# Dictionary to map words to indices
word_index = {}
for sentence, tags in training_sentences:
    for word in sentence:
        if word not in word_index:
            word_index[word] = len(word_index)
            
print(word_index)

# Dictionary to map tags to indices
tag_index = {"N": 0, "V": 1, "ART": 2, "ADJ": 3, "PRO": 4 }
print(tag_index)

{'the': 0, 'cat': 1, 'caught': 2, 'mouse': 3, 'loves': 4, 'cheese': 5, 'dog': 6, 'hates': 7, 'sleeps': 8, 'is': 9, 'black': 10, 'white': 11, 'runs': 12, 'i': 13, 'like': 14, 'yellow': 15, 'you': 16, 'she': 17, 'watches': 18, 'tv': 19}
{'N': 0, 'V': 1, 'ART': 2, 'ADJ': 3, 'PRO': 4}


In [3]:
# Convert a sentence to a numerical tensor
def sentence2tensor(sentence, to_index):
    '''Convert a word sentence to numerical tensor'''
    indexes = [to_index[word] for word in sentence]
    indexes = np.array(indexes)
    return torch.from_numpy(indexes).type(torch.LongTensor)

# Check the the tensor conversion
sample_tensor = sentence2tensor("I like cheese".lower().split(), word_index)
print(sample_tensor)

tensor([13, 14,  5])


## Define the LSTM Neural Network

Simple LSTM that takes in a sentence broken down to sqeuence of words.  The words in the sentence are all from known words list. The network will predict that categories for the words in the sentence.  The prediction is done by applying softmax to the hidden state of the LSTM.  The first layer of the model is an Embeddeding layer.

In [4]:
class GrammaticalTagger(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocabulary_size, tagset_size):
        '''Init'''
        super(GrammaticalTagger, self).__init__()
        
        self.hidden_dim = hidden_dim
        
        # Embedding layer turning words into a specificied size vector
        self.word_embeddings = nn.Embedding(vocabulary_size, embedding_dim)
        
        # LSTM layer takes embedded word vectors as inputs and output hidden states
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # Linear layer maps hidden layer into the output layer with the number of tags
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        
        # Initialize hidden state
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        '''Initialize the hidden state'''
        # (number of layers, batch size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim), torch.zeros(1, 1, self.hidden_dim))
    
    def forward(self, sentence):
        '''Model feedfoward inference'''
        # first create embedded word vectors
        embeds = self.word_embeddings(sentence)
        
        # Get Output and hidden states 
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        
        # Get the scores for tags
        tag_outputs = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_outputs, dim=1)
        
        return tag_scores
                        

## Instantiate model and set hyper parameters

In [5]:
# Embedding_dim defines the size of word vectors
embeddeding_dim = 6
hidden_dim = 6

# Instantiate model
tagger_model = GrammaticalTagger(embeddeding_dim, hidden_dim, len(word_index), len(tag_index))
                                
# Define loss function and optimizer
loss_function = nn.NLLLoss()
optimizer = optim.SGD(tagger_model.parameters(), lr=0.1)


## Sanity Check

Pass a test sentence thru just to check that we get a reasonable response thru forward pass

In [6]:
test_sentence = "The dog caught the cat".lower().split()

input_tensor = sentence2tensor(test_sentence, word_index)
print("Input_tensor: ", input_tensor)

tag_scores = tagger_model(input_tensor)
print(tag_scores)

Input_tensor:  tensor([0, 6, 2, 0, 1])
tensor([[-1.8660, -1.3212, -1.9169, -1.7903, -1.3301],
        [-1.8355, -1.2306, -2.0042, -1.8759, -1.3456],
        [-1.8811, -1.3204, -1.9167, -1.7908, -1.3220],
        [-1.8858, -1.2962, -1.9018, -1.7942, -1.3503],
        [-1.8558, -1.3091, -1.9910, -1.7990, -1.3035]],
       grad_fn=<LogSoftmaxBackward>)


## Training

Loop thru large number of epochs as we have such a small number of training samples. Peform typical training tasks: Zero the gradients, initialize the hidden state, feed training data forward thru network, calculate error, update weights thru backpropagation.  Rinse and repeat.  

In [7]:
num_epochs = 500

for epoch in range(num_epochs):
    
    epoch_loss = 0.0
    
    # Loop over sentences and tags in training data
    for sentence, tags in training_sentences:
                
        # Zero gradients
        tagger_model.zero_grad()
        
        # Zero the hidden state, remove history
        tagger_model.hidden = tagger_model.init_hidden()
        
        # Prepare inputs for the network
        input_tensor = sentence2tensor(sentence, word_index)
        #print("Input Tensor: ", input_tensor)
        target_tags = sentence2tensor(tags, tag_index)
        #print("Target Scores: ", target_scores)

        # Run forward pass
        result_tags = tagger_model(input_tensor)
        
        # Compute loss and gradient
        loss = loss_function(result_tags, target_tags)
        epoch_loss += loss.item()
        loss.backward()
        
        # Update network weights
        optimizer.step()
        
    # Print out loss for every 25 epochs
    if (epoch % 25 == 24):
        print("Epoch # ", epoch+1, "loss: ", epoch_loss/len(training_sentences))
        

Epoch #  25 loss:  1.0027798305858264
Epoch #  50 loss:  0.5931678847833113
Epoch #  75 loss:  0.2780598137866367
Epoch #  100 loss:  0.09165386483073235
Epoch #  125 loss:  0.042576139962131325
Epoch #  150 loss:  0.02742708292366429
Epoch #  175 loss:  0.02015267460691658
Epoch #  200 loss:  0.015891543293202467
Epoch #  225 loss:  0.013096884993666952
Epoch #  250 loss:  0.0111245309129696
Epoch #  275 loss:  0.009658936940302903
Epoch #  300 loss:  0.008527391209182415
Epoch #  325 loss:  0.007627664493735541
Epoch #  350 loss:  0.006895272717387838
Epoch #  375 loss:  0.006287553823891689
Epoch #  400 loss:  0.005775286964225498
Epoch #  425 loss:  0.0053376929665153675
Epoch #  450 loss:  0.004959619551135058
Epoch #  475 loss:  0.004629754736511545
Epoch #  500 loss:  0.004339499462565238


## Testing

In [12]:
test_sentence = "I like the dog".lower().split()

# Run thru the network 
input_tensor = sentence2tensor(test_sentence, word_index)
result_tags = tagger_model(input_tensor)
print("Result Tensor:")
print(result_tags)

# Get the maximum score for most likely result
_, predicted_tags = torch.max(result_tags, 1)
print("Predicted Categories:")
print(predicted_tags)
#tag_index = {"N": 0, "V": 1, "ART": 2, "ADJ": 3, "PRO": 4 }
print("Should have been: 4, 1, 2, 0")

Result Tensor:
tensor([[-4.8152e+00, -1.0538e+00, -5.5886e+00, -6.5666e+00, -4.4921e-01],
        [-5.5098e-02, -3.0479e+00, -8.6860e+00, -6.3506e+00, -5.4643e+00],
        [-5.5126e+00, -5.2395e+00, -1.0135e-02, -8.2610e+00, -7.6262e+00],
        [-3.1144e-03, -5.9378e+00, -8.0129e+00, -9.1729e+00, -1.0213e+01]],
       grad_fn=<LogSoftmaxBackward>)
Predicted Categories:
tensor([4, 0, 2, 0])
Should have been: 4, 1, 2, 0
