# NLP Homework 4 Programming Assignment

In this assignment, we will train and evaluate a neural model to tag the parts of speech in a sentence.
We will also implement several improvements to the model to test its performance.

We will be using English text from the Wall Street Journal, marked with POS tags such as `NNP` (proper noun) and `DT` (determiner).

## Building a POS Tagger


### Setup

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random

random.seed(1)

In [1]:
from functions import *

### Preparing Data

The relevant data is present in the files `train.txt` and `test.txt`.

`train.txt`: The training data is present in this file. The file contains sequences of words and their respective tags. The data is split into 80% training and 20% development to train the model and tune the hyperparameters, respectively. See `load_tag_data` for details on how to read the training data.

`test.txt`: The test data is present in the file. Use this file only for final evaluation of the trained models. See `load_txt_data` for details on how to read the test data.

In [None]:
def load_tag_data(tag_file):
    all_sentences = []
    all_tags = []
    sent = []
    tags = []
    with open(tag_file, 'r') as f:
        for line in f:
            if line.strip() == "":
                all_sentences.append(sent)
                all_tags.append(tags)
                sent = []
                tags = []
            else:
                word, tag, _ = line.strip().split()
                sent.append(word)
                tags.append(tag)
    return all_sentences, all_tags

def load_txt_data(txt_file):
    all_sentences = []
    sent = []
    with open(txt_file, 'r') as f:
        for line in f:
            if(line.strip() == ""):
                all_sentences.append(sent)
                sent = []
            else:
                word = line.strip()
                sent.append(word)
    return all_sentences

train_sentences, train_tags = load_tag_data('train.txt')
test_sentences = load_txt_data('test.txt')

unique_tags = set([tag for tag_seq in train_tags for tag in tag_seq])

# Create train-val split from train data
train_val_data = list(zip(train_sentences, train_tags))
random.shuffle(train_val_data)
split = int(0.8 * len(train_val_data))
training_data = train_val_data[:split]
val_data = train_val_data[split:]

print("Train Data: ", len(training_data))
print("Val Data: ", len(val_data))
print("Test Data: ", len(test_sentences))
print("Total tags: ", len(unique_tags))

### Word-to-Index and Tag-to-Index mapping
In order to work with text in Tensor format, we need to map each word to an index.

In [None]:
word_to_idx = {}
for sent in train_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)

for sent in test_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
            
tag_to_idx = {}
for tag in unique_tags:
    if tag not in tag_to_idx:
        tag_to_idx[tag] = len(tag_to_idx)

idx_to_tag = {}
for tag in tag_to_idx:
    idx_to_tag[tag_to_idx[tag]] = tag

print("Total tags", len(tag_to_idx))
print("Vocab size", len(word_to_idx))

In [None]:
def prepare_sequence(sent, idx_mapping):
    idxs = [idx_mapping[word] for word in sent]
    return torch.tensor(idxs, dtype=torch.long)

### Set up model
We will build and train a Basic POS Tagger which is an LSTM model to tag the parts of speech in a given sentence.


First we need to define some default hyperparameters.

In [None]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 3
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 30

### Define Model

The model takes as input a sentence as a tensor in the index space. This sentence is then converted to embedding space where each word maps to its word embedding. The word embeddings is learned as part of the model training process. 

These word embeddings act as input to the LSTM which produces a hidden state. This hidden state is then passed to a Linear layer that produces the probability distribution for the tags of every word. The model will output the tag with the highest probability for a given word.

In [None]:
# class BasicPOSTagger(nn.Module):
#     def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
#         super(BasicPOSTagger, self).__init__()
#         #############################################################################
#         # TODO: Define and initialize anything needed for the forward pass.
#         # You are required to create a model with:
#         # an embedding layer: that maps words to the embedding space
#         # an LSTM layer: that takes word embeddings as input and outputs hidden states
#         # a Linear layer: maps from hidden state space to tag space
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################

#     def forward(self, sentence):
#         tag_scores = None
#         #############################################################################
#         # TODO: Implement the forward pass.
#         # Given a tokenized index-mapped sentence as the argument, 
#         # compute the corresponding scores for tags
#         # returns:: tag_scores (Tensor)
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################
#         return tag_scores

### Training

We define train and evaluate procedures that allow us to train our model using our created train-val split.

In [None]:
# def train(epoch, model, loss_function, optimizer):
#     train_loss = 0
#     train_examples = 0
#     for sentence, tags in training_data:
#         #############################################################################
#         # TODO: Implement the training loop
#         # Hint: you can use the prepare_sequence method for creating index mappings 
#         # for sentences. Find the gradient with respect to the loss and update the
#         # model parameters using the optimizer.
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################
    
#     avg_train_loss = train_loss / train_examples
#     avg_val_loss, val_accuracy = evaluate(model, loss_function, optimizer)
        
#     print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
#                                                                       EPOCHS, 
#                                                                       avg_train_loss, 
#                                                                       avg_val_loss,
#                                                                       val_accuracy))

# def evaluate(model, loss_function, optimizer):
#   # returns:: avg_val_loss (float)
#   # returns:: val_accuracy (float)
#     val_loss = 0
#     correct = 0
#     val_examples = 0
#     with torch.no_grad():
#         for sentence, tags in val_data:
#             #############################################################################
#             # TODO: Implement the evaluate loop
#             # Find the average validation loss along with the validation accuracy.
#             # Hint: To find the accuracy, argmax of tag predictions can be used.
#             #############################################################################
            
#             #############################################################################
#             #                             END OF YOUR CODE                              #
#             #############################################################################
#     val_accuracy = 100. * correct / val_examples
#     avg_val_loss = val_loss / val_examples
#     return avg_val_loss, val_accuracy

In [None]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################

#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1): 
    train(epoch, model, loss_function, optimizer)

**Sanity Check!** After 5 epochs you should get around 53% accuracy, after 10 epochs the accuracy should be around 61%, the accuracy rises to around 70% after 20 epochs, and to around 75% accuracy after 30 epochs.

**Note1** If you run the notebook on CPU, it could take time to run all 30 epochs. So try the small number of epochs first to sanity check and then run the model for all 30 epochs. You're free to use GPU for this assignment.

**Note2** Your accuracy may not match exactly to what is reported here but as long as the trend is increasing, it should be good.

**Note3** The reported numbers are with the default hyperparameters. If you reach the desired accuracy, you can try different hyperparameter settings to improve the accuracy.

Write a method to save the predictions for the test set.

In [None]:
# def test():
#     val_loss = 0
#     correct = 0
#     val_examples = 0
#     predicted_tags = []
#     with torch.no_grad():
#         for sentence in test_sentences:
#             #############################################################################
#             # TODO: Implement the test loop
#             # This method saves the predicted tags for the sentences in the test set.
#             # The tags are first added to a list which is then written to a file for
#             # submission. An empty string is added after every sequence of tags
#             # corresponding to a sentence to add a newline following file formatting
#             # convention, as has been done already.
#             #############################################################################
            
#             #############################################################################
#             #                             END OF YOUR CODE                              #
#             #############################################################################
#             predicted_tags.append("")

#     with open('test_labels.txt', 'w+') as f:
#         for item in predicted_tags:
#             f.write("%s\n" % item)

In [None]:
test()


### Test accuracy

Evaluate your performance on the test data by submitting test_labels.txt generated by the method above and **report your test accuracy here**.

Imitate the above method to generate prediction for validation data.
Create lists of words, tags predicted by the model and ground truth tags. 

Use these lists to carry out error analysis to find the top-10 types of errors made by the model.

In [None]:
# #############################################################################
# # TODO: Generate predictions from val data
# # Create lists of words, tags predicted by the model and ground truth tags.
# #############################################################################
# def generate_predictions(model, test_sentences):
#     # returns:: word_list (str list)
#     # returns:: model_tags (str list)
#     # returns:: gt_tags (str list)
#     # Your code here
#     return word_list, model_tags, gt_tags

# #############################################################################
# # TODO: Carry out error analysis
# # From those lists collected from the above method, find the 
# # top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# # sorted by frequency
# #############################################################################
# def error_analysis(word_list, model_tags, gt_tags):
#     # returns: errors (list of tuples)
#     # Your code here
#     return errors

### Error analysis
**Report your findings here.**  
What kinds of errors did the model make and why do you think it made them?

## II. Character level PoS Tagger

Use the character-level information present to augment word embeddings. Words that end with -ing or -ly give quite a bit of information about their POS tags. To incorporate this information, run a character level LSTM on every word (treated as a tensor of characters, each mapped to character-index space) to create a character-level representation of the word. Take the last hidden state from the character level LSTM as the representation and concatenate with the word embedding (as in the WordLSTMPoSTagger) to create a new word embedding that captures more information.

In [None]:
# Create char to index mapping
char_to_idx = {}
unique_chars = set()
MAX_WORD_LEN = 0

for sent in train_sentences:
    for word in sent:
        for c in word:
            unique_chars.add(c)
        if len(word) > MAX_WORD_LEN:
            MAX_WORD_LEN = len(word)

for c in unique_chars:
    char_to_idx[c] = len(char_to_idx)
char_to_idx[' '] = len(char_to_idx)

# New Hyperparameters
EMBEDDING_DIM = 6
HIDDEN_DIM = 3
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 30
CHAR_EMBEDDING_DIM = 3
CHAR_HIDDEN_DIM = 3

In [None]:
# class CharPOSTagger(nn.Module):
#     def __init__(self, embedding_dim, hidden_dim, char_embedding_dim, 
#                  char_hidden_dim, char_size, vocab_size, tagset_size):
#         super(CharPOSTagger, self).__init__()
#         #############################################################################
#         # TODO: Define and initialize anything needed for the forward pass.
#         # You are required to create a model with:
#         # an embedding layer: that maps words to the embedding space
#         # an char level LSTM: that finds the character level embedding for a word
#         # an LSTM layer: that takes the combined embeddings as input and outputs hidden states
#         # a Linear layer: maps from hidden state space to tag space
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################

#     def forward(self, sentence, chars):
#         tag_scores = None
#         #############################################################################
#         # TODO: Implement the forward pass.
#         # Given a tokenized index-mapped sentence and a character sequence as the arguments, 
#         # find the corresponding scores for tags
#         # returns:: tag_scores (Tensor)
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################
#         return tag_scores

# def train_char(epoch, model, loss_function, optimizer):
#     train_loss = 0
#     train_examples = 0
#     for sentence, tags in training_data:
#         #############################################################################
#         # TODO: Implement the training loop
#         # Hint: you can use the prepare_sequence method for creating index mappings 
#         # for sentences as well as character sequences. Find the gradient with 
#         # respect to the loss and update the model parameters using the optimizer.
#         #############################################################################
        
#         #############################################################################
#         #                             END OF YOUR CODE                              #
#         #############################################################################
    
#     avg_train_loss = train_loss / train_examples
#     avg_val_loss, val_accuracy = evaluate_char(model, loss_function, optimizer)
        
#     print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
#                                                                       EPOCHS, 
#                                                                       avg_train_loss, 
#                                                                       avg_val_loss,
#                                                                       val_accuracy))

# def evaluate_char(model, loss_function, optimizer):
#     # returns:: avg_val_loss (float)
#     # returns:: val_accuracy (float)
#     val_loss = 0
#     correct = 0
#     val_examples = 0
#     with torch.no_grad():
#         for sentence, tags in val_data:
#             #############################################################################
#             # TODO: Implement the evaluate loop
#             # Find the average validation loss along with the validation accuracy.
#             # Hint: To find the accuracy, argmax of tag predictions can be used.
#             #############################################################################
            
#             #############################################################################
#             #                             END OF YOUR CODE                              #
#             #############################################################################
#     val_accuracy = 100. * correct / val_examples
#     avg_val_loss = val_loss / val_examples
#     return avg_val_loss, val_accuracy

In [None]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################

#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1): 
    train_char(epoch, model, loss_function, optimizer)

**Sanity Check!** After 5 epochs you should get around 57% accuracy, after 10 epochs the accuracy should be around 67%, the accuracy rises to around 74% after 20 epochs, and to around 77% accuracy after 30 epochs.


### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.

### Error analysis

In [None]:
# #############################################################################
# # TODO: Generate predictions from val data
# # Create lists of words, tags predicted by the model and ground truth tags.
# #############################################################################
# def generate_predictions_char(model, test_sentences):
#     # returns:: word_list (str list)
#     # returns:: model_tags (str list)
#     # returns:: gt_tags (str list)
#     # Your code here
#     return word_list, model_tags, gt_tags

# #############################################################################
# # TODO: Carry out error analysis
# # From those lists collected from the above method, find the 
# # top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# # sorted by frequency
# #############################################################################
# def error_analysis_char(word_list, model_tags, gt_tags):
#     # returns: errors (list of tuples)
#     # Your code here
#     return errors


**Report your findings here.**  
What kinds of errors does the character-level model make as compared to the original model, and why do you think it made them? 

## Modifications

Now implement one of the following three modifications and report the model's performance.
- Change the number of LSTM layers
- Change the number of hidden dimensions
- Change the number of word embedding dimensions

### Modification choice
Which modification did you use and why?

### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.

### Error analysis
**Report your findings here.**  
Compare the top-10 errors made by this modified model with the errors made by the model from part (a). 
What errors does the original model make as compared to the modified model, and why do you think it made them? 

Feel free to reuse the methods defined above for this purpose.