# NLP Homework 4 Programming Assignment

In this assignment, we will train and evaluate a neural model to tag the parts of speech in a sentence.
We will also implement several improvements to the model to test its performance.

We will be using English text from the Wall Street Journal, marked with POS tags such as `NNP` (proper noun) and `DT` (determiner).

## Building a POS Tagger

### Setup

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
random.seed(1)

### Preparing Data
We collect the data in the following cell from the `train.txt` and `test.txt` files.  
For `train.txt`, we read the word and tag sequences for each sentence. We then create an 80-20 train-val split on this data for training and evaluation purpose.

Finally, we are interested in our accuracy on `test.txt`, so we prepare test data from this file.

In [3]:
def load_tag_data(tag_file):
    all_sentences = []
    all_tags = []
    sent = []
    tags = []
    with open(tag_file, 'r') as f:
        for line in f:
            if line.strip() == "":
                all_sentences.append(sent)
                all_tags.append(tags)
                sent = []
                tags = []
            else:
                word, tag, _ = line.strip().split()
                sent.append(word)
                tags.append(tag)
    return all_sentences, all_tags

def load_txt_data(txt_file):
    all_sentences = []
    sent = []
    with open(txt_file, 'r') as f:
        for line in f:
            if(line.strip() == ""):
                all_sentences.append(sent)
                sent = []
            else:
                word = line.strip()
                sent.append(word)
    return all_sentences

train_sentences, train_tags = load_tag_data('train.txt')
test_sentences = load_txt_data('test.txt')

unique_tags = set([tag for tag_seq in train_tags for tag in tag_seq])

# Create train-val split from train data
train_val_data = list(zip(train_sentences, train_tags))
random.shuffle(train_val_data)
split = int(0.8 * len(train_val_data))
training_data = train_val_data[:split]
val_data = train_val_data[split:]

print("Train Data: ", len(training_data))
print("Val Data: ", len(val_data))
print("Test Data: ", len(test_sentences))
print("Total tags: ", len(unique_tags))

Train Data:  7148
Val Data:  1788
Test Data:  2012
Total tags:  44


### Word-to-Index and Tag-to-Index mapping
In order to work with text in Tensor format, we need to map each word to an index.

In [4]:
word_to_idx = {}
for sent in train_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)

for sent in test_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
            
tag_to_idx = {}
for tag in unique_tags:
    if tag not in tag_to_idx:
        tag_to_idx[tag] = len(tag_to_idx)

idx_to_tag = {}
for tag in tag_to_idx:
    idx_to_tag[tag_to_idx[tag]] = tag

print("Total tags", len(tag_to_idx))
print("Vocab size", len(word_to_idx))

Total tags 44
Vocab size 21589


In [15]:
def prepare_sequence(sent, idx_mapping):
    idxs = [idx_mapping[word] for word in sent]
    return torch.tensor(idxs, dtype=torch.long)

### Set up model
We will build and train a Basic POS Tagger which is an LSTM model to tag the parts of speech in a given sentence.


First we need to define some default hyperparameters.

In [6]:
EMBEDDING_DIM = 20
HIDDEN_DIM = 10
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 30

### Define Model

The model takes as input a sentence as a tensor in the index space. This sentence is then converted to embedding space where each word maps to its word embedding. The word embeddings is learned as part of the model training process. 

These word embeddings act as input to the LSTM which produces a hidden state. This hidden state is then passed to a Linear layer that produces the probability distribution for the tags of every word. The model will output the tag with the highest probability for a given word.

In [7]:
class BasicPOSTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BasicPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # an LSTM layer: that takes word embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size=hidden_dim, num_layers=LSTM_LAYERS, dropout=DROPOUT)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence as the argument, 
        # compute the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores
        
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

### Training

We define train and evaluate procedures that allow us to train our model using our created train-val split.

In [7]:
def train(epoch, model, loss_function, optimizer):
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. Find the gradient with respect to the loss and update the
        # model parameters using the optimizer.
        #############################################################################
        
        # zero the gradient
        model.zero_grad()
        # Prepare sentence into indexs
        sentence_in = prepare_sequence(sentence, word_to_idx)
        # Prepare tag into indexs
        targets = prepare_sequence(tags, tag_to_idx)
        # predictions for the tags of sentence
        tag_scores = model(sentence_in)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()        
        optimizer.step()
        
        train_loss += loss.cpu().detach().numpy()
        train_examples += len(targets.cpu().detach().numpy())
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate(model, loss_function, optimizer)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate(model, loss_function, optimizer):
  # returns:: avg_val_loss (float)
  # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            # Prepare sentence into indexs
            sentence_in = prepare_sequence(sentence, word_to_idx)
            # Prepare tag into indexs
            targets = prepare_sequence(tags, tag_to_idx)
            # predictions for the tags of sentence
            tag_scores = model(sentence_in)
            # get the prediction results
            _, preds = torch.max(tag_scores, 1)
            loss = loss_function(tag_scores, targets)
            
            val_loss += loss.cpu().detach().numpy()
            correct += (torch.sum(preds == torch.LongTensor(targets)).cpu().detach().numpy())
            val_examples += len(targets.cpu().detach().numpy())
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy


In [1]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = BasicPOSTagger(embedding_dim=EMBEDDING_DIM, 
                       hidden_dim=HIDDEN_DIM, 
                       vocab_size = len(word_to_idx), 
                       tagset_size = len(tag_to_idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
import time

for epoch in range(1, EPOCHS + 1): 
    start = time.time()
    train(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)

NameError: name 'BasicPOSTagger' is not defined

You should get a performance of **at least 80%** on the validation set for the BasicPOSTagger.

Let us now write a method to save our predictions for the test set.

In [None]:
def test():
    predicted_tags = []
    with torch.no_grad():
        for sentence in test_sentences:
            #############################################################################
            # TODO: Implement the test loop
            # This method saves the predicted tags for the sentences in the test set.
            # The tags are first added to a list which is then written to a file for
            # submission. An empty string is added after every sequence of tags
            # corresponding to a sentence to add a newline following file formatting
            # convention, as has been done already.
            #############################################################################
            sentence_in = prepare_sequence(sentence, word_to_idx)
            tag_scores = model(sentence_in)
            
            _, preds = torch.max(tag_scores, 1)

            for pred in preds.tolist():
                predicted_tags.append(idx_to_tag[pred])
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
            predicted_tags.append("")

    with open('test_labels.txt', 'w+') as f:
        for item in predicted_tags:
            f.write("%s\n" % item)
    

In [None]:
test()


### Test accuracy
Evaluate your performance on the test data by submitting test_labels.txt generated by the method above and **report your test accuracy here**.

Imitate the above method to generate prediction for validation data.
Create lists of words, tags predicted by the model and ground truth tags. 

Use these lists to carry out error analysis to find the top-10 types of errors made by the model.

In [0]:
#############################################################################
# TODO: Generate predictions from val data
# Create lists of words, tags predicted by the model and ground truth tags.
#############################################################################
def generate_predictions(model, test_sentences):
    # returns:: word_list (str list)
    # returns:: model_tags (str list)
    # returns:: gt_tags (str list)
    # Your code here
    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    # returns: errors (list of tuples)
    # Your code here
    return errors

### Error analysis
**Report your findings here.**  
What kinds of errors did the model make and why do you think it made them?

## Define a Character Level POS Tagger

We can use the character-level information present to augment our word embeddings. Words that end with -ing or -ly give quite a bit of information about their POS tags. To incorporate this information, we can run a character level LSTM on every word (treated as a tensor of characters, each mapped to character-index space) to create a character-level representation of the word. This representation can be concatenated with the word embedding (as in the BasicPOSTagger) to create a new word embedding that captures more information.

In [6]:
# Create char to index mapping
char_to_idx = {}
unique_chars = set()
MAX_WORD_LEN = 0

for sent in train_sentences:
    for word in sent:
        for c in word:
            unique_chars.add(c)
        if len(word) > MAX_WORD_LEN:
            MAX_WORD_LEN = len(word)

for c in unique_chars:
    char_to_idx[c] = len(char_to_idx)
char_to_idx[' '] = len(char_to_idx)

# New Hyperparameters
EMBEDDING_DIM = 6
HIDDEN_DIM = 3
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 30
CHAR_EMBEDDING_DIM = 3
CHAR_HIDDEN_DIM = 3

In [12]:
class CharPOSTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, char_embedding_dim, 
                 char_hidden_dim, char_size, vocab_size, tagset_size):
        super(CharPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # an char level LSTM: that finds the character level embedding for a word
        # an LSTM layer: that takes the combined embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        # word embedding
        self.hidden_dim = hidden_dim
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm_word = nn.LSTM(embedding_dim, self.hidden_dim)
        
        # char embedding
        self.char_hidden_dim = char_hidden_dim
        self.char_embedding = nn.Embedding(char_size, char_embedding_dim)
        self.lstm_char = nn.LSTM(char_embedding_dim, self.char_hidden_dim)
        
        # combine the word / character
        self.overall_hidden_dim = hidden_dim + MAX_WORD_LEN * char_hidden_dim
        
        self.hidden2tag = nn.Linear(self.overall_hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        self.char_hidden = self.init_hidden(isChar=True)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def init_hidden(self, isChar=False):
        # Before we've done anything, we dont have any hidden state.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        if isChar:
            return (torch.autograd.Variable(torch.zeros(1, 1, self.char_hidden_dim)),
                torch.autograd.Variable(torch.zeros(1, 1, self.char_hidden_dim)))
        else:
            return (torch.autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                torch.autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))        

    def forward(self, sentence, chars):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence and a character sequence as the arguments, 
        # find the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.word_embedding(sentence)
        lstm_out, self.hidden = self.lstm_word(embeds.view(len(sentence), 1, -1), self.hidden)
        
        embeds_char = self.char_embedding(chars)
        char_lstm_out, self.char_hidden = self.lstm_char(embeds_char.view(len(chars), 1, -1), self.char_hidden)
        
        # Remember!!!!!!! You Should re-organized the characters into sentence!!!!!!!!!!
        merge_out = torch.cat((lstm_out.view(len(sentence), -1), char_lstm_out.view(len(sentence), -1)), 1)
        
        tag_space = self.hidden2tag(merge_out)
        tag_scores = F.log_softmax(tag_space, dim=1)
        
        tag_space = self.hidden2tag(merge_out)
        tag_scores = F.log_softmax(tag_scores, dim=1)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

def train_char(epoch, model, loss_function, optimizer):
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences as well as character sequences. Find the gradient with 
        # respect to the loss and update the model parameters using the optimizer.
        #############################################################################
        model.zero_grad()
        
        # initiate the hidden state
        model.hidden = model.init_hidden()
        model.char_hidden = model.init_hidden(isChar=True)
        
        # Get input for the model
        sentence_in = prepare_sequence(sentence, word_to_idx)
        sentence_chars = []
        for w in sentence:
            spaces = ' ' * (MAX_WORD_LEN - len(w))
            sentence_chars.extend(list(spaces + w))
        char_in = prepare_sequence(sentence_chars, char_to_idx)
        targets = prepare_sequence(tags, tag_to_idx)
        
        tag_scores = model(sentence_in, char_in)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()
        
        train_loss += loss
        train_examples += len(targets)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate_char(model, loss_function, optimizer)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate_char(model, loss_function, optimizer):
    # returns:: avg_val_loss (float)
    # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            model.zero_grad()

            # initiate the hidden state
            model.hidden = model.init_hidden()
            model.char_hidden = model.init_hidden(isChar=True)

            # Get input for the model
            sentence_in = prepare_sequence(sentence, word_to_idx)
            sentence_chars = []
            for w in sentence:
                spaces = ' ' * (MAX_WORD_LEN - len(w))
                sentence_chars.extend(list(spaces + w))
            char_in = prepare_sequence(sentence_chars, char_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            tag_scores = model(sentence_in, char_in)
            _, preds = torch.max(tag_scores, 1)
            
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()        
            
            val_loss += loss
            correct += torch.sum(preds == torch.LongTensor(targets))
            val_examples += len(targets)
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy

In [None]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = CharPOSTagger(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                       char_hidden_dim = CHAR_HIDDEN_DIM, char_embedding_dim = CHAR_EMBEDDING_DIM,
                       char_size = len(char_to_idx), vocab_size = len(word_to_idx), tagset_size = len(tag_to_idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1):
    import time
    train_char(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)

Tune your hyperparameters, to get a performance of **at least 85%** on the validation set for the CharPOSTagger.

### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.

### Error analysis

In [0]:
#############################################################################
# TODO: Generate predictions from val data
# Create lists of words, tags predicted by the model and ground truth tags.
#############################################################################
def generate_predictions_char(model, test_sentences):
    # returns:: word_list (str list)
    # returns:: model_tags (str list)
    # returns:: gt_tags (str list)
    # Your code here
    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis_char(word_list, model_tags, gt_tags):
    # returns: errors (list of tuples)
    # Your code here
    return errors


**Report your findings here.**  
What kinds of errors does the character-level model make as compared to the original model, and why do you think it made them? 

## Define a BiLSTM POS Tagger

A bidirectional LSTM that runs both left-to-right and right-to-left to represent dependencies between adjacent words in both directions and thus captures dependencies in both directions. 

In this part, you make your model bidirectional. 

In addition, you should implement one of these modifications to improve the model's performance:
- Tune the model hyperparameters. Try at least 5 different combinations of parameters. For example:
    - number of LSTM layers
    - number of hidden dimensions
    - number of word embedding dimensions
    - dropout rate
    - learning rate
- Switch to pre-trained Word Embeddings instead of training them from scratch. Try at least one different embedding method. For example:
    - [Glove](https://nlp.stanford.edu/projects/glove/)
    - [Fast Text](https://fasttext.cc/docs/en/english-vectors.html)
- Implement a different model architecture. Try at least one different architecture. For example:
    - adding a conditional random field on top of the LSTM
    - adding Viterbi decoding to the model

In [41]:
import torchtext as text

# Set 1: 92% acc
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 32
# LEARNING_RATE = 0.01
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0.1
# EPOCHS = 50

# Set 2: 91% acc
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 16
# LEARNING_RATE = 0.01
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0.1
# EPOCHS = 50

# Set 3: 86% acc
EMBEDDING_DIM = 200
HIDDEN_DIM = 16
LEARNING_RATE = 0.01
BIDIRECTIONAL = True
LSTM_LAYERS = 4
DROPOUT = 0.2
EPOCHS = 50


In [42]:
class BiLSTMPOSTagger(nn.Module):
    # NOTE: you may have to modify these function headers to include your 
    # modification, e.g. adding a parameter for embeddings data

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BiLSTMPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # a BiLSTM layer: that takes word embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        self.vec = text.vocab.GloVe(name='6B', dim=embedding_dim)
        self.dropout = nn.Dropout(DROPOUT)
        self.lstm = nn.LSTM(embedding_dim, hidden_size=hidden_dim, 
                        num_layers=LSTM_LAYERS, 
                        dropout = DROPOUT if LSTM_LAYERS > 1 else 0,bidirectional=BIDIRECTIONAL)
        self.hidden2tag = nn.Linear(hidden_dim * 2 if BIDIRECTIONAL else hidden_dim, tagset_size)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence as the argument, 
        # find the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.vec.get_vecs_by_tokens(sentence, lower_case_backup=True)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

In [43]:
def train(epoch, model, loss_function, optimizer):
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. Find the gradient with respect to the loss and update the
        # model parameters using the optimizer.
        #############################################################################
        
        # zero the gradient
        model.zero_grad()
        # Prepare sentence into indexs
#         sentence_in = prepare_sequence(sentence, word_to_idx)
        sentence4embed = sentence
        # Prepare tag into indexs
        targets = prepare_sequence(tags, tag_to_idx)
        # predictions for the tags of sentence
        tag_scores = model(sentence4embed)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()        
        optimizer.step()
        
        train_loss += loss.cpu().detach().numpy()
        train_examples += len(targets.cpu().detach().numpy())
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate(model, loss_function, optimizer)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate(model, loss_function, optimizer):
  # returns:: avg_val_loss (float)
  # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            # Prepare sentence into indexs
            sentence4embed = sentence
            # Prepare tag into indexs
            targets = prepare_sequence(tags, tag_to_idx)
            # predictions for the tags of sentence
            tag_scores = model(sentence4embed)
            # get the prediction results
            _, preds = torch.max(tag_scores, 1)
            loss = loss_function(tag_scores, targets)
            
            val_loss += loss.cpu().detach().numpy()
            correct += (torch.sum(preds == torch.LongTensor(targets)).cpu().detach().numpy())
            val_examples += len(targets.cpu().detach().numpy())
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy


In [44]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = BiLSTMPOSTagger(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                       vocab_size = len(word_to_idx), tagset_size = len(tag_to_idx))
loss_function = nn.CrossEntropyLoss()
# optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
import time

for epoch in range(1, EPOCHS + 1): 
    start = time.time()
    train(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)

Epoch: 1/50	Avg Train Loss: 0.0509	Avg Val Loss: 0.0349	 Val Accuracy: 75
Time used for Epoch1:  245.82527446746826
Epoch: 2/50	Avg Train Loss: 0.0300	Avg Val Loss: 0.0289	 Val Accuracy: 80
Time used for Epoch2:  232.62984371185303
Epoch: 3/50	Avg Train Loss: 0.0260	Avg Val Loss: 0.0256	 Val Accuracy: 82
Time used for Epoch3:  238.1525263786316
Epoch: 4/50	Avg Train Loss: 0.0238	Avg Val Loss: 0.0235	 Val Accuracy: 84
Time used for Epoch4:  225.94699621200562
Epoch: 5/50	Avg Train Loss: 0.0227	Avg Val Loss: 0.0234	 Val Accuracy: 84
Time used for Epoch5:  204.00997281074524
Epoch: 6/50	Avg Train Loss: 0.0218	Avg Val Loss: 0.0231	 Val Accuracy: 84
Time used for Epoch6:  221.0171856880188
Epoch: 7/50	Avg Train Loss: 0.0210	Avg Val Loss: 0.0222	 Val Accuracy: 85
Time used for Epoch7:  224.36360263824463
Epoch: 8/50	Avg Train Loss: 0.0208	Avg Val Loss: 0.0220	 Val Accuracy: 85
Time used for Epoch8:  226.28252744674683
Epoch: 9/50	Avg Train Loss: 0.0202	Avg Val Loss: 0.0212	 Val Accuracy: 86


KeyboardInterrupt: 

Your modified model should get a performance of **at least 90%** on the validation set.

### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.



### Error analysis
**Report your findings here.**  
Compare the top-10 errors made by this modified model with the errors made by the model from part (a). 
If you tried multiple hyperparameter combinations, choose the model with the highest validation data accuracy.
What errors does the original model make as compared to the modified model, and why do you think it made them? 

Feel free to reuse the methods defined above for this purpose.