# NLP Homework 4 Programming Assignment

In this assignment, we will train and evaluate a neural model to tag the parts of speech in a sentence.
We will also implement several improvements to the model to test its performance.

We will be using English text from the Wall Street Journal, marked with POS tags such as `NNP` (proper noun) and `DT` (determiner).

## Building a POS Tagger

### Setup

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
random.seed(1)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Preparing Data
We collect the data in the following cell from the `train.txt` and `test.txt` files.  
For `train.txt`, we read the word and tag sequences for each sentence. We then create an 80-20 train-val split on this data for training and evaluation purpose.

Finally, we are interested in our accuracy on `test.txt`, so we prepare test data from this file.

In [2]:
def load_tag_data(tag_file):
    all_sentences = []
    all_tags = []
    sent = []
    tags = []
    with open(tag_file, 'r') as f:
        for line in f:
            if line.strip() == "":
                all_sentences.append(sent)
                all_tags.append(tags)
                sent = []
                tags = []
            else:
                word, tag, _ = line.strip().split()
                sent.append(word)
                tags.append(tag)
    return all_sentences, all_tags

def load_txt_data(txt_file):
    all_sentences = []
    sent = []
    with open(txt_file, 'r') as f:
        for line in f:
            if(line.strip() == ""):
                all_sentences.append(sent)
                sent = []
            else:
                word = line.strip()
                sent.append(word)
    return all_sentences

train_sentences, train_tags = load_tag_data('train.txt')
test_sentences = load_txt_data('test.txt')

unique_tags = set([tag for tag_seq in train_tags for tag in tag_seq])

# Create train-val split from train data
train_val_data = list(zip(train_sentences, train_tags))
random.shuffle(train_val_data)
split = int(0.8 * len(train_val_data))
training_data = train_val_data[:split]
val_data = train_val_data[split:]

print("Train Data: ", len(training_data))
print("Val Data: ", len(val_data))
print("Test Data: ", len(test_sentences))
print("Total tags: ", len(unique_tags))

Train Data:  7148
Val Data:  1788
Test Data:  2012
Total tags:  44


### Word-to-Index and Tag-to-Index mapping
In order to work with text in Tensor format, we need to map each word to an index.

In [3]:
word_to_idx = {}
for sent in train_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)

for sent in test_sentences:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
            
tag_to_idx = {}
for tag in unique_tags:
    if tag not in tag_to_idx:
        tag_to_idx[tag] = len(tag_to_idx)

idx_to_tag = {}
for tag in tag_to_idx:
    idx_to_tag[tag_to_idx[tag]] = tag

print("Total tags", len(tag_to_idx))
print("Vocab size", len(word_to_idx))

Total tags 44
Vocab size 21589


In [4]:
def prepare_sequence(sent, idx_mapping):
    idxs = [idx_mapping[word] for word in sent]
    return torch.tensor(idxs, dtype=torch.long)

### Set up model
We will build and train a Basic POS Tagger which is an LSTM model to tag the parts of speech in a given sentence.


First we need to define some default hyperparameters.

In [9]:
EMBEDDING_DIM = 15
HIDDEN_DIM = 6
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 12

### Define Model

The model takes as input a sentence as a tensor in the index space. This sentence is then converted to embedding space where each word maps to its word embedding. The word embeddings is learned as part of the model training process. 

These word embeddings act as input to the LSTM which produces a hidden state. This hidden state is then passed to a Linear layer that produces the probability distribution for the tags of every word. The model will output the tag with the highest probability for a given word.

In [2]:
class BasicPOSTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BasicPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # an LSTM layer: that takes word embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size=hidden_dim, num_layers=LSTM_LAYERS, dropout=DROPOUT)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence as the argument, 
        # compute the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores
        
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

NameError: name 'nn' is not defined

### Training

We define train and evaluate procedures that allow us to train our model using our created train-val split.

In [22]:
def train(epoch, model, loss_function, optimizer):
    train_loss = 0
    train_examples = 0
    for sentence, tags in training_data:
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. Find the gradient with respect to the loss and update the
        # model parameters using the optimizer.
        #############################################################################
        
        # zero the gradient
        model.zero_grad()
        # Prepare sentence into indexs
        sentence_in = prepare_sequence(sentence, word_to_idx)
        # Prepare tag into indexs
        targets = prepare_sequence(tags, tag_to_idx)
        # predictions for the tags of sentence
        tag_scores = model(sentence_in)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()        
        optimizer.step()
        
        train_loss += loss.cpu().detach().numpy()
        train_examples += len(targets.cpu().detach().numpy())
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate(model, loss_function, optimizer)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate(model, loss_function, optimizer):
  # returns:: avg_val_loss (float)
  # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            # Prepare sentence into indexs
            sentence_in = prepare_sequence(sentence, word_to_idx)
            # Prepare tag into indexs
            targets = prepare_sequence(tags, tag_to_idx)
            # predictions for the tags of sentence
            tag_scores = model(sentence_in)
            # get the prediction results
            _, preds = torch.max(tag_scores, 1)
            loss = loss_function(tag_scores, targets)
            
            val_loss += loss.cpu().detach().numpy()
            correct += (torch.sum(preds == torch.LongTensor(targets)).cpu().detach().numpy())
            val_examples += len(targets.cpu().detach().numpy())
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy


In [23]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = BasicPOSTagger(embedding_dim=EMBEDDING_DIM, 
                       hidden_dim=HIDDEN_DIM, 
                       vocab_size = len(word_to_idx), 
                       tagset_size = len(tag_to_idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
import time

for epoch in range(1, EPOCHS + 1): 
    start = time.time()
    train(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)
torch.save(model, "basic_pos_tagger.pth")

Epoch: 1/30	Avg Train Loss: 0.0787	Avg Val Loss: 0.0627	 Val Accuracy: 58
Time used for Epoch1:  29.361035585403442
Epoch: 2/30	Avg Train Loss: 0.0560	Avg Val Loss: 0.0534	 Val Accuracy: 64
Time used for Epoch2:  27.0801739692688
Epoch: 3/30	Avg Train Loss: 0.0491	Avg Val Loss: 0.0486	 Val Accuracy: 67
Time used for Epoch3:  27.373235940933228
Epoch: 4/30	Avg Train Loss: 0.0449	Avg Val Loss: 0.0452	 Val Accuracy: 70
Time used for Epoch4:  37.17160701751709
Epoch: 5/30	Avg Train Loss: 0.0416	Avg Val Loss: 0.0424	 Val Accuracy: 72
Time used for Epoch5:  69.100022315979
Epoch: 6/30	Avg Train Loss: 0.0389	Avg Val Loss: 0.0401	 Val Accuracy: 74
Time used for Epoch6:  28.22225332260132
Epoch: 7/30	Avg Train Loss: 0.0365	Avg Val Loss: 0.0382	 Val Accuracy: 75
Time used for Epoch7:  25.881731271743774
Epoch: 8/30	Avg Train Loss: 0.0344	Avg Val Loss: 0.0365	 Val Accuracy: 77
Time used for Epoch8:  35.12621259689331
Epoch: 9/30	Avg Train Loss: 0.0326	Avg Val Loss: 0.0350	 Val Accuracy: 78
Time u

  "type " + obj.__name__ + ". It won't be checked "


You should get a performance of **at least 80%** on the validation set for the BasicPOSTagger.

Let us now write a method to save our predictions for the test set.

In [25]:
def test():
    predicted_tags = []
    with torch.no_grad():
        for sentence in test_sentences:
            #############################################################################
            # TODO: Implement the test loop
            # This method saves the predicted tags for the sentences in the test set.
            # The tags are first added to a list which is then written to a file for
            # submission. An empty string is added after every sequence of tags
            # corresponding to a sentence to add a newline following file formatting
            # convention, as has been done already.
            #############################################################################
            sentence_in = prepare_sequence(sentence, word_to_idx)
            tag_scores = model(sentence_in)
            preds = tag_scores.argmax(axis=1)
            predicted_tags.extend([idx_to_tag[preds[i].item()] for i in range(len(preds))])
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
            predicted_tags.append("")

    with open('test_labels.txt', 'w+') as f:
        for item in predicted_tags:
            f.write("%s\n" % item)
    

In [26]:
test()


### Test accuracy
Evaluate your performance on the test data by submitting test_labels.txt generated by the method above and **report your test accuracy here**.

Imitate the above method to generate prediction for validation data.
Create lists of words, tags predicted by the model and ground truth tags. 

Use these lists to carry out error analysis to find the top-10 types of errors made by the model.

In [27]:
#############################################################################
# TODO: Generate predictions from val data
# Create lists of words, tags predicted by the model and ground truth tags.
#############################################################################
def generate_predictions(model, test_sentences):
    word_list = []
    model_tags = []
    gt_tags = []
    for sentence, tags in test_sentences:
        sentence_trans = prepare_sequence(sentence, word_to_idx)
        tag_scores = model(sentence_trans)  
        preds = tag_scores.argmax(axis=1)
        model_tags.extend([idx_to_tag[i.item()] for i in preds])

        word_list.extend(sentence)
        gt_tags.extend(tags)
    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    from collections import Counter, defaultdict
    
    wl = []
    mt = []
    gt = []
    words = defaultdict(list)
    
    for w, m, g in zip(word_list, model_tags, gt_tags):
        if m != g:
            wl.append(w)
            mt.append(m)
            gt.append(g)
            words[(m, g)].append(w)
            
    c = Counter(zip(mt, gt))
    top10 = c.most_common(10)
    
    errors = []
    for (mtag, gt_tag), count in top10:
        errors.append((mtag, gt_tag, count / sum(c.values()), words[(mtag, gt_tag)][:5]))
    return errors

errors = error_analysis(*generate_predictions(model, val_data))

for err in errors:
    print(err)

('NN', 'NNP', 0.056300774607972795, ['Whitten', 'Arthur', 'Education', 'Herman', 'Hoyt'])
('JJ', 'NN', 0.0477989797846212, ['verge', 'worth', 'span', 'lie', 'chief'])
('NN', 'JJ', 0.04402040430757605, ['literary', 'same', 'fastest-growing', 'Western', 'peaceful'])
('NNP', 'NN', 0.03117324768562252, ['overhead', 'province', 'provision', 'democratization', 'rent'])
('JJ', 'NNP', 0.029850746268656716, ['American', 'Lyneses', 'Wolf', 'Neuhaus', 'Chris-Craft'])
('NNS', 'NNP', 0.02550538447005479, ['Thanh', 'Galicia', 'Cordis', 'Amex', 'PATOIS'])
('NNP', 'NNS', 0.024371811826941245, ['sources', 'onlookers', 'barricades', 'buildings', 'supports'])
('NNP', 'JJ', 0.023238239183827697, ['nonprofit', 'less-developed', 'nearby', 'perturbed', 'equitable'])
('NN', 'NNS', 0.02153788021915738, ['donors', 'rows', 'foundations', 'buffs', 'aspirations'])
('VBN', 'VBD', 0.019837521254487057, ['stepped', 'told', 'managed', 'ordered', 'used'])


### Error analysis
**Report your findings here.**  
What kinds of errors did the model make and why do you think it made them?

In the results, we could see with the basic pos tagger using words, the NN, NNP and JJ are confusing model since their position in the sentence often similar. Also, the model also make mistakes on discern different type of noun. 
The reason is that we use one direction LSTM which do not have enough clue to separate them by the similar tagging structure. The accuracy achieved 86.23%

## Define a Character Level POS Tagger

We can use the character-level information present to augment our word embeddings. Words that end with -ing or -ly give quite a bit of information about their POS tags. To incorporate this information, we can run a character level LSTM on every word (treated as a tensor of characters, each mapped to character-index space) to create a character-level representation of the word. This representation can be concatenated with the word embedding (as in the BasicPOSTagger) to create a new word embedding that captures more information.

In [5]:
# Create char to index mapping
char_to_idx = {}
unique_chars = set()
MAX_WORD_LEN = 0

for sent in train_sentences:
    for word in sent:
        for c in word:
            unique_chars.add(c)
        if len(word) > MAX_WORD_LEN:
            MAX_WORD_LEN = len(word)

for c in unique_chars:
    char_to_idx[c] = len(char_to_idx)
char_to_idx[' '] = len(char_to_idx)

# New Hyperparameters
# EMBEDDING_DIM = 16
# HIDDEN_DIM = 8
EMBEDDING_DIM = 32
HIDDEN_DIM = 16
LEARNING_RATE = 0.1
LSTM_LAYERS = 1
DROPOUT = 0
EPOCHS = 12
CHAR_EMBEDDING_DIM = 4
CHAR_HIDDEN_DIM = 2

In [6]:
from tqdm import tqdm
class CharPOSTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, char_embedding_dim, 
                 char_hidden_dim, char_size, vocab_size, tagset_size):
        super(CharPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # an char level LSTM: that finds the character level embedding for a word
        # an LSTM layer: that takes the combined embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        # word embedding
        self.hidden_dim = hidden_dim
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm_word = nn.LSTM(embedding_dim, self.hidden_dim)
        
        # char embedding
        self.char_hidden_dim = char_hidden_dim
        self.char_embedding = nn.Embedding(char_size, char_embedding_dim)
        self.lstm_char = nn.LSTM(char_embedding_dim, self.char_hidden_dim)
        
        # combine the word / character
        self.overall_hidden_dim = hidden_dim + MAX_WORD_LEN * char_hidden_dim
        
        self.hidden2tag = nn.Linear(self.overall_hidden_dim, tagset_size)

        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################    

    def forward(self, sentence, chars):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence and a character sequence as the arguments, 
        # find the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.word_embedding(sentence)
        lstm_out, _ = self.lstm_word(embeds.view(len(sentence), 1, -1))
        
        embeds_char = self.char_embedding(chars)
        char_lstm_out, _ = self.lstm_char(embeds_char.view(len(chars), 1, -1))
        
        # Remember!!!!!!! You Should re-organized the characters into sentence!!!!!!!!!!
        merge_out = torch.cat((lstm_out.view(len(sentence), -1), char_lstm_out.view(len(sentence), -1)), 1)
        
        tag_space = self.hidden2tag(merge_out)
        tag_scores = F.log_softmax(tag_space, dim=1)
        
        tag_space = self.hidden2tag(merge_out)
        tag_scores = F.log_softmax(tag_scores, dim=1)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

def train_char(epoch, model, loss_function, optimizer):
    model.train()
    train_loss = 0
    train_examples = 0
    s_idx = 0
    for sentence, tags in tqdm(training_data):
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences as well as character sequences. Find the gradient with 
        # respect to the loss and update the model parameters using the optimizer.
        #############################################################################
        model.zero_grad()
        
        # Get input for the model
        sentence_in = prepare_sequence(sentence, word_to_idx)
        sentence_chars = []
        for w in sentence:
            spaces = ' ' * (MAX_WORD_LEN - len(w))
            sentence_chars.extend(list(spaces + w))
        char_in = prepare_sequence(sentence_chars, char_to_idx)
        targets = prepare_sequence(tags, tag_to_idx)
        
        tag_scores = model(sentence_in, char_in)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.cpu().detach().numpy()
        train_examples += len(targets.cpu().detach().numpy())
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate_char(model, loss_function, optimizer)

    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate_char(model, loss_function, optimizer):
    # returns:: avg_val_loss (float)
    # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################

            # Get input for the model
            sentence_in = prepare_sequence(sentence, word_to_idx)
            sentence_chars = []
            for w in sentence:
                spaces = ' ' * (MAX_WORD_LEN - len(w))
                sentence_chars.extend(list(spaces + w))
            char_in = prepare_sequence(sentence_chars, char_to_idx)
            targets = prepare_sequence(tags, tag_to_idx)

            tag_scores = model(sentence_in, char_in)
            _, preds = torch.max(tag_scores, 1)
            
            loss = loss_function(tag_scores, targets)
            
            val_loss += loss.cpu().detach().numpy()
            correct += (torch.sum(preds == torch.LongTensor(targets)).cpu().detach().numpy())
            val_examples += len(targets.cpu().detach().numpy())
            
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy

In [7]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = CharPOSTagger(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                       char_hidden_dim = CHAR_HIDDEN_DIM, char_embedding_dim = CHAR_EMBEDDING_DIM,
                       char_size = len(char_to_idx), vocab_size = len(word_to_idx), tagset_size = len(tag_to_idx))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
for epoch in range(1, EPOCHS + 1):
    import time
    start = time.time()
    train_char(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)
    if epoch % 2 == 0:
        torch.save(model, f"char_pos_tagger_v2_{epoch}.pth")

  0%|          | 12/7148 [00:02<21:51,  5.44it/s]


KeyboardInterrupt: 

In [8]:
model = torch.load("char_pos_tagger_8.pth")

In [9]:
def test():
    predicted_tags = []
    with torch.no_grad():
        for sentence in test_sentences:
            #############################################################################
            # TODO: Implement the test loop
            # This method saves the predicted tags for the sentences in the test set.
            # The tags are first added to a list which is then written to a file for
            # submission. An empty string is added after every sequence of tags
            # corresponding to a sentence to add a newline following file formatting
            # convention, as has been done already.
            #############################################################################
            try:
                sentence_in = prepare_sequence(sentence, word_to_idx)
                sentence_chars = []
                for w in sentence:
                    spaces = ' ' * (MAX_WORD_LEN - len(w))
                    sentence_chars.extend(list(spaces + w))
                char_in = prepare_sequence(sentence_chars, char_to_idx)
                tag_scores = model(sentence_in, char_in)
                preds = tag_scores.argmax(axis=1)
                predicted_tags.extend([idx_to_tag[preds[i].item()] for i in range(len(preds))])
            except:
                pass
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
            predicted_tags.append("")

    with open('test_labels_char.txt', 'w+') as f:
        for item in predicted_tags:
            f.write("%s\n" % item)
    

In [10]:
test()

Tune your hyperparameters, to get a performance of **at least 85%** on the validation set for the CharPOSTagger.

### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.

### Error analysis

In [11]:
#############################################################################
# TODO: Generate predictions from val data
# Create lists of words, tags predicted by the model and ground truth tags.
#############################################################################
def generate_predictions(model, test_sentences):
    word_list = []
    model_tags = []
    gt_tags = []
    for sentence, tags in test_sentences:
        
        sentence_chars = []
        for w in sentence:
            spaces = ' ' * (MAX_WORD_LEN - len(w))
            sentence_chars.extend(list(spaces + w))
        char_in = prepare_sequence(sentence_chars, char_to_idx)
        sentence_trans = prepare_sequence(sentence, word_to_idx)
        tag_scores = model(sentence_trans, char_in)  
        preds = tag_scores.argmax(axis=1)
        model_tags.extend([idx_to_tag[i.item()] for i in preds])

        word_list.extend(sentence)
        gt_tags.extend(tags)
    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    from collections import Counter, defaultdict
    
    wl = []
    mt = []
    gt = []
    words = defaultdict(list)
    
    for w, m, g in zip(word_list, model_tags, gt_tags):
        if m != g:
            wl.append(w)
            mt.append(m)
            gt.append(g)
            words[(m, g)].append(w)
            
    c = Counter(zip(mt, gt))
    top10 = c.most_common(10)
    
    errors = []
    for (mtag, gt_tag), count in top10:
        errors.append((mtag, gt_tag, count / sum(c.values()), words[(mtag, gt_tag)][:5]))
    return errors

errors = error_analysis(*generate_predictions(model, val_data))

for err in errors:
    print(err)

('POS', 'DT', 0.08363784979324936, ['the', 'the', 'the', 'the', 'no'])
('VBG', 'IN', 0.07979132608904703, ['in', 'In', 'off', 'at', 'in'])
('WDT', 'NN', 0.06440523127223771, ['overhead', 'province', 'rice', 'program', 'program'])
('WDT', 'NNP', 0.05442831041446293, ['Peterson', 'Service', 'Whitten', 'James', 'B.A.T'])
('WRB', ',', 0.05113472449273969, [',', ',', ',', ',', ','])
('(', 'NN', 0.04459563419559573, ['suit', 'fact', 'money', 'panhandler', 'night'])
('CD', '.', 0.04228771997307433, ['.', '.', '.', '.', '.'])
('WDT', 'NNS', 0.04019617270891432, ['companies', 'officials', 'villagers', 'sponsors', 'regulators'])
('WDT', 'JJ', 0.03185402442542552, ['tax-collection', 'fiscal', 'less-developed', 'third-quarter', 'institutional'])
('RBS', 'TO', 0.0233916722761804, ['to', 'to', 'to', 'to', 'to'])



**Report your findings here.**  
What kinds of errors does the character-level model make as compared to the original model, and why do you think it made them? 

In the LSTM with character information, we could see that confusion of different noun has been improved. 
The final accuracy is 90.74%.
However, punctuation marks become some of the error. 
And the error occurred when there are similar word structure like or similar end of the word. 
The reason might be the information provided by characters is over-weighted than the sentence structure.

## Define a BiLSTM POS Tagger

A bidirectional LSTM that runs both left-to-right and right-to-left to represent dependencies between adjacent words in both directions and thus captures dependencies in both directions. 

In this part, you make your model bidirectional. 

In addition, you should implement one of these modifications to improve the model's performance:
- Tune the model hyperparameters. Try at least 5 different combinations of parameters. For example:
    - number of LSTM layers
    - number of hidden dimensions
    - number of word embedding dimensions
    - dropout rate
    - learning rate
- Switch to pre-trained Word Embeddings instead of training them from scratch. Try at least one different embedding method. For example:
    - [Glove](https://nlp.stanford.edu/projects/glove/)
    - [Fast Text](https://fasttext.cc/docs/en/english-vectors.html)
- Implement a different model architecture. Try at least one different architecture. For example:
    - adding a conditional random field on top of the LSTM
    - adding Viterbi decoding to the model

In [12]:
import torchtext as text

# Set 1: 92% acc
EMBEDDING_DIM = 200
HIDDEN_DIM = 32
LEARNING_RATE = 0.01
BIDIRECTIONAL = True
LSTM_LAYERS = 2
DROPOUT = 0.1
EPOCHS = 12

# # Set 2: 79% acc
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 16
# LEARNING_RATE = 0.01
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0.5
# EPOCHS = 20

# Set 3: 86% acc
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 32
# LEARNING_RATE = 0.1
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0.3
# EPOCHS = 30

# Set 4: 83%
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 32
# LEARNING_RATE = 0.001
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0
# EPOCHS = 30

# Set 5:
# EMBEDDING_DIM = 200
# HIDDEN_DIM = 32
# LEARNING_RATE = 0.1
# BIDIRECTIONAL = True
# LSTM_LAYERS = 2
# DROPOUT = 0
# EPOCHS = 30


In [13]:
class BiLSTMPOSTagger(nn.Module):
    # NOTE: you may have to modify these function headers to include your 
    # modification, e.g. adding a parameter for embeddings data

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BiLSTMPOSTagger, self).__init__()
        #############################################################################
        # TODO: Define and initialize anything needed for the forward pass.
        # You are required to create a model with:
        # an embedding layer: that maps words to the embedding space
        # a BiLSTM layer: that takes word embeddings as input and outputs hidden states
        # a Linear layer: maps from hidden state space to tag space
        #############################################################################
        self.vec = text.vocab.GloVe(name='6B', dim=embedding_dim)
        self.dropout = nn.Dropout(DROPOUT)
        self.lstm = nn.LSTM(embedding_dim, hidden_size=hidden_dim, 
                        num_layers=LSTM_LAYERS, 
                        dropout = DROPOUT if LSTM_LAYERS > 1 else 0,bidirectional=BIDIRECTIONAL)
        self.hidden2tag = nn.Linear(hidden_dim * 2 if BIDIRECTIONAL else hidden_dim, tagset_size)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################

    def forward(self, sentence):
        tag_scores = None
        #############################################################################
        # TODO: Implement the forward pass.
        # Given a tokenized index-mapped sentence as the argument, 
        # find the corresponding scores for tags
        # returns:: tag_scores (Tensor)
        #############################################################################
        embeds = self.vec.get_vecs_by_tokens(sentence, lower_case_backup=True)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
        return tag_scores

In [14]:
from tqdm import tqdm
def train(epoch, model, loss_function, optimizer):
    train_loss = 0
    train_examples = 0
    for sentence, tags in tqdm(training_data):
        #############################################################################
        # TODO: Implement the training loop
        # Hint: you can use the prepare_sequence method for creating index mappings 
        # for sentences. Find the gradient with respect to the loss and update the
        # model parameters using the optimizer.
        #############################################################################
        
        # zero the gradient
        model.zero_grad()
        # Prepare sentence into indexs
#         sentence_in = prepare_sequence(sentence, word_to_idx)
        sentence4embed = sentence
        # Prepare tag into indexs
        targets = prepare_sequence(tags, tag_to_idx)
        # predictions for the tags of sentence
        tag_scores = model(sentence4embed)
        
        loss = loss_function(tag_scores, targets)
        loss.backward()        
        optimizer.step()
        
        train_loss += loss.cpu().detach().numpy()
        train_examples += len(targets.cpu().detach().numpy())
        #############################################################################
        #                             END OF YOUR CODE                              #
        #############################################################################
    
    avg_train_loss = train_loss / train_examples
    avg_val_loss, val_accuracy = evaluate(model, loss_function, optimizer)
        
    print("Epoch: {}/{}\tAvg Train Loss: {:.4f}\tAvg Val Loss: {:.4f}\t Val Accuracy: {:.0f}".format(epoch, 
                                                                      EPOCHS, 
                                                                      avg_train_loss, 
                                                                      avg_val_loss,
                                                                      val_accuracy))

def evaluate(model, loss_function, optimizer):
  # returns:: avg_val_loss (float)
  # returns:: val_accuracy (float)
    val_loss = 0
    correct = 0
    val_examples = 0
    with torch.no_grad():
        for sentence, tags in val_data:
            #############################################################################
            # TODO: Implement the evaluate loop
            # Find the average validation loss along with the validation accuracy.
            # Hint: To find the accuracy, argmax of tag predictions can be used.
            #############################################################################
            # Prepare sentence into indexs
            sentence4embed = sentence
            # Prepare tag into indexs
            targets = prepare_sequence(tags, tag_to_idx)
            # predictions for the tags of sentence
            tag_scores = model(sentence4embed)
            # get the prediction results
            _, preds = torch.max(tag_scores, 1)
            loss = loss_function(tag_scores, targets)
            
            val_loss += loss.cpu().detach().numpy()
            correct += (torch.sum(preds == torch.LongTensor(targets)).cpu().detach().numpy())
            val_examples += len(targets.cpu().detach().numpy())
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
    val_accuracy = 100. * correct / val_examples
    avg_val_loss = val_loss / val_examples
    return avg_val_loss, val_accuracy


In [16]:
#############################################################################
# TODO: Initialize the model, optimizer and the loss function
#############################################################################
model = BiLSTMPOSTagger(embedding_dim=EMBEDDING_DIM, hidden_dim=HIDDEN_DIM,
                       vocab_size = len(word_to_idx), tagset_size = len(tag_to_idx))
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
#############################################################################
#                             END OF YOUR CODE                              #
#############################################################################
import time

for epoch in range(1, EPOCHS + 1): 
    start = time.time()
    train(epoch, model, loss_function, optimizer)
    print(f"Time used for Epoch{epoch}: ",time.time() - start)
    torch.save(model, "bi_pos_tagger-exp1.pth")

100%|██████████| 7148/7148 [03:34<00:00, 33.39it/s]


Epoch: 1/12	Avg Train Loss: 0.0203	Avg Val Loss: 0.0145	 Val Accuracy: 90
Time used for Epoch1:  221.84125971794128


  "type " + obj.__name__ + ". It won't be checked "
100%|██████████| 7148/7148 [03:20<00:00, 35.66it/s]


Epoch: 2/12	Avg Train Loss: 0.0125	Avg Val Loss: 0.0132	 Val Accuracy: 91
Time used for Epoch2:  205.65346002578735


100%|██████████| 7148/7148 [02:48<00:00, 42.34it/s]


Epoch: 3/12	Avg Train Loss: 0.0111	Avg Val Loss: 0.0119	 Val Accuracy: 92
Time used for Epoch3:  175.02353739738464


100%|██████████| 7148/7148 [03:32<00:00, 33.69it/s]


Epoch: 4/12	Avg Train Loss: 0.0101	Avg Val Loss: 0.0124	 Val Accuracy: 92
Time used for Epoch4:  219.23559665679932


100%|██████████| 7148/7148 [03:33<00:00, 33.53it/s]


Epoch: 5/12	Avg Train Loss: 0.0098	Avg Val Loss: 0.0117	 Val Accuracy: 92
Time used for Epoch5:  220.40116548538208


100%|██████████| 7148/7148 [03:52<00:00, 30.72it/s]


Epoch: 6/12	Avg Train Loss: 0.0094	Avg Val Loss: 0.0113	 Val Accuracy: 92
Time used for Epoch6:  239.92650365829468


100%|██████████| 7148/7148 [03:32<00:00, 33.58it/s]


Epoch: 7/12	Avg Train Loss: 0.0090	Avg Val Loss: 0.0119	 Val Accuracy: 92
Time used for Epoch7:  219.8170702457428


100%|██████████| 7148/7148 [04:02<00:00, 29.49it/s]


Epoch: 8/12	Avg Train Loss: 0.0090	Avg Val Loss: 0.0113	 Val Accuracy: 93
Time used for Epoch8:  248.811377286911


100%|██████████| 7148/7148 [03:23<00:00, 35.18it/s]


Epoch: 9/12	Avg Train Loss: 0.0087	Avg Val Loss: 0.0109	 Val Accuracy: 93
Time used for Epoch9:  210.94725704193115


100%|██████████| 7148/7148 [04:04<00:00, 29.28it/s]


Epoch: 10/12	Avg Train Loss: 0.0087	Avg Val Loss: 0.0113	 Val Accuracy: 92
Time used for Epoch10:  251.02008819580078


100%|██████████| 7148/7148 [03:41<00:00, 32.34it/s]


Epoch: 11/12	Avg Train Loss: 0.0086	Avg Val Loss: 0.0107	 Val Accuracy: 93
Time used for Epoch11:  228.0372278690338


100%|██████████| 7148/7148 [03:38<00:00, 32.69it/s]


Epoch: 12/12	Avg Train Loss: 0.0081	Avg Val Loss: 0.0105	 Val Accuracy: 93
Time used for Epoch12:  224.1796362400055


Your modified model should get a performance of **at least 90%** on the validation set.

### Test accuracy
Also evaluate your performance on the test data by submitting test_labels.txt and **report your test accuracy here**.



In [9]:
def test():
    predicted_tags = []
    with torch.no_grad():
        for sentence in test_sentences:
            #############################################################################
            # TODO: Implement the test loop
            # This method saves the predicted tags for the sentences in the test set.
            # The tags are first added to a list which is then written to a file for
            # submission. An empty string is added after every sequence of tags
            # corresponding to a sentence to add a newline following file formatting
            # convention, as has been done already.
            #############################################################################
            sentence_in = sentence
            tag_scores = model(sentence_in)
            preds = tag_scores.argmax(axis=1)
            predicted_tags.extend([idx_to_tag[preds[i].item()] for i in range(len(preds))])
            #############################################################################
            #                             END OF YOUR CODE                              #
            #############################################################################
            predicted_tags.append("")

    with open('test_labels_bilstm.txt', 'w+') as f:
        for item in predicted_tags:
            f.write("%s\n" % item)
    


In [10]:
test()

### Error analysis
**Report your findings here.**  
Compare the top-10 errors made by this modified model with the errors made by the model from part (a). 
If you tried multiple hyperparameter combinations, choose the model with the highest validation data accuracy.
What errors does the original model make as compared to the modified model, and why do you think it made them? 

Feel free to reuse the methods defined above for this purpose.

In [12]:
#############################################################################
# TODO: Generate predictions from val data
# Create lists of words, tags predicted by the model and ground truth tags.
#############################################################################
def generate_predictions(model, test_sentences):
    word_list = []
    model_tags = []
    gt_tags = []
    for sentence, tags in test_sentences:
        sentence_idx = prepare_sequence(sentence, word_to_idx)
        sentence_trans = sentence
        tag_scores = model(sentence_trans)  
        preds = tag_scores.argmax(axis=1)
        model_tags.extend([idx_to_tag[i.item()] for i in preds])

        word_list.extend(sentence)
        gt_tags.extend(tags)
    return word_list, model_tags, gt_tags

#############################################################################
# TODO: Carry out error analysis
# From those lists collected from the above method, find the 
# top-10 tuples of (model_tag, ground_truth_tag, frequency, example words)
# sorted by frequency
#############################################################################
def error_analysis(word_list, model_tags, gt_tags):
    from collections import Counter, defaultdict
    
    wl = []
    mt = []
    gt = []
    words = defaultdict(list)
    
    for w, m, g in zip(word_list, model_tags, gt_tags):
        if m != g:
            wl.append(w)
            mt.append(m)
            gt.append(g)
            words[(m, g)].append(w)
            
    c = Counter(zip(mt, gt))
    top10 = c.most_common(10)
    
    errors = []
    for (mtag, gt_tag), count in top10:
        errors.append((mtag, gt_tag, count / sum(c.values()), words[(mtag, gt_tag)][:5]))
    return errors

errors = error_analysis(*generate_predictions(model, val_data))

for err in errors:
    print(err)

('NN', 'NNP', 0.1099092812281926, ['Service', 'Education', 'Bush', 'Disneyland', 'Paper'])
('JJ', 'NN', 0.061409630146545706, ['panhandler', 'stuff', 'estuarian', 'gamma', 'eclectic'])
('NN', 'JJ', 0.05722260990928123, ['nonprofit', 'necessary', 'literary', 'environmental', 'savings-and-loan'])
('NNP', 'NN', 0.05198883461270063, ['province', 'bureau', 'press', 'secretary', 'rent'])
('JJ', 'NNP', 0.047801814375436145, ['Thanh', 'Hoa', 'Extension', 'Third', 'East'])
('NNS', 'NN', 0.027564549895324496, ['list', 'worth', 'globulin', 'fluoride', 'hedge'])
('VBD', 'VBN', 0.026866713189113746, ['confiscated', 'ended', 'led', 'caused', 'ended'])
('NN', 'VBG', 0.026168876482903, ['setting', 'operating', 'operating', 'driving', 'sitting'])
('NN', 'NNS', 0.02407536636427076, ['write-downs', 'market-makers', 'doldrums', 'placements', 'woods'])
('NNP', 'JJ', 0.019190509420795535, ['universal', 'electric', '30-share', 'first', 'mainframe-class'])


Also, I try to use different pertained embedded weight called Glove with dimension of 200. With trying different hyper-parameters, I found that with proper dropout rate like 0.1 and learning rate 0.01 is the good way not to overfitting. The final accuracy on the validation set achieved 93 %.