# Colx 525 Lab Assignment 3: POS Tagging

## Assignment objectives

In this assignment, you will develop a POS tagger using pytorch. You will:

1. Read in training, development and test data using `torchtext`.
1. Implement a baseline majority class tagger.
1. Numericalize data (i.e. transform sentences and words into `torch.Tensor` objects).
1. Develop a BiLSTM POS tagger.  

The [`pytorch` documentation](https://pytorch.org/docs/stable/index.html) will be useful in this lab.

## Getting started

You will need to install the Python modules `torchtext`, `torch` and `numpy`. The easiest way to do this is using `anaconda` or `pip`.

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

* Submit the assignment by filling in this jupyter notebook with your answers embedded
* Be sure to follow the general lab instructions

### Exercise 1: Reading in data

We will now read in training development and test sets using the `torchtext` library which `torchtext` can simplify your data handling code. Please have a look at [Practical lecture 5-6].

Before you do anything else, please download the following file, place it in your `Lab3` directory and unzip it:

```
https://mpsilfve.github.io/assets/uddata.zip
```

We'll start by installing the `conllu` library which can read data Universal Dependencies treebank data. We'll also read the English UD training, development and test set from the `uddata` directory.

In [3]:
# !python3 -m pip install conllu

In [2]:
# !wget https://mpsilfve.github.io/assets/uddata.zip
# !unzip uddata.zip 

In [1]:
import conllu
import os
import torch
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def read_data(dire, lang):
    # 1K sentences (*.head) instead of the entire train file (12K)
    train_data = conllu.parse(open(os.path.join(dire, f"{lang}-ud-train.conllu.head")).read())     
    dev_data = conllu.parse(open(os.path.join(dire, f"{lang}-ud-dev.conllu")).read())
    test_data = conllu.parse(open(os.path.join(dire, f"{lang}-ud-test.conllu")).read())
    return train_data, dev_data, test_data

train_data, dev_data, test_data = read_data("uddata","en")

Let's look at the format of the UD data. We'll print the first three tokens in the first training sentence. As you can see, the token is represented by a dictionary with several fields. For our purposes, the most important ones are:

* `form` which gives the word form, and
* `upos` which gives the Universal Dependencies POS tag. 

```
# newdoc id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000
# sent_id = weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0001
# text = Al-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque in the town of Qaim, near the Syrian border.
1       Al      Al      PROPN   NNP     Number=Sing     0       root    _       SpaceAfter=No
2       -       -       PUNCT   HYPH    _       1       punct   _       SpaceAfter=No
3       Zaman   Zaman   PROPN   NNP     Number=Sing     1       flat    _       _
4       :       :       PUNCT   :       _       1       punct   _       _
5       American        american        ADJ     JJ      Degree=Pos      6       amod    _       _
6       forces  force   NOUN    NNS     Number=Plur     7       nsubj   _       _
7       killed  kill    VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        1       parataxis       _       _
8       Shaikh  Shaikh  PROPN   NNP     Number=Sing     7       obj     _       _
9       Abdullah        Abdullah        PROPN   NNP     Number=Sing     8       flat    _       _
10      al      al      PROPN   NNP     Number=Sing     8       flat    _       SpaceAfter=No
11      -       -       PUNCT   HYPH    _       8       punct   _       SpaceAfter=No
...
```

In [5]:
len(train_data)
train_data[0][:3]

[{'id': 1,
  'form': 'Al',
  'lemma': 'Al',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'feats': {'Number': 'Sing'},
  'head': 0,
  'deprel': 'root',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 2,
  'form': '-',
  'lemma': '-',
  'upos': 'PUNCT',
  'xpos': 'HYPH',
  'feats': None,
  'head': 1,
  'deprel': 'punct',
  'deps': None,
  'misc': {'SpaceAfter': 'No'}},
 {'id': 3,
  'form': 'Zaman',
  'lemma': 'Zaman',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'feats': {'Number': 'Sing'},
  'head': 1,
  'deprel': 'flat',
  'deps': None,
  'misc': None}]

### Exercise 1.1
rubric={accuracy:5}

Following the example in [`practical_lecture5`](https://github.ubc.ca/MDS-CL-2022-23/COLX_525_morphology_students/blob/master/lectures/practical_lecture5.ipynb) and this [tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implement a `UDDataset` class. It should be a subclass of `torch.utils.data.Dataset`.

Your `__init__` function should take a dataset (`train_data`, `dev_data` or `test_data`) as argument. and assign it as `self.data`. You'll also need to define `__len__` and `__getitem__` member functions which return the length of `self.data` and and element at index `i` respectively.

In [48]:
from torch.utils.data import Dataset

# ## from practical_lecture5:
# class PandasDataset(Dataset):
#     def __init__(self, dataframe):
#         self.dataframe = dataframe
#         self.iloc = dataframe.iloc        

#     def __len__(self):
#         return len(self.dataframe)

#     def __getitem__(self, index):
#         return self.dataframe.iloc[index]

# Your code here
class UDDataset(Dataset):
    # Your `__init__` function should take a dataset (`train_data`, `dev_data` or `test_data`) as argument. 
    # and assign it as `self.data`. 
    def __init__(self, data):
    

    # You'll define `__len__` 
    # which return the length of `self.data` 
    def __len__(self):

    
    # and define `__getitem__` 
    # which return the element (`data`) at index `i`
    def __getitem__(self, index):



train = UDDataset(train_data)
dev = UDDataset(dev_data)
test = UDDataset(test_data)

In [58]:
print(train[0][:3] == train_data[0][:3])
print(type(train_data))
print(type(train))

True
<class 'conllu.models.SentenceList'>
<class '__main__.UDDataset'>


### Exercise 1.2
rubric={accuracy:5}

We'll now compile three vocabularies. One for words, one for characters and one for POS tags. Let's start by initializing generators for words (`yield_tokens`), characters (`yield_chars`) and POS tags (`yield_pos`). These extract words, characters and POS tags from a batch of examples. 

Note, that:
* `yield_tokens` generates a list of tokens: `["the", "dog", "sleeps"]`
* `yield_chars` generates a list of characters: `["t", "h", "e", "d", "o", "g", "s", "l", "e", "e", "p", "s"]` (notice the lack of spaces between words)
* `yield_pos` generates a list of POS tags: `["DET","NOUN","VERB"]`

In [60]:
# provided code
def yield_tokens(data):
    for ex in data:
        yield([tok["form"] for tok in ex])
        
def yield_chars(data):
    for ex in data:
        yield([c for tok in ex for c in tok["form"]])
        
def yield_pos(data):
    for ex in data:
        yield([tok["upos"] for tok in ex])

# print("First training example:")
print("Tokens:", next(yield_tokens(train)))
print("Characters:", next(yield_chars(train)))
print("POS:", next(yield_pos(train)))

Tokens: ['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
Characters: ['A', 'l', '-', 'Z', 'a', 'm', 'a', 'n', ':', 'A', 'm', 'e', 'r', 'i', 'c', 'a', 'n', 'f', 'o', 'r', 'c', 'e', 's', 'k', 'i', 'l', 'l', 'e', 'd', 'S', 'h', 'a', 'i', 'k', 'h', 'A', 'b', 'd', 'u', 'l', 'l', 'a', 'h', 'a', 'l', '-', 'A', 'n', 'i', ',', 't', 'h', 'e', 'p', 'r', 'e', 'a', 'c', 'h', 'e', 'r', 'a', 't', 't', 'h', 'e', 'm', 'o', 's', 'q', 'u', 'e', 'i', 'n', 't', 'h', 'e', 't', 'o', 'w', 'n', 'o', 'f', 'Q', 'a', 'i', 'm', ',', 'n', 'e', 'a', 'r', 't', 'h', 'e', 'S', 'y', 'r', 'i', 'a', 'n', 'b', 'o', 'r', 'd', 'e', 'r', '.']
POS: ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET

Following the example in [`practical_lecture5`](https://github.ubc.ca/MDS-CL-2022-23/COLX_525_morphology_students/blob/master/lectures/practical_lecture5.ipynb), use `build_vocab_from_iterator` and the **training set** to define:

* a word vocabulary `word_vocab` based on the tokens generated by `yield_tokens`. Your vocabulary should contain the special symbols: \<unk\>, \<start\> and \<end\>
* a character vocabulary `char_vocab` based on the tokens generated by `yield_chars`. Your vocabulary should contain the special symbols: \<unk\>, \<start\>, \<end\> and \<pad\>
* a pos vocabulary `pos_vocab` based on the tokens generated by `yield_pos`. Your vocabulary should contain the special symbol: \<unk\>

**Hint:** Remember to call `set_default_index` to set the ID for the \<unk\> token.

In [13]:
# ## from practical_lecture5:
# token_vocab = build_vocab_from_iterator(yield_tokens(train_data, tokenizer),
#                                         specials=["<unk>", "<start>", "<end>", "<pad>"])
# token_vocab.set_default_index(token_vocab["<unk>"])

# label_vocab = build_vocab_from_iterator(yield_labels(train_data), specials=["<unk>"])
# label_vocab.set_default_index(label_vocab["<unk>"])


# Your code here
# Note that your word, charcter, pos DO NOT requires the `tokenizer` 

# `word_vocab` based on the tokens generated by `yield_tokens`. 
# it should contain the special symbols: <unk>, <start> and <end>
# `set_default_index` for <unk> to avoid index errors.
word_vocab = 

# `char_vocab` based on the tokens generated by `yield_chars`. 
# it should contain the special symbols: <unk>, <start>, <end> and <pad>
# `set_default_index` for <unk> to avoid index errors.
char_vocab = 

# `pos_vocab` based on the tokens generated by `yield_pos`. 
# it should contain the special symbol: <unk> (See `label_vocab` in practical_lecture5)
# `set_default_index` for <unk> to avoid index errors.
pos_vocab = 

### Exercise 1.3
rubric={accuracy:10}

We'll now write a `collate_batch` function which numericalizes a batch of examples into torch tensors.

We'll use the following transformation functions which tokenize and numericalize the tokens, characters and POS tags, respectively. Note that both `word_transform` and `char_transform` add \<start\> and \<end\> tokens to the input. 

In [14]:
# provided code:
word_transform = lambda s: [word_vocab[w] for w in ["<start>"] + s + ["<end>"]]
char_transform = lambda w: [char_vocab[c] for c in ["<start>"] + w + ["<end>"]]
pos_transform = lambda s: [pos_vocab[w] for w in s]

# `data_process` from seq2seq_tutorial: 
# [trg_vocab[token] for token in spacy_en_tokenizer(trg_raw.lower())]
# [src_vocab[token] for token in spacy_fr_tokenizer(src_raw.lower())]
# --> then, they go to `torch.tensor()`

In [70]:
print(word_vocab['forces'])
print(word_vocab.get_stoi()['forces'])
print(word_vocab.get_itos()[190])

# ---
# print(word_vocab['Al'])
# print(word_vocab.get_stoi()['Al'])
# print(word_vocab.get_itos()[61])

190
190
forces


The function `collate_batch` takes a batch of examples as input. It extracts the tokens, characters and POS tags for each example using `yield_tokens`, `yield_chars` and `yield_pos`, respectively.

Your first task to use `word_transform` and `pos_transform` to transform token and POS lists into torch tensors (with data type `torch.long`). You should store the tensors as `token_tensor` and `pos_tensor`. `token_tensor` should have size `sentence_length + 2 x 1` and `pos_tensor` will have size `sentence_length x 1`. The `+ 2` comes from adding start and end symbols to `tokens`. 

**Hint:** You may need to call the torch function [`unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze) to ensure the correct size.

Your second task is to convert `chars` into a list of tensors of shape `1 x word_length`. Remember that `yield_chars` returns a long list containing all the characters in the input sentence. You should first split this up into a list contining inidividual words like `["t", "h", "e"]` using the function `split_char_sequence`. Store the result as `chars`.

You can then use `char_transform` to numericalize each word in `chars`. Again, store the resulting list as `chars`. Finally, compile each numericalized word in `chars` into a torch tensor with dtype `torch.long`. Store the result as `char_tensor`. 

In [71]:
# provided code:

from itertools import islice
from collections import namedtuple

Example = namedtuple("Example",["word", "pos", "char"])

def split_char_sequence(chars, tokens):
    """ Split a sequence of characters representing a sentence into 
        sequences representing the individual words in the sentence
        
        FROM    : ["t","h","e","d","o","g","s","l","e","e","p","s"] 
        TO      : [["t","h","e"], ["d","o","g"], ["s","l","e","e","p","s"]]
        
        Arguments:        
        chars: A list of chars ["t","h","e","d","o","g","s","l","e","e","p","s"]
        tokens: A list of tokens ["the", "dog", "sleeps"]
    """
    word_lens = [len(w) for w in tokens]
    chars = iter(chars)
    return [list(islice(chars, elem)) for elem in word_lens]


In [75]:
# https://docs.python.org/3/library/collections.html#collections.namedtuple

# # Basic example
# >>> Point = namedtuple('Point', ['x', 'y'])
# >>> p = Point(11, y=22)     # instantiate with positional or keyword arguments
# >>> p[0] + p[1]             # indexable like the plain tuple (11, 22)
# 33
# >>> x, y = p                # unpack like a regular tuple
# >>> x, y
# (11, 22)
# >>> p.x + p.y               # fields also accessible by name
# 33
# >>> p                       # readable __repr__ with a name=value style
# Point(x=11, y=22)

split_char_sequence(["t","h","e","d","o","g","s","l","e","e","p","s"], ["the", "dog", "sleeps"])

[['t', 'h', 'e'], ['d', 'o', 'g'], ['s', 'l', 'e', 'e', 'p', 's']]

In [6]:
# ## from practical_lecture5:
# # Build batches. Each example contains a list of tokens and a gold standard label.
# def collate_batch(batch):
#     label_list, text_list = [], []
#     for quote, person in zip(yield_tokens(batch, tokenizer), yield_labels(batch)):
#         label_list.append(label_transform(person))
#         processed_text = torch.tensor(text_transform(quote))
#         text_list.append(processed_text)

#     # We use pad_sequence to pad all examples in the batch to the same length using the padding token <pad>    
#     return (pad_sequence(text_list, padding_value=token_vocab["<pad>"], batch_first=True), 
#                          torch.tensor(label_list))

def collate_batch(batch):
    pos_list, token_list, char_list, word_lens = [], [], [], []         # we don't use word_lens (ignore it);
    for tokens, chars, pos in zip(yield_tokens(batch), 
                                  yield_chars(batch),
                                  yield_pos(batch)):
        # Your code here
        # >> token and pos should be "tensor" using `word_transform and pos_transform`;
        # >> char should be tensor using `char_transform` after using given `split_char_sequence`; 
                

                
        # Please make sure not to change the indentation
        # of the following three lines
        token_list.append(token_tensor)
        pos_list.append(pos_tensor)
        char_list += char_tensors

    return Example(token_list[0],
                   pos_list[0],
                   (pad_sequence(char_list, batch_first=True, padding_value=char_vocab["<pad>"]).unsqueeze(0),
                    len(token_list[0])-2,
                    [len(w) for w in tokens]))

![https://i.stack.imgur.com/NiJu4.png](https://i.stack.imgur.com/NiJu4.png)

https://stackoverflow.com/questions/57237352/what-does-unsqueeze-do-in-pytorch

In [80]:
print(train_data[0:1])
print("-------------------------------")
b = collate_batch(train_data[0:1])
print("-------------------------------")
print(b)


# Some assertions to check your code
# b = collate_batch(dev_data[0:1])
# assert(b.word.size()[0] == len(dev_data[0]) + 2)
# assert(b.pos.size()[0] == len(dev_data[0]))
# chars, _, _ = b.char
# assert(chars.size()[2] == len(dev_data[0]))
# assert(chars.size()[1] == max([len(tok["form"]) for tok in dev_data[0]]) + 2)

[TokenList<Al, -, Zaman, :, American, forces, killed, Shaikh, Abdullah, al, -, Ani, ,, the, preacher, at, the, mosque, in, the, town, of, Qaim, ,, near, the, Syrian, border, ., metadata={newdoc id: "weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000", sent_id: "weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0001", text: "Al-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque in the town of Qaim, near the Syrian border."}>]
-------------------------------
tokens>	 ['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
after word_tranform>	 [1, 61, 12, 921, 38, 240, 190, 287, 908, 2115, 112, 12, 2136, 4, 3, 4048, 44, 3, 3857, 9, 3, 1173, 6, 1380, 4, 611, 3, 1410, 323, 5, 2]
after tensor>	 tensor([   1,   61,   12,  921,   38,  240,  190,  287,  908, 2115,  112, 

Let's then use the function to numericalize the training, development and test data. Note, that we set `shuffle=True` for the training set.

In [30]:
# provided code:
dev_iter = DataLoader(dev_data, batch_size=1, shuffle=False, collate_fn=collate_batch)
test_iter = DataLoader(test_data, batch_size=1, shuffle=False, collate_fn=collate_batch)
train_iter = DataLoader(train_data, batch_size=1, shuffle=True, collate_fn=collate_batch)

We'll then take a closer look at the training examples retured in our datasets:

In [89]:
ex = next(iter(dev_iter))

print("ex.word is a tensor containing the word tokens in the sentence:")
print(ex.word)
print("It has size sentence_length+2 x 1:")
print(ex.word.size())
print()
print("ex.pos is a tensor containing the POS tags in the sentence:")
print(ex.pos)
print("It has size sentence_length x 1:")
print(ex.pos.size())
print()
print("ex.char is a tuple having three elements:")
char_tensor, sentence_length, word_lengths = ex.char
print("char_tensor contains character-sequence representations of the tokens in the sentence")
print(char_tensor)
print("It has size 1 x sentence_length x sequence_length:")
print(char_tensor.size())
print()
print("sentence_length simply gives the sentence_length (without <start> and <end> tokens):")
print(sentence_length)
print()
print("word_lengths contains the length of each word in the sentence (without <start> and <end> tokens):")
print(word_lengths)

['<start>', 'From', 'the', '<unk>', 'comes', 'this', 'story', ':', '<end>']
ex.word is a tensor containing the word tokens in the sentence:
tensor([[   1],
        [ 670],
        [   3],
        [   0],
        [1532],
        [  36],
        [ 350],
        [  38],
        [   2]])
It has size sentence_length+2 x 1:
torch.Size([9, 1])

ex.pos is a tensor containing the POS tags in the sentence:
tensor([[4],
        [6],
        [3],
        [5],
        [6],
        [1],
        [2]])
It has size sentence_length x 1:
torch.Size([7, 1])

ex.char is a tuple having three elements:
char_tensor contains character-sequence representations of the tokens in the sentence
tensor([[[ 1, 60, 11,  9, 17,  2,  3],
         [ 1,  6, 12,  4,  2,  3,  3],
         [ 1, 29, 37,  2,  3,  3,  3],
         [ 1, 16,  9, 17,  4, 10,  2],
         [ 1,  6, 12,  8, 10,  2,  3],
         [ 1, 10,  6,  9, 11, 21,  2],
         [ 1, 54,  2,  3,  3,  3,  3]]])
It has size 1 x sentence_length x sequence_length:
t

### Exercise 2: Simple baseline tagger

To be able to gauge the performance of our deep learning tagger, we'll now implement a baseline majority label classifier.

#### Exercise 2.1: Counting tags

rubric={accuracy:5}

As a first step, you will count the occurrences of different POS tags for each word in the **training data**. These counts will be stored in `tag_counts` below. For exaple, `tag_counts["this"]["PRON"]` should tell you how many times the word "this" was tagged `PRON` in the training data.

In [96]:
from collections import defaultdict, Counter

# A counter for POS tags. tag_counts[wf][pos] should denote the number of times we saw the word wf with 
# POS tag pos in the training data.
tag_counts = defaultdict(Counter)

# Populate tag_counts with the counts of different POS tags for each word type in the training data. 
# your code here
# >> 


# A few assertions to make sure that your code is working properly.
assert(tag_counts["this"]["DET"] == 46)
assert(tag_counts["this"]["PRON"] == 20)
assert(tag_counts["this"]["ADV"] == 1)

#### Exercise 2.2 Tagging the development data

rubric={accuracy:5}

The next step is to tag the development data. For each example in the development set, you should append a list of predicted POS tags to `output_tags`. 

For each word in an example, output its most common tag given by `tag_counts`. For OOV (out-of-vocabulary) words which are missing from `tag_counts`, you can predict `NOUN`. 

In [33]:
output_tags = []
word_itos = word_vocab.get_itos()

for ex in dev_iter:
    output_tags.append([])
    for wf in ex.word[1:-1]:
        wf = word_itos[wf]
        # your code here
        # >> predicted POS tags to `output_tags`        
        # >> for OOV, predict NOUN

        # your code here

Using the `accuracy` function below, you can now print the baseline tagging accuracy on the development set. It should be around 77%.

In [34]:
# provided code:

def accuracy(sys,gold):
    """
    Function for evaluating tagging accuracy w.r.t. a gold standard test set (gold).
    """
    assert(len(sys) == len(gold))
    corr = 0
    tot = 0
    pos_itos = pos_vocab.get_itos()
    for s, g in zip(sys,gold):
        assert(len(s) == len(g.pos))
        corr += sum([1 if x==pos_itos[y] else 0 for x,y in zip(s,g.pos)])
        tot += len(s)
    return corr * 100.0 / tot

print("Accuracy for baseline majority class tagger:",accuracy(output_tags,dev_iter))

Accuracy for baseline majority class tagger: 77.22681724192779


### Exercise 3: The POS tagger

In this exercise, you will build a basic BiLSTM POS tagger. The tagger:

1. Embeds word tokens in the input sentence.
2. Passes the embeddings through a bidirectional LSTM layer.
3. Predicts POS tags using a feed-forward network and log softmax layer.

When you are implementing the POS tagger, remember to always keep track of the input and output sizes of all of you tensors. It is very important to check that these are correct. It is also important to understand what your dimensions refer to.

Let's start by loading a few necessary libraries and setting hyper-parameters:

In [35]:
import numpy as np

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.nn.functional import log_softmax, relu
from torch.optim import Adam, SGD

from random import random, seed, shuffle

# Ensure reproducible results by setting random seeds to 0.
seed(0)
torch.manual_seed(0)
np.random.seed(0)

import re

# Hyperparameters
EMBEDDING_DIM=300
RNN_HIDDEN_DIM=50
RNN_LAYERS=1
BATCH_SIZE=1
EPOCHS=5

# Maximum length of generated output word forms.
MAXWFLEN=40

#### Exercise 3.1: LSTM layer

rubric={accuracy:15}

You should implement a `BidirectionalLSTM` class which is used to encode a sequence of word embeddings into  representations. `BidirectionalLSTM` encapsulates two LSTM networks: `self.forward_rnn` and `self.backward_rnn` which you should initialize in `BidirectionalLSTM.__init__`. Both should have:

1. Embedding dimension `self.embedding_dim` (this is a parameter to the `__init__` function)
1. Hidden dimension `RNN_HIDDEN_DIM` 
1. Layer count `RNN_LAYERS`

Your second task is to implement the `BidirectionalLSTM.forward` function. As argument, the function takes a `torch.Tensor` `sequence` which has size `(sequence_length,1,EMBEDDING_DIM)`. This tensor contains the word embeddings for the input sentence.  

You should pass the `sequence` to `self.forward_rnn` which returns:

1. a sequence $f_1,...,f_n$ of forward hidden states represented as a tensor `fwd_hss` having size `(sequence_length,1,RNN_HIDDEN_DIM)` and
1. a pair `(fwd_hs, fwd_cs)`, where:
   1. `fwd_hs` is the final forward hidden state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   1. `fwd_cs` is the final forward cell state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   
You should pass the **reversed** `sequence` to `self.backward_rnn` (**HINT**: [`torch.flip`](https://pytorch.org/docs/stable/torch.html#torch.flip) can be useful here) which returns:

1. a sequence $b_n,...,b_1$ of backward hidden states represented as a tensor `bwd_hss` having size `(sequence_length,1,RNN_HIDDEN_DIM)` (NOTE! the backward states are reversersed here) 
1. and a pair `(bwd_hs, bwd_cs)`, where:
   1. `bwd_hs` is the final backward hidden state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   1. `bwd_cs` is the final backward cell state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   
The `forward` function should return a tensor `hss` having dimension `(sequence_length, 1, 2*self.hidden_dim)`, where `hss[i]` represents the concatenation of the $i$th forward hidden state $f_i$ and the $i$th backward hidden state $b_i$ (**HINT**: Again `torch.flip` can be useful).

In [36]:
class BidirectionalLSTM(nn.Module):

    def __init__(self, embedding_dim=EMBEDDING_DIM):
        super(BidirectionalLSTM,self).__init__()
        self.embedding_dim = embedding_dim

        # your code here
        # `BidirectionalLSTM` encapsulates two LSTM networks: 
        # `self.forward_rnn` and `self.backward_rnn` which you should initialize 
            # `self.embedding_dim` (this is a parameter to the `__init__` function)
            # `RNN_HIDDEN_DIM` 
            # `RNN_LAYERS`

    # `forward` takes a `sequence` which has size `(sequence_length,1,EMBEDDING_DIM)`. 
    # This tensor contains the word embeddings for the input sentence. 
    def forward(self,sequence):
        # your code here
        # `fwd_hss` by `forward_rnn`
        # `bwd_hss` by `backward_rnn` (your sequence should be reversersed here using `torch.flip` )

        # return  concat of `fwd_hss` and `bwd_hss` where you should use again `torch.flip` for bwd_hss
        return ...
        # your code here
        
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
# assert(BidirectionalLSTM()(torch.zeros(10,1,EMBEDDING_DIM)).size() == (10,1,2*RNN_HIDDEN_DIM))

In order to improve tagging accuracy for OOV words, we need to use word dropout. It takes two arguments:
1. A `torch.Tensor` `sequence` of size `(sequence_length,1)` and
1. A float `word_dropout` in the interval `[0,1]`.
During training, the function randomly replaces words by `WORD.vocab.stoi[UNK]` with probability 'word_dropout'.  

In [100]:
# provided code:
def drop_words(sequence,word_dropout):
    seq_len, _ = sequence.size()
    dropout_sequence = sequence.clone()
    for i in range(1,seq_len-1):
        if random() < word_dropout:
            dropout_sequence[i,0] = word_vocab["<unk>"]
    return dropout_sequence
    
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
# assert(drop_words(torch.zeros(10,1),0.5).size() == (10,1))

#### Exercise 3.2 Sentence Encoder 

rubric={accuracy:15}

Your next task is to build a class `SentenceEncoder` which takes an example (from `train_iter`, `dev_iter` or `test_iter`) as input and returns a sequence of LSTM hidden states given by `BidirectionalLSTM`.

You first task is to initialize the `SentenceEncoder` class. You will initialize 3 class-members:
1. `self.vocabulary` which is just an alias for `WORD.vocab.stoi`.
1. `self.embeddings` which is a `torch.nn.Embedding` having input dimension `len(self.vocabulary)` and output dimension `EMBEDDING_DIM`.
1. `self.rnn` which is a `BidirectionalLSTM` object.

You should them implement `SentenceEncoder.forward` which takes as example `ex` as input. Additionally, it takes another parameter `word_dropout` which is the probability for word dropout on the sentence `ex`. The function should:
1. Perform word dropout on `ex` by calling the `drop_words` function above.
1. Embed the resulting tensor resulting in a tensor `embedded`.
1. Run `self.rnn` on embedded.
1. Return the resulting representation tensor. However, `ex.word` represents a sentence where we have appended an initial symbol `START` and final symbol `END`. You need to therefore clip the first and last representation vector before returning the output of `self.rnn`.

In [101]:
class SentenceEncoder(nn.Module):
    def __init__(self):
        super(SentenceEncoder,self).__init__()
        # your code here
        # `self.vocabulary` which is just an alias for `word_vocab`.
        # `self.embedding` which is a `nn.Embedding` with the length of `vocabulary` and `EMBEDDING_DIM`
        # `self.rnn` which is a `BidirectionalLSTM` object.
    
        
    def forward(self,ex,word_dropout):
        # your code here
        # `drop_words` using `ex.word`, `word_dropout` 
        # -> `self.embedding`.
        # -> `self.rnn`

        # return the ressult of `self.rnn`. 
        # need to therefore clip the first and last representation vector (<start> and <end>) before returning the output
        return ...
       
ex = next(iter(dev_iter))
sentence_length = ex.word.size()[0] - 2
assert(SentenceEncoder()(ex,0.5).size() == (sentence_length, 1, 2*RNN_HIDDEN_DIM))

#### Exercise 3.3: Prediction Layer

rubric={accuracy:15}

Your next task is to implement a feed-forward network `FeedForward` which is used to predict tags from LSTM representations. The constructor `feedForward.__init__` takes two arguments `input_dim` and `output_dim` representing the input and output dimension of the network, respectively. 
Your first task is to complete the function `FeedForward.__init__` by initializing two linear layers:
1. `self.linear1` having input dimension `input_dim` and output dimension `input_dim` and
2. `self.linear2` having input dimension `input_dim` and output dimension `output_dim`.

Your second task is to implement the function `FeedForward.forward`. As input, the function takes `tensor` which is a torch.Tensor object having size `(sequence_length, 1, input_dim)`. It then:
1. Applies `self.linear1` followed by a ReLU activation function on `tensor` and
2. then passes the result through `self.linear2` and a `log_softmax` layer and finally returns the result.

In [39]:
class FeedForward(nn.Module):
    def __init__(self,input_dim,output_dim):
        super(FeedForward, self).__init__()
        # your code here
        # `self.linear1` having input dimension `input_dim` and output dimension `input_dim` 
        # `self.linear2` having input dimension `input_dim` and output dimension `output_dim`

        
    def forward(self,tensor):
        # your code here
        # applies `self.linear1` followed by a ReLU activation function (`relu`) on `tensor` 
        # then, passes the result through `self.linear2` and a `log_softmax` layer and finally returns the result.
        

        return ...
        # your code here
        
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
assert(FeedForward(2*RNN_HIDDEN_DIM,100)(torch.zeros(10,1,2*RNN_HIDDEN_DIM)).size() == (10,1,100))      

## Tagging sentences and training the model

Now it's time to put together all the components that you built so far. `SimplePOSTagger` is a wrapper around a `SentenceEncoder` and `FeedForward` layer. It has a `forward` method which returns a tensor `res` where `res[i,j]` represents the log probability of tag `POS.itos[j]` for the word at position `i` in our input sentences. 

The function `tag` gets POS tags for a dataset `data`.  

In [102]:
# provided code:
class SimplePOSTagger(nn.Module):
    def __init__(self):
        super(SimplePOSTagger,self).__init__()
        self.tagset_size = len(pos_vocab)
        
        self.sentence_encoder = SentenceEncoder()
        self.hidden2tag = FeedForward(2*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        states = self.sentence_encoder(ex,word_dropout)
        return self.hidden2tag(states)

    def tag(self,data):
        with torch.no_grad():
            results = []
            pos_itos = pos_vocab.get_itos()
            for ex in data:
                tags = self(ex).argmax(dim=2).squeeze(1)
                results.append([pos_itos[i] for i in tags])
            return results
        
pos_size = len(pos_vocab)
ex = next(iter(dev_iter))
assert(SimplePOSTagger()(ex).size() == (ex.word.size()[0]-2,1,pos_size))
assert(len(SimplePOSTagger().tag([ex])[0]) == ex.word.size()[0] -2) 

Armed with the `SimplePOSTagger` class, you can now train your tagger using the following code. You should get to around 75% tagging accuracy on the development set.

In [42]:
# provided code:

tagger = SimplePOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

EPOCHS = 5

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

Epoch 1: Example 1000 of 1000
Average loss per example: 1.2512
Development accuracy: 69.94
Epoch 2: Example 1000 of 1000
Average loss per example: 0.5149
Development accuracy: 72.72
Epoch 3: Example 1000 of 1000
Average loss per example: 0.2772
Development accuracy: 73.44
Epoch 4: Example 1000 of 1000
Average loss per example: 0.1808
Development accuracy: 73.15
Epoch 5: Example 1000 of 1000
Average loss per example: 0.1451
Development accuracy: 75.57


You may notice that this is almost the same accuracy as for our baseline model. You can get a bit higher if you raise the number of epochs. If you really want better accuracy, you'll have to implement a character-level model and use pretrained embeddings. These can get you up to 85%.

### Exercise 4: Character-based tagging (optional)

In this exercise, you will extend your basic BiLSTM POS tagger to include character-based embeddings. In contrast to the basic tagger, the character-based tagger computes word embeddings as a concatenation of a token embedding derived from a regular `nn.Embedding` object and a character-based embedding which is computed by a bidirectional LSTM.

Again, always remember to keep track of the input and output sizes of all of you tensors. It is very important to check that these are correct.

#### Exercise 4.1 optional
rubric={accuracy:3}

Start by implementing the `CharEmbedding` class which computes character-based embeddings using a bidirectional LSTM. Your first task is to implement the `CharEmbedding.__init__` function. Initialize: 

* `self.char_set` which is just an alias for `CHAR.vocab.stoi` (which numericalizes individual characters like 'a').
* `self.embeddings` which is a `torch.nn.Embedding` having input dimension `len(self.char_set)` and output dimension `EMBEDDING_DIM`.
* `self.rnn` which is a `nn.LSTM` object (note, not `BidirectionalLSTM`!). You should set the embedding dimension to `EMBEDDING_DIM`, hidden dimension to `RNN_HIDDEN_DIM`, layer count to `RNN_LAYERS` and `bidirectional` to `True`.

After initializing `CharEmbedding`, you should implement the `CharEmbedding.forward` method. The method takes an example `ex` which represents a sentence as input. For each word in `ex`, `forward` embeds the characters in the word and passes the character embeddings through `self.rnn`. It then stores the final hidden state returned by `self.rnn` in a list and finally returns a tensor representing these states for every word in the sentence. 

The function `forward` takes two inputs:

* `ex` and example representing the input sentence, and
* `char_dropout` a real number between 0 and 1, which represents the probability for dropping a character during training.

We are interested in `ex.char` which is a tuple containing three tensors:

1. `ex.char[0]` is a `1 x sentence_len x char_count` tensor, where `char_count` is the length of the **longest** word in the sentence. 
1. `ex.char[1]` is a scalar representing the sentence length
1. `ex.char[2]` is a tensor of shape `1 x sentence_length` which contains the lengths of each word in the sentence. E.g. `ex.char[2][0]` is the length of the first word.

E.g. `ex.char[0][:,0,:]` represents the first word. The tensor might look like this:
```
tensor([[ 2, 29, 13, 13,  3,  1,  1,  1,  1,  1,  1]])
```
The word has 5 actual characters and the rest of the characters are padding characters. In this case, `ex[2][0,0] == 5`.  

To implement `forward`, you should:

1. Initialize a list `embeddings`.
2. Loop through the word indices `i` in `1 ... sentence_length`. 
3. Each word `ex.char[0][:,i,:]` will contain `ex.char[2][0,i]` characters, the rest are padding. Initialize a tensor `word` which contains all the characters from index `0` up to and including `ex.char[2][0,i] - 1`. Note that `word` should be an order 3 tensor, so you'll need to call `unsqueeze(1)`. 
4. We will now perform character dropout. We can use the function `drop_words()` to accomplish this. Call the function `drop_words()` passing `word` and `char_dropout` as arguments.
5. Embed the characters in `word` and pass them through `self.rnn`. This should give you a tensor of size `2 x 1 x RNN_HIDDEN_DIM` (2 because `self.rnn` is bidirectional).
6. Reshape the final state into a `1 x 1 x 2*RNN_HIDDEN_DIM` tensor and append it to `embeddings`.
7. After you've looped through the words in `ex`, append a zero tensor of dimension `1 x 1 x 2*RNN_HIDDEN_DIM` at the front and end of `embeddings`. We need to do this to take into account start and end of sentence markers (which are not present in `ex.char`).
8. Now, `embeddings` should contain one character-based embedding for every word in the sentence + two embeddings for the start and end of sequence markers. Concatenate these into a `(sentence_length + 2) x 1 x 2*RNN_HIDDEN_DIM` tensor and return it.

In [43]:
class CharEmbedding(nn.Module):
    def __init__(self):
        super(CharEmbedding,self).__init__()

        # your code here
        # it should be SAME as in `SentenceEncoder` by using char_vocab and its length;
        # otherwise, you will use `nn.LSTM` instead of `BidirectionalLSTM` with 
        #   EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS, `bidirectional=True``
        #your code here
        
    def forward(self,ex,char_dropout):
        # your code here
        # you iterate word by word, then
        #   it should be SAME as in `SentenceEncoder` (say, you are making a `WordEncoder`)


        # `ex.char[0]` = `1 x sentence_len x char_count` tensor
        # `ex.char[1]` = the sentence length
        # `ex.char[2]` = `1 x sentence_length` which contains the lengths of each word in the sentence. 
        #   E.g. `ex.char[2][0]` is the length of the first word. 

        # then, iterate `i` word by word, 
        # ex.char[2][i]         -> the length of ith word;
        # ex.char[0][:,i,:]     -> the ith word
        # then you should remove 'padding' in ith word, and unsqueeze it (see what we did during `collate_batch()`)

        # now, same as in `SentenceEncoder`: 
        #     embedding and rnn;  
        
        # reshape the final state into a `1 x 1 x 2*RNN_HIDDEN_DIM` (1,1,-1) and append it to `embeddings`.
        # after iteration for `ex`, 
        #     append a zero tensor of dimension `1 x 1 x 2*RNN_HIDDEN_DIM` 
        #     at the front and end of `embeddings`. 
        #     for start and end of sentence markers (which are not present in `ex.char`)
        
        
        # finally, concatenate `embedding` into a `(sentence_length + 2) x 1 x 2*RNN_HIDDEN_DIM` tensor and return it.
        return ...
        # your code here

ex = next(iter(dev_iter))
sentence_length = ex.word.size()[0]
assert(CharEmbedding()(ex,0.5).size() == (sentence_length, 1, 2*RNN_HIDDEN_DIM))

#### Exercise 4.2 optional
rubric={accuracy:2}

Your next task is to build a class `CharSentenceEncoder` which takes an example `ex` as input and returns a sequence of LSTM hidden states.

You first task is to initialize the `CharSentenceEncoder` class. Start by copying your initialization code from `SentenceEncoder`, but change the initialization of `self.rnn` slightly. It should have embedding dimension `EMBEDDING_DIM+2*RNN_HIDDEN_DIM` because we are feeding in concatenated token and character-based embeddings to `self.rnn`. You should also initialize `self.char_embedding`, which is a `CharEmbedding` object.  

You should them implement `CharSentenceEncoder.forward` which takes as example `ex` as input. Additionally, it takes two other parameters `word_dropout` and `char_dropout` which represent the word and character dropout probabilities, respectively. The function should:
1. Perform word dropout on `ex` by calling the `drop_words` function above.
1. Embed the resulting tensor resulting in a tensor `word_embedded`.
1. Embed `ex` using `self.char_embedding` giving a tensor `char_embedded` (remember to pass `char_dropout` as argument).
1. Concatenate `word_embedded` and `char_embedded` into a `(sentence_length + 2) x 1 x (EMBEDDING_DIM + 2*RNN_HIDDEN_DIM)` tensor. 
1. Run `self.rnn` on `embedded`.
1. Return the resulting representation tensor. Clip the first and last representation vector before returning the output of `self.rnn` (these correspond to the start and end of sequence tokens).

In [45]:
class CharSentenceEncoder(nn.Module):
    def __init__(self):
        super(CharSentenceEncoder,self).__init__()

        # your code here
        # exactly same as in previous `SentenceEncoder` 
        # and add `self.char_embedding = `CharEmbedding()`
        
    def forward(self,ex,word_dropout,char_dropout):
        # your code here
        # exactly same as in previous `SentenceEncoder` 
        # and add `char_embedded = char_embedding`
        
        # then, concatenate `word_embedded` and `char_embedded` as one big `embedded`
        # rnn, return it (by removing <start> and <end> as in `SentenceEncoder`)


ex = next(iter(dev_iter))
sentence_length = ex.word.size()[0] - 2
assert(CharSentenceEncoder()(ex,0.5,0.5).size() == (sentence_length, 1, 2*RNN_HIDDEN_DIM))

The following code defines and trains a character-based POS tagger which uses the classes that you implemented above. You should get accuracy > 80% on the development data.

Note that it can be much slower to train `CharPOSTagger` than `SimplePOSTagger`. This is partly due to the looping in `CharEmbedding`. There are more efficient ways to handle this and we'll see some techniques next week.

In [46]:
class CharPOSTagger(nn.Module):
    def __init__(self):
        super(CharPOSTagger,self).__init__()
        self.tagset_size = len(pos_vocab)
        
        self.sentence_encoder = CharSentenceEncoder()
        self.hidden2tag = FeedForward(2*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0, char_dropout=0):
        states = self.sentence_encoder(ex,word_dropout,char_dropout)
        return self.hidden2tag(states)

    def tag(self,data):
        with torch.no_grad():
            results = []
            pos_itos=pos_vocab.get_itos()
            for ex in data:
                tags = self(ex).argmax(dim=2).squeeze(1)
                results.append([pos_itos[i] for i in tags])
            return results
        
pos_size = len(pos_vocab)
ex = next(iter(dev_iter))
assert(CharPOSTagger()(ex).size() == (ex.word.size()[0]-2,1,pos_size))
assert(len(CharPOSTagger().tag([ex])[0]) == ex.word.size()[0] -2) 

tagger = CharPOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

Epoch 1: Example 1000 of 1000
Average loss per example: 0.9608
Development accuracy: 79.80
Epoch 2: Example 1000 of 1000
Average loss per example: 0.3046
Development accuracy: 82.46
Epoch 3: Example 1000 of 1000
Average loss per example: 0.1707
Development accuracy: 83.04
Epoch 4: Example 1000 of 1000
Average loss per example: 0.1140
Development accuracy: 83.05
Epoch 5: Example 1000 of 1000
Average loss per example: 0.0955
Development accuracy: 82.95
