# Colx 525 Lab Assignment 3: POS Tagging

## Assignment objectives

In this assignment, you will develop a POS tagger using pytorch. You will:

1. Read in training, development and test data using `torchtext`.
1. Implement a baseline majority class tagger.
1. Numericalize data (i.e. transform sentences and words into `torch.Tensor` objects).
1. Develop a BiLSTM POS tagger.  

The [`pytorch` documentation]() will be useful in this lab.

## Getting started

You will need to install the Python modules `torchtext`, `torch` and `numpy`. The easiest way to do this is using `anaconda` or `pip`.

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

* Submit the assignment by filling in this jupyter notebook with your answers embedded
* Be sure to follow the general lab instructions

### Exercise 1: Reading in data using torchtext

We will now read in training development and test sets using the `torchtext` library. The `torchtext` can considerably simplify your code. If you need a tutorial on the torchtext library, please check [this one](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/) and please have a look at [Roger's torchtext workshop](https://github.ubc.ca/MDS-CL-2019-20/COLX_525_morphology_students/blob/master/pytorch_workshop/workshop1.ipynb).

Let's start by importing the relevant classes from `torchtext` and setting up some important constants. 

In [61]:
from torchtext.data import Field, NestedField
from torchtext.datasets import SequenceTaggingDataset 

# The padding symbol is used when words are encoded into character sequences, the unknown symbol is used for input 
# characters, which were not attested in the training set. We also have start & end of sequence symbols which are 
# appended both to the start and end of sentences and character sequences.
PAD="<pad>"
UNK="<unk>"
START="<start>"
END="<end>"

We will use [Universal Dependencies](https://universaldependencies.org/) or UD dataset for English. Sentences in UD datasets are encoded for dependency syntax which means that they are marked for POS, morphological features, syntactic dependencies (i.e. head words) and syntactic role (like predicate and subject). Below you can see an example of a sentence in UD annotation also called CoNLL-U format:
```
1   This    this    DET     DT    Number=Sing|PronType=Dem                                2   det     _       _
2   item    item    NOUN    NN    Number=Sing                                             6   nsubj   _       _
3   is      be      AUX     VBZ   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6   cop     _       _
4   a       a       DET     DT    Definite=Ind|PronType=Art                               6   det     _       _
5   small   small   ADJ     JJ    Degree=Pos                                              6   amod    _       _
6   one     one     NOUN    NN    Number=Sing                                             0   root    _       _
7   and     and     CCONJ   CC    _                                                       9   cc      _       _
8   easily  easily  ADV     RB    _                                                       9   advmod  _       _
9   missed  miss    VERB    VBN   Tense=Past|VerbForm=Part                                6   conj    _       _
10  .       .       PUNCT   .     _                                                       6   punct   _       _
```

You will now implement a `torchtext.Dataset` which reads data in UD format. The dataset class `UDData` is a subclass of [`torchtext.datasets.SequenceTaggingDataset`](https://torchtext.readthedocs.io/en/latest/datasets.html#sequence-tagging) because our inputs are sentences consisting of several word forms. Your task is to make `UDData.splits` return a training, development and test set containing examples which have the members `word`, `char` and `pos`. For the following sentence:

```
1   This       this      PRON   PRN   Number=Sing|PronType=Dem                                2   nsubj   _   _
2   is         be        AUX    AUX   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4   cop     _   _
3   a          a         DET    DET   _                                                       4   det     _   _
4   sentence   sentence  NOUN   NOUN  Number=Sing                                             0   root    _   _
```

your code should produce an example `ex` with the following members:
```
ex.word == ["This", "is", "a", "sentence"]
ex.char == [["T","h","i","s"], ["i","s"], ["a"], ["s","e","n","t","e","n","c","e"]]
ex.pos  == ["PRON", "AUX", "DET", "NOUN"]
```

#### Exercise 1.1 Defining `torchtext.data.Field` objects and downloading data
rubric={accuracy:10}

The first thing you need to do is to define appropriate `torchtext.data.Field` objects for the member fields. You need to define `WORD` for the word forms in the sentence (like `This`), `CHAR` for word forms which have been split into characters (like `["T","h","i","s"]`) and `POS` for POS tags (like `NOUN`)

The field `CHAR` is special because it needs to perform two tokenizations. `CHAR` first tokenizes the sentence into a sequence of words and then tokenizes each word into a sequence of characters. This can be done using a [`torchtext.data.NestedField`](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.NestedField). The idea here is that you will first define a regular `Field` which will tokenize a word form into characters and then define a `NestedField` which takes your word field as argument (more details in Roger's [torchtext workshop](https://github.ubc.ca/MDS-CL-2019-20/COLX_525_morphology_students/blob/master/pytorch_workshop/workshop1.ipynb)). 

In [62]:
# your code here

WORD = Field(sequential=True,
            pad_token = PAD,
            unk_token = UNK,
        init_token = START,
            eos_token = END)
NESTING_FIELD = Field(tokenize=list,
                     pad_token = PAD,
            unk_token = UNK,
        init_token = START,
            eos_token = END)
CHAR = NestedField(nesting_field=NESTING_FIELD, include_lengths = True)
POS = Field(sequential=True)
#             unk_token = UNK,
#         init_token = START,
#             eos_token = END)


# your code here

# Your fields should pass the following assertions: 
assert(WORD.preprocess("This is a sentence") == ["This", "is", "a", "sentence"]) 
assert(CHAR.preprocess("This is a sentence") == [["T", "h", "i", "s"], 
                                                 ["i", "s"], 
                                                 ["a"], 
                                                 ["s", "e", "n", "t", "e", "n", "c", "e"]])
assert(POS.preprocess("PRON AUX DET NOUN") == ["PRON", "AUX", "DET", "NOUN"])

Next you need to run the following code to download and read Universal Dependencies data. The data will be downloaded into a directory `data` the first time you run this code so you'll need a network connection.

In [63]:
class UDData(SequenceTaggingDataset):
    """
    Universal Dependencies .
    Download original at http://universaldependencies.org/
    License: http://creativecommons.org/licenses/by-sa/4.0/
    
    UDData defines a data loader and reader for UD treebanks. Since we inherit 
    SequenceTaggingDataset, the only method we need to define is splits which 
    returns the training, development and test set.
    """
    
    urls = ['https://mpsilfve.github.io/assets/uddata.zip']
    dirname = 'uddata'
    name = 'uddata'

    
    @classmethod
    def splits(cls, language, root="data", **kwargs):
        """
        Downloads and reads Universal Dependencies Version 2 data. The function 
        returns three torchtext.data.Data objects: train, dev and test which 
        contain torchtext.data.Example objects.
        
        The language parameter should be set to "en" for English, "es" for 
        Spanish and "fi" for Finnish.
        
        The variable FIELDS determines how UDData treats the fields in the 
        CoNLL-U format. It consists of 10 fields corresponding to each of the 10 
        fields in the CoNLL-U format. We are interested in field 1 (word form) and 3 
        (POS tag). We don't want to extract any information for the 
        remaining fields. 
    
        Each entry in FIELDS is a pair (name, field) where field refers to the 
        torchtext.data.Field object that handles the information stored in this 
        field. The information is stored as the variable 
        torchtext.data.Example.name. 
        
        Field 1 is special because we extract two kinds of information: the word 
        form as a monolithic token and the same word form as a character sequence. 
        That is why FIELDS[1] extracts two name values: 'word' and 'char'.  
        """
        FIELDS = ((None,None), 
                  (('word','char'), (WORD, CHAR)), 
                  (None,None), 
                  ('pos', POS), 
                  (None,None), 
                  (None,None), 
                  (None,None),
                  (None,None),
                  (None,None))
        
        return super(UDData, cls).splits(
                fields=FIELDS, 
                root=root, 
                train="%s-ud-train.conllu.head" % language, 
                validation="%s-ud-dev.conllu" % language,
                test="%s-ud-test.conllu" % language, **kwargs)
    
# Read Universal Depandencies v2 training, development and test sets for English.
train, dev, test = UDData.splits(language="en")

# Print the first example in the English training set.
ex = next(iter(iter(iter(train))))
print(ex.word)
print(ex.char)
print(ex.pos)

['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
[['A', 'l'], ['-'], ['Z', 'a', 'm', 'a', 'n'], [':'], ['A', 'm', 'e', 'r', 'i', 'c', 'a', 'n'], ['f', 'o', 'r', 'c', 'e', 's'], ['k', 'i', 'l', 'l', 'e', 'd'], ['S', 'h', 'a', 'i', 'k', 'h'], ['A', 'b', 'd', 'u', 'l', 'l', 'a', 'h'], ['a', 'l'], ['-'], ['A', 'n', 'i'], [','], ['t', 'h', 'e'], ['p', 'r', 'e', 'a', 'c', 'h', 'e', 'r'], ['a', 't'], ['t', 'h', 'e'], ['m', 'o', 's', 'q', 'u', 'e'], ['i', 'n'], ['t', 'h', 'e'], ['t', 'o', 'w', 'n'], ['o', 'f'], ['Q', 'a', 'i', 'm'], [','], ['n', 'e', 'a', 'r'], ['t', 'h', 'e'], ['S', 'y', 'r', 'i', 'a', 'n'], ['b', 'o', 'r', 'd', 'e', 'r'], ['.']]
['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'AD

#### Exercise 1.2 Build Vocabularies

rubric={accuracy:2}

You should now construct your word, character and POS vocabularies from the **training data** by calling the `build_vocab` function for the approporiate fields. 

In [64]:
# your code here

WORD.build_vocab(train)
CHAR.build_vocab(train)
POS.build_vocab(train)


# your code here

How many word tokens does the `WORD` vocabulary contain? What about `CHAR` and `POS`? Print the number of tokens in each vocabulary. (There should be substantially more word tokens than characters or POS tags)

In [65]:
# your code here

print("WORD:",len(WORD.vocab.stoi))
print("CHAR:",len(CHAR.vocab.stoi))
print("POS:",len(POS.vocab.stoi))

# your code here

WORD: 4740
CHAR: 96
POS: 19


### Exercise 2: Simple baseline tagger

To be able to gauge the performance of our deep learning tagger, we'll now implement a baseline majority label classifier.

#### Exercise 2.1: Counting tags

rubric={accuracy:5}

As a first step, you will count the occurrences of different POS tags for each word in the **training data**. These counts will be stored in `tag_counts` below. For exaple, `tag_counts["this"]["PRON"]` should tell you how many times the word "this" was tagged `PRON` in the training data.

In [66]:
from collections import defaultdict, Counter

# A counter for POS tags. tag_counts[wf][pos] should denote the number of times we saw the word wf with 
# POS tag pos in the training data.
tag_counts = defaultdict(Counter)

# Populate tag_counts with the counts of different POS tags for each word type in the training data. 
# your code here

for d in train:
    words = d.word
    tags = d.pos
    for word, tag in zip(words, tags):
        tag_counts[word][tag] += 1
    

# your code here

# A few assertions to make sure that your code is working properly.
assert(tag_counts["this"]["DET"] == 46)
assert(tag_counts["this"]["PRON"] == 20)
assert(tag_counts["this"]["ADV"] == 1)

#### Exercise 2.2 Tagging the development data

rubric={accuracy:5}

The next step is to tag the development data. For each example in the development set, you should append a list of predicted POS tags to `output_tags`. 

For each word in an example, output its most common tag given by `tag_counts`. For OOV (out-of-vocabulary) words which are missing from `tag_counts`, you can predict `NOUN`. 

In [67]:
output_tags = []

for ex in dev:
    output_tags.append([])
    for wf in ex.word:
        # your code here
        if wf in tag_counts:
            output_tags[-1].append(max(tag_counts[wf], key = tag_counts[wf].get))
        else:
            output_tags[-1].append("NOUN")
        
        # your code here
    #print(output_tags[-1])

Using the `accuracy` function below, you can now print the baseline tagging accuracy on the development set. It should be around 77%.

In [68]:
def accuracy(sys,gold):
    """
    Function for evaluating tagging accuracy w.r.t. a gold standard test set (gold).
    """
    assert(len(sys) == len(gold))
    corr = 0
    tot = 0
    for s, g in zip(sys,gold):
        assert(len(s) == len(g.pos))
        corr += sum([1 if x==y else 0 for x,y in zip(s,g.pos)])
        tot += len(s)
    return corr * 100.0 / tot

print("Accuracy for baseline majority class tagger:",accuracy(output_tags,dev))

Accuracy for baseline majority class tagger: 77.36599331954828


### Exercise 3: Numericalizing data

rubric={accuracy:10}

Now, you should define iterators for the training data, development data and test data. For efficiency reasons, `train_iter` should be a [`BucketIterator`](https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator) which will sort examples according to length. If you implement batching (which is not a part of this lab), using `BucketIterator` will result in less padding which means faster runtime. `dev_iter` and `test_iter` should be regular [`Iterator`](https://torchtext.readthedocs.io/en/latest/data.html#iterator) objects because we don't want to permute the examples in the development and test sets. Again, please check Roger's [torchtext workshop](https://github.ubc.ca/MDS-CL-2019-20/COLX_525_morphology_students/blob/master/pytorch_workshop/workshop1.ipynb) for details.

Note that `train_iter` should shuffle examples between epochs but `dev_iter` and `test_iter` should not in order to retain the correct order of development and test examples for evaluation of accuracy. None of the iterators should repeat over multiple epochs. As `device` you should use `"cpu"` unless you have access to a GPU.

In [69]:
from torchtext.data import Iterator, BucketIterator

# your code here

train_iter = BucketIterator(train,
                          batch_size=1,
                          sort_key=len,
                          shuffle=True,
                         device = "cpu")

dev_iter, test_iter = Iterator.splits((dev, test),
                                     batch_sizes=(1, 1),
                                     sort=False,
                                     shuffle=False,
                                    device = "cpu")

# your code here

Make sure that you understand the contents of the fields `ex.pos`, `ex.word` and `ex.char`. You just need to run the following code: 

In [70]:
# ex represents the first sentence in the training set. 
ex = next(iter(train_iter))

print("Here is the first example from the training set:")
print(ex)
print("\nEach example contains a vector of POS tags ex.pos having dimension (sentence_length,1):\n")
print(ex.pos)
print(ex.pos.size())
print("\nEach example contains a vector of word tokens ex.word having dimension (sentence_length+2,1)")
print("The +2 stems from START symbol at the beginning of the sentence (WORD.vocab.stoi[START] == %u)" 
      % WORD.vocab.stoi[START])
print("and END symbol at the end of the sentence (WORD.vocab.stoi[END] == %u):\n" % WORD.vocab.stoi[END])
print(ex.word)
print(ex.word.size())
print("\nEach example contains a tensor of character strings ex.char[0] having dimension (1,sentence_length,max_word_length+2)")
print("The tensor is big enough to fit all tokens in the sentence. Again, +2 stems from the sequence initial")
print("symbol START (CHAR.vocab.stoi[START] == %u) and sequence final symbol END (CHAR.vocab.stoi[END] == %u)"
      % (CHAR.vocab.stoi[START], CHAR.vocab.stoi[END]))
print("which are appended to each word. All words are padded to the same length using the symbol PAD")
print("(CHAR.vocab.stoi[PAD] == %s):\n" % CHAR.vocab.stoi[PAD])
chars, word_count, char_lengths = ex.char      
print(chars)
print(chars.size())
print("\nAdditionally we get the length of each input word form ex.char[2] in a (1,sentence_length) tensor:\n")
print(char_lengths)

Here is the first example from the training set:

[torchtext.data.batch.Batch of size 1 from UDDATA]
	[.pos]:[torch.LongTensor of size 27x1]
	[.word]:[torch.LongTensor of size 29x1]
	[.char]:('[torch.LongTensor of size 1x27x10]', '[torch.LongTensor of size 1]', '[torch.LongTensor of size 1x27]')

Each example contains a vector of POS tags ex.pos having dimension (sentence_length,1):

tensor([[11],
        [11],
        [ 9],
        [ 9],
        [10],
        [ 6],
        [11],
        [10],
        [10],
        [ 8],
        [ 5],
        [ 9],
        [12],
        [ 9],
        [ 6],
        [11],
        [ 9],
        [10],
        [ 6],
        [13],
        [ 6],
        [ 9],
        [ 3],
        [ 3],
        [ 6],
        [ 4],
        [ 3]])
torch.Size([27, 1])

Each example contains a vector of word tokens ex.word having dimension (sentence_length+2,1)
The +2 stems from START symbol at the beginning of the sentence (WORD.vocab.stoi[START] == 2)
and END symbol at the end 

### Exercise 4: The POS tagger

In this exercise, you will build a basic BiLSTM POS tagger. The tagger:

1. Embeds word tokens in the input sentence.
2. Passes the embeddings through a bidirectional LSTM layer.
3. Predicts POS tags using a feed-forward network and log softmax layer.

When you are implementing the POS tagger, remember to always keep track of the input and output sizes of all of you tensors. It is very important to check that these are correct. It is also important to understand what your dimensions refer to.

Let's start by loading a few necessary libraries and setting hyper-parameters:

In [71]:
import numpy as np

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.nn.functional import log_softmax, relu
from torch.optim import Adam, SGD

from random import random, seed, shuffle

# Ensure reproducible results by setting random seeds to 0.
seed(0)
torch.manual_seed(0)
np.random.seed(0)

import re

# Hyperparameters
VOCAB_SIZE = len(WORD.vocab.stoi)
EMBEDDING_DIM=300
RNN_HIDDEN_DIM=50
RNN_LAYERS=1
BATCH_SIZE=10
EPOCHS=5

# Maximum length of generated output word forms.
MAXWFLEN=40

#### Exercise 4.1: LSTM layer

rubric={accuracy:15}

You should implement a `BidirectionalLSTM` class which is used to encode a sequence of word embeddings into  representations. `BidirectionalLSTM` encapsulates two LSTM networks: `self.forward_rnn` and `self.backward_rnn` which you should initialize in `BidirectionalLSTM.__init__`. Both should have:

1. Embedding dimension `EMBEDDING_DIM`
1. Hidden dimension `RNN_HIDDEN_DIM`
1. Layer count `RNN_LAYERS`

Your second task is to implement the `BidirectionalLSTM.forward` function. As argument, the function takes a `torch.Tensor` `sequence` which has size `(sequence_length,1,EMBEDDING_DIM)`. This tensor contains the word embeddings for the input sentence.  

You should pass the `sequence` to `self.forward_rnn` which returns:

1. a sequence $f_1,...,f_n$ of forward hidden states represented as a tensor `fwd_hss` having size `(sequence_length,1,RNN_HIDDEN_DIM)` and
1. a pair `(fwd_hs, fwd_cs)`, where:
   1. `fwd_hs` is the final forward hidden state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   1. `fwd_cs` is the final forward cell state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   
You should pass the **reversed** `sequence` to `self.backward_rnn` (**HINT**: [`torch.flip`](https://pytorch.org/docs/stable/torch.html#torch.flip) can be useful here) which returns:

1. a sequence $b_n,...,b_1$ of backward hidden states represented as a tensor `bwd_hss` having size `(sequence_length,1,RNN_HIDDEN_DIM)` (NOTE! the backward states are reversersed here) 
1. and a pair `(bwd_hs, bwd_cs)`, where:
   1. `bwd_hs` is the final backward hidden state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   1. `bwd_cs` is the final backward cell state having dimension `(1,1,RNN_HIDDEN_DIM)`.
   
The `forward` function should return a tensor `hss` having dimension `(sequence_length, 1, 2*RNN_HIDDEN_DIM)`, where `hss[i]` represents the concatenation of the $i$th forward hidden state $f_i$ and the $i$th backward hidden state $b_i$ (**HINT**: Again `torch.flip` can be useful).

In [72]:
class BidirectionalLSTM(nn.Module):
    def __init__(self):# vocab_size, emb_dim, hid_dim, n_layers):
        super(BidirectionalLSTM,self).__init__()
        # your code here
        
        self.forward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)
        self.backward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)
        #self.feedforward = nn.Linear(2*RNN_HIDDEN_DIM, len(POS.vocab.itos))
        # your code here
        
    def forward(self, sequence):
        # your code here

        output_f, (hidden_f, cell_f) = self.forward_rnn(sequence)
        output_b, (hidden_b, cell_b) = self.backward_rnn(sequence.flip(0))
        
        bi_output = torch.cat((output_f, output_b.flip(0)), dim = 2) 
        
        #bi_output = bi_output[1:-1, :, :]
        return bi_output
        
        # your code here

# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
assert(BidirectionalLSTM()(torch.zeros(10,1,EMBEDDING_DIM)).size() == (10,1,2*RNN_HIDDEN_DIM))

In order to improve tagging accuracy for OOV words, we need to use word dropout. It takes two arguments:
1. A `torch.Tensor` `sequence` of size `(sequence_length,1)` and
1. A float `word_dropout` in the interval `[0,1]`.
During training, the function randomly replaces words by `WORD.vocab.stoi[UNK]` with probability 'word_dropout'.  

In [73]:
def drop_words(sequence,word_dropout):
    seq_len, _ = sequence.size()
    dropout_sequence = sequence.clone()
    for i in range(1,seq_len-1):
        if random() < word_dropout:
            dropout_sequence[i,0] = WORD.vocab.stoi[UNK]
    return dropout_sequence
    
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
assert(drop_words(torch.zeros(10,1),0.5).size() == (10,1))

#### Exercise 4.2 Sentence Encoder 

rubric={accuracy:15}

Your next task is to build a class `SentenceEncoder` which takes an example (from `train_iter`, `dev_iter` or `test_iter`) as input and returns a sequence of LSTM hidden states given by `BidirectionalLSTM`.

You first task is to initialize the `SentenceEncoder` class. You will initialize 3 class-members:
1. `self.vocabulary` which is just an alias for `WORD.vocab.stoi`.
1. `self.embeddings` which is a `torch.nn.Embedding` having input dimension `len(self.vocabulary)` and output dimension `EMBEDDING_DIM`.
1. `self.rnn` which is a `BidirectionalLSTM` object.

You should them implement `SentenceEncoder.forward` which takes as example `ex` as input. Additionally, it takes another parameter `word_dropout` which is the probability for word dropout on the sentence `ex`. The function should:
1. Perform word dropout on `ex` by calling the `drop_words` function above.
1. Embed the resulting tensor resulting in a tensor `embedded`.
1. Run `self.rnn` on embedded.
1. Return the resulting representation tensor. However, `ex.word` represents a sentence where we have appended an initial symbol `START` and final symbol `END`. You need to therefore clip the first and last representation vector before returning the output of `self.rnn`.

In [74]:
class SentenceEncoder(nn.Module):
    def __init__(self):
        super(SentenceEncoder,self).__init__()

        # your code here
        
        self.vocabulary = WORD.vocab.stoi
        self.embedding = nn.Embedding(len(self.vocabulary), EMBEDDING_DIM)
        self.rnn = BidirectionalLSTM()
        
        #your code here
        
    def forward(self,ex,word_dropout):
        # your code here
        
        ex = drop_words(ex.word,word_dropout)
        embedded = self.embedding(ex)
        output = self.rnn(embedded)
        output = output[1:-1, :, :]
        return output
        
        
        # your code here
        
sentence_length = ex.word.size()[0] - 2
assert(SentenceEncoder()(ex,0.5).size() == (sentence_length, 1, 2*RNN_HIDDEN_DIM))

#### Exercise 4.3: Prediction Layer

rubric={accuracy:15}

Your next task is to implement a feed-forward network `FeedForward` which is used to predict tags from LSTM representations. The constructor `feedForward.__init__` takes two arguments `input_dim` and `output_dim` representing the input and output dimension of the network, respectively. 
Your first task is to complete the function `FeedForward.__init__` by initializing two linear layers:
1. `self.linear1` having input dimension `input_dim` and output dimension `input_dim` and
2. `self.linear2` having input dimension `input_dim` and output dimension `output_dim`.

Your second task is to implement the function `FeedForward.forward`. As input, the function takes `tensor` which is a torch.Tensor object having size `(sequence_length, 1, input_dim)`. It then:
1. Applies `self.linear1` followed by a ReLU activation function on `tensor` and
2. then passes the result through `self.linear2` and a `log_softmax` layer and finally returns the result.

In [75]:
class FeedForward(nn.Module):
    def __init__(self,input_dim,output_dim):
        super(FeedForward, self).__init__()
        # your code here
        self.linear1 = nn.Linear(input_dim, input_dim)
        self.linear2 = nn.Linear(input_dim, output_dim)
        self.act1 = nn.ReLU()
        self.act2 = nn.LogSoftmax(dim = 2)
        # your code here
        
    def forward(self,tensor):
        # your code here
        output = self.linear1(tensor)
        output = self.act1(output)
        output = self.linear2(output)
        output = self.act2(output)
        
        return output
        
        # your code here
        
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
assert(FeedForward(2*RNN_HIDDEN_DIM,100)(torch.zeros(10,1,2*RNN_HIDDEN_DIM)).size() == (10,1,100))      

## Tagging sentences and training the model

Now it's time to put together all the components that you built so far. `SimplePOSTagger` is a wrapper around a `SentenceEncoder` and `FeedForward` layer. It has a `forward` method which returns a tensor `res` where `res[i,j]` represents the log probability of tag `POS.itos[j]` for the word at position `i` in our input sentences. 

The function `tag` gets POS tags for a dataset `data`.  

In [76]:
class SimplePOSTagger(nn.Module):
    def __init__(self):
        super(SimplePOSTagger,self).__init__()
        self.tagset_size = len(POS.vocab.itos)
        
        self.sentence_encoder = SentenceEncoder()
        self.hidden2tag = FeedForward(2*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        states = self.sentence_encoder(ex,word_dropout)
        return self.hidden2tag(states)

    def tag(self,data):
        with torch.no_grad():
            results = []
            for ex in data:
                tags = self(ex).argmax(dim=2).squeeze(1)
                results.append([POS.vocab.itos[i] for i in tags])
            return results
        
pos_size = len(POS.vocab.itos)
assert(SimplePOSTagger()(ex).size() == (ex.word.size()[0]-2,1,pos_size))
assert(len(SimplePOSTagger().tag([ex])[0]) == ex.word.size()[0] -2) 

Armed with the `SimplePOSTagger` class, you can now train your tagger using the following code. You should get to around 75% tagging accuracy on the development set.

In [77]:
tagger = SimplePOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev))

Epoch 1: Example 1000 of 1000
Average loss per example: 1.2292
Development accuracy: 71.78
Epoch 2: Example 1000 of 1000
Average loss per example: 0.5045
Development accuracy: 75.11
Epoch 3: Example 1000 of 1000
Average loss per example: 0.2637
Development accuracy: 76.44
Epoch 4: Example 1000 of 1000
Average loss per example: 0.1891
Development accuracy: 75.29
Epoch 5: Example 1000 of 1000
Average loss per example: 0.1542
Development accuracy: 77.68


You may notice that this is almost the same accuracy as for our baseline model. You can get a bit higher if you raise the number of epochs. If you really want better accuracy, you'll have to implement a character-level model and use pretrained embeddings. These can get you up to 85%.