# Colx 525 Lab 3: POS projection

- Yarowsky, D., & Ngai, G. (2001). **Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora**. In *Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics*, p.200–207. https://aclanthology.org/N01-1026/

- Das, D., & Petrov, S. (2011). **Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections**. In *Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies*, p.600–609. http://www.aclweb.org/anthology/P11-1061




## Assignment objectives

In this assignment, you will be building a POS-tagger for a low resource language - Mbya Guaraní.
Mbya Guaraní is an indigenous Tupian language spoken in South America.  Although speakers have likely been influenced by the colonizing languages around them, it is not related to Spanish, Portuguese, or any other European language.  It only has about 15,000 speakers (according to a 2008 census), and
has very few digital tools - in the Universal Dependencies treebank, there are only about 1000 annotated sentences.  In this lab, we will be constructing a POS tagger for Mbya Guaraní, using a parallel Bible translation to get some silver-annotated data.  The first 8 exercises are mandatory, and you should try to complete them.  The final exercise is optional, but should be attempted for full marks.

1. Datahandling:  answer questions about data handling
    - No coding exercise
    - 

1. Word-based biLSTM POS-tagger: Implement a word-based LSTM POS-tagger.
    - No coding exercise
	- ACCURACY: 50.19
    - 

1. Character-based model: Augment the LSTM tagger with character-based features.
    - Coding exercise:
    - See https://github.com/jungyeul/mds-cl-2023-24/blob/main/block5/COLX_525_morphology_lab3.ipynb (Block 5)	
    - `def reverse_words(tensor, lengths)`
	- `class CharacterModel(nn.Module)`
	    - including `def __init__(self)` and `def forward(self,embs,rev_embs,word_lengths)`
	- `def forward(self,ex)` in class `WordEncoder(nn.Module)`
    - ACCURACY: 70.51
    -



1. Mixing Mbyá Guaraní and German Data: Supplement the taggers with some Standard German data.  
    - Not really coding exercise ...
    - ACCURACY only w/ German: 26.73 
    - ACCURACY fine-tuning: 72.52
    -


1. Annotation Projection (preparation): Using a small parallel corpus, align Guaraní and Standard German texts.
    - Coding exercise
    - `def prepare_data(source_file, target_file)`
    - `def align_source_target(source, target)`
    - 

1. Annotation Projection: Project Standard German tags onto the Guaraní data, and re-train the model.
    - Coding exercise:
	- `def project_tags(bitext, source_tags, target_tags, target_words, out_train, out_test)`
	- Training `train_gun_bible` (projected Guarani from German)
	- ACCURACY: 37.64	(because of _ )
    -

1. Don't predict "_"!: Allow the model to ignore missing data.
    - Not really coding exercise ...
    - 

1. Using both Gold and Silver data: Re-train the model, this time fine-tuning on gold data
    - Not really coding exercise ...
	- ACCURACY w/o _ : 42.53
	- ACCURACY fine-tuning: 72.77
    - 

1. Learning less from noise (optional): Bias the learning model towards the gold data.
    - Not really coding exercise ...
    - change the hyperparameters including learning rate,  dropout, etc. 
    - ACCURACY goes up to 73.53 (you can do better)
    - 


## Getting started

You will need to install the Python modules `torchtext`, `torch`, `numpy` and `nltk`. The easiest way to do this is using `anaconda` or `pip`.

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

* Submit the assignment by filling in this jupyter notebook with your answers embedded
* Be sure to follow the general lab instructions

## 1. Datahandling

We will first read in the data from the data folder in this week's lab.  Prior
to starting work on the lab, you will need to make a copy of the data folder (in the student repo for 581) in your own working directory.
This data is in "CONLLU" format, which consists of tab-separated fields 
such as words, POS, fine-grained POS, and others. 

In this lab, we are interested in the word and POS: 

```
1   This    this    DET     DT    Number=Sing|PronType=Dem                                2   det     _       _
2   item    item    NOUN    NN    Number=Sing                                             6   nsubj   _       _
3   is      be      AUX     VBZ   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   6   cop     _       _
4   a       a       DET     DT    Definite=Ind|PronType=Art                               6   det     _       _
5   small   small   ADJ     JJ    Degree=Pos                                              6   amod    _       _
6   one     one     NOUN    NN    Number=Sing                                             0   root    _       _
7   and     and     CCONJ   CC    _                                                       9   cc      _       _
8   easily  easily  ADV     RB    _                                                       9   advmod  _       _
9   missed  miss    VERB    VBN   Tense=Past|VerbForm=Part                                6   conj    _       _
10  .       .       PUNCT   .     _                                                       6   punct   _       _
```

The format consists of tab-separated lines. Each line has 10 fields:

1. ID
2. Word form
3. lemma/base form
4. POS tag according to the UD annotation standard.
5. Language Specific POS tag.
6. Morphological features.
7. ID of the syntactic head of the current word.
8. The dependency relation between this word and its head.
9. A list of depdendencies (can be empty).
10. Misc. annotations.
  
The syntactic dependency trees for different sentences are separated by an empty line. See [this website](https://universaldependencies.org/format.html) for further documentation.

The code for data reading is given to you. There is no need to modify it. However, make sure that you understand the output format.

In [1]:
# !pip3 install conllu

In [2]:
dire = "./data"

In [3]:

import conllu
import os
import torch
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

UNK="<unk>"
START="<start>"
END="<end>"
PAD="<pad>"

word_transform = lambda s: [word_vocab[w] for w in ["<start>"] + s + ["<end>"]]
pos_transform = lambda s: [pos_vocab[w] for w in s]
char_transform = lambda w: [char_vocab[c] for c in ["<start>"] + w + ["<end>"]]

from itertools import islice
from collections import namedtuple
Example = namedtuple("Example",["word", "pos", "char"])

class UDDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index]
    
def yield_tokens(data):
    for ex in data:
        yield([tok["form"] for tok in ex])
        
def yield_chars(data):
    for ex in data:
        yield([c for tok in ex for c in tok["form"]])
        
def yield_pos(data):
    for ex in data:
        yield([tok["upos"] for tok in ex])
        
def read_ud_data(lan, vocabs = None):
    
    def read_data(dire, lang):
        train_data = conllu.parse(open(os.path.join(dire, f"{lang}_train")).read())
        test_data = conllu.parse(open(os.path.join(dire, f"{lang}_test")).read())
        return train_data, test_data

    train_data, test_data = read_data("data",lan)
    train = UDDataset(train_data)
    test = UDDataset(test_data)

    word_vocab = char_vocab = pos_vocab = None
    
    if vocabs:
        word_vocab, char_vocab, pos_vocab = vocabs
    else:
        word_vocab = build_vocab_from_iterator(yield_tokens(train_data),
                                               specials=["<unk>", "<start>", "<end>"])
        word_vocab.set_default_index(word_vocab["<unk>"])

        char_vocab = build_vocab_from_iterator(yield_chars(train_data), 
                                                specials=["<unk>", "<start>", "<end>", "<pad>"])
        char_vocab.set_default_index(char_vocab["<unk>"])

        pos_vocab = build_vocab_from_iterator(yield_pos(train_data), 
                                                specials=["<unk>"])
        pos_vocab.set_default_index(pos_vocab["<unk>"])
        
    def split_char_sequence(chars, tokens):
        word_lens = [len(w) for w in tokens]
        chars = iter(chars)
        return [list(islice(chars, elem)) for elem in word_lens]
    
    def collate_batch(batch):
        pos_list, token_list, char_list, word_lens = [], [], [], []
        for tokens, chars, pos in zip(yield_tokens(batch), 
                                      yield_chars(batch),
                                      yield_pos(batch)):
            # Your code here
            token_tensor = torch.tensor(word_transform(tokens), dtype=torch.long).unsqueeze(1)
            pos_tensor = torch.tensor(pos_transform(pos), dtype=torch.long).unsqueeze(1)
            chars = split_char_sequence(chars, tokens)
            chars = [char_transform(w) for w in chars]
            char_tensors = [torch.tensor(cs, dtype=torch.long) for cs in chars]

            pos_list.append(pos_tensor)
            token_list.append(token_tensor)
            char_list += char_tensors

        return Example(token_list[0],
                       pos_list[0],
                       (pad_sequence(char_list, batch_first=True, padding_value=char_vocab["<pad>"]).unsqueeze(0),
                        len(token_list[0])-2,
                        [len(w) for w in tokens]))

    test_iter = DataLoader(test_data, batch_size=1, shuffle=False, collate_fn=collate_batch)
    dev_iter = DataLoader(test_data, batch_size=1, shuffle=False, collate_fn=collate_batch)
    train_iter = DataLoader(train_data, batch_size=1, shuffle=True, collate_fn=collate_batch)
    
    return train_iter, dev_iter, test_iter, word_vocab, char_vocab, pos_vocab

We'll now read Guarani training, development and test data. The `read_ud_data()` function returns both the numericalized datasets and word, character and POS vocabularies which map tokens, characters and POS tags into ID numbers. 

In [4]:
train_iter, dev_iter, test_iter, word_vocab, char_vocab, pos_vocab = read_ud_data("gun")

A few assertions to check that everything is working correctly:

In [5]:
print("Count of word types:", len(word_vocab))
print("Count of character types:", len(char_vocab))
print("Count of POS types:", len(pos_vocab))

assert len(word_vocab) == 221
assert len(char_vocab) == 48
assert len(pos_vocab) == 16

Count of word types: 221
Count of character types: 48
Count of POS types: 16


#### 1.1
rubric={reasoning:1}

The size of the word vocab is much larger than the size of character vocab.  Why would we even want to consider characters in a model?

In [6]:
# your answer here

# your answer here

#### 1.2
rubric={reasoning:1}

Given your answer from 1., which model do you think could best use data from another language - a word model, or a character model, and why?  Try to answer this question before moving on to the rest of the lab.

In [7]:
# your answer here

# your answer here

### Accuracy: To evaluate POS tagging, we will be using simple accuracy.

#### 1.3 
rubric={reasoning:1}

1. Why would F1-score be inappropriate for this task?

In [8]:
# your answer here

# your answer here

The function to compute tagging accuracy is given. No need to modify anything

In [9]:
def accuracy(sys,gold):
    """
    Function for evaluating tagging accuracy w.r.t. a gold standard test set (gold).
    """
    assert(len(sys) == len(gold))
    pos_itos = pos_vocab.get_itos()
    corr = 0
    tot = 0
    for s, g in zip(sys,gold):
        g_pos = [pos_itos[t[0]] for t in g.pos.tolist()]
        assert(len(s) == len(g_pos))
        corr += sum([1 if x==y else 0 for x,y in zip(s,g_pos)])
        tot += len(s)
    return corr * 100.0 / tot

### Exercise 2: A Simple POS Tagger without a character-level model

In this exercise, you will build run BiLSTM POS tagger. The tagger:

1. Embeds word tokens in the input sentence.
2. Passes the embeddings through a bidirectional LSTM layer.
3. Predicts POS tags using a feed-forward network and log softmax layer.

This will serve as a baseline against which to compare future models,
and can simply be run as is.

This is the model that you implemented previously, and you will not need to
modify the code.


In [10]:
import numpy as np

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.nn.functional import log_softmax, relu
from torch.optim import Adam, SGD

from random import random, seed, shuffle

# Ensure reproducible results.
seed(0)
torch.manual_seed(0)
np.random.seed(0)

import re

# Hyperparameters
EMBEDDING_DIM=50
RNN_HIDDEN_DIM=50
RNN_LAYERS=1
BATCH_SIZE=10
EPOCHS=5

class BidirectionalLSTM(nn.Module):
    def __init__(self):
        super(BidirectionalLSTM,self).__init__()
        self.forward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)
        self.backward_rnn = nn.LSTM(EMBEDDING_DIM, RNN_HIDDEN_DIM, RNN_LAYERS)
        
    def forward(self,sentence):
        fwd_hss, _ = self.forward_rnn(sentence)
        bwd_hss, _ = self.backward_rnn(sentence.flip(0))
        return torch.cat([fwd_hss, bwd_hss.flip(0)], dim=2)
        
def drop_words(sequence,word_dropout):
    seq_len, _ = sequence.size()
    dropout_sequence = sequence.clone()
    for i in range(1,seq_len-1):
        if random() < word_dropout:
            dropout_sequence[i,0] = word_vocab[UNK]
    return dropout_sequence
        
class SentenceEncoder(nn.Module):
    def __init__(self):
        super(SentenceEncoder,self).__init__()

        self.vocabulary = word_vocab
        self.embedding = nn.Embedding(len(self.vocabulary),EMBEDDING_DIM)
        self.rnn = BidirectionalLSTM()
        
    def forward(self,ex,word_dropout):
        embedded = self.embedding(drop_words(ex.word,word_dropout))
        hss = self.rnn(embedded)
        return hss[1:-1]
        
class FeedForward(nn.Module):
    def __init__(self,input_dim,output_dim):
        super(FeedForward, self).__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.linear1 = nn.Linear(input_dim,input_dim)
        self.linear2 = nn.Linear(input_dim,output_dim)
        
    def forward(self,tensor):
        layer1 = relu(self.linear1(tensor))
        layer2 = self.linear2(layer1).log_softmax(dim=2)
        return layer2
       

#### Exercise 2.1 Tagging sentences
rubric={accuracy:1}

Now it's time to put together all the components that you built so far. `SimplePOSTagger` is a wrapper around a `SentenceEncoder` and `FeedForward` layer. It has a `forward` method which returns a tensor `res` where `res[i,j]` represents the log probability of tag `POS.itos[j]` for the word at position `i` in our input sentences. 

The function `tag` takes a dataset (development or test data) `data` as input and returns a list of tag sequences as output. For example:

```
[["DET","NOUN","VERB"],["VERB","ADV"],["PRON","VERB","PRON"]]
```

As with the previous code, you can simply run the WordPOSTagger "as is".  We will 
be modifying it in the coming sections.

In [11]:
class WordPOSTagger(nn.Module):
    def __init__(self):
        super(WordPOSTagger,self).__init__()
        self.tagset_size = len(pos_vocab)
        
        self.sentence_encoder = SentenceEncoder()
        self.hidden2tag = FeedForward(2*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        states = self.sentence_encoder(ex,word_dropout)
        scores = self.hidden2tag(states)
                
        return scores

    def tag(self,data):
        with torch.no_grad():
            pos_itos = pos_vocab.get_itos()
            results = []
            for ex in data:
                scores = self(ex)
                tags = scores.argmax(dim=2).squeeze(1)
                results.append([pos_itos[i] for i in tags])
            return results

Armed with the `WordPOSTagger` class, 
you can now train your tagger using the following code. 
For Mbyá Guaraní, you should achieve tagging accuracy around 45-50% on the test set. 
On Garrett's computer the model trains in 15 seconds (Remember, we only have 50 sentences).

In [12]:
tagger = WordPOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

Epoch 1: Example 49 of 49
Average loss per example: 2.3946
Development accuracy: 25.22
Epoch 2: Example 49 of 49
Average loss per example: 1.7641
Development accuracy: 39.77
Epoch 3: Example 49 of 49
Average loss per example: 1.3992
Development accuracy: 38.27
Epoch 4: Example 49 of 49
Average loss per example: 1.1138
Development accuracy: 44.67
Epoch 5: Example 49 of 49
Average loss per example: 0.8576
Development accuracy: 50.19


## Exercise 3: Character-level model

#### Exercise 3.1: Reversing a batch of words
rubric={accuracy:1}

We're now going to be extending the POS-tagger to use characters as well as words.

This requires that we run a bidirectional LSTM for a padded batch of words, namely all of the words in our input sentence. Consider the following example having size `(4,3)`:

$$\begin{bmatrix} a_1 & b_1 & c_1 \\
a_2 & b_2 & c_2 \\
{\rm PAD} & b_3 & c_3 \\
{\rm PAD} & b_4 & {\rm PAD}
\end{bmatrix}$$

This batch consists of three words $a_1a_2$, $b_1b_2b_3b_4$ and $c_1c_2c_3$. They have been padded to equal length using the symbol ${\rm PAD}$. In order run a bidirectional LSTM, we need to reverse the words in the batch. This gives the following tensor:

$$\begin{bmatrix} a_2 & b_4 & c_3 \\
a_1 & b_3 & c_2 \\
{\rm PAD} & b_2 & c_1 \\
{\rm PAD} & b_1 & {\rm PAD}
\end{bmatrix}$$

Note that reverse has the same size as the original tensor and that we haven't touched the PAD symbols. 

It is your task to implement a function `reverse_words` which takes a tensor as input and returns its reverse. The function also takes a vector of word lengths which you can use when reversing the strings. (**HINT**: you can a Python `range` object to index tensors).

In [13]:
def reverse_words(tensor,lengths):
    # rev_tensor has the same size as tensor but is filled with padding symbols. 
    rev_tensor = torch.zeros(tensor.size(),dtype=tensor.dtype) + char_vocab[PAD]

    # Fill in the correct values in rev_tensor.
    # your code here
    for i,l in enumerate(lengths):
        ...
    # your code here

    return rev_tensor

# An assertion to check that your code works correctly.
words = [["a", "e", "g"],
         ["b", "f", "h"],
         ["c", PAD, "j"],
         [PAD, PAD, "k"]]
reversed_words = [["c", "f", "k"],
                  ["b", "e", "j"],
                  ["a", PAD, "h"],
                  [PAD, PAD, "g"]]
words = torch.LongTensor([[char_vocab[c] for c in row] for row in words])
reversed_words = torch.LongTensor([[char_vocab[c] for c in row] for row in reversed_words])
word_lengths = torch.tensor([3,2,4])
assert(torch.all(reverse_words(words,word_lengths)==reversed_words))

#### Exercise 3.2: Character-Level Model
rubric={accuracy:4}

Now you will implement the character-level model `CharacterModel`. You need to initialize the model with two LSTM networks: `self.forward_rnn` and `self.backward_rnn`. Both take a tensor of character embeddings (of dimension `EMBEDDING_DIM`) as input and generate final hidden states (having dimension  `RNN_HIDDEN_DIM`). The networks should be unidirectional and have layer count `RNN_LAYERS`.

Your second task is to implement the function `self.forward`. It takes three arguments. The first two arguments `embs` and `rev_embs` are tensors of embeddings representing the input words in a sentence. Both have dimension `(N,sentence_length,EMBEDDING_DIM)`, where `N` is a large enough so that the tensor will fit all words in the input sentence. `emb[i,j,:]` is the embedding of the $i+1$th character in the $j$th word or `embeddin(PAD)` if the $j$th word has fewer than `i+1` characters. `rev_emb` is the reversed version of `emb` obtained using the function `reverse_words` which you just implemented. The third argument to `self.forward` is `word_lengths` which is an array of word lengths.

The first thing you should do is to pack the input tensors `embs` and `rev_embs` using [pack_padded_sequence](https://pytorch.org/docs/stable/nn.html?highlight=pack_padded_sequence#torch.nn.utils.rnn.pack_padded_sequence). This ensures that `self.forward_rnn` and `self.backward_rnn` will return the correct final state. Then yous should run `self.forward_rnn` and `self.backward_rnn` on your packed tensors and returns their final states. You should contatenate them and return a tensor of size `(1,sentence_length,2*RNN_HIDDEN_DIM)`.

In [14]:
class CharacterModel(nn.Module):
    def __init__(self):
        super(CharacterModel,self).__init__()
        
        # your code here
        self.forward_rnn = ... # LSTM with EMBEDDING_DIM, RNN_HIDDEN_DIM, and RNN_LAYERS; 
        self.backward_rnn = ... # same: LSTM with EMBEDDING_DIM, RNN_HIDDEN_DIM, and RNN_LAYERS; 
        #your code here
        
    def forward(self,embs,rev_embs,word_lengths):
        # your code here
        embs = ... # requires `pack_padded_sequence` for embs with `enforce_sorted=False`
        rev_embs = ... # same; requires `pack_padded_sequence` for rev_embs with `enforce_sorted=False`

        _, (fwd_hs,_) = ... # `forward_rnn``
        _, (bwd_hs,_) = ... # `backward_rnn`
        
        return ... # cat with `fwd_hs` `bwd_hs`; 
        # your code here
        
# Assertions to check that your code returns objects of the correct size (not a guarantee that your code works).
assert(CharacterModel()(torch.zeros(100,10,EMBEDDING_DIM),
                        torch.zeros(100,10,EMBEDDING_DIM),
                        torch.tensor([1,2,3,4,5,6,7,8,9,10])).size() == 
       (1,10,2*RNN_HIDDEN_DIM))

#### Exercise 3.3: Encoding words
rubric={accuracy:3}

The `WordEncoder` class encapsulates a character embedding `self.embedding` and a `CharacterModel`. Its job is to encode a batch of words such as the example below into a representation of size `(sentence_length,1,2*RNN_HIDDEN_DIM)`. 

$$C=\begin{bmatrix} t & d & b \\
h & o & a \\
e & g & r \\
{\rm PAD} & s & k
\end{bmatrix}$$

It does this by first using `self.embedding` to embed $B$ and its reversal $D$ given by the function `reverse_words`. It then calls `self.char_model` on the embeddings. `self.char_model` returns a tensor of size `(1,sentence_len,2*RNN_HIDDEN_DIM)`. You need to rearrange the dimensions so that `Wordencoder.forward` can return a tensor of size `(sentence_length,1,2*RNN_HIDDEN_DIM)`.

In [15]:
class WordEncoder(nn.Module):
    def __init__(self):
        super(WordEncoder,self).__init__()  
        self.embedding = nn.Embedding(len(char_vocab),EMBEDDING_DIM)
        self.char_model = CharacterModel()
    
    def forward(self,ex):
        words, sentence_len, word_lens = ex.char
        
        # We need to rearrange words and word_lens a bit here to be able 
        # to feed them to self.char_model. After rearranging, words will 
        # have size (N,sentence_len), where N is the length of the longest 
        # word in our input sentence.
        words = words.squeeze(0).permute(1,0)
        word_lens = torch.tensor(word_lens,dtype=torch.long)
        
        # Reverse the words in ex.
        rev_words = reverse_words(words, word_lens)

        # your code here
        embedded = ... 
        rev_embedded = ... 
        hs = ... # requires `char_model` with embedded and rev_embedded; 

        # here... you may need `permute`  (print hs to see why)
        hs = ... # permute
        
        return hs
        # your code here

sentence_len = ex.word.size()[0] - 2
assert(WordEncoder()(ex).size() == (sentence_len,1,2*RNN_HIDDEN_DIM))

#### Exercise 3.4 Tagging sentences
rubric={accuracy:1}

Now it's time to put together all the components that you built so far. `CharacterPOSTagger` is a wrapper around a `SentenceEncoder`, `WordEncoder` and `FeedForward` layer. It has a `forward` method which returns a tensor `res` where `res[i,j]` represents the log probability of tag `POS.itos[j]` for the word at position `i` in our input sentences. 

The function `tag` takes a dataset (development or test data) `data` as input and returns a list of tag sequences as output. For example:

```
[["DET","NOUN","VERB"],["VERB","ADV"],["PRON","VERB","PRON"]]

```

Note that the "tag" function is identical to the function for the WordPOSTagger - it simply calls the modified tagger.
**NOTE!** Again, `with torch.no_grad():` is required to prevent Pytorch from collecting gradients from the test examples.

In [16]:
class CharacterPOSTagger(nn.Module):
    def __init__(self):
        super(CharacterPOSTagger,self).__init__()
        self.tagset_size = len(pos_vocab)
        
        self.sentence_encoder = SentenceEncoder()
        self.word_encoder = WordEncoder()
        self.hidden2tag = FeedForward(4*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        word_states = self.sentence_encoder(ex,word_dropout)
        char_states = self.word_encoder(ex)
        scores = self.hidden2tag(torch.cat([word_states, char_states],dim=2))
        return scores
       
    def tag(self,data):
        with torch.no_grad():
            results = []
            pos_itos = pos_vocab.get_itos()
            for ex in data:
                scores = self(ex)
                tags = scores.argmax(dim=2).squeeze(1)
                results.append([pos_itos[i] for i in tags])
            return results
        
pos_size = len(pos_vocab)
assert(CharacterPOSTagger()(ex).size() == (ex.word.size()[0]-2,1,pos_size))
assert(len(CharacterPOSTagger().tag([ex])[0]) == ex.word.size()[0] -2) 

Armed with the `CharacterPOSTagger` class, you can now train your tagger. You should get accuracy about 65-70%. On Garrett's computer, the model trains in about a minute.

In [17]:
tagger = CharacterPOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_iter):
        print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    print("\nAverage loss per example: %.4f" % (tot_loss/len(train_iter)))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

Epoch 1: Example 49 of 49
Average loss per example: 2.0911
Development accuracy: 44.67
Epoch 2: Example 49 of 49
Average loss per example: 1.2655
Development accuracy: 57.97
Epoch 3: Example 49 of 49
Average loss per example: 0.8390
Development accuracy: 64.87
Epoch 4: Example 49 of 49
Average loss per example: 0.5694
Development accuracy: 67.38
Epoch 5: Example 49 of 49
Average loss per example: 0.3773
Development accuracy: 70.51


### Excercise 4: Mixing Mbyá Guaraní and German Data

#### Assignment 4.1
rubric={accuracy:6}

You now have a Mbyá Guaraní POS-tagger that's been trained on all the available
data, and accuracy has improved over the word model. What next?
In this section, we are going to add about 1000 Standard German 
sentences to the training data.  Although German and Guaraní are not closely
related, it's possible they share some syntactic information that could help 
pre-train the model.

Run the following cell to read in German data:

In [18]:
# Read Universal Depandencies v2 training, development and test sets for German(deu).
# Next, create a training iterator for German
# train_gua_iter, dev_gua_iter, test_gua_iter, word_vocab, char_vocab, pos_vocab = read_ud_data("gun")
train_deu_iter, dev_deu_iter, test_deu_iter, _, _, _ = read_ud_data("deu", vocabs=[word_vocab, char_vocab, pos_vocab])
assert(len(train_deu_iter) == 948)

You will now need to re-train your CharacterPOSTagger to incorporate both the Standard and Mbyá Guaraní.  We will be using a method known as "fine-tuning", where we train a model on one language, and then re-train for a small number of epochs on a small amount of the data we're really interested in.  This method has wide applications in NLP, from MT, to Sentiment Analysis, to document classification, and beyond.

In [34]:
tagger = CharacterPOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()
import itertools
    
for epoch in range(EPOCHS):
    tot_loss = 0

    #Your code here
    # Since it's same as in Ex2, i just copy here with `train_deu_iter`
    # for i,ex in enumerate(train_iter):
    #     print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_iter)),end="\r")
    #     tagger.zero_grad()
    #     output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     loss = loss_function(output,gold)
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()

    for i,ex in enumerate(train_deu_iter):
        print("Epoch %u: Example %u of %u for German" % (epoch+1, i+1,len(train_deu_iter)),end="\r")
        tagger.zero_grad()
        output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
        gold = ex.pos.squeeze(dim=1)
        loss = loss_function(output,gold)
        loss.backward()
        optimizer.step()
        tot_loss += loss.detach().numpy()
    #Your code here

    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_deu_iter))))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

for epoch in range(EPOCHS):
    tot_loss = 0
    #Your code here
    # It's also same as in Ex2, 
    
    # for i,ex in enumerate(train_iter):
    #     print("Epoch %u: Example %u of %u for Mbyá Guaraní" % (epoch+1, i+1,len(train_iter)),end="\r")
    #     tagger.zero_grad()
    #     output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     loss = loss_function(output,gold)
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()
    #Your code here
    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_iter))))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))
 

Epoch 1: Example 948 of 948 for German
Average loss per example: 0.9057
Development accuracy: 30.11
Epoch 2: Example 948 of 948 for German
Average loss per example: 0.4422
Development accuracy: 30.87
Epoch 3: Example 948 of 948 for German
Average loss per example: 0.3258
Development accuracy: 27.35
Epoch 4: Example 948 of 948 for German
Average loss per example: 0.2543
Development accuracy: 27.48
Epoch 5: Example 948 of 948 for German
Average loss per example: 0.2115
Development accuracy: 26.85
Epoch 1: Example 49 of 49 for Mbyá Guaraní
Average loss per example: 1.9154
Development accuracy: 59.72
Epoch 2: Example 49 of 49 for Mbyá Guaraní
Average loss per example: 0.5629
Development accuracy: 69.26
Epoch 3: Example 49 of 49 for Mbyá Guaraní
Average loss per example: 0.3556
Development accuracy: 68.88
Epoch 4: Example 49 of 49 for Mbyá Guaraní
Average loss per example: 0.2037
Development accuracy: 71.52
Epoch 5: Example 49 of 49 for Mbyá Guaraní
Average loss per example: 0.1479
Developm

#### Assignment 4.2
rubric={reasoning:1}

Why is it important that we iterate over Standard German first, and then Mbyá Guaraní?  What do you think would happen if we reversed this order?

In [20]:
# your answer here

# your answer here

#### Assignment 4.3
rubric={reasoning:1}

Can you see any potential dangers in training on 50 Mbyá Guaraní sentences, and 1000 Standard German ones?

In [21]:
# your answer here

# your answer here

### Exercise 5: Annotation Projection (preparation)

We see that just building a German POS tagger gives us very low accuracy, but the fine-tuning is able to recover, despite the fact that it is only 5% of the data of the German.  What if we could get some more Mbyá Guaraní annotated data without having to do it by hand?

In this next section, we'll be taking advantage of an unsupervised word aligner to take advantage of the fact that translations of
words often occur in the same environments, and usually have the same POS.



We'll be using the ``translate'' module from the nltk to produce a word alignment for each sentence on a small parallel corpus: the Bible.  Make sure that you have the nltk installed.  


#### Exercise 5.1: Read in the German and Mbyá Guaraní, preserving parallelism.
rubric={accuracy:4}

A tagged version of the German Bible, as well as an
untagged version of the Mbyá Guaraní Bible are in the data folder (deu_bible and gun_bible, respectively).

The files are in CONLL format, as before.

Each line of the file corresponds to a single word with its tag, and verses are separated by a blank line.

You must read in each of the Bibles, storing both the tags and words in such 
a way that the parallel nature of the Bibles is preserved.  For now, you can assign "_" as the tag of every Mbyá Guaraní word.  Your function should return 4 lists: the source words, the target words, the source tags, and the target tags.

I encourage you to make sure that the data is really parallel.  Although I don't speak Mbyá Guaraní, and most of you probably don't speak German, named Entities can provide clues that data is parallel.  Named Entities are often similar across languages.  If you see "Jesu" on the German side, but no "Jesus" on the Mbyá Guaraní side, you may have lost the parallelism.

```
data % head *_bible
==> deu_bible <==
1	DAS	_	X	_	_	_	_	_	_
2	Buch	_	NOUN	_	_	_	_	_	_
3	der	_	DET	_	_	_	_	_	_
4	Abstammung	_	NOUN	_	_	_	_	_	_
5	Jesu	_	PROPN	_	_	_	_	_	_
6	Christi	_	PROPN	_	_	_	_	_	_
7	,	_	PUNCT	_	_	_	_	_	_
8	des	_	DET	_	_	_	_	_	_
9	Sohnes	_	NOUN	_	_	_	_	_	_
10	Davids	_	PROPN	_	_	_	_	_	_

==> gun_bible <==
1	Kova'e	_	_	_	_	_	_	_	_
2	kuaxia	_	_	_	_	_	_	_	_
3	re	_	_	_	_	_	_	_	_
4	ma	_	_	_	_	_	_	_	_
5	oĩ	_	_	_	_	_	_	_	_
6	Jesus	_	_	_	_	_	_	_	_
7	Cristo	_	_	_	_	_	_	_	_
8	ramoĩ	_	_	_	_	_	_	_	_
9	ypy	_	_	_	_	_	_	_	_
10	kuery	_	_	_	_	_	_	_	_
data % 

```

In [36]:
import os

def prepare_data(source_file, target_file):
    
    source_words = [] # list of lists; 
    target_words = []
    source_tags = []
    target_tags = []


    # your code here
    
    ...  FUN to process data files ... 

    # Your code here
    
    return source_words, target_words, source_tags, target_tags

german_words, guarani_words, german_tags, guarani_tags = prepare_data("deu_bible", "gun_bible")

print(guarani_words[0])
print(german_words[0])
print(guarani_tags[0])
print(german_tags[0])
print(len(german_tags))
print(len(guarani_tags))
# assert(len(guarani_words) == len(german_words))
# assert(len(guarani_tags) == len(german_tags))
# assert(guarani_words[0] == ["Kova'e", 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', "ha'e", 'Davi', 'ma', 'Abraão', 'ramymino', "raka'e", '.'])
# assert(guarani_tags[0] == ['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_'])

["Kova'e", 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', "ha'e", 'Davi', 'ma', 'Abraão', 'ramymino', "raka'e", '.']
['DAS', 'Buch', 'der', 'Abstammung', 'Jesu', 'Christi', ',', 'des', 'Sohnes', 'Davids', ',', 'des', 'Sohnes', 'Abrahams', '.']
['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_']
['X', 'NOUN', 'DET', 'NOUN', 'PROPN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT']
7957
7957


In [37]:
print(guarani_words[:2])
print(german_words[:2])
print(guarani_tags[:2])
print(german_tags[:2])


[["Kova'e", 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', "ha'e", 'Davi', 'ma', 'Abraão', 'ramymino', "raka'e", '.'], ['Abraão', "ra'y", 'ma', 'Isaque', ',', 'Isaque', "ra'y", 'ma', 'Jacó', ',', 'Jacó', "ra'y", 'ma', 'Judá', "ha'e", 'tyvy', 'kuery', '.']]
[['DAS', 'Buch', 'der', 'Abstammung', 'Jesu', 'Christi', ',', 'des', 'Sohnes', 'Davids', ',', 'des', 'Sohnes', 'Abrahams', '.'], ['Abraham', 'zeugte', 'den', 'Isaak', '.', 'Isaak', 'zeugte', 'den', 'Jakob', '.', 'Jakob', 'zeugte', 'den', 'Juda', 'und', 'seine', 'Brüder', '.']]
[['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_'], ['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_']]
[['X', 'NOUN', 'DET', 'NOUN', 'PROPN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT'], ['PROPN', 'VERB', 'DET', 'NOUN', 'P

#### Exercise 5.2 - Align the two corpora
rubric={accuracy:1}

We will now use the IBMModel2 class from nltk.translate to align the two corpora.  You may recall that IBM Model 1 simply looks for words that co-occur in sentences, while Model 2 also learns an overarching re-ordering model.

In this part of the assignment, you will need to assign the german_words and guarani_words from the previous section to a bitext, which IBMModel2 will use to produce an alignment.  I suggest you look at https://www.nltk.org/_modules/nltk/translate/ibm2.html for help in how to do so.

The input to the function will be two lists: a list of source (German) sentences, and a list of target (Guarani) sentences:
[["Kova'e", 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', "ha'e", 'Davi', 'ma', 'Abraão', 'ramymino', "raka'e", '.'], ['Guarani', 'Sentence', 'number', '2', '.'], ['Guarani', 'Sentence', 'number', '3', '.'], etc]
['DAS', 'Buch', 'der', 'Abstammung', 'Jesu', 'Christi', ',', 'des', 'Sohnes', 'Davids', ',', 'des', 'Sohnes', 'Abrahams', '.'], ['German', 'Sentence', 'number', '2', '.'], ['German', 'Sentence', 'number', '3', '.'], etc]

Each of these sentences should be added to the bitext in parallel.

The output will be the bitext, after you run it through IBMModel2 (you can use 5 iterations for the IBMModel2 function).

After it goes through the IBMModel2 function, bitext will be transformed into an item with three parameters: words, mots, and alignment:

bitext.words will be the source sentences
bitext.mots will be the target sentences
bitext.alignment will be the alignment


This alignment is of the form:

0 0 1 3 4 3 6 3 8 5 ..., where the first index corresponds to the source language in your bitext, and the second to the target.  Although we are using German as a source, and Mbyá Guaraní as a target, the algorithm is only concerned with how you present them to the function - be careful that you are consistent!

This step takes a bit of time, but you should only have to learn the alignment once.  Wait for it to finish... at completion, it will print out the time it took to learn.  On Garrett's computer, it takes about 6 minutes.

In [23]:
def align_source_target(source, target):
    

    from nltk.translate import AlignedSent
    from nltk.translate import IBMModel2
    import time

    bitext = []
    #Your code here
    # iterate source or target legnth, 
    # append AlignedSent (source and target) to `bitext`
    
    #Your code here

    print("Starting!")
    start_time = time.time()
    # Then, IBMModel2 (given)
    alignment = IBMModel2(bitext, 5)
    print("--- %s seconds ---" % (time.time() - start_time))
    print("Done!")
    return bitext
    
german_guarani_alignment = align_source_target(german_words, guarani_words)

Starting!
--- 158.92377614974976 seconds ---
Done!


In [39]:
german_guarani_alignment[:2]

[AlignedSent(['DAS', 'Buch', 'der', 'Abstammung', 'Jesu', 'Christi', ',', 'des', 'Sohnes', 'Davids', ',', 'des', 'Sohnes', 'Abrahams', '.'], ['Kova'e', 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', 'ha'e', 'Davi', 'ma', 'Abraão', 'ramymino', 'raka'e', '.'], Alignment([(0, 7), (1, 1), (2, 4), (3, 7), (4, 10), (5, 6), (6, 13), (7, 10), (8, 7), (9, 19), (10, 16), (11, 10), (12, 7), (13, 21), (14, None)])),
 AlignedSent(['Abraham', 'zeugte', 'den', 'Isaak', '.', 'Isaak', 'zeugte', 'den', 'Jakob', '.', 'Jakob', 'zeugte', 'den', 'Juda', 'und', 'seine', 'Brüder', '.'], ['Abraão', 'ra'y', 'ma', 'Isaque', ',', 'Isaque', 'ra'y', 'ma', 'Jacó', ',', 'Jacó', 'ra'y', 'ma', 'Judá', 'ha'e', 'tyvy', 'kuery', '.'], Alignment([(0, 0), (1, 8), (2, 16), (3, 3), (4, None), (5, 3), (6, 8), (7, 8), (8, 8), (9, None), (10, 10), (11, 10), (12, 16), (13, 13), (14, 14), (15, 15), (16, 15), (17, 17)]))]

In [41]:
# [AlignedSent(['DAS', 'Buch', 'der', 'Abstammung', 'Jesu', 'Christi', ',', 'des', 'Sohnes', 'Davids', ',', 'des', 'Sohnes', 'Abrahams', '.'], 
#              ['Kova'e', 'kuaxia', 're', 'ma', 'oĩ', 'Jesus', 'Cristo', 'ramoĩ', 'ypy', 'kuery', 'rery', '.', 'Jesus', 'ma', 'Davi', 'ramymino', 'oiko', ',', 'ha'e', 'Davi', 'ma', 'Abraão', 'ramymino', 'raka'e', '.'], 
#               Alignment([(0, 7), (1, 1), (2, 4), (3, 7), (4, 10), (5, 6), (6, 13), (7, 10), (8, 7), (9, 19), (10, 16), (11, 10), (12, 7), (13, 21), (14, None)])), 
#               ... ]


### Exercise 6: Annotation Projection

#### Assignment 6.1
rubric={accuracy:3}

Now that we have an alignment, we can use it to project German tags onto the Mbyá Guaraní data.

The alignment is now stored in the variables bitext[x].alignment, where x is the xth sentence of the bitext.  You should loop through the alignment for each sentence (to speed things up down the road, and to make things comparable with the Standard German results, I suggest limiting the sentences to 948), and for each alignment pair (ie, 0-3), find the tag from the source and assign it to the target (hint, remember german_tags and guarani_tags, from above?).  You will need to decide what to do if more than one German word aligns to a single Guarani word.  Some possible suggestions:  simply choose the
tag of the first word that aligns to the Guarani word; randomly choose a tag from the words that align; if more than one word aligns, and the tags aren't the same, assign "_" as the tag.  All are reasonable, and all have repercussions.

After assigning tags, the tagged Mbyá Guaraní information will be written to a pair of files.

Your function will take five parameters - the bitext, the source(Standard German) tags, the target (Mbyá Guaraní) words, and the file names of your training and testing files that you will write.  It will not return anything.  Remember to close your files at the end of the function!

An example projection might look like this:

source_words[x] = ["Eishockey", "ist", "ein", "wünderbarer", "Sport", "!"]
target_words[x] = ["Hockey", "is", "a", "great", "game", "!"]
source_tags[x] = ["NOUN", "VERB", "DET", "ADJ", "NOUN", "PUNCT"]
target_tags[x] = ["_", "_", "_", "_", "_", "_"]
bitext[x].alignment = [0 0, 1 1, 2 2, 5 5]

output[x] (written to file. Don't worry about lining it up; just separate everything with tabs):

1     Hockey    _     NOUN       _     _     _     _     _     _
2     is        _     VERB       _     _     _     _     _     _
3     a         _     DET        _     _     _     _     _     _
4     great     _     _          _     _     _     _     _     _
5     game      _     _          _     _     _     _     _     _
6     !         _     PUNCT      _     _     _     _     _     _

1     Sentence  _     ...
2     2         _     ...
3     goes      _     ...
4     here      _     ...
5     .         _     ...

...


In [52]:
def project_tags(bitext, source_tags, target_tags, target_words, out_train, out_test):
    #Your code here:
    
    # you can use the following: 

    # alignment = bitext[sentence].alignment
    # for sourceID, targetID in alignment:
    #   target_tags[sentence][targetID] = source_tags[sentence][sourceID]

    # To create CONLLU files; 

    #Your code here
                            
# print(guarani_tags[936])
project_tags(german_guarani_alignment, german_tags, guarani_tags, guarani_words, "gun_bible_train", "gun_bible_test")
# print(len(german_tags))
# print(len(guarani_tags))

# #Check that we haven't introduced any extra lines, or removed any
# assert(len(german_tags) == len(guarani_tags) and len(german_guarani_alignment) == len(guarani_tags))
# #Check that we have a tag (or _) for each word in a sentence:
# assert(len(german_tags[40]) == len(german_words[40]) and len(guarani_tags[40]) == len(guarani_words[40]))
# #Check that we've gotten some tags projected       
# assert('NOUN' in guarani_tags[936] or 'DET' in guarani_tags[936] or 'VERB' in guarani_tags[936] or 'ADP' in guarani_tags[936])

In [54]:
german_guarani_alignment[0].alignment

Alignment([(0, 7), (1, 1), (2, 4), (3, 7), (4, 10), (5, 6), (6, 13), (7, 10), (8, 7), (9, 19), (10, 16), (11, 10), (12, 7), (13, 21), (14, None)])

In [55]:
print(german_tags[0])
print(guarani_tags[0])

['X', 'NOUN', 'DET', 'NOUN', 'PROPN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'PROPN', 'PUNCT']
['_', 'NOUN', '_', '_', 'DET', '_', 'PROPN', 'X', '_', '_', 'PROPN', '_', '_', 'PUNCT', '_', '_', 'PUNCT', '_', '_', 'PROPN', '_', 'PROPN', '_', '_', '_']


Next, we read the files in as using `read_ud_data()`.

In [56]:
# Read Universal Depandencies v2 training, development and test sets for English.
train_gun_bible_iter, dev_gun_bible_iter, test_gun_bible_iter, word_vocab, char_vocab, pos_vocab = read_ud_data("gun_bible")

# # Print the first example in the Mbyá Guaraní training set.
# ex = next(iter(iter(iter(train_gun_bible_iter))))
# print(len(train_gun_bible_iter))
# print(ex.word)
# print(ex.char)
# print(ex.pos)

You can now train a Mbyá Guaraní tagger from just the Bible data.  It's noisy,
and missing a lot of tags, so it likely won't be as good as the one we trained previously, even though there's more data.  You will likely get below 50% accuracy.

In [26]:
tagger = CharacterPOSTagger()
optimizer = Adam(tagger.parameters())

weights = torch.ones(len(pos_vocab))
weights[pos_vocab['_']] = 0
loss_function = nn.NLLLoss(weights)

for epoch in range(EPOCHS):
    #Your code here
    tot_loss = 0

    # Training is same as before except for `train_gun_bible_iter`; 
    # for i,ex in enumerate(train_gun_bible_iter):
    #     print("Epoch %u: Example %u of %u" % (epoch+1, i+1,len(train_gun_bible_iter)),end="\r")
    #     tagger.zero_grad()
    #     output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     #print(output)
    #     loss = loss_function(output,gold)
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()
    #Your code here
    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_gun_bible_iter))))
    sys_dev = tagger.tag(dev_iter)
    print("Development accuracy: %.2f" % accuracy(sys_dev, dev_iter))

Epoch 1: Example 948 of 948
Average loss per example: 1.6749
Development accuracy: 38.02
Epoch 2: Example 948 of 948
Average loss per example: 1.3689
Development accuracy: 40.15
Epoch 3: Example 948 of 948
Average loss per example: 1.2191
Development accuracy: 35.26
Epoch 4: Example 948 of 948
Average loss per example: 1.0856
Development accuracy: 40.03
Epoch 5: Example 948 of 948
Average loss per example: 0.9434
Development accuracy: 37.14


### Exercise 7: Don't predict "_"!

Ok, so we do about half as well as our tagger trained only on 50 Mbyá Guaraní sentences (but much better than the model trained on just German).  Not bad, but we can do better.  Some of you may have noticed that the alignment was missing a lot of words on the Guaraní side, so we had no choice but to use "_" as a tag.  This is ok, but "_" will never be a tag in the test data!  We can instead bias the model so that it never predicts "_".  This will involve 2 small changes to your character-based model:

1.  In the tag function of your tagger, we need to make sure that "_" is never picked as the argmax.  This can be achieved by modifying the "scores" variable so that its value for "_" is very, very small (not 0!  These are log-likelihoods, so will be negative values).  Hint: remember vocab.stoi? 

2.  Make sure that the model ignores the loss for examples that have "_" in the testing data.  If we just do 1., then every time the model predicts something other than "_" when we have "_" in the data, the model will think we have huge loss, which will be back-propagated through the network.  Luckily, the NLLLoss() function allows you to set the "weight" parameter to a tensor specifying weights for each loss: https://pytorch.org/docs/stable/nn.html.  Why not just set the weight of "_" loss to 0?

You may only observe a small gain from this step (or no gain at all), but it is necessary.


In [27]:
class CharacterPOSTagger(nn.Module):
    def __init__(self):
        super(CharacterPOSTagger,self).__init__()
        self.tagset_size = len(pos_vocab)
        
        self.sentence_encoder = SentenceEncoder()
        self.word_encoder = WordEncoder()
        self.hidden2tag = FeedForward(4*RNN_HIDDEN_DIM,self.tagset_size)
        
    def forward(self,ex, word_dropout=0):
        word_states = self.sentence_encoder(ex,word_dropout)
        char_states = self.word_encoder(ex)
        scores = self.hidden2tag(torch.cat([word_states, char_states],dim=2))
        return scores
       
    def tag(self,data):
        with torch.no_grad():
            results = []
            pos_itos = pos_vocab.get_itos()
            for ex in data:
                scores = self(ex)
                #Your code here
                # assign a very low value for `_`
                ...
                #Your code here

                tags = scores.argmax(dim=2).squeeze(1)
                results.append([pos_itos[i] for i in tags])
            return results

#### Assignment 7.1
rubric={reasoning:1}

Even though the projected model is not as good as the gold model, can you think of any cases where we might want (or have) to use it on its own?

In [28]:
# your answer here

# your answer here

We'll also be fine-tuning on the gold data, as we did before.

### Exercise 8: Using both Gold and Silver data

#### Assignment 8.1
rubric={accuracy:3}

As a final step, we will fine-tune the projected Mbyá Guaraní model with the gold data, though, fine-tuned on the gold-data.  In this section, we'll be combining the Silver data that you got from the projection and the Gold, hand-annotated data. This model can get up to 75-80% accuracy.

In [29]:
tagger = CharacterPOSTagger()
optimizer = Adam(tagger.parameters())
loss_function = nn.NLLLoss()
import itertools
import copy

weights = torch.ones(len(pos_vocab))

#Your code here
# assigne `weights`` for `_` 0.0

#Your code here


loss_function = nn.NLLLoss(weight = weights)

print(len(dev_iter))

best_acc = 0.0
best_model = None
best_ft = None

for epoch in range(EPOCHS):
    tot_loss = 0
    for i,ex in enumerate(train_gun_bible_iter):
    # #Your code here#
    # #SAME as before (uncomment them to run)
    #     print("Epoch %u: Example %u of %u for Silver" % (epoch+1, i+1,len(train_gun_bible_iter)),end="\r")
    #     tagger.zero_grad()
    #     output = tagger(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     loss = loss_function(output,gold)
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()
    # #Your code here#
    
    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_gun_bible_iter))))
    sys_dev = tagger.tag(dev_iter)
    sys_acc = accuracy(sys_dev, dev_iter)
    print("Development accuracy: %.2f" % sys_acc)
    if(sys_acc > best_acc):
        best_acc = sys_acc
        best_model = copy.deepcopy(tagger)
    
 
# This is effectively saving the model, so we don't have to re-train the silver
# part every time
fine_tuned = copy.deepcopy(best_model)
optimizer = Adam(fine_tuned.parameters())


for epoch in range(EPOCHS):
    # #Your code here#
    # #SAME as before (uncomment them to run)
    # tot_loss = 0
    # for i,ex in enumerate(train_iter):
    #     print("Epoch %u: Example %u of %u for Gold" % (epoch+1, i+1,len(train_iter)),end="\r")
    #     fine_tuned.zero_grad()
    #     output = fine_tuned(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     loss = loss_function(output,gold)
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()
    # #Your code here#
    
    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_iter))))
    sys_dev = fine_tuned.tag(dev_iter)
    sys_acc = accuracy(sys_dev, dev_iter)
    print("Development accuracy: %.2f" % sys_acc)
    if(sys_acc > best_acc):
        best_acc = sys_acc
        best_ft = copy.deepcopy(fine_tuned)
    
        

tensor([1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
49
Epoch 1: Example 948 of 948 for Silver
Average loss per example: 1.6665
Development accuracy: 37.26
Epoch 2: Example 948 of 948 for Silver
Average loss per example: 1.3579
Development accuracy: 42.53
Epoch 3: Example 948 of 948 for Silver
Average loss per example: 1.2133
Development accuracy: 42.53
Epoch 4: Example 948 of 948 for Silver
Average loss per example: 1.0747
Development accuracy: 37.77
Epoch 5: Example 948 of 948 for Silver
Average loss per example: 0.9388
Development accuracy: 41.28
Epoch 1: Example 49 of 49 for Gold
Average loss per example: 1.3910
Development accuracy: 60.60
Epoch 2: Example 49 of 49 for Gold
Average loss per example: 0.7847
Development accuracy: 69.89
Epoch 3: Example 49 of 49 for Gold
Average loss per example: 0.5476
Development accuracy: 73.02
Epoch 4: Example 49 of 49 for Gold
Average loss per example: 0.3795
Development accuracy: 73.27
Epoch 5: Example 49 of 49 for Go

### Exercise 9: Learning less from noise (optional)

#### Assignment 9.1 Optional
rubric={accuracy:2}

We get a gain over the model trained on just gold data - an error reduction of about 10%-20%.  We might be able to do a little bit better, though.  We could fine-tune for more epochs - since we only have 50 gold sentences, fine-tuning is fast - but we're going to try something a little bit different.  We're going to instead vary the learning rate of the algorithm.

Broadly speaking, the learning rate controls how much we want our network to 
move toward our current training data.  By default, we've been using a learning
rate of 1.0, meaning that all of the loss is back-propagated through the network.  By changing the learning rate, we can either punish the network more
for mistakes, or let it more slowly model the training data if we don't trust it.

We can implement a learning rate by multiplying the loss by a constant value - values greater than 1.0 will speed up training (possibly at the cost of overfitting the data), while speeds less than 1.0 will slow it down (but possibly underfit the data).  You are free to experiment with various values of learning rates - I was able to get the final accuracy up to about 81%, which represents almost a 10% further error reduction over the original model.

You're also free to tune some other parameters - the dropout rate is very low.  Higher dropouts tend to create more robust systems, 
but can underfit the data if there isn't a lot of it.  You might also want to compare your results from tuning the learning rate to increasing
the number of fine-tuning epochs.

In [30]:
fine_tuned = copy.deepcopy(best_model)
optimizer = Adam(fine_tuned.parameters())


## YOU CAN CHANGE the value;
learning_rate = 1.0


import numpy

for epoch in range(EPOCHS):
    #Your code here#
    # #SAME as before (uncomment them to run)

    # tot_loss = 0
    # best_acc = 0

    # for i,ex in enumerate(train_iter):
    #     print("Epoch %u: Example %u of %u for Gold" % (epoch+1, i+1,len(train_iter)),end="\r")
    #     fine_tuned.zero_grad()
    #     output = fine_tuned(ex,word_dropout=0.05).squeeze(dim=1)
    #     gold = ex.pos.squeeze(dim=1)
    #     loss = loss_function(output,gold)
    #     loss = loss * learning_rate
    #     loss.backward()
    #     optimizer.step()
    #     tot_loss += loss.detach().numpy()
    #Your code here#
    
    print("\nAverage loss per example: %.4f" % (tot_loss/(len(train_iter))))
    sys_dev = fine_tuned.tag(dev_iter)
    sys_acc = accuracy(sys_dev, dev_iter)
    print("Development accuracy: %.2f" % sys_acc)
    if(sys_acc > best_acc):
        best_acc = sys_acc
        best_ft = copy.deepcopy(fine_tuned)
    

Epoch 1: Example 49 of 49 for Gold
Average loss per example: 1.3507
Development accuracy: 63.36
Epoch 2: Example 49 of 49 for Gold
Average loss per example: 0.7706
Development accuracy: 70.26
Epoch 3: Example 49 of 49 for Gold
Average loss per example: 0.5319
Development accuracy: 70.51
Epoch 4: Example 49 of 49 for Gold
Average loss per example: 0.3914
Development accuracy: 72.15
Epoch 5: Example 49 of 49 for Gold
Average loss per example: 0.2900
Development accuracy: 72.77


#### Assignment 9.2 Optional
rubric={reasoning:1}

One of the problems with Projection is that many words don't align that well (particularly if we have a small parallel corpus).  Can you think of some ways that we could replace some of the "_"s?

In [31]:
# your answer here

# your answer here