# 1. Information about the submission

## 1.1 Name and number of the assignment 

**Assignment 1: Morphological Analysis, CoNLL–SIGMORPHON 2018 Shared Task** 

## 1.2 Student name

**Kundyz Onlabek** 

# 2. Technical Report

## 2.1 Methodology 

To assess the system’s ability to generalize in different resource settings, three varying amounts of labeled training data (low, medium, high) were given. The systems were evaluated separately for each language and the three data quantity conditions. Accuracy and the average Levenshtein distance between the prediction and the truth across all predictions were used as metrics. An aggregated performance measure separate for each of the resource setting was obtained by averaging the results for individual languages

The system is based on attention based encoderdecoder models. The lemma and the tags are encoded using two separate encoders. While decoding, the decoder reads relevant parts of the lemma and the tags using attention mechanism. As most of the characters in
the inflected form are copied from the lemma, it is necessary to design a system with strong tendency to copy. Pointer-Generator Network is used which facilitates copying of characters of lemma and tackles the problem of out-ofvocabulary tokens during prediction.

The neural network architecture is based on
Pointer-Generator Network with
some subtle differences.
Characters of the lemma $c_i$ along with the additional start and stop characters are fed one by one into a bidirectional LSTM encoder producing a sequence of hidden states $h_{l_i}$. Similarly, using a separate bidirectional LSTM encoder, the tags $tg_i$ are encoded and another sequence of hidden states
$h_{tg_i}$ is obtained.

Unidirectional LSTM is used as the decoder.
The decoder’s hidden state $s_i$ is initialised by applying an affine transformation on the concatenation of the last hidden states of the lemma and the tag encoders. As the input and output sequences have different semantics, this affine transformation gives the model the ability to learn transformation of semantics from input to output:
$$s_{0}=W_{\text {initial }}\left[h_{l_{N}} ; h_{t g_{N}}\right]+b.$$

While decoding, at each time step t, the decoder
computes an attention distribution over the lemma and the tag separately denoted as $a^t_l$
and $a^t_tg$:
$$ \begin{gathered}
e_{l_{i}}^{t}=v^{T} \tanh \left(W_{h_{l}} h_{l_{i}}+W_{s_{l}} s_{t-1}+b_{l}\right) \\
e_{t g_{i}}^{t}=v^{T} \tanh \left(W_{h_{t g}} h_{t g_{i}}+W_{s_{t g}} s_{t-1}+b_{t g}\right) \\
a_{l}^{t}=\operatorname{softmax}\left(e_{l}^{t}\right) \\
a_{t g}^{t}=\operatorname{softmax}\left(e_{t g}^{t}\right).
\end{gathered}$$

The context vectors:
$$\begin{gathered}
h_{l_{t}}^{*}=\sum_{i} a_{l_{i}}^{t} h_{l_{i}} \\
h_{t g_{t}}^{*}=\sum_{i} a_{t g_{i}}^{t} h_{t g_{i}}
\end{gathered}.$$

The combined context vector:
$$ h_{t}^{*}=\left[h_{l_{t}}^{*} ; h_{t g_{t}}^{*}\right].$$

At the first time step, start
character is given as input in place of $y_{t-1}$.
$$ s_{t}=f\left(s_{t-1}, h_{t}^{*}, y_{t-1}\right), $$
where $f$ is a nonlinear function.

A probability distribution over the characters in the vocabulary is calculated which corresponds to how likely will a particular character be generated:
$$ P_{\text {vocab }}=\operatorname{softmax}\left(V\left[s_{t} ; h^{*}\right]+b\right).$$

The probability of predicting a character c is
computed as the sum of probability of generating
$c$ weighted by the generation probability $p_{\text {gen }}$ and the total attention distribution over $c$ weighted by the probability of copying it $(1 − p_{\text {gen }})$:
$$P(c)=p_{\text {gen }} P_{\text {vocab }}(c)+\left(1-p_{\text {gen }}\right) \sum_{i: c_{i}=c} a_{l_{i}}^{t}.$$

The negative log likelihood is used to compute the loss. The loss for time step $t$, where $c^∗_t$ is the target character is given by  $loss_t = − log P(c^∗_t)$.

The loss for the overall sequence is $$
\text { loss }=\frac{1}{T} \sum_{t=0}^{T} \text { loss }_{t}
.$$

Adam Optimiser is used with initial learning rate 0.001 and batch
size 32 to train the neural network. To deal with exploding gradient problem, the norm is clipped and computed over all the gradients together to 3. Dropout is applied with probability $p$ over embeddings and the encoder hidden states.

Early stopping is used to prevent overfitting. A
portion of the development set is used as the validation set. 
Single layer LSTMs were used as encoders and
decoders to reduce number of parameters. 

## 2.2 Discussion of results

The experiment 1 works (with simpled batched encoder decoder with characters and tags as input), while other 3 implementations (Simple attention over characters and tags; Pointer generator over characters and tags; simple attention over characters and tags with attention layer trained) did not show good results.

# 3. Code

## 3.1 Requirements

In [None]:
! pip install python-levenshtein

fatal: destination path 'conll2018' already exists and is not an empty directory.


## 3.2 Download the data

In [None]:
! git clone https://github.com/sigmorphon/conll2018

fatal: destination path 'conll2018' already exists and is not an empty directory.


## 3.3 Experiment 1

### Simple Batched Encoder Decoder with characters and tags as input

In [None]:
START_CHAR = '⏵'
STOP_CHAR = '⏹'
UNKNOWN_CHAR = '⊗'
UNKNOWN_TAG = '⊤'
PAD_CHAR = '₮'
PAD_TAG = '<PAD>'

In [None]:
from itertools import zip_longest
import os
import pickle

import Levenshtein
import numpy as np
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F

In [None]:
def load_data(file_name):
    """Loads data.

    Args:
        file_name: path to file containing the data

    Returns:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form
    """

    with open(file_name, 'r', encoding='utf') as file:
        text = file.read()

    lemmas = []
    tags = []
    inflected_forms = []

    for line in text.split('\n')[:-1]:
        lemma, inflected_form, tag = line.split('\t')
        lemmas.append(lemma)
        inflected_forms.append(inflected_form)
        tags.append(tag) 

    return lemmas, tags, inflected_forms

def get_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char, tag2index

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps character to index
        index2char: a dictionary which maps index to character
        tag2index: a ditionary which maps morphological tag to index 
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR) # special start and end symbols
    unique_chars.update(UNKNOWN_CHAR) # special charcter for unknown word
    char2index = {}
    index2char = {}

    char2index[PAD_CHAR] = 0
    index2char[0] = PAD_CHAR
    
    for index, char in enumerate(unique_chars):
        char2index[char] = index + 1
        index2char[index + 1] = char

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.update(UNKNOWN_TAG)
    tag2index = {tag:index+1 for index, tag in enumerate(unique_tags)}
    tag2index[PAD_TAG] = 0

    return char2index, index2char, tag2index

def get_combined_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps inputs and  to index
        index2char: a dictionary which maps index to inputs
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR, UNKNOWN_CHAR) # special start and end symbols  
    
    input2index = {}
    index2input = {}
    
    input2index[PAD_CHAR] = 0
    index2input[0] = PAD_CHAR
        
    for index, char in enumerate(unique_chars, start=1):
        input2index[char] = index
        index2input[index] = char
        
    char_vocab_length = len(input2index.keys())

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.add(UNKNOWN_TAG) # special character for unknown tags
    
    for index, char in enumerate(unique_tags, start=char_vocab_length):
        input2index[char] = index
        index2input[index] = char

    return input2index, index2input, char_vocab_length

def words_to_indices(words, char2index, tensor=False, start_char=False, stop_char=False):
    """Converts list of words to a list with list containing indices

    Args:
        words: list of words
        char2index: dictionary which maps character to index
        tensor: if to return a list of tensor  

    Returns:
        tensor: list of list/tensor containing indices for a sequence of characters
    """

    list_indices = []
    for word in words:
        word_indices = []
        if start_char:
            word_indices.append(char2index[START_CHAR])
        for char in word:
            if char in char2index.keys():
                word_indices.append(char2index[char])
            else:
                word_indices.append(char2index[UNKNOWN_CHAR])
        if stop_char:
            word_indices.append(char2index[STOP_CHAR])
        if tensor:
            word_indices = torch.Tensor(word_indices)
        list_indices.append(word_indices)

    return list_indices

def tag_to_vector(tags, tag2index):
    """Returns one hot representation of tags given a tag.

    Args:
        tags: list of string representation of tag (eg, V;IND;PRS;2;PL)

    Returns:
        tag_vectors: list of 1D tensors with one hot representation of tags 
    """

    tag_vectors = []
    for tag in tags:
        tag_vector = torch.zeros(len(tag2index))
        for tag_feature in tag.split(';'):
            if tag_feature in tag2index:
                tag_vector[tag2index[tag_feature]] = 1
            else:
                tag_vector[tag2index[UNKNOWN_TAG]] = 1
        tag_vectors.append(tag_vector)
    return tag_vectors

def tag_to_indices(tags, tag2index):
    """Converts list of tags to a list of lists containing indices

    Args:
        words: list of tags

    Returns:
        tensor: list of list containing indices of sub_tags
    """
    
    list_indices = []
    for tag in tags:
        tag_indices = []
        for sub_tag in tag.split(';'):
            if sub_tag in tag2index.keys():
                tag_indices.append(tag2index[sub_tag])
            else:
                tag_indices.append(tag2index[UNKNOWN_TAG])
        list_indices.append(tag_indices)

    return list_indices
    

def indices_to_word(indices, index2char):
    """Returns a word given list contaning indices of words

    Args:
        indices: list containing indices

    Returns:
        word: a string
    """

    return ''.join([index2char[index] for index in indices])[:-1]

def pad_lists(lists, pad_int, pad_len=None):
    """Pads lists in a list to make them of equal size"""
    
    if pad_len is None:
        pad_len = max([len(lst) for lst in lists])
    new_list = []
    for lst in lists:
        if len(lst) < pad_len:
            new_list.append(torch.tensor(lst + [pad_int] * (pad_len-len(lst))))
        else:
            new_list.append(torch.tensor(lst[:pad_len]))
    return torch.stack(new_list)

def merge_lists(lists1, lists2):
    """Add two list of lists."""
    
    merged_lists = []
    for list1, list2 in zip(lists1, lists2):
        merged_lists.append(list1 + list2)
    return merged_lists

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

def accuracy(predictions, targets):
    correct_count = 0
    for prediction, target in zip(predictions, targets):
        if prediction == target:
            correct_count += 1
    return correct_count / len(predictions)

def average_distance(predictions, targets):
    total_distance = 0
    for prediction, target in zip(predictions, targets):
        total_distance += Levenshtein.distance(prediction, target)
    return total_distance / len(predictions)

def evaluate(predictions, targets):
    return accuracy(predictions, targets), average_distance(predictions, targets)

In [None]:
lemmas, tags, inflected_forms = load_data('./conll2018/task1/all/middle-french-train-high')

In [None]:
lemmas_train, lemmas_val, tags_train, tags_val, inflected_forms_train, inflected_forms_val = train_test_split(lemmas, tags, inflected_forms, test_size=0.2, random_state=42)

In [None]:
input2index, index2input, char_vocab_size = get_combined_index_dictionaries(lemmas_train, tags_train, inflected_forms_train)


### Train

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
batch_size = 32
embedding_size = 300
hidden_size = 100
input_vocab_size = len(index2input.keys())

In [None]:
Embedder = nn.Embedding(input_vocab_size, embedding_size, padding_idx=input2index[PAD_CHAR]).to(device)
Encoder = nn.LSTM(embedding_size, hidden_size, batch_first=True, bidirectional=True).to(device)
Decoder = nn.LSTM(embedding_size, hidden_size, batch_first=True).to(device)
linear1 = nn.Linear(hidden_size, char_vocab_size).to(device)
log_softmax = nn.LogSoftmax(dim=2).to(device)
criterion = nn.NLLLoss(ignore_index=0)
params = list(Embedder.parameters()) + list(Encoder.parameters()) + list(Decoder.parameters()) + list(linear1.parameters())

In [None]:
def test(lemmas_val, tags_val, inflected_forms_val, batch_size=32):

    inflected_predicted = []
    inflected_forms_true = []

    for batch in grouper(zip(lemmas_val, tags_val, inflected_forms_val), batch_size):
        batch = list(filter(lambda x: x is not None, batch))
        lemmas, tags, inflected_forms = zip(*batch)

        lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
        tags_indices = tag_to_indices(tags, input2index)    
        input_indices = merge_lists(lemmas_indices, tags_indices)

        # Sort by length of input sequence
        input_indices, inflected_forms = zip(*sorted(zip(input_indices, inflected_forms), key=lambda x: len(x[0]), reverse=True))
        input_indices = [torch.tensor(lst) for lst in input_indices]


        input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
        embedding = Embedder(input_tensor)
        lengths = [Tensor.shape[0] for Tensor in input_indices]
        packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
        encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
        encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]


        # Decode
        hidden_state = hidden[0,:,:] + hidden[1,:,:]
        hidden_state = hidden_state.unsqueeze(0)
        cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

        decoder_input = torch.tensor([input2index[START_CHAR]] * len(lengths)).to(device)
        decoder_input = Embedder(decoder_input).unsqueeze(1)

        outputs = []
        for seq in range(0, 25):
            output, (hidden_state, cell_state) = Decoder(decoder_input, (hidden_state, cell_state))
            output = F.relu(linear1(hidden_state))
            decoder_input = Embedder(output.argmax(dim=2)).transpose(0, 1)
            outputs.append(output.squeeze())

        batch_indices = torch.stack(outputs).argmax(dim=2).transpose(0, 1).cpu().numpy()
        for indices in batch_indices:
            inflected_predicted.append(''.join([index2input[index] for index in indices]).split(STOP_CHAR)[0])
        inflected_forms_true += inflected_forms
        
    return inflected_predicted, inflected_forms_true

In [None]:
optimiser = optim.Adagrad(params)

In [None]:
def train(lemmas_train, tags_train, inflected_forms_train, epochs=1):
    for epoch in range(epochs):

        epoch_loss = 0

        for batch in grouper(zip(lemmas_train, tags_train, inflected_forms_train), batch_size):
            batch = list(filter(lambda x: x is not None, batch))
            lemmas, tags, inflected_forms = zip(*batch)


            lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
            tags_indices = tag_to_indices(tags, input2index)    
            inflected_forms_indices = words_to_indices(inflected_forms, input2index)
            input_indices = merge_lists(lemmas_indices, tags_indices)

            # Sort by length of input sequence
            input_indices, inflected_forms_indices = zip(*sorted(zip(input_indices, inflected_forms_indices), key=lambda x: len(x[0]), reverse=True))
            input_indices = [torch.tensor(lst) for lst in input_indices]

            optimiser.zero_grad()

            input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
            embedding = Embedder(input_tensor)
            lengths = [Tensor.shape[0] for Tensor in input_indices]
            packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
            encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
            encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]


            # Decode
            hidden_state = hidden[0,:,:] + hidden[1,:,:]
            hidden_state = hidden_state.unsqueeze(0)
            cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

            target = pad_lists([lst + [input2index[STOP_CHAR]] for lst in inflected_forms_indices], input2index[PAD_CHAR]).to(device)    

            decoder_input = pad_lists(inflected_forms_indices, input2index[PAD_CHAR]).to(device)
            decoder_input = torch.cat([torch.tensor([input2index[START_CHAR]] * len(lengths)).unsqueeze(1).to(device), decoder_input], dim=1)
            decoder_input = Embedder(decoder_input)


            loss = 0

            max_length = target.shape[1]
            for seq in range(0, max_length):
                output, (hidden_state, cell_state) = Decoder(decoder_input[:,seq,:].unsqueeze(1), (hidden_state, cell_state))
                output = F.relu(linear1(hidden_state))
                output = log_softmax(output).squeeze()
                loss += criterion(output, target[:,seq])

            epoch_loss += loss.item()
            loss.backward()
            optimiser.step()

        print("Epoch: {}/{}\tLoss: {:.4f}\tAccuracy: {:4f}\tDistance: {:.4f} ".format(epoch, epochs, epoch_loss / len(lemmas_train), *evaluate(*test(lemmas_val, tags_val, inflected_forms_val))))

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=150)

Epoch: 0/150	Loss: 0.7956	Accuracy: 0.000500	Distance: 6.2360 
Epoch: 1/150	Loss: 0.5525	Accuracy: 0.000500	Distance: 5.8865 
Epoch: 2/150	Loss: 0.4720	Accuracy: 0.006000	Distance: 5.3875 
Epoch: 3/150	Loss: 0.4212	Accuracy: 0.014000	Distance: 5.2285 
Epoch: 4/150	Loss: 0.3786	Accuracy: 0.019500	Distance: 4.9650 
Epoch: 5/150	Loss: 0.3401	Accuracy: 0.045000	Distance: 4.5215 
Epoch: 6/150	Loss: 0.3127	Accuracy: 0.057000	Distance: 4.3700 
Epoch: 7/150	Loss: 0.2901	Accuracy: 0.076000	Distance: 4.1155 
Epoch: 8/150	Loss: 0.2712	Accuracy: 0.100000	Distance: 3.8910 
Epoch: 9/150	Loss: 0.2542	Accuracy: 0.119000	Distance: 3.7620 
Epoch: 10/150	Loss: 0.2380	Accuracy: 0.150000	Distance: 3.6075 
Epoch: 11/150	Loss: 0.2257	Accuracy: 0.177000	Distance: 3.4765 
Epoch: 12/150	Loss: 0.2199	Accuracy: 0.186000	Distance: 3.4030 
Epoch: 13/150	Loss: 0.2073	Accuracy: 0.224000	Distance: 3.2465 
Epoch: 14/150	Loss: 0.1895	Accuracy: 0.222000	Distance: 3.1990 
Epoch: 15/150	Loss: 0.1824	Accuracy: 0.252500	Dist

In [None]:
print("Train accuracy:", evaluate(*test(lemmas_train, tags_train, inflected_forms_train)), "Validation accuracy: ", evaluate(*test(lemmas_val, tags_val, inflected_forms_val)))

Train accuracy: (0.97025, 0.142875) Validation accuracy:  (0.77, 0.7015)


In [None]:
list(zip(*test(lemmas_val, tags_val, inflected_forms_val)))

[('esmerveille', 'esmerveille'),
 ('enpoinouera', 'enpoisonna'),
 ('vous entrenchontrions', 'nous entrehurtons'),
 ('chevaucherai', 'chevauchieras'),
 ('tournoyassions', 'tournoyassions'),
 ('represente', 'represente'),
 ('delaçasse', 'delacyez'),
 ('encontreront', 'encontreront'),
 ('praepareroys', 'praepareroys'),
 ('esveiglez', 'esveiglez'),
 ('esbaucheroys', 'esbaucheroys'),
 ('remercyez', 'remercyes'),
 ('chargeoys', 'chargeoys'),
 ('laissoys', 'laissoys'),
 ('separois', 'separions'),
 ('depeschez', 'depeschez'),
 ('apuyoys', 'apuyoys'),
 ('pastureroyent', 'pastureroyent'),
 ('devoures', 'devoures'),
 ('mouschons', 'mouschons'),
 ('apatissois', 'apatissois'),
 ('contribueront', 'contribuant'),
 ('rincera', 'rinçasmes'),
 ('arroiez', 'arroiez'),
 ('cuyde', 'cuyde'),
 ('ieune', 'ieune'),
 ('quite', 'quite'),
 ('gaignez', 'gaignez'),
 ('pillent', 'pillent'),
 ('leichons', 'leichons'),
 ('alois', 'aloit'),
 ('endamyez', 'en dampnant'),
 ('accoustumassent', 'accoustumassent'),
 ('provo

### Development Dataset

In [None]:
lemmas_dev, tags_dev, inflected_forms_dev = load_data('./conll2018/task1/all/middle-french-dev')

In [None]:
evaluate(*test(lemmas_dev, tags_dev, inflected_forms_dev))

(0.795, 0.617)

In [None]:
for pred, true in list(zip(*test(lemmas_dev, tags_dev, inflected_forms_dev))):
    print(pred, true)

enconvenencierois enconvenencierois
pourpensiez pourpensiez
accompaignerez accompaignerez
allaictois allaictois
adioustoyt adioustoyent
menassoys menassoys
pourforce pourforce
approuche approuche
effectuas effectuas
evaporiez evaporiez
besoigneroient besoigneroient
desista desista
baptisastes baptisastes
essayoys essayoys
afferma afferma
armoias armoia
essilieroys essilieroys
exploreriez exploreriez
aornois aornois
esperassiez esperassiez
evitue evityez
gaigne gaigna
eslancent eslancent
baysassions baysassions
desyrez desyrent
estimerions estimerions
affieront affieront
mesleroient mesleroient
resveroyt resveroyt
oublye oublye
cryes cryes
en delassions en deslyant
esgratignasse esgratignasse
accomplist accomplissoyy
entitulyez entitulyez
allaigions allaictyez
gouvernyons gouvernyons
enchargeassions enchargeassent
accointasse accointyez
eschaufferiez eschaufferiez
trousseroyt troussois
inquietons inquietons
souspirera souspirera
esbauchons esbauchons
esquachent esquachent
esguisa esguis

## 3.4 Experiment 2

### Simple Attention over characters and tags

In [None]:
%matplotlib inline

from itertools import zip_longest
import os
import pickle
import time

import Levenshtein
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F

In [None]:
def load_data(file_name):
    """Loads data.

    Args:
        file_name: path to file containing the data

    Returns:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form
    """

    with open(file_name, 'r', encoding='utf') as file:
        text = file.read()

    lemmas = []
    tags = []
    inflected_forms = []

    for line in text.split('\n')[:-1]:
        lemma, inflected_form, tag = line.split('\t')
        lemmas.append(lemma)
        inflected_forms.append(inflected_form)
        tags.append(tag) 

    return lemmas, tags, inflected_forms

def get_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char, tag2index

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps character to index
        index2char: a dictionary which maps index to character
        tag2index: a ditionary which maps morphological tag to index 
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR) # special start and end symbols
    unique_chars.update(UNKNOWN_CHAR) # special charcter for unknown word
    char2index = {}
    index2char = {}

    char2index[PAD_CHAR] = 0
    index2char[0] = PAD_CHAR
    
    for index, char in enumerate(unique_chars):
        char2index[char] = index + 1
        index2char[index + 1] = char

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.update(UNKNOWN_TAG)
    tag2index = {tag:index+1 for index, tag in enumerate(unique_tags)}
    tag2index[PAD_TAG] = 0

    return char2index, index2char, tag2index

def get_combined_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps inputs and  to index
        index2char: a dictionary which maps index to inputs
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR, UNKNOWN_CHAR) # special start and end symbols  
    
    input2index = {}
    index2input = {}
    
    input2index[PAD_CHAR] = 0
    index2input[0] = PAD_CHAR
        
    for index, char in enumerate(unique_chars, start=1):
        input2index[char] = index
        index2input[index] = char
        
    char_vocab_length = len(input2index.keys())

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.add(UNKNOWN_TAG) # special character for unknown tags
    
    for index, char in enumerate(unique_tags, start=char_vocab_length):
        input2index[char] = index
        index2input[index] = char

    return input2index, index2input, char_vocab_length

def words_to_indices(words, char2index, tensor=False, start_char=False, stop_char=False):
    """Converts list of words to a list with list containing indices

    Args:
        words: list of words
        char2index: dictionary which maps character to index
        tensor: if to return a list of tensor  

    Returns:
        tensor: list of list/tensor containing indices for a sequence of characters
    """

    list_indices = []
    for word in words:
        word_indices = []
        if start_char:
            word_indices.append(char2index[START_CHAR])
        for char in word:
            if char in char2index.keys():
                word_indices.append(char2index[char])
            else:
                word_indices.append(char2index[UNKNOWN_CHAR])
        if stop_char:
            word_indices.append(char2index[STOP_CHAR])
        if tensor:
            word_indices = torch.Tensor(word_indices)
        list_indices.append(word_indices)

    return list_indices

def tag_to_vector(tags, tag2index):
    """Returns one hot representation of tags given a tag.

    Args:
        tags: list of string representation of tag (eg, V;IND;PRS;2;PL)

    Returns:
        tag_vectors: list of 1D tensors with one hot representation of tags 
    """

    tag_vectors = []
    for tag in tags:
        tag_vector = torch.zeros(len(tag2index))
        for tag_feature in tag.split(';'):
            if tag_feature in tag2index:
                tag_vector[tag2index[tag_feature]] = 1
            else:
                tag_vector[tag2index[UNKNOWN_TAG]] = 1
        tag_vectors.append(tag_vector)
    return tag_vectors

def tag_to_indices(tags, tag2index):
    """Converts list of tags to a list of lists containing indices

    Args:
        words: list of tags

    Returns:
        tensor: list of list containing indices of sub_tags
    """
    
    list_indices = []
    for tag in tags:
        tag_indices = []
        for sub_tag in tag.split(';'):
            if sub_tag in tag2index.keys():
                tag_indices.append(tag2index[sub_tag])
            else:
                tag_indices.append(tag2index[UNKNOWN_TAG])
        list_indices.append(tag_indices)

    return list_indices
    

def indices_to_word(indices, index2char):
    """Returns a word given list contaning indices of words

    Args:
        indices: list containing indices

    Returns:
        word: a string
    """

    return ''.join([index2char[index] for index in indices])[:-1]

def pad_lists(lists, pad_int, pad_len=None):
    """Pads lists in a list to make them of equal size"""
    
    if pad_len is None:
        pad_len = max([len(lst) for lst in lists])
    new_list = []
    for lst in lists:
        if len(lst) < pad_len:
            new_list.append(torch.tensor(lst + [pad_int] * (pad_len-len(lst))))
        else:
            new_list.append(torch.tensor(lst[:pad_len]))
    return torch.stack(new_list)

def merge_lists(lists1, lists2):
    """Add two list of lists."""
    
    merged_lists = []
    for list1, list2 in zip(lists1, lists2):
        merged_lists.append(list1 + list2)
    return merged_lists

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

def accuracy(predictions, targets):
    correct_count = 0
    for prediction, target in zip(predictions, targets):
        if prediction == target:
            correct_count += 1
    return correct_count / len(predictions)

def average_distance(predictions, targets):
    total_distance = 0
    for prediction, target in zip(predictions, targets):
        total_distance += Levenshtein.distance(prediction, target)
    return total_distance / len(predictions)

def evaluate(predictions, targets):
    return accuracy(predictions, targets), average_distance(predictions, targets)

In [None]:
language = 'middle-french'
dataset = 'high'

In [None]:
lemmas, tags, inflected_forms = load_data('./conll2018/task1/all/{}-train-{}'.format(language, dataset))

In [None]:
lemmas_train, lemmas_val, tags_train, tags_val, inflected_forms_train, inflected_forms_val = train_test_split(lemmas, tags, inflected_forms, test_size=0.2, random_state=42)

In [None]:
input2index, index2input, char_vocab_size = get_combined_index_dictionaries(lemmas_train, tags_train, inflected_forms_train)


### Train

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
batch_size = 32
embedding_size = 300
hidden_size = 100
input_vocab_size = len(index2input.keys())

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size, p=0.3):

        super(Attention, self).__init__()
        self.W_a = nn.Linear(2*hidden_size, hidden_size, bias=False)

    def forward(self, encoded_input, hidden_state, input_mask):
        src_len = encoded_input.shape[1]
        attn_weights = torch.bmm(self.W_a(encoded_input), hidden_state.transpose(0, 1).transpose(1, 2)).squeeze()
        attn_weights.data.masked_fill_(input_mask, float('-inf'))
  
        attn_weights = F.softmax(attn_weights, dim=1)
        return attn_weights

In [None]:
Embedder = nn.Embedding(input_vocab_size, embedding_size, padding_idx=input2index[PAD_CHAR]).to(device)
Encoder = nn.LSTM(embedding_size, hidden_size, batch_first=True, bidirectional=True).to(device)
attention = Attention(hidden_size).to(device)
Decoder = nn.LSTM(embedding_size + 2 * hidden_size, hidden_size, batch_first=True).to(device)
linear1 = nn.Linear(3 * hidden_size, char_vocab_size - 1, ).to(device) # remove pad character option
dropout1 = torch.nn.Dropout(p=0)
log_softmax = nn.LogSoftmax(dim=2).to(device)
criterion = nn.NLLLoss(ignore_index=-1)
params = list(Embedder.parameters()) + list(Encoder.parameters()) + list(Decoder.parameters()) + list(linear1.parameters())+ list(attention.parameters())

In [None]:
def test(lemmas_val, tags_val, inflected_forms_val, batch_size=32, attn_debug=False, max_pred_len=25):

    inflected_predicted = []
    inflected_forms_true = []
    if attn_debug:
        input_seq = []
        attn_weights_all = []
    output_indices = []

    for batch in grouper(zip(lemmas_val, tags_val, inflected_forms_val), batch_size):
        batch = list(filter(lambda x: x is not None, batch))
        lemmas, tags, inflected_forms = zip(*batch)

        lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
        tags_indices = tag_to_indices(tags, input2index)    
        input_indices = merge_lists(lemmas_indices, tags_indices)

        # Sort by length of input sequence
        input_indices, inflected_forms = zip(*sorted(zip(input_indices, inflected_forms), key=lambda x: len(x[0]), reverse=True))        
        if attn_debug:
            input_seq += [[index2input[index] for index in input_indices1] for input_indices1 in input_indices]
        input_indices = [torch.tensor(lst) for lst in input_indices]

        input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
        embedding = Embedder(input_tensor)
        lengths = [Tensor.shape[0] for Tensor in input_indices]
        packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
        encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
        encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]
        
        input_mask = input_tensor == 0
        # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 


        # Decode
        hidden_state = hidden[0,:,:] + hidden[1,:,:]
        hidden_state = hidden_state.unsqueeze(0)
        cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

        decoder_input = torch.tensor([input2index[START_CHAR]] * len(lengths)).to(device)
        decoder_input = Embedder(decoder_input).unsqueeze(1)


        outputs = []
        attn_weights_sequence = []
        for seq in range(0, max_pred_len):
            attn_weights = attention(encoded_input, hidden_state, input_mask)
            if attn_debug:
                attn_weights_sequence.append(attn_weights)
            context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()

            decoder_input_concat = torch.cat([context.unsqueeze(1), decoder_input], dim=2)

            output, (hidden_state, cell_state) = Decoder(decoder_input_concat, (hidden_state, cell_state))
            output = F.relu(linear1(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))            
            decoder_input = Embedder(output.argmax(dim=2) + 1).transpose(0, 1)
            outputs.append(output.squeeze())

        if attn_debug:
            attn_weights_sequence = torch.stack(attn_weights_sequence, dim=1)
            attn_weights_all.append(attn_weights_sequence)
        output_indices.append(torch.stack(outputs).argmax(dim=2))
        inflected_forms_true += inflected_forms

    output_indices = torch.cat(output_indices, dim=1).transpose(0, 1).cpu().numpy()
    for indices in output_indices:
        inflected_predicted.append(''.join([index2input[index + 1] for index in indices]).split(STOP_CHAR)[0])
            
    if attn_debug:
        max_src_len = max([x.shape[2] for x in attn_weights_all])
        for i in range(len(attn_weights_all)):
            attn_weights = attn_weights_all[i]
            batch_size = attn_weights.shape[0]
            src_len = attn_weights.shape[2]
            attn_weights_all[i] = torch.zeros(batch_size, max_pred_len, max_src_len)
            attn_weights_all[i][:,:,:src_len] = attn_weights
        attn_weights_all = torch.cat(attn_weights_all, dim=0)
        return inflected_predicted, input_seq, attn_weights_all
    else:
        return inflected_predicted, inflected_forms_true

In [None]:
optimiser = optim.Adagrad(params, lr=0.1)

In [None]:
def train(lemmas_train, tags_train, inflected_forms_train, epochs=1):
    start_time = time.time()

    for epoch in range(epochs):

        epoch_loss = 0

        for batch in grouper(zip(lemmas_train, tags_train, inflected_forms_train), batch_size):
            batch = list(filter(lambda x: x is not None, batch))
            lemmas, tags, inflected_forms = zip(*batch)


            lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
            tags_indices = tag_to_indices(tags, input2index)    
            inflected_forms_indices = words_to_indices(inflected_forms, input2index)
            input_indices = merge_lists(lemmas_indices, tags_indices)

            # Sort by length of input sequence
            input_indices, inflected_forms_indices = zip(*sorted(zip(input_indices, inflected_forms_indices), key=lambda x: len(x[0]), reverse=True))
            input_indices = [torch.tensor(lst) for lst in input_indices]

            optimiser.zero_grad()

            input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
            embedding = Embedder(input_tensor)
            lengths = [Tensor.shape[0] for Tensor in input_indices]
            packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
            encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
            encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]
            
            input_mask = input_tensor == 0
            # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 

            # Decode
            hidden_state = hidden[0,:,:] + hidden[1,:,:]
            hidden_state = hidden_state.unsqueeze(0)
            cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

            target = pad_lists([lst + [input2index[STOP_CHAR]] for lst in inflected_forms_indices], input2index[PAD_CHAR]).to(device)    
            target = target - 1

            decoder_input = pad_lists(inflected_forms_indices, input2index[PAD_CHAR]).to(device)
            decoder_input = torch.cat([torch.tensor([input2index[START_CHAR]] * len(lengths)).unsqueeze(1).to(device), decoder_input], dim=1)
            decoder_input = Embedder(decoder_input)

            loss = 0

            max_length = target.shape[1]
            for seq in range(0, max_length):
                attn_weights = attention(encoded_input, hidden_state, input_mask)
                context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()
                
                decoder_input_concat = torch.cat([context.unsqueeze(1), decoder_input[:,seq,:].unsqueeze(1)], dim=2)
                
                output, (hidden_state, cell_state) = Decoder(decoder_input_concat, (hidden_state, cell_state))
                output = F.relu(linear1(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))
                output = dropout1(output)
                output = log_softmax(output).squeeze()
                loss += criterion(output, target[:,seq])

            epoch_loss += loss.item()
            loss.backward()
            optimiser.step()

        print("Epoch: {}/{}\tTime: {:.2f}s\tLoss: {:.4f}\tAccuracy: {:.4f}\tDistance: {:.4f} ".format(epoch+1, epochs, time.time() - start_time, epoch_loss / len(lemmas_train), *evaluate(*test(lemmas_val, tags_val, inflected_forms_val))))

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=70)

Epoch: 1/70	Time: 11.73s	Loss: 1.6367	Accuracy: 0.0000	Distance: 16.6345 
Epoch: 2/70	Time: 25.15s	Loss: 1.4519	Accuracy: 0.0000	Distance: 19.5775 
Epoch: 3/70	Time: 38.63s	Loss: 1.3849	Accuracy: 0.0000	Distance: 16.6520 
Epoch: 4/70	Time: 52.03s	Loss: 1.3729	Accuracy: 0.0000	Distance: 16.1330 
Epoch: 5/70	Time: 65.65s	Loss: 1.3632	Accuracy: 0.0000	Distance: 14.3960 
Epoch: 6/70	Time: 79.01s	Loss: 1.3580	Accuracy: 0.0000	Distance: 14.6755 
Epoch: 7/70	Time: 92.29s	Loss: 1.3412	Accuracy: 0.0000	Distance: 15.3875 
Epoch: 8/70	Time: 105.72s	Loss: 1.3313	Accuracy: 0.0000	Distance: 15.1705 
Epoch: 9/70	Time: 119.14s	Loss: 1.3258	Accuracy: 0.0000	Distance: 14.9580 
Epoch: 10/70	Time: 132.53s	Loss: 1.3227	Accuracy: 0.0000	Distance: 14.6830 
Epoch: 11/70	Time: 145.83s	Loss: 1.3143	Accuracy: 0.0000	Distance: 14.7305 
Epoch: 12/70	Time: 159.10s	Loss: 1.3147	Accuracy: 0.0000	Distance: 13.5515 
Epoch: 13/70	Time: 172.54s	Loss: 1.3143	Accuracy: 0.0000	Distance: 14.5335 
Epoch: 14/70	Time: 186.05s	L

In [None]:
print("Train accuracy:", evaluate(*test(lemmas_train, tags_train, inflected_forms_train)), "Validation accuracy: ", evaluate(*test(lemmas_val, tags_val, inflected_forms_val)))

Train accuracy: (0.0, 12.706375) Validation accuracy:  (0.0, 12.95)


In [None]:
optimiser = optim.Adagrad(params, lr=0.01)

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=30)

Epoch: 1/30	Time: 11.60s	Loss: 1.3157	Accuracy: 0.0000	Distance: 12.3610 
Epoch: 2/30	Time: 24.88s	Loss: 1.3131	Accuracy: 0.0000	Distance: 12.0975 
Epoch: 3/30	Time: 38.17s	Loss: 1.3130	Accuracy: 0.0000	Distance: 11.9540 
Epoch: 4/30	Time: 51.38s	Loss: 1.3123	Accuracy: 0.0000	Distance: 11.8285 
Epoch: 5/30	Time: 64.61s	Loss: 1.3122	Accuracy: 0.0000	Distance: 11.4050 
Epoch: 6/30	Time: 77.90s	Loss: 1.3119	Accuracy: 0.0000	Distance: 12.0620 
Epoch: 7/30	Time: 91.13s	Loss: 1.3119	Accuracy: 0.0000	Distance: 11.9110 
Epoch: 8/30	Time: 104.33s	Loss: 1.3116	Accuracy: 0.0000	Distance: 11.8495 
Epoch: 9/30	Time: 117.57s	Loss: 1.3112	Accuracy: 0.0000	Distance: 11.7805 
Epoch: 10/30	Time: 130.95s	Loss: 1.3110	Accuracy: 0.0000	Distance: 12.1065 
Epoch: 11/30	Time: 144.29s	Loss: 1.3103	Accuracy: 0.0000	Distance: 12.0985 
Epoch: 12/30	Time: 157.57s	Loss: 1.3100	Accuracy: 0.0000	Distance: 12.1470 
Epoch: 13/30	Time: 170.83s	Loss: 1.3097	Accuracy: 0.0000	Distance: 12.1355 
Epoch: 14/30	Time: 184.04s	L

In [None]:
print("Train accuracy:", evaluate(*test(lemmas_train, tags_train, inflected_forms_train)), "Validation accuracy: ", evaluate(*test(lemmas_val, tags_val, inflected_forms_val)))

Train accuracy: (0.0, 11.718875) Validation accuracy:  (0.0, 11.9275)


### Development Dataset

In [None]:
lemmas_dev, tags_dev, inflected_forms_dev = load_data('./middle-french-train-high')

In [None]:
evaluate(*test(lemmas_dev, tags_dev, inflected_forms_dev))

(0.0, 11.7606)

In [None]:
for prediction, truth in zip(*test(lemmas_dev, tags_dev, inflected_forms_dev)):
    print("Prediction: {}\tTruth: {}".format(prediction, truth))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Prediction: ent	Truth: desrisastes
Prediction: seeeeeeeeeeeeeeeeeeeeeeee	Truth: proffiterions
Prediction: eneeeeeeeeeeeeeeeeeeeeeee	Truth: estoffassiez
Prediction: ent	Truth: surnommera
Prediction: enseeses	Truth: subiecteront
Prediction: en	Truth: impugneront
Prediction: ent	Truth: branslerions
Prediction: ent	Truth: deslyast
Prediction: eneeeeeeeeeeeeeeeeeeeeeee	Truth: pastureryez
Prediction: ent	Truth: gaboient
Prediction: eses	Truth: esleves
Prediction: ent	Truth: enortes
Prediction: es	Truth: guarde
Prediction: ent	Truth: meslast
Prediction: seses	Truth: flatterons
Prediction: ent	Truth: destablant
Prediction: ent	Truth: estourdir
Prediction: eentrares	Truth: en esprouvant
Prediction: eses	Truth: garde
Prediction: ent	Truth: aymerent
Prediction: eneeeeeeeeeeeeeeeeeeeeeee	Truth: ayderont
Prediction: ent	Truth: guide
Prediction: ent	Truth: vont
Prediction: en	Truth: eu
Prediction: ent	Truth: entredemandoyent
Prediction

## 3.5 Experiment 3

### Pointer Generator over characters and tags

In [None]:
%matplotlib inline

from itertools import zip_longest
import os
import pickle
import time

import Levenshtein
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F

In [None]:
def load_data(file_name):
    """Loads data.

    Args:
        file_name: path to file containing the data

    Returns:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form
    """

    with open(file_name, 'r', encoding='utf') as file:
        text = file.read()

    lemmas = []
    tags = []
    inflected_forms = []

    for line in text.split('\n')[:-1]:
        lemma, inflected_form, tag = line.split('\t')
        lemmas.append(lemma)
        inflected_forms.append(inflected_form)
        tags.append(tag) 

    return lemmas, tags, inflected_forms

def get_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char, tag2index

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps character to index
        index2char: a dictionary which maps index to character
        tag2index: a ditionary which maps morphological tag to index 
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR) # special start and end symbols
    unique_chars.update(UNKNOWN_CHAR) # special charcter for unknown word
    char2index = {}
    index2char = {}

    char2index[PAD_CHAR] = 0
    index2char[0] = PAD_CHAR
    
    for index, char in enumerate(unique_chars):
        char2index[char] = index + 1
        index2char[index + 1] = char

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.update(UNKNOWN_TAG)
    tag2index = {tag:index+1 for index, tag in enumerate(unique_tags)}
    tag2index[PAD_TAG] = 0

    return char2index, index2char, tag2index

def get_combined_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps inputs and  to index
        index2char: a dictionary which maps index to inputs
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR, UNKNOWN_CHAR) # special start and end symbols  
    
    input2index = {}
    index2input = {}
    
    input2index[PAD_CHAR] = 0
    index2input[0] = PAD_CHAR
        
    for index, char in enumerate(unique_chars, start=1):
        input2index[char] = index
        index2input[index] = char
        
    char_vocab_length = len(input2index.keys())

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.add(UNKNOWN_TAG) # special character for unknown tags
    
    for index, char in enumerate(unique_tags, start=char_vocab_length):
        input2index[char] = index
        index2input[index] = char

    return input2index, index2input, char_vocab_length

def words_to_indices(words, char2index, tensor=False, start_char=False, stop_char=False):
    """Converts list of words to a list with list containing indices

    Args:
        words: list of words
        char2index: dictionary which maps character to index
        tensor: if to return a list of tensor  

    Returns:
        tensor: list of list/tensor containing indices for a sequence of characters
    """

    list_indices = []
    for word in words:
        word_indices = []
        if start_char:
            word_indices.append(char2index[START_CHAR])
        for char in word:
            if char in char2index.keys():
                word_indices.append(char2index[char])
            else:
                word_indices.append(char2index[UNKNOWN_CHAR])
        if stop_char:
            word_indices.append(char2index[STOP_CHAR])
        if tensor:
            word_indices = torch.Tensor(word_indices)
        list_indices.append(word_indices)

    return list_indices

def tag_to_vector(tags, tag2index):
    """Returns one hot representation of tags given a tag.

    Args:
        tags: list of string representation of tag (eg, V;IND;PRS;2;PL)

    Returns:
        tag_vectors: list of 1D tensors with one hot representation of tags 
    """

    tag_vectors = []
    for tag in tags:
        tag_vector = torch.zeros(len(tag2index))
        for tag_feature in tag.split(';'):
            if tag_feature in tag2index:
                tag_vector[tag2index[tag_feature]] = 1
            else:
                tag_vector[tag2index[UNKNOWN_TAG]] = 1
        tag_vectors.append(tag_vector)
    return tag_vectors

def tag_to_indices(tags, tag2index):
    """Converts list of tags to a list of lists containing indices

    Args:
        words: list of tags

    Returns:
        tensor: list of list containing indices of sub_tags
    """
    
    list_indices = []
    for tag in tags:
        tag_indices = []
        for sub_tag in tag.split(';'):
            if sub_tag in tag2index.keys():
                tag_indices.append(tag2index[sub_tag])
            else:
                tag_indices.append(tag2index[UNKNOWN_TAG])
        list_indices.append(tag_indices)

    return list_indices
    

def indices_to_word(indices, index2char):
    """Returns a word given list contaning indices of words

    Args:
        indices: list containing indices

    Returns:
        word: a string
    """

    return ''.join([index2char[index] for index in indices])[:-1]

def pad_lists(lists, pad_int, pad_len=None):
    """Pads lists in a list to make them of equal size"""
    
    if pad_len is None:
        pad_len = max([len(lst) for lst in lists])
    new_list = []
    for lst in lists:
        if len(lst) < pad_len:
            new_list.append(torch.tensor(lst + [pad_int] * (pad_len-len(lst))))
        else:
            new_list.append(torch.tensor(lst[:pad_len]))
    return torch.stack(new_list)

def merge_lists(lists1, lists2):
    """Add two list of lists."""
    
    merged_lists = []
    for list1, list2 in zip(lists1, lists2):
        merged_lists.append(list1 + list2)
    return merged_lists

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

def accuracy(predictions, targets):
    correct_count = 0
    for prediction, target in zip(predictions, targets):
        if prediction == target:
            correct_count += 1
    return correct_count / len(predictions)

def average_distance(predictions, targets):
    total_distance = 0
    for prediction, target in zip(predictions, targets):
        total_distance += Levenshtein.distance(prediction, target)
    return total_distance / len(predictions)

def evaluate(predictions, targets):
    return accuracy(predictions, targets), average_distance(predictions, targets)

In [None]:
language = 'middle-french'
dataset = 'high'

In [None]:
lemmas, tags, inflected_forms = load_data('./conll2018/task1/all/{}-train-{}'.format(language, dataset))

In [None]:
lemmas_train, lemmas_val, tags_train, tags_val, inflected_forms_train, inflected_forms_val = train_test_split(lemmas, tags, inflected_forms, test_size=0.2, random_state=42)

In [None]:
input2index, index2input, char_vocab_size = get_combined_index_dictionaries(lemmas_train, tags_train, inflected_forms_train)


### Train

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
batch_size = 32
embedding_size = 300
hidden_size = 100
input_vocab_size = len(index2input.keys())

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size, p=0.3):
        # general scoring function - global attention 

        super(Attention, self).__init__()
        self.W_a = nn.Linear(2*hidden_size, hidden_size, bias=False)
        # self.dropout = nn.Dropout(p)

    def forward(self, encoded_input, hidden_state, input_mask):
        src_len = encoded_input.shape[1]
        attn_weights = torch.bmm(self.W_a(encoded_input), hidden_state.transpose(0, 1).transpose(1, 2)).squeeze()
        attn_weights.data.masked_fill_(input_mask, float('-inf'))
        # attn_weights = dropout(attn_weights)
        attn_weights = F.softmax(attn_weights, dim=1)
        return attn_weights

In [None]:
class GenerationProbability(nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super(GenerationProbability, self).__init__()
        self.linear_context = nn.Linear(2*hidden_size, 1)
        self.linear_hidden = nn.Linear(hidden_size, 1)
        self.linear_input = nn.Linear(embedding_size, 1)
        
    def forward(self, context, hidden_state, decoder_input):
        return F.sigmoid(self.linear_context(context) + self.linear_hidden(hidden_state.squeeze()) + self.linear_input(decoder_input))

In [None]:
Embedder = nn.Embedding(input_vocab_size, embedding_size, padding_idx=input2index[PAD_CHAR]).to(device)
Encoder = nn.LSTM(embedding_size, hidden_size, batch_first=True, bidirectional=True).to(device)
attention = Attention(hidden_size).to(device)
Decoder = nn.LSTM(embedding_size + 2 * hidden_size, hidden_size, batch_first=True).to(device)
linear_vocab = nn.Linear(3 * hidden_size, char_vocab_size - 1).to(device) # remove pad character option
generation_prob = GenerationProbability(embedding_size, hidden_size).to(device)
log_softmax = nn.LogSoftmax(dim=2).to(device)
criterion = nn.NLLLoss(ignore_index=-1)
params = list(Embedder.parameters()) + list(Encoder.parameters()) + list(Decoder.parameters()) + \
        list(linear_vocab.parameters())+ list(attention.parameters()) + list(generation_prob.parameters())

In [None]:
def test(lemmas_val, tags_val, inflected_forms_val, batch_size=32, attn_debug=False, max_pred_len=25):
    max_pred_len = 25

    inflected_predicted = []
    inflected_forms_true = []
    if attn_debug:
        input_seq = []
        attn_weights_all = []
        p_gen_all = []
    output_indices = []

    for batch in grouper(zip(lemmas_val, tags_val, inflected_forms_val), batch_size):
        batch = list(filter(lambda x: x is not None, batch))
        lemmas, tags, inflected_forms = zip(*batch)

        lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
        tags_indices = tag_to_indices(tags, input2index)    
        input_indices = merge_lists(lemmas_indices, tags_indices)

        # Sort by length of input sequence
        input_indices, inflected_forms = zip(*sorted(zip(input_indices, inflected_forms), key=lambda x: len(x[0]), reverse=True))        
        if attn_debug:
            input_seq += [[index2input[index] for index in input_indices1] for input_indices1 in input_indices]
        input_indices = [torch.tensor(lst) for lst in input_indices]

        input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
        embedding = Embedder(input_tensor)
        lengths = [Tensor.shape[0] for Tensor in input_indices]
        packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
        encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
        encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]

        input_mask = input_tensor == 0
        # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 


        # Decode
        hidden_state = hidden[0,:,:] + hidden[1,:,:]
        hidden_state = hidden_state.unsqueeze(0)
        cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

        decoder_input = torch.tensor([input2index[START_CHAR]] * len(lengths)).to(device)
        decoder_input = Embedder(decoder_input)


        outputs = []
        attn_weights_sequence = []
        p_gen_seq = []
        for seq in range(0, max_pred_len):
            attn_weights = attention(encoded_input, hidden_state, input_mask)
            context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()

            decoder_input_concat = torch.cat([context, decoder_input], dim=1)

            output, (hidden_state, cell_state) = Decoder(decoder_input_concat.unsqueeze(1), (hidden_state, cell_state))
            p_vocab = F.relu(linear_vocab(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))    
            p_vocab = F.softmax(p_vocab.squeeze(), dim=1)
            p_gen = generation_prob(context, hidden_state, decoder_input)
            p_atten = torch.zeros(attn_weights.shape[0], len(input2index)).to(device)
            p_atten.scatter_add_(1, input_tensor, attn_weights)
            p_w = p_gen * p_vocab + (1 - p_gen) * p_atten[:, 1:char_vocab_size]
            decoder_input = Embedder(p_w.argmax(dim=1) + 1)
            outputs.append(p_w)
            if attn_debug:
                attn_weights_sequence.append(attn_weights)
                p_gen_seq.append(p_gen)

        if attn_debug:
            attn_weights_sequence = torch.stack(attn_weights_sequence, dim=1)
            attn_weights_all.append(attn_weights_sequence)
            p_gen_all.append(torch.cat(p_gen_seq, dim=1))
        output_indices.append(torch.stack(outputs).argmax(dim=2))
        inflected_forms_true += inflected_forms

    output_indices = torch.cat(output_indices, dim=1).transpose(0, 1).cpu().numpy()
    for indices in output_indices:
        inflected_predicted.append(''.join([index2input[index + 1] for index in indices]).split(STOP_CHAR)[0])

    if attn_debug:
        max_src_len = max([x.shape[2] for x in attn_weights_all])
        for i in range(len(attn_weights_all)):
            attn_weights = attn_weights_all[i]
            batch_size = attn_weights.shape[0]
            src_len = attn_weights.shape[2]
            attn_weights_all[i] = torch.zeros(batch_size, max_pred_len, max_src_len)
            attn_weights_all[i][:,:,:src_len] = attn_weights
        attn_weights_all = torch.cat(attn_weights_all, dim=0)
        p_gen_all = torch.cat(p_gen_all, dim=0)
        return inflected_predicted, input_seq, attn_weights_all, p_gen_all
    else:
        return inflected_predicted, inflected_forms_true

In [None]:
def train(lemmas_train, tags_train, inflected_forms_train, epochs=1):
    start_time = time.time()
    
    for epoch in range(epochs):

        epoch_loss = 0

        for batch in grouper(zip(lemmas_train, tags_train, inflected_forms_train), batch_size):
            batch = list(filter(lambda x: x is not None, batch))
            lemmas, tags, inflected_forms = zip(*batch)


            lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
            tags_indices = tag_to_indices(tags, input2index)    
            inflected_forms_indices = words_to_indices(inflected_forms, input2index)
            input_indices = merge_lists(lemmas_indices, tags_indices)

            # Sort by length of input sequence
            input_indices, inflected_forms_indices = zip(*sorted(zip(input_indices, inflected_forms_indices), key=lambda x: len(x[0]), reverse=True))
            input_indices = [torch.tensor(lst) for lst in input_indices]

            optimiser.zero_grad()

            input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
            embedding = Embedder(input_tensor)
            lengths = [Tensor.shape[0] for Tensor in input_indices]
            packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
            encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
            encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]

            input_mask = input_tensor == 0
            # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 

            # Decode
            hidden_state = hidden[0,:,:] + hidden[1,:,:]
            hidden_state = hidden_state.unsqueeze(0)
            cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

            target = pad_lists([lst + [input2index[STOP_CHAR]] for lst in inflected_forms_indices], input2index[PAD_CHAR]).to(device)    
            target = target - 1

            decoder_input = pad_lists(inflected_forms_indices, input2index[PAD_CHAR]).to(device)
            decoder_input = torch.cat([torch.tensor([input2index[START_CHAR]] * len(lengths)).unsqueeze(1).to(device), decoder_input], dim=1)
            decoder_input = Embedder(decoder_input)

            loss = 0

            max_length = target.shape[1]
            for seq in range(0, max_length):
                attn_weights = attention(encoded_input, hidden_state, input_mask)
                context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()

                decoder_input_concat = torch.cat([context.unsqueeze(1), decoder_input[:,seq,:].unsqueeze(1)], dim=2)

                output, (hidden_state, cell_state) = Decoder(decoder_input_concat, (hidden_state, cell_state))
                p_vocab = F.relu(linear_vocab(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))
                p_vocab = F.softmax(p_vocab.squeeze(), dim=1)

                p_gen = generation_prob(context, hidden_state, decoder_input[:,seq,:])
                p_atten = torch.zeros(attn_weights.shape[0], len(input2index)).to(device)
                p_atten.scatter_add_(1, input_tensor, attn_weights)
                p_w = p_gen * p_vocab + (1 - p_gen) * p_atten[:, 1:char_vocab_size]

                p_w = torch.log(p_w)
                loss += criterion(p_w, target[:,seq])

                epoch_loss += loss.item()
            loss.backward()
            optimiser.step()
        print("Epoch: {}/{}\tTime: {:.2f}s\tLoss: {:.4f}\tAccuracy: {:.4f}\tDistance: {:.4f} ".format(epoch+1, epochs, time.time() - start_time, epoch_loss / len(lemmas_train), *evaluate(*test(lemmas_val, tags_val, inflected_forms_val))))

In [None]:
optimiser = optim.Adagrad(params, lr=0.1)

In [None]:
%prun train(lemmas_train, tags_train, inflected_forms_train, epochs=50)



Epoch: 1/50	Time: 16.64s	Loss: 14.5957	Accuracy: 0.0000	Distance: 12.7390 
Epoch: 2/50	Time: 35.58s	Loss: 13.8855	Accuracy: 0.0000	Distance: 17.2985 
Epoch: 3/50	Time: 54.31s	Loss: 13.6634	Accuracy: 0.0000	Distance: 16.7880 
Epoch: 4/50	Time: 73.19s	Loss: 13.4057	Accuracy: 0.0000	Distance: 19.1180 
Epoch: 5/50	Time: 92.14s	Loss: 13.1169	Accuracy: 0.0000	Distance: 20.7965 
Epoch: 6/50	Time: 111.04s	Loss: 12.9084	Accuracy: 0.0000	Distance: 21.7560 
Epoch: 7/50	Time: 129.62s	Loss: 12.8018	Accuracy: 0.0000	Distance: 21.1950 
Epoch: 8/50	Time: 148.41s	Loss: 12.7310	Accuracy: 0.0000	Distance: 21.3260 
Epoch: 9/50	Time: 167.23s	Loss: 12.6687	Accuracy: 0.0000	Distance: 20.6375 
Epoch: 10/50	Time: 185.90s	Loss: 12.5775	Accuracy: 0.0000	Distance: 20.2160 
Epoch: 11/50	Time: 204.67s	Loss: 12.5116	Accuracy: 0.0000	Distance: 18.8285 
Epoch: 12/50	Time: 223.43s	Loss: 12.4269	Accuracy: 0.0000	Distance: 19.6920 
Epoch: 13/50	Time: 242.21s	Loss: 12.4159	Accuracy: 0.0000	Distance: 19.8855 
Epoch: 14/50	

In [None]:
optimiser = optim.Adagrad(params, lr=0.01)

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=30)



Epoch: 1/30	Time: 16.17s	Loss: 11.9899	Accuracy: 0.0000	Distance: 13.5740 
Epoch: 2/30	Time: 34.56s	Loss: 11.9382	Accuracy: 0.0000	Distance: 13.2590 
Epoch: 3/30	Time: 52.93s	Loss: 11.9349	Accuracy: 0.0000	Distance: 13.0410 
Epoch: 4/30	Time: 71.16s	Loss: 11.9377	Accuracy: 0.0000	Distance: 13.0575 
Epoch: 5/30	Time: 89.43s	Loss: 11.9283	Accuracy: 0.0000	Distance: 13.0115 
Epoch: 6/30	Time: 107.73s	Loss: 11.9314	Accuracy: 0.0000	Distance: 13.5965 
Epoch: 7/30	Time: 125.95s	Loss: 11.9341	Accuracy: 0.0000	Distance: 12.8995 
Epoch: 8/30	Time: 144.26s	Loss: 11.9303	Accuracy: 0.0000	Distance: 12.9055 
Epoch: 9/30	Time: 162.65s	Loss: 11.9297	Accuracy: 0.0000	Distance: 13.2690 
Epoch: 10/30	Time: 180.99s	Loss: 11.9307	Accuracy: 0.0000	Distance: 13.1035 
Epoch: 11/30	Time: 199.32s	Loss: 11.9325	Accuracy: 0.0000	Distance: 13.1335 
Epoch: 12/30	Time: 217.65s	Loss: 11.9293	Accuracy: 0.0000	Distance: 13.2650 
Epoch: 13/30	Time: 235.89s	Loss: 11.9288	Accuracy: 0.0000	Distance: 13.2315 
Epoch: 14/30	

In [None]:
optimiser = optim.Adagrad(params, lr=0.01)

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=50)



Epoch: 1/50	Time: 16.16s	Loss: 11.9580	Accuracy: 0.0000	Distance: 13.3305 
Epoch: 2/50	Time: 34.35s	Loss: 11.9596	Accuracy: 0.0000	Distance: 13.2535 
Epoch: 3/50	Time: 52.72s	Loss: 11.9609	Accuracy: 0.0000	Distance: 13.3145 
Epoch: 4/50	Time: 71.13s	Loss: 11.9617	Accuracy: 0.0000	Distance: 13.2490 
Epoch: 5/50	Time: 89.43s	Loss: 11.9596	Accuracy: 0.0000	Distance: 13.2190 
Epoch: 6/50	Time: 107.80s	Loss: 11.9578	Accuracy: 0.0000	Distance: 13.1185 
Epoch: 7/50	Time: 126.09s	Loss: 11.9569	Accuracy: 0.0000	Distance: 13.4665 
Epoch: 8/50	Time: 144.38s	Loss: 11.9554	Accuracy: 0.0000	Distance: 13.3335 
Epoch: 9/50	Time: 162.66s	Loss: 11.9560	Accuracy: 0.0000	Distance: 13.4140 
Epoch: 10/50	Time: 180.98s	Loss: 11.9548	Accuracy: 0.0000	Distance: 13.3595 
Epoch: 11/50	Time: 199.27s	Loss: 11.9554	Accuracy: 0.0000	Distance: 13.3915 
Epoch: 12/50	Time: 217.62s	Loss: 11.9545	Accuracy: 0.0000	Distance: 13.4610 
Epoch: 13/50	Time: 235.91s	Loss: 11.9527	Accuracy: 0.0000	Distance: 13.4385 
Epoch: 14/50	

In [None]:
print("Train accuracy:", evaluate(*test(lemmas_train, tags_train, inflected_forms_train)), "Validation accuracy: ", evaluate(*test(lemmas_val, tags_val, inflected_forms_val)))



Train accuracy: (0.0, 14.392875) Validation accuracy:  (0.0, 14.1925)


### Development Dataset

In [None]:
lemmas_dev, tags_dev, inflected_forms_dev = load_data('./conll2018/task1/all/{}-dev'.format(language))

In [None]:
evaluate(*test(lemmas_dev, tags_dev, inflected_forms_dev))



(0.0, 14.3528)

In [None]:
for prediction, truth in zip(*test(lemmas_dev, tags_dev, inflected_forms_dev)):
    if prediction != truth:
        print("Prediction: {}\tTruth: {}".format(prediction, truth))



[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
Prediction: nee	Truth: desrisastes
Prediction: oonoaent	Truth: proffiterions
Prediction: esentesre	Truth: estoffassiez
Prediction: eneeeeeeeeeeeeeeeeeeeeeee	Truth: surnommera
Prediction: o	Truth: subiecteront
Prediction: eoeleeeeeeeeeeeeeeeeeeeee	Truth: impugneront
Prediction: orran	Truth: branslerions
Prediction: nio	Truth: deslyast
Prediction: eeeeeeeeeeeeeeeeeeeeeeeee	Truth: pastureryez
Prediction: ei	Truth: gaboient
Prediction: eeeeeeeeeeeeeeeeeeeeeeeee	Truth: esleves
Prediction: eeeeeeeeeeeeeeeeeeeeeeeee	Truth: enortes
Prediction: eseentrentrentrentrentren	Truth: guarde
Prediction: eoeoeoeoeoeoeoeoeoeoeoeoe	Truth: meslast
Prediction: niez	Truth: flatterons
Prediction: eesences	Truth: destablant
Prediction: eeeeeeeeeeeeeeeeeeeeeeeee	Truth: estourdir
Prediction: eso	Truth: en esprouvant
Prediction: esceeeeeeeeeeeeeeeeeeeeee	Truth: garde
Prediction: eisen	Truth: aymerent
Prediction: 	Truth: ayderont
Pre

## 3.6 Experiment 4

### Simple Attention over characters and tags with Attention layer trained 

In [None]:
aligner_path = './m2m-aligner/m2m-aligner'

In [None]:
%matplotlib inline

from itertools import zip_longest
import os
import pickle
import time
import subprocess

import Levenshtein
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F

In [None]:
def load_data(file_name):
    """Loads data.

    Args:
        file_name: path to file containing the data

    Returns:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form
    """

    with open(file_name, 'r', encoding='utf') as file:
        text = file.read()

    lemmas = []
    tags = []
    inflected_forms = []

    for line in text.split('\n')[:-1]:
        lemma, inflected_form, tag = line.split('\t')
        lemmas.append(lemma)
        inflected_forms.append(inflected_form)
        tags.append(tag) 

    return lemmas, tags, inflected_forms

def get_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char, tag2index

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps character to index
        index2char: a dictionary which maps index to character
        tag2index: a ditionary which maps morphological tag to index 
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR) # special start and end symbols
    unique_chars.update(UNKNOWN_CHAR) # special charcter for unknown word
    char2index = {}
    index2char = {}

    char2index[PAD_CHAR] = 0
    index2char[0] = PAD_CHAR
    
    for index, char in enumerate(unique_chars):
        char2index[char] = index + 1
        index2char[index + 1] = char

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.update(UNKNOWN_TAG)
    tag2index = {tag:index+1 for index, tag in enumerate(unique_tags)}
    tag2index[PAD_TAG] = 0

    return char2index, index2char, tag2index

def get_combined_index_dictionaries(lemmas, tags, inflected_forms):
    """Returns char2index, index2char

    Args:
        lemmas: list of lemma
        tags: list of tags
        inflected_forms: list of inflected form

    Returns: 
        char2index: a dictionary which maps inputs and  to index
        index2char: a dictionary which maps index to inputs
    """

    unique_chars = set(''.join(lemmas) + ''.join(inflected_forms))
    unique_chars.update(START_CHAR, STOP_CHAR, UNKNOWN_CHAR) # special start and end symbols  
    
    input2index = {}
    index2input = {}
    
    input2index[PAD_CHAR] = 0
    index2input[0] = PAD_CHAR
        
    for index, char in enumerate(unique_chars, start=1):
        input2index[char] = index
        index2input[index] = char
        
    char_vocab_length = len(input2index.keys())

    unique_tags = set(';'.join(tags).split(';'))
    unique_tags.add(UNKNOWN_TAG) # special character for unknown tags
    
    for index, char in enumerate(unique_tags, start=char_vocab_length):
        input2index[char] = index
        index2input[index] = char

    return input2index, index2input, char_vocab_length

def words_to_indices(words, char2index, tensor=False, start_char=False, stop_char=False):
    """Converts list of words to a list with list containing indices

    Args:
        words: list of words
        char2index: dictionary which maps character to index
        tensor: if to return a list of tensor  

    Returns:
        tensor: list of list/tensor containing indices for a sequence of characters
    """

    list_indices = []
    for word in words:
        word_indices = []
        if start_char:
            word_indices.append(char2index[START_CHAR])
        for char in word:
            if char in char2index.keys():
                word_indices.append(char2index[char])
            else:
                word_indices.append(char2index[UNKNOWN_CHAR])
        if stop_char:
            word_indices.append(char2index[STOP_CHAR])
        if tensor:
            word_indices = torch.Tensor(word_indices)
        list_indices.append(word_indices)

    return list_indices

def tag_to_vector(tags, tag2index):
    """Returns one hot representation of tags given a tag.

    Args:
        tags: list of string representation of tag (eg, V;IND;PRS;2;PL)

    Returns:
        tag_vectors: list of 1D tensors with one hot representation of tags 
    """

    tag_vectors = []
    for tag in tags:
        tag_vector = torch.zeros(len(tag2index))
        for tag_feature in tag.split(';'):
            if tag_feature in tag2index:
                tag_vector[tag2index[tag_feature]] = 1
            else:
                tag_vector[tag2index[UNKNOWN_TAG]] = 1
        tag_vectors.append(tag_vector)
    return tag_vectors

def tag_to_indices(tags, tag2index):
    """Converts list of tags to a list of lists containing indices

    Args:
        words: list of tags

    Returns:
        tensor: list of list containing indices of sub_tags
    """
    
    list_indices = []
    for tag in tags:
        tag_indices = []
        for sub_tag in tag.split(';'):
            if sub_tag in tag2index.keys():
                tag_indices.append(tag2index[sub_tag])
            else:
                tag_indices.append(tag2index[UNKNOWN_TAG])
        list_indices.append(tag_indices)

    return list_indices
    

def indices_to_word(indices, index2char):
    """Returns a word given list contaning indices of words

    Args:
        indices: list containing indices

    Returns:
        word: a string
    """

    return ''.join([index2char[index] for index in indices])[:-1]

def pad_lists(lists, pad_int, pad_len=None):
    """Pads lists in a list to make them of equal size"""
    
    if pad_len is None:
        pad_len = max([len(lst) for lst in lists])
    new_list = []
    for lst in lists:
        if len(lst) < pad_len:
            new_list.append(torch.tensor(lst + [pad_int] * (pad_len-len(lst))))
        else:
            new_list.append(torch.tensor(lst[:pad_len]))
    return torch.stack(new_list)

def merge_lists(lists1, lists2):
    """Add two list of lists."""
    
    merged_lists = []
    for list1, list2 in zip(lists1, lists2):
        merged_lists.append(list1 + list2)
    return merged_lists

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

def accuracy(predictions, targets):
    correct_count = 0
    for prediction, target in zip(predictions, targets):
        if prediction == target:
            correct_count += 1
    return correct_count / len(predictions)

def average_distance(predictions, targets):
    total_distance = 0
    for prediction, target in zip(predictions, targets):
        total_distance += Levenshtein.distance(prediction, target)
    return total_distance / len(predictions)

def evaluate(predictions, targets):
    return accuracy(predictions, targets), average_distance(predictions, targets)

def showAttention(input_sentence, output_words, attentions, font=None):
        
    # Set up figure with colorbar
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions, cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    font = FontProperties(fname=font, size=24)
    ax.set_xticklabels([''] + input_sentence, rotation=90, fontproperties=font)
    ax.set_yticklabels([''] + output_words, fontproperties=font)
    ax.tick_params(axis='both', which='major', pad=15)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    
def get_alignments(lemmas, inflected_forms, tags=None, normalise=False, maxX=1):
    """Gets alignments
    
    Args:
        training_data: path to training data
    """
    
    aligner_dir = os.path.split(aligner_path)[0]
    processed_file = open(os.path.join(aligner_dir, 'processed'), 'w', encoding='utf8')

    for lemma, inflected_form in zip(lemmas, inflected_forms):
        lemma = lemma.replace(' ', '*') + STOP_CHAR
        inflected_form = inflected_form.replace(' ', '*') + STOP_CHAR
        processed_file.write(' '.join(lemma) + '\t' + ' '.join(inflected_form) + '\n')

    processed_file.close()
    
    subprocess.run(aligner_path + ' --errorInFile --delY --maxX '+ str(maxX) + ' --maxY 1 -i ./m2m-aligner/processed -o ./m2m-aligner/output', shell=True)
    
    aligner_output = open(os.path.join(aligner_dir, 'output'), 'r', encoding='utf8')
    
    alignments = []

    for i, line in enumerate(aligner_output):
        if 'NO ALIGNMENT' in line:
            if tags is not None:
                word_alignment = np.zeros((len(inflected_forms[i]) + 1, len(lemmas[i]) + 2 + len(tags[i].split(';'))))
            else:
                word_alignment = np.zeros((len(inflected_forms[i]) + 1, len(lemmas[i]) + 2))
                
            if normalise:
                word_alignment = 1 / word_alignment.shape[1]
            alignments.append(word_alignment)
            continue

        aligned_lemma, aligned_inflected_form = line.split()
        index = 1
        lemma = START_CHAR + ''.join(aligned_lemma.split('|')[:-1]).replace('_', '').replace(':', '')
        inflected_form = ''.join(aligned_inflected_form.split('|')[:-1])
        if tags is not None:
            word_alignment = np.zeros((len(inflected_form), len(lemma) + len(tags[i].split(';'))))
        else:
            word_alignment = np.zeros((len(inflected_form), len(lemma)))

        for i, (lemma_char, word_char) in enumerate(zip(aligned_lemma.split('|')[:-1], aligned_inflected_form.split('|')[:-1])):
            if ':' in lemma_char:
                chars = len(lemma_char.split(':'))
                word_alignment[i, index:(index+chars)] = 1 / chars
                index += chars
            elif lemma_char != '_':
                word_alignment[i, index] = 1
                index += 1 
            else:
                if normalise:
                    word_alignment[i, :] = 1 / word_alignment.shape[1]
        alignments.append(word_alignment)    
        
    aligner_output.close()
            
    return alignments

In [None]:
language = 'middle-french'
dataset = 'low'

In [None]:
lemmas, tags, inflected_forms = load_data('./conll2018/task1/all/{}-train-{}'.format(language, dataset))

In [None]:
lemmas_train,lemmas_val, tags_train, tags_val, inflected_forms_train, inflected_forms_val = train_test_split(lemmas, tags, inflected_forms, test_size=0.2, random_state=42)

In [None]:
%time alignments_train = get_alignments(lemmas_train, inflected_forms_train, tags_train, normalise=False)

FileNotFoundError: ignored

In [None]:
input2index, index2input, char_vocab_size = get_combined_index_dictionaries(lemmas_train, tags_train, inflected_forms_train)


### Train

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
batch_size = 32
embedding_size = 300
hidden_size = 100
input_vocab_size = len(index2input.keys())

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size, p=0.3):
        # general scoring function - global attention 

        super(Attention, self).__init__()
        self.W_a = nn.Linear(2*hidden_size, hidden_size, bias=False)
        # self.dropout = nn.Dropout(p)

    def forward(self, encoded_input, hidden_state, input_mask):
        src_len = encoded_input.shape[1]
        attn_weights = torch.bmm(self.W_a(encoded_input), hidden_state.transpose(0, 1).transpose(1, 2)).squeeze()
        attn_weights.data.masked_fill_(input_mask, float('-inf'))
        # attn_weights = dropout(attn_weights)
        attn_weights = F.softmax(attn_weights, dim=1)
        return attn_weights

In [None]:
Embedder = nn.Embedding(input_vocab_size, embedding_size, padding_idx=input2index[PAD_CHAR]).to(device)
Encoder = nn.LSTM(embedding_size, hidden_size, batch_first=True, bidirectional=True).to(device)
attention = Attention(hidden_size).to(device)
Decoder = nn.LSTM(embedding_size + 2 * hidden_size, hidden_size, batch_first=True).to(device)
linear1 = nn.Linear(3 * hidden_size, char_vocab_size - 1, ).to(device) # remove pad character option
dropout1 = torch.nn.Dropout(p=0)
log_softmax = nn.LogSoftmax(dim=2).to(device)
# attention_criterion = nn.MSELoss(size_average=False)
attention_criterion = nn.CrossEntropyLoss(ignore_index=-1)
criterion = nn.NLLLoss(ignore_index=-1)
params = list(Embedder.parameters()) + list(Encoder.parameters()) + list(Decoder.parameters()) + list(linear1.parameters()) + list(attention.parameters())

In [None]:
def test(lemmas_val, tags_val, inflected_forms_val, batch_size=32, attn_debug=False, max_pred_len=25):

    inflected_predicted = []
    inflected_forms_true = []
    if attn_debug:
        input_seq = []
        attn_weights_all = []
    output_indices = []

    for batch in grouper(zip(lemmas_val, tags_val, inflected_forms_val), batch_size):
        batch = list(filter(lambda x: x is not None, batch))
        lemmas, tags, inflected_forms = zip(*batch)

        lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
        tags_indices = tag_to_indices(tags, input2index)    
        input_indices = merge_lists(lemmas_indices, tags_indices)

        # Sort by length of input sequence
        input_indices, inflected_forms = zip(*sorted(zip(input_indices, inflected_forms), key=lambda x: len(x[0]), reverse=True))        
        if attn_debug:
            input_seq += [[index2input[index] for index in input_indices1] for input_indices1 in input_indices]
        input_indices = [torch.tensor(lst) for lst in input_indices]

        input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
        embedding = Embedder(input_tensor)
        lengths = [Tensor.shape[0] for Tensor in input_indices]
        packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
        encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
        encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]
        
        input_mask = input_tensor == 0
        # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 


        # Decode
        hidden_state = hidden[0,:,:] + hidden[1,:,:]
        hidden_state = hidden_state.unsqueeze(0)
        cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

        decoder_input = torch.tensor([input2index[START_CHAR]] * len(lengths)).to(device)
        decoder_input = Embedder(decoder_input).unsqueeze(1)


        outputs = []
        attn_weights_sequence = []
        for seq in range(0, max_pred_len):
            attn_weights = attention(encoded_input, hidden_state, input_mask)
            if attn_debug:
                attn_weights_sequence.append(attn_weights)
            context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()

            decoder_input_concat = torch.cat([context.unsqueeze(1), decoder_input], dim=2)

            output, (hidden_state, cell_state) = Decoder(decoder_input_concat, (hidden_state, cell_state))
            output = F.relu(linear1(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))            
            decoder_input = Embedder(output.argmax(dim=2) + 1).transpose(0, 1)
            outputs.append(output.squeeze())

        if attn_debug:
            attn_weights_sequence = torch.stack(attn_weights_sequence, dim=1)
            attn_weights_all.append(attn_weights_sequence)
        output_indices.append(torch.stack(outputs).argmax(dim=2))
        inflected_forms_true += inflected_forms

    output_indices = torch.cat(output_indices, dim=1).transpose(0, 1).cpu().numpy()
    for indices in output_indices:
        inflected_predicted.append(''.join([index2input[index + 1] for index in indices]).split(STOP_CHAR)[0])
            
    if attn_debug:
        max_src_len = max([x.shape[2] for x in attn_weights_all])
        for i in range(len(attn_weights_all)):
            attn_weights = attn_weights_all[i]
            batch_size = attn_weights.shape[0]
            src_len = attn_weights.shape[2]
            attn_weights_all[i] = torch.zeros(batch_size, max_pred_len, max_src_len)
            attn_weights_all[i][:,:,:src_len] = attn_weights
        attn_weights_all = torch.cat(attn_weights_all, dim=0)
        return inflected_predicted, input_seq, attn_weights_all
    else:
        return inflected_predicted, inflected_forms_true

In [None]:
def train(lemmas_train, tags_train, inflected_forms_train, epochs=1, lamda=1):
    
    start_time = time.time()

    for epoch in range(epochs):

        epoch_loss = 0
        epoch_char_loss = 0
        epoch_attention_loss = 0

        for batch in grouper(zip(lemmas_train, tags_train, inflected_forms_train, alignments_train), batch_size):
            batch = list(filter(lambda x: x is not None, batch))
            lemmas, tags, inflected_forms, alignments = zip(*batch)


            lemmas_indices = words_to_indices(lemmas, input2index, start_char=True, stop_char=True)
            tags_indices = tag_to_indices(tags, input2index)    
            inflected_forms_indices = words_to_indices(inflected_forms, input2index)
            input_indices = merge_lists(lemmas_indices, tags_indices)

            # Sort by length of input sequence
            input_indices, inflected_forms_indices, alignments = zip(*sorted(zip(input_indices, inflected_forms_indices, alignments), key=lambda x: len(x[0]), reverse=True))
            input_indices = [torch.tensor(lst) for lst in input_indices]

            optimiser.zero_grad()

            input_tensor = pad_sequence(input_indices, padding_value=input2index[PAD_CHAR], batch_first=True).to(device)
            embedding = Embedder(input_tensor)
            lengths = [Tensor.shape[0] for Tensor in input_indices]
            packed_input = pack_padded_sequence(embedding, lengths, batch_first=True)
            encoded_packed_seq, (hidden, cell) = Encoder(packed_input)
            encoded_input = pad_packed_sequence(encoded_packed_seq, batch_first=True)[0]

            input_mask = input_tensor == 0
            # input_mask = torch.where(input_mask > 0, torch.zeros_like(input_mask), torch.ones_like(input_mask) * float('inf')) 

            # Decode
            hidden_state = hidden[0,:,:] + hidden[1,:,:]
            hidden_state = hidden_state.unsqueeze(0)
            cell_state = torch.zeros(1, len(lengths), hidden_size).to(device)

            target = pad_lists([lst + [input2index[STOP_CHAR]] for lst in inflected_forms_indices], input2index[PAD_CHAR]).to(device)    
            target = target - 1

            decoder_input = pad_lists(inflected_forms_indices, input2index[PAD_CHAR]).to(device)
            decoder_input = torch.cat([torch.tensor([input2index[START_CHAR]] * len(lengths)).unsqueeze(1).to(device), decoder_input], dim=1)
            decoder_input = Embedder(decoder_input)

            char_loss = 0
            attention_loss = 0

            max_length = target.shape[1]
            
            # get alignment matrix
            max_src_len = lengths[0]
            bsz = len(alignments)
            alignments_batched = np.zeros((bsz, max_length, max_src_len))
            for i, alignment in enumerate(alignments):
                alignments_batched[i, :alignment.shape[0], :alignment.shape[1]] = alignment
            alignments_batched = torch.from_numpy(alignments_batched).float().to(device)            
            
            for seq in range(0, max_length):
                attn_weights = attention(encoded_input, hidden_state, input_mask)
                attention_target = alignments_batched[:, seq, :].argmax(dim=1).detach().masked_fill(alignments_batched[:, seq, :].sum(dim=1) == 0, -1)
                # loss += lamda * attention_criterion(attn_weights, alignments_batched[:, seq, :]) MSE
                attention_loss += attention_criterion(attn_weights, attention_target)
                context = torch.bmm(attn_weights.unsqueeze(1), encoded_input).squeeze()

                decoder_input_concat = torch.cat([context.unsqueeze(1), decoder_input[:,seq,:].unsqueeze(1)], dim=2)

                output, (hidden_state, cell_state) = Decoder(decoder_input_concat, (hidden_state, cell_state))
                output = F.relu(linear1(torch.cat([hidden_state, context.unsqueeze(0)], dim=2)))
                output = dropout1(output)
                output = log_softmax(output).squeeze()
                char_loss += criterion(output, target[:,seq])

            loss = char_loss + lamda * attention_loss
            epoch_loss += loss.item()
            epoch_attention_loss += attention_loss.item() 
            epoch_char_loss += char_loss.item()
            loss.backward()
            optimiser.step()

        print("Epoch: {}/{}\tTime: {:.2f}s\tLoss: {:.4f}\tALoss: {:.4f}\tCLoss: {:.4f}\tAcc: {:.4f}\tLD: {:.4f} ".format(epoch+1, epochs, time.time() - start_time, epoch_loss / len(lemmas_train), epoch_attention_loss / len(lemmas_train), epoch_char_loss / len(lemmas_train), *evaluate(*test(lemmas_val, tags_val, inflected_forms_val))))

In [None]:
optimiser = optim.Adagrad(params, lr=0.01)

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=20, lamda=2)

In [None]:
optimiser = optim.Adagrad(params, lr=0.001)

In [None]:
train(lemmas_train, tags_train, inflected_forms_train, epochs=1000, lamda=0)

In [None]:
print("Train accuracy:", evaluate(*test(lemmas_train, tags_train, inflected_forms_train)), "Validation accuracy: ", evaluate(*test(lemmas_val, tags_val, inflected_forms_val)))

### Development Dataset

In [None]:
lemmas_dev, tags_dev, inflected_forms_dev = load_data('./conll2018/task1/all/{}-dev'.format(language))

In [None]:
evaluate(*test(lemmas_dev, tags_dev, inflected_forms_dev))

In [None]:
for prediction, truth in zip(*test(lemmas_dev, tags_dev, inflected_forms_dev)):
    print("Prediction: {}\tTruth: {}".format(prediction, truth))