## Q3. Translating a cryptic language (30 marks)
In a fictitious world, a group of scientists obtained a document containing a sample of texts written in a cryptic language, together with their English translations. However, they are unable to find a way of translating arbitrary texts from the cryptic language. Knowing that you have learned deep learning models for machine translation, they are approaching you for help.

They have sent you a data file containing a training set and a test set.
The code below illustrates how to use the data.

In [10]:
import pickle as pkl

data = pkl.load(open('cryptic.pkl', 'rb'))
x_tr = data['train']['x']
y_tr = data['train']['y']
x_ts = data['test']['x']
y_ts = data['test']['y']
print('%d training pairs and %d test pairs\n' % (len(x_tr), len(x_ts)))

print('The third pair of training example:')
print('- Cryptic:', x_tr[2])
print('- English:', y_tr[2])

29721 training pairs and 9431 test pairs

The third pair of training example:
- Cryptic: T:aTmUB:BTzp:;)p,:Gp,M 1iU773g rvz:Um:7;,7o,pM r:B
- English: a panic came over her. "Kitty! I'm in torture. I c


As you can see, these texts are not necessarily complete sentences, and in fact, some words may be incomplete too.

In this question, you will train a character-level machine translator using the encoder-decoder architecture as described in Lecture 18 - that is, the encoder takes in one letter at a time from the text in the cryptic language, and the decoder generates one letter at a time for its English translation.
For this dataset, each short text, whether in the cryptic language or in English, has exactly 50 letters, which is convenient for batching processing using recurrent models.

Specifically, as described in lecture, the encoder-decoder architecture has an encoder RNN $f^{e}$, a decoder RNN $f^{d}$, and a predictor $g$: 
\begin{align*}
    h^{e}_{t} &= f^{e}(h_{t-1}^{e}, x_{t}), \\
    h^{d}_{t} &= f^{d}(h_{t}^{d}, y_{t-1}, c), \\
    y_{t} &\sim g(\cdot | h_{t}^{d}, y_{t-1}, c).
\end{align*}
For this problem, each input $x_{t}$ is the one-hot vector for a single letter in the cryptic text,
each output $y_{t}$ is the one-hot vector for a single letter in the English text.
As in general, the context vector $c$ is the last hidden state from the encoder, 
and the input to the decoder at the $t$-th time step is $(y_{t-1}, c)$.
The decoder RNN behaves different during training and testing:
* During training, each $y_{t-1}$ in the input $(y_{t-1}, c)$ to the decoder is the one-hot vector representing the $(t-1)$-th letter in the given English translation (assume $t=1$ is the first letter), except that $y_{0}$ is a one-hot vector representing a special SOS (start of sentence) letter.
* During testing, each $y_{t-1}$ in the input $(y_{t-1}, c)$ to the decoder is obtained in a greedy way as the one-hot vector for the most likely letter predicted by the predictor $g$, except that $y_{0}$ represents SOS.

**(a)** (5 marks)
Complet the code for the `Lang` class below.
Append an EOS letter '\n' to each example cryptic/English text.
Create `Lang` objects for both the cryptic language and English, and use them to convert texts to one-hot representations for both the training and test data.
Make sure you add an additional SOS letter defined by `chr(0)` as an additional letter for the English language, as this is needed for the decoder.

**Answer**.

In [11]:
import torch

class Lang:
    def __init__(self, text):
        self.int2char = dict(enumerate(sorted(set(text))))
        self.char2int = {v: k for k, v in self.int2char.items()}
        self.num_char = len(self.char2int)

    def onehot(self, texts):
        '''
        Convert a list of strings to a list of sequences of one-hot vectors for letters.
        
        Parameters
        ----------
        texts: a list or array of str

        Returns
        -------
        outputs: a list, where output[i] is a list of one-hot vectors for the letters in texts[i]
        '''
        # Task: implement this function according to the docstring
        pass

    def text(self, onehots):
        '''
        Convert a list of sequences of one-hot vectors to a list of strings.
        
        Parameters
        ----------
        onehots: a list, where onehots[i] is a list of one-hot vectors

        Returns
        -------
        outputs: a list, where output[i] is a str corresponding to sequence of one-hot vectors in onehots[i]
        '''
        # Task: implement this function according to the docstring
        pass

# Task: create the Lang objects for the cryptic language and English, and convert the data to one-hot representations
SOS = chr(0)
EOS = '\n'

In [None]:
## HELP: create your Lang objects and run some tests on them
# cr = Lang object for the cryptic language
# en = Lang object for English 

# run the code below to check whether the fields are correctly computed
#en.num_char, en.char2int, en.int2char

# run the code below to check the implementon of text and onehot - the output should be two True's
# cr.text(cr.onehot(x_tr[:10])) == x_tr[:10], en.text(en.onehot(y_tr[:10])) == y_tr[:10]

**(b)** (10 marks)
Complete the code for the encoder-decoder model below.
Write a short paragraph to briefly describe the type of architecture used
(e.g. vanilla RNN, LSTM, GRU, MLP) for the encoder/decoder/predictor, and the hyperparameters chosen 
(e.g. number of layers, number of neurons for each layer, activation functions).

**Answer**.

In [12]:
import torch.nn as nn
from torch.nn import LSTM
import numpy as np
torch.manual_seed(1)
np.random.seed(1)

class Translator(nn.Module):
    '''
    Your code should support usages as illustrated in the code below, and you should run such 
    code to see if your model has any bug before proceeding to train it.
    
    net = Translator(en, cr)
    # input the one-hot representation of some cryptic texts and their translations, and output
    # the class probabilities at each time step.
    p = net(x, y)
    # input the one-hot representation of some cryptic texts, and output one-hot representation
    # of their translations
    yhat = net(x)
    '''
    
    def __init__(self, en: Lang, cr: Lang):
        '''
        You can add additional arguments to this function if you want.
        '''
        self.en = en
        self.cr = cr
        # Task: define your encoder RNN, decoder RNN and the predictor by modifying the code below
        self.encoder = None
        self.decoder = None
        self.predictor = None

    def forward(self, x: torch.tensor, y: torch.tensor = None):
        '''
        Compute the class/letter distribution during training and one-hot vectors during testing.
        Note that while in general, we stop generating the text only when EOS is generated, we 
        always generate output sequences of the same lengths as the input sequences for simplicity
        in this question.

        Parameters
        ----------
        x: a batch_size x length x num_char_cr one-hot tensor representation for batch_size input 
           texts, each of the same length, and where num_char_cr is the number of letters in the
           cryptic language
        y: None during testing, and a batch_size x length x num_char_en tensor during training, 
           corresponding to the one-hot representation for the English translation of the cryptic 
           texts.

        Returns:
        output: a batch_size x length x num_char_en tensor, which represents the class probabilities 
           output by the predictor g during training, and the one-hot representation of the English 
           translation of the texts using the greedy strategy described in class.
        '''
        # Task: implement this function according to the docstring
        # Note: reversing the input sequence before feeding it to the encoder is often helpful

        # HELP:
        # - write down the input shape and output shape for the encoder, decoder and predictor on paper first
        # - print the input shape and output shape for the encoder, decoder and predictor in your code and 
        #   compare with your answers on paper
        
        pass

    def translate(self, x: list[str]):
        '''
        Return a list of English texts of a list of cryptic texts.
        '''
        # HELP
        # If you use the code below as it is, you need to also allow the forward function to 
        # accept a list of onehot sequences output by the onehot function. This can be achieved 
        # by calling x = torch.stack(x) at the beginning of forward.
        # Alternatively, you can change the first line below to x = torch.stack(self.cr.onehot(x))
        # In general, you can interpret list, array, and tensor in the docstring as synomymous 
        # and choose whichever you find easier to work with.
        x = self.cr.onehot(x) 
        y = self(x)
        y = self.en.text(y)
        return y

In [None]:
## HELP: initialize your Translator and do something like the following to test it
# net = Translator(en, cr)
# net(x_tr_oh[:10]) # assume x_tr_oh is the onehot tensor verson of x_tr
# net(x_tr_oh[:10], y_tr_oh[:10]) # assume y_tr_oh is the onehot tensor version of y_tr

**(c)** (15 marks)
Implement a train function and train your model.
Describe details in the training procedure so that someone reading your
report are able to reproduce how you train the model.
That is, you need to include details like the choice of optimizer together
with the hyperparameters used, the batch size, the initial hidden state.

For your final model, output its translations along with the given translations for the first three test examples.
In addition, use the `jaccard` function to score your model's translations on the training and test sets.
5 marks are allocated for your model performance: your model receives $5 \times \min(1, \text{test score}/0.85)$ marks.
Note that training a good model can take time, and it is better to figure out how to train a good model on a small subset first before trying to train a good model on the full dataset.

Save your final model using `torch.save` in the `supplements` and name your model as `mt.pt`.

**Answer**.

In [None]:
## HELP: if you compute your loss using PyTorch's CrossEntropyLoss, make sure you read 
## the documentation carefully and compute your loss correctly.

In [13]:
def jaccard(y_true: list[str], y_pred: list[str]):
    ncorrect = 0.
    npred = 0.
    ntrue = 0.
    for i in range(len(y_true)):
        ntrue += len(y_true[i])
        npred += len(y_pred[i])
        for t in range(min(len(y_true[i]), len(y_pred[i]))):
            ncorrect += (y_pred[i][t] == y_true[i][t])
    return 2*ncorrect/(npred + ntrue)

# example usage
y_true = ['this is a correct translation', 'this is another correct translation']
y_pred = ['this is a norrect translatiom', 'that is another dorrect sranslation']
jaccard(y_true, y_pred)

0.90625