## English to Italian automatic translation

Automatic language translation is often regarded to as the most typical sequence-to-sequence problem. Traditional approaches based on explicitly modeling languages have been proven difficult. In the last decade, deep learning demonstrated to be a more than viable solution to this problem.

Deep learning solutions only requires a large bilingual corpus, and computational resources. Encoder-decoder architectures are the most widely used. They can be implemented with recurrent networks (LSTM and the like) or transformers (which are the state-of-the-art for this problem).

In this lab activity we will build a simple English-to-Italian translation system based on a pair of LSTM networks working together in a encoder-decoder architecture.

## Data

We will use a subset of the English-Italian bilingual dataset from the [Tatoeba Project](https://www.manythings.org/anki/).

There are two files, `text-eng.txt` and `text-ita.txt`, containing 333112 lines, each one reporting one sentenced in English or Italian. Sentences are paired so that the i-th sentence in the English file has a corresponding translation in the i-th sentence in the Italian file.

Each sentence has been already converted to lowercase, rewritten as space-separated tokens (words and punctuation symbols). Each sentence starts with the special `<sos>` token and is terminated by the `<eos>` token. The longest sequences are 20 tokens long.

For instance, this is one example from the English file:

`<sos> do you want me to make coffee ? <eos>`

and this is the corresponding translation in the Italian file:

`<sos> vuoi che prepari del caffè ? <eos>`


In [None]:
# Download the files
URL = "https://drive.google.com/file/d/1_npGYZk13fs5hE0kAggiSrmKkqW3OrLT/view?usp=sharing"
!gdown --fuzzy $URL -O- | tar -xz

### Vocabularies

First, we need to build separate vocabularies for English and Italian.
For each language we need to find the list of unique tokens, and an inverse  mapping between tokens and their index in the list.

We need to include in the vocabularies also the special tokens `<sos>`, `<eos>` and `<pad>` (that we will need later, and is not in the dataset). It's better if we can manage to have these three tokens in the same position (index) of both vocabularies.

In [None]:
SPECIAL = ["<sos>", "<eos>", "<pad>"]
MAXLEN = 20

f = open("text-eng.txt")
# Define the list of all tokens in the English set ...
ENG_VOCABULARY = ...

f.close()

f = open("text-ita.txt")
# Define the list of all tokens in the Italian set ...
ITA_VOCABULARY = ...

f.close()

# Make sure that the three special tokens have the same indices in the two vocabularies.
# Assign here the three indices...
SOS = ...
EOS = ...
PAD = ...

# Inverse mappings.
ENG_INVERSE = {w: n for n, w in enumerate(ENG_VOCABULARY)}
ITA_INVERSE = {w: n for n, w in enumerate(ITA_VOCABULARY)}

print(len(ENG_VOCABULARY), len(ITA_VOCABULARY))

### Encoding/decoding functions

We need now functions to map strings with sentences into lists of numerical indices, and vice-versa. Thse functions will take as arguments also the vocabularies, or thweir inverses, so that we can use them for both English and Italian.

Having all sequences of the same length simplify training.
For this reason, the `encode_sentence` should add padding to make sure that the list of codes include exactly `MAXLEN` elements.  

In [None]:
def encode_sentence(sentence, inverse):
    """Translate the sentence as a list of numerical codes, given the inverse mapping."""
    # ...



def decode_sentence(codes, voc):
    """Translate a list of numerical codes into a sentence, given the mapping."""
    # ...



eng = "<sos> do you want me to make coffee ? <eos>"
codes = encode_sentence(eng, ENG_INVERSE)
print(codes)
print(decode_sentence(codes, ENG_VOCABULARY))

ita = "<sos> vuoi che prepari del caffè ? <eos>"
codes = encode_sentence(ita, ITA_INVERSE)
print(codes)
print(decode_sentence(codes, ITA_VOCABULARY))

### Dataset and data loader

All the data will be loaded into memory. The `torch.utils.data.TensorDataset` will make the data accessible to the data loader.

In [None]:
import torch


with open("text-eng.txt") as f:
    eng_sentences = [encode_sentence(line, ENG_INVERSE) for line in f]

with open("text-ita.txt") as f:
    ita_sentences = [encode_sentence(line, ITA_INVERSE) for line in f]

train_set = torch.utils.data.TensorDataset(torch.tensor(eng_sentences), torch.tensor(ita_sentences))
train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True, drop_last=True)

eng, ita = next(iter(train_loader))
print(eng.shape, eng.dtype, ita.shape, ita.dtype)

print(decode_sentence(eng[0], ENG_VOCABULARY))
print(decode_sentence(ita[0], ITA_VOCABULARY))

## Model

We will use an encoder-decoder architecture (picture from "Dive into deep learning").

![link text](https://d2l.ai/_images/seq2seq.svg)

The encoder will read the English sentence and encode it into a vector of features (we will use both the final hidden state and cell state).

The decoder will output Italian tokens, given the previous one.
The encoded input is passed to the decoder as initial state and as additional input at each step.

In [None]:
DIM = 256
DROPOUT = 0.2
LAYERS = 2

encoder = torch.nn.Sequential(
    torch.nn.Embedding(len(ENG_VOCABULARY), DIM),
    torch.nn.LSTM(DIM, DIM, batch_first=True, dropout=DROPOUT, num_layers=LAYERS)
)

class Decoder(torch.nn.Module):
    def __init__(self, embedding_size, hidden_size):
        super().__init__()
        self.embedding = torch.nn.Embedding(len(ITA_VOCABULARY), embedding_size)
        self.cell_linear = torch.nn.Linear(hidden_size, embedding_size)
        self.lstm = torch.nn.LSTM(embedding_size, hidden_size, batch_first=True, dropout=DROPOUT, num_layers=LAYERS)
        self.linear = torch.nn.Linear(hidden_size, len(ITA_VOCABULARY))

    def forward(self, input, hidden):
        cell_state = hidden[1][-1]
        output = self.embedding(input)
        y = self.cell_linear(cell_state).unsqueeze(1)
        output = output + y
        output, _ = self.lstm(output, hidden)
        output = self.linear(output)
        return output


decoder = Decoder(DIM, DIM)

input1 = torch.zeros(7, 22, dtype=torch.long)
_, hidden = encoder(input1)
print(input1.shape, "->", hidden[0].shape, hidden[1].shape)

input2 = torch.zeros(7, 22, dtype=torch.long)
output = decoder(input2, hidden)
print(input2.shape, "->", output.shape)

## Training

During training the cross entropy is minimized.
Each output from the decoder is compared to the next token in the output sequence.

Padding should be ignored during training. The `torch.nn.CrossEntropyLoss` has an optional argument for this.

In [None]:
EPOCHS = 10
LEARNING_RATE = 0.001
DEVICE = ("cuda" if torch.cuda.is_available() else "cpu")

encoder.to(DEVICE)
decoder.to(DEVICE)

optimizer = torch.optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=LEARNING_RATE)
loss_fun = torch.nn.CrossEntropyLoss(ignore_index=PAD)

In [None]:
encoder.train()
decoder.train()

steps = 0
for epoch in range(EPOCHS):
    for lq, sq in train_loader:
        lq = lq.to(DEVICE)
        sq = sq.to(DEVICE)
        _, hidden = encoder(lq)
        output = decoder(sq[:, :-1], hidden)
        loss = loss_fun(output.permute(0, 2, 1), sq[:, 1:])
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        steps += 1
        if steps % 1000 == 0:
            predictions = output.argmax(2)
            correct = (predictions == sq[:, 1:]).sum().item()
            total = (sq[:, 1:] != PAD).sum().item()
            accuracy = 100 * correct / max(total, 1)
            print(f"{steps} [{epoch}]  Loss: {loss.item():.4f}  Acc: {accuracy:.1f}%")
            print(decode_sentence(lq[0], ENG_VOCABULARY))
            print(decode_sentence(sq[0], ITA_VOCABULARY))
            print(decode_sentence(predictions[0], ITA_VOCABULARY))
            print()

## Using the model

To translate a new sentence, you need to:

1. encode the input sentence;
2. initialize the output sentence with the `<sos>` token;
3. pass the current output into the decoder together with the encoder state;
4. take the output token with the highest score, and add it to the current output.
5. repeat from step 3 until the `<eos>` token is generated.

Implement this algorithm and use it to translate some English sentence.

In [None]:
encoder.eval()
decoder.eval()

eng = "<sos> how old are you ? <eos>"
# eng = "<sos> i like to play tennis . <eos>"
# eng = "<sos> i hope it snows at christmas . <eos>"
# eng = "<sos> would you like to go to the movie theater . <eos>"

input = torch.tensor([encode_sentence(eng, ENG_INVERSE)], device=DEVICE)
_, hidden = encoder(input)

output = torch.zeros(1, MAXLEN, dtype=torch.long, device=DEVICE)
output[0, 0] = SOS

# ...

ita = ...

print(ita)