# Neural Machine Translation  
So, one you are looking at this notebook, I think that you already known what NMT task is. 
Nevertheless, in a few words: we have source text in language $X$ and we need to translate it to target language $Y$. 

Let's take a look at our imports: we are going to use `PyTorch` package for neural networks and `nltk` for tokenization.

In [None]:
import torch

import nltk
nltk.download('punkt')

In [None]:
from src.data_parser import DataParser
from src.tokenizer import Tokenizer
from src.dataset import NMTDataset
from src.models import NMTModel, Encoder, Decoder
from src.utils import seed_all

What is the keyword all over the Deep Learning? Sure, reproducibility! Let's seed everything that we can seed.

In [None]:
seed_all(42)

Now we can consider our dataset a bit closer. 
Firstly, read out data and split it by language.

In [None]:
data_parser = DataParser('./data/rus.txt')
eng, ru = data_parser.split_by_languages()

In [None]:
def get_statistic(language_data, lang_name):
    from collections import Counter
    words = []
    for sample in language_data:
        sample = sample.lower()
        tokens = nltk.word_tokenize(sample)
        words.extend(tokens)
    cntr = Counter(words)
    print('-----------------------')
    print('{} language part contains {} samples and {} unique tokens'.format(lang_name, len(language_data), len(cntr)))
    print('15 most common tokens: ')
    most_common = cntr.most_common(15)
    for token, freq in most_common:
        print('{} : {}'.format(token, freq))
    print('-----------------------')

In [None]:
get_statistic(eng, 'english')

In [None]:
get_statistic(ru, 'russian')

Okay, we see that our target language set contains 54k tokens. It's a bit bigger than we thought. For our "toy" dataset vocabulary with 54k tokens could be too big. There are several ways to solve this issue:
* use special tokenization techniques, like BPE ([paper](https://arxiv.org/abs/1508.07909), [blogpost](https://leimao.github.io/blog/Byte-Pair-Encoding/), [fastest implementation by VK team](https://github.com/VKCOM/YouTokenToMe))
* use special `[UNK]` token to replace subwords with low frequency.

On this notebook we are going to use second way. `threshold=0.7` for Russian language tokenizer means that for out vocabulary we are going to use 0.7 * 54k ~ 37k tokens.

In [None]:
Tokenizer.build_vocab(eng, './data/vocab_eng.txt')
Tokenizer.build_vocab(ru, './data/vocab_ru.txt', threshold=0.7)

eng_tokenizer = Tokenizer('eng', './data/vocab_eng.txt')
ru_tokenizer = Tokenizer('ru', './data/vocab_ru.txt')

In [None]:
test_strings = ['I love dogs and cats', 'Я люблю собак и кошек']
inform = ['---------- test eng ----------', '---------- test ru ----------']
tokenizers = [eng_tokenizer, ru_tokenizer]
for test_str, inf, tokenizer in list(zip(test_strings, inform, tokenizers)):
    print(inf)
    print(test_str)
    tokenized = tokenizer.tokenize(test_str)
    print(tokenized)
    encoded = tokenizer.encode(tokenized)
    print(encoded)
    decoded = tokenizer.decode(encoded)
    print(decoded)

## Few words about model

Machine translation is a classic example of seq2seq task and the main architecture for this problem is called `encoder-decoder`.  
`Encoder` part is used to project our source text to a latent space.  
`Decoder` part gets latent vector from an encoder (for RNN, it could be a last hidden state) to generate a sequence on target language.  
After that we train our system like a classic autoregressive Language Model trying to predict next token.
For encoder and decoder we can use several NN architectures: RNNs (this notebook), CNNs, Transformers (next class).  

You can read [this](https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/) blogpost, if you are still missing something.

![seq2seq](https://pytorch.org/tutorials/_images/seq2seq.png)

In [None]:
config = {
    'dataset': {
        'source_pad_len': 10,
        'target_pad_len': 10
    },
    'dataloader': {
        'train_bs': 40,
        'test_bs': 40
    },
    'encoder_cfg': {
        'vocab_size': eng_tokenizer.get_vocab_size(),
        'embedding_size': 256,
        'hidden_size': 128
    },
    'decoder_cfg': {
        'vocab_size': ru_tokenizer.get_vocab_size(),
        'embedding_size': 256,
        'hidden_size': 128
    },
    'optim': {
        'lr': 5e-5
    }
}

In [None]:
train, test = data_parser.train_test_split(0.9)

train_dataset = NMTDataset(train, eng_tokenizer, ru_tokenizer, **config['dataset'])
test_dataset = NMTDataset(test, eng_tokenizer, ru_tokenizer, **config['dataset'])

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=config['dataloader']['train_bs'], shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=config['dataloader']['test_bs'], shuffle=False)

## Baseline model
Encoder baseline model contains: Embedding -> GRU -> SpatialDropout  
Decoder: Embedding -> GRU -> pre-head layer -> head layer (with [weight tying](https://arxiv.org/abs/1608.05859))

### How to improve baseline model?
* Use `pad_packed_sequence` and `pack_padded_sequence` methods in Encoder and Decoder.
* Use attention! In `src/models.py` file you can find class for GlobalAttention. I **STRONGLY RECOMMEND** to read this [article](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html). (**DO NOT FORGET ABOUT ATTENTION MASKS FOR PAD TOKENS!**)
* The baseline model is trained by [teacher forcing](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/) method, you can also read about [Professor forcing](https://arxiv.org/abs/1610.09038).
* You can find something interesting [here](https://arxiv.org/abs/1409.3215) and [here](https://arxiv.org/abs/1409.0473).
* Implement bidirectional GRU/LSTM in encoder.
* You can try to use self-attention in encoder or decoder (**DO NOT FORGET ABOUT ATTENTION MASKS FOR PAD TOKENS!**).
* Write validation loop (for example, check loss on validation dataset):)
* You can try to implement [beam-search](https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/)/[nucleus sampling](https://arxiv.org/abs/1904.09751).
* Hyper-parameter tuning.
* Read about BLEU metric and realize, how you can score it better (see the last cell).

You can edit everything you want, your main task is get the highest BLEU score.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

encoder = Encoder(**config['encoder_cfg'])
decoder = Decoder(**config['decoder_cfg'])

model = NMTModel(encoder, decoder).to(device)
optimizer = torch.optim.Adam(model.parameters(), config['optim']['lr'])
criterion = torch.nn.NLLLoss(ignore_index = ru_tokenizer.encode(['<PAD>'])[0])

In [None]:
def validation_loop(model, test_loader, criterion) -> float:
    pass

In [None]:
def train_epoch(model, optimizer, loader, criterion, epoch, log_step=200):
    model.train()
    loss_val = []
    avg_loss = []
    iter_step = 1
    for batch in loader:
        optimizer.zero_grad()
        for key in batch.keys():
            batch[key] = batch[key].to(device)
        preds = model(batch)
        preds = preds.permute(0, 2, 1)
        loss = criterion(preds, batch['target_for_loss'])
        avg_loss.append(loss.detach().item())
        if iter_step % log_step == 0:
            avg_loss_val = sum(avg_loss) / len(avg_loss)
            print('epoch\t{}\t[{}/{}]\tloss: {:4f}'.format(epoch, iter_step, len(loader), avg_loss_val))
            avg_loss = []
            loss_val.append(avg_loss_val)
        iter_step += 1
        loss.backward()
        optimizer.step()
    return loss_val

In [None]:
losses = []
EPOCHS = 2
for epoch in range(1, EPOCHS + 1):
    epoch_loss = train_epoch(model, optimizer, train_loader, criterion, epoch)
    losses.extend(epoch_loss) 

In [None]:
import matplotlib.pyplot as plt

plt.plot(losses)

In [None]:
def translate(model, sent, device):
    model.eval()
    sent = eng_tokenizer.encode(eng_tokenizer.tokenize(sent))
    dec_sent = model.translate(sent, device)
    return ' '.join(ru_tokenizer.decode(dec_sent))

In [None]:
translate(model, 'i love cats', device)

In [None]:
import sacrebleu

translated = []
target = []

for source, target_ in test:
  translated.append(translate(model, source, device))
  target.append(target_)

bleu = sacrebleu.corpus_bleu(translated, [target])
print(bleu.score)