### Домашнее задание Transformers Training (50 баллов)

В этом домашнем задании требуется обучить несколько Transformer-based моделей в задаче машинного перевода. Для обучения можно воспользоваться текущим проектом, так и реализовать свой пайплайн обучения. Если будете использовать проект, теги **TODO** проекта отмечают, какие компоненты надо реализовать.
В ноутбуке нужно только отобразить результаты обучения и выводы. Архитектура модели(количетсво слоев, размерность и тд) остается на ваш выбор.

Ваш код обучения нужно выложить на ваш github, в строке ниже дать ссылку на него. В первую очередь будут оцениваться результаты в ноутбуке, код нужен для проверки адекватности результатов. 

Обучать модели до конца не нужно, только для демонстрации, что модель обучается и рабочая - снижение val_loss, рост bleu_score.

#### Сcылка на ваш github с проектом(вставить свой) - https://github.com/ibragg/hw-3-dl-sber/tree/main

Ноутбук с результатами выкладывать на ваш **google диск** курса. 

### Обучение Seq2seq Transformer модель(25 баллов)

Реализуйте Seq2seq Transformer. В качестве блока трансформера можно использовать https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html. В качестве токенизатора воспользуйтесь HuggingFace токенизатор для source/target языков - https://huggingface.co/docs/transformers/fast_tokenizers
В качестве максимальной длинны возьмите предложения длинной **до 15 слов**, без каких либо префиксов. 

Не забудьте остальные элементы модели:
* Мы можем использовать 1 трансформер как энкодер - декодером будет выступать линейный слой. 
* Обучите свой BPE токенизатор - https://huggingface.co/docs/transformers/fast_tokenizers
* Матрицу эмбеддингов токенов
* Матрицу позицонных эмбеддингов
* Линейный слой проекции в target словарь
* Функцию маскирования будущих состояний attention, так как модель авто-регрессионна
* Learning rate schedualer


В качестве результатов, приложите слудующие данные:
1) Параметры обучения - learning rate, batch_size, epoch_num, размерность скрытого слоя, количетсво слоев
2) Графики обучения - train loss, val loss, bleu score
3) Примеры переводов вашей модели(10 штук) - source text, true target text, predicted target text

In [8]:
import os
import math
import time
import csv
import shutil
import random
import itertools
from collections import Counter
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from nltk.tokenize import WordPunctTokenizer
from nltk.translate.bleu_score import corpus_bleu

import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

In [9]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [24]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [10]:
with open("./data/data.txt") as f:
    pairs = list(csv.reader(f, delimiter="\t", quotechar='"'))
    
print(f"Total number of pairs: {len(pairs)}")
print(f"Sample pair:\n{random.choice(pairs)}")

Total number of pairs: 50000
Sample pair:
['It offers self-catering cottages, a shared terrace with outdoor furniture and BBQ facilities.', 'К вашим услугам общая меблированная терраса, принадлежности для барбекю и коттеджи с собственной кухней.']


In [11]:
train, test = train_test_split(pairs, test_size=0.05, random_state=1)
train, val = train_test_split(train, test_size=0.15, random_state=1)

print(f"train: {len(train) / len(pairs)}")
print(f"val: {len(val) / len(pairs)}")
print(f"test: {len(test) / len(pairs)}")

train: 0.8075
val: 0.1425
test: 0.05


In [12]:
class TextPreprocessor:
    
    def __init__(self):
        """
        """
        self.tokenizer = WordPunctTokenizer()

    def __call__(self, text, split=True):
        """
            text: text
            split: if split to tokens
            
            returns: indexes list
        """
        text = text.lower()
        if split:
            text = self.tokenizer.tokenize(text)
        
        return text
    
    def get_unique_vocab(self, corpus, min_freq=3):
        """
            get unique words from a list of sentences
        """
        corpus_words = list(map(self, corpus))
        words = Counter(list(itertools.chain(*corpus_words)))
        words = [word[0] for word in words.items() if word[1] >= 3]
        return words

In [13]:
class TextTokenizer:
    
    def __init__(self, vocab, text_preprocessor):
        """
            vocab: words from training set
        """
        self.pad_token = "<pad>"
        self.sos_token = "<sos>"
        self.eos_token = "<eos>"
        self.unk_token = "<unk>"
        self.additional_tokens = [self.pad_token, self.sos_token, self.eos_token, self.unk_token]
        
        self.text_preprocessor = text_preprocessor
        
        self.vocab = vocab
        self.word2index = dict(zip(
            self.additional_tokens + vocab,
            range(len(vocab) + len(self.additional_tokens))
        ))
        
        self.index2word = {v: k for k, v in self.word2index.items()}

    @property
    def vocab_size(self):
        """
            returns: number of words in dict
        """
        return len(self.word2index)
    
    def detokenize(self, indexes, remove_tech_tokens=True):
        if remove_tech_tokens:
            return " ".join([self.index2word[idx] for idx in indexes if self.index2word[idx] not in self.additional_tokens])
        else:
            return " ".join([self.index2word[idx] for idx in indexes])
    
    def __call__(self, text):
        """
            text: text
            
            returns: indexes list
        """
        text = self.text_preprocessor(text)
        indexes = [self.word2index[self.sos_token]] +\
                  [self.word2index[token] if token in self.word2index else self.word2index[self.unk_token] \
                   for token in text] +\
                  [self.word2index[self.eos_token]]
        
        return indexes

In [14]:
text_preprocessor = TextPreprocessor()

ru_vocab = text_preprocessor.get_unique_vocab([pair[1] for pair in train])
en_vocab = text_preprocessor.get_unique_vocab([pair[0] for pair in train])

ru_tokenizer = TextTokenizer(ru_vocab, text_preprocessor)
en_tokenizer = TextTokenizer(en_vocab, text_preprocessor)
print(f"number of tokens in ru vocab: {ru_tokenizer.vocab_size}")
print(f"number of tokens in en vocab: {en_tokenizer.vocab_size}")

number of tokens in ru vocab: 9367
number of tokens in en vocab: 6738


In [15]:
class TranslationDataset(Dataset):
    
    def __init__(self, src, trg, src_tokenizer, trg_tokenizer):
        """
            src: source language sentences
            trg: target language sentences
            src_tokenizer: source language tokenizer
            trg_tokenizer: target language tokenizer
        """
        self.src = src
        self.trg = trg
        
        self.src_tokenizer = src_tokenizer
        self.trg_tokenizer = trg_tokenizer
        
        self.data = []
        for s, t in zip(src, trg):
            self.data.append((self.src_tokenizer(s), self.trg_tokenizer(t)))
                
    def __len__(self):
        """
            returns: number of pairs
        """
        return len(self.data)
    
    def __getitem__(self, idx):
        """
            returns: ([indexes of source sentence], [indexes of target sentence])
        """
        return self.data[idx]

In [16]:
train_dataset = TranslationDataset(
    src=[pair[1] for pair in train],
    trg=[pair[0] for pair in train],
    src_tokenizer=ru_tokenizer,
    trg_tokenizer=en_tokenizer
)

val_dataset = TranslationDataset(
    src=[pair[1] for pair in val],
    trg=[pair[0] for pair in val],
    src_tokenizer=ru_tokenizer,
    trg_tokenizer=en_tokenizer
)

test_dataset = TranslationDataset(
    src=[pair[1] for pair in test],
    trg=[pair[0] for pair in test],
    src_tokenizer=ru_tokenizer,
    trg_tokenizer=en_tokenizer
)

In [17]:
print(train[-2])
print(train_dataset[-2])
print([ru_tokenizer.word2index[word] for word in text_preprocessor(train[-2][1])])
print(ru_tokenizer.detokenize(train_dataset[-2][0]))
print(en_tokenizer.detokenize(train_dataset[-2][1]))

['The kitchen is equipped with a dishwasher.', 'Кухня оснащена посудомоечной машиной.']
([1, 177, 1559, 427, 428, 20, 2], [1, 80, 171, 6, 170, 8, 9, 178, 22, 2])
[177, 1559, 427, 428, 20]
кухня оснащена посудомоечной машиной .
the kitchen is equipped with a dishwasher .


In [18]:
src_pad_idx = ru_tokenizer.word2index["<pad>"]
trg_pad_idx = en_tokenizer.word2index["<pad>"]

def collate_fn(batch):
    """
        batch: pair of lists of words indexes
        
        returns: list of src, list of trg
    """
    src, trg = map(list, zip(*batch))
    max_src = max(list(map(len, src)))
    max_trg = max(list(map(len, trg)))
    src = torch.tensor([s + [src_pad_idx] * (max_src - len(s)) for s in src])
    trg = torch.tensor([t + [trg_pad_idx] * (max_trg - len(t)) for t in trg])
    
    return src, trg

In [19]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

batch = next(iter(train_dataloader))
src, trg = batch
print(src.shape)
print(trg.shape)
src

torch.Size([32, 47])
torch.Size([32, 39])


tensor([[  1,   4,   5,  ...,   0,   0,   0],
        [  1,  21,  22,  ...,   0,   0,   0],
        [  1,  34,   3,  ...,   0,   0,   0],
        ...,
        [  1, 259,  23,  ...,   0,   0,   0],
        [  1,  76,  77,  ...,   0,   0,   0],
        [  1, 279, 159,  ...,   0,   0,   0]])

In [20]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(
            num_embeddings=output_dim,
            embedding_dim=emb_dim
        )
        
        self.rnn = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hid_dim,
            num_layers=n_layers,
            dropout=dropout,
            batch_first=True
        )
        
        self.out = nn.Linear(
            in_features=hid_dim,
            out_features=output_dim
        )
        
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        #input = [batch size, 1]
        input = input.unsqueeze(1)
        
        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        
        
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.out(output.squeeze(1))
        
        return prediction, hidden, cell

In [21]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(
            num_embeddings=input_dim,
            embedding_dim=emb_dim
        )
        
        self.rnn = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hid_dim,
            num_layers=n_layers,
            dropout=dropout,
            batch_first=True
        )
        
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, src):
        
        embedded = self.embedding(src)
        embedded = self.dropout(embedded)
        
        output, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return output, (hidden, cell)
    

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[0]
        max_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(batch_size, max_len, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        encoder_outputs, (hidden, cell) = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[:, 0]
        
        for t in range(1, max_len):
            
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:,t,:] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1] # max on every batch, pick indexes
            input = (trg[:, t] if teacher_force else top1)
        
        return outputs

In [53]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers
        self.dropout = dropout
        
        self.embedding = nn.Embedding(
            num_embeddings=output_dim,
            embedding_dim=emb_dim
        )
        
        self.dropout = nn.Dropout(p=dropout)
        
        self.rnn = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hid_dim,
            num_layers=n_layers,
            dropout=dropout,
            batch_first=True
        )
        
        self.W_attn = nn.Linear(hid_dim, hid_dim, bias=False)
        
        self.out = nn.Linear(
            in_features=2*hid_dim,
            out_features=output_dim
        )
        
        
    def forward(self, input, hidden, cell, encoder_outputs):
        
        input = input.unsqueeze(1)
        
        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        
        _, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        ht_x_W = self.W_attn(hidden[-1]).unsqueeze(-1) # take hidden state only from last layer [B, HID_DIM, 1]
        scores = F.softmax(torch.bmm(encoder_outputs, ht_x_W), dim=1) # encoder_outputs = [B, SEQ_LEN, HID_DIM]
        weighted_states = encoder_outputs * scores # dim same as encoder_outputs
        attention_output = weighted_states.sum(1) # [B, HID_DIM]

        prediction = self.out(torch.cat((hidden[-1].squeeze(0), attention_output), dim=-1))
        
        return prediction, hidden, cell

In [57]:
INPUT_DIM = ru_tokenizer.vocab_size
OUTPUT_DIM = en_tokenizer.vocab_size
MAX_SEQ_LEN = 100
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device)
outputs = model(src, trg)
print(outputs.shape)

print(f'The model has {count_parameters(model):,} trainable parameters')

torch.Size([128, 46, 6738])
The model has 18,647,890 trainable parameters


In [59]:
src, trg = next(iter(train_dataloader))

In [61]:
class Trainer:
    
    def __init__(
            self, 
            model, 
            optimizer, 
            criterion, 
            logdir="./logs", 
            device=None
    ):
        self.device = device
        self.model = model.to(self.device)
        self.optimizer = optimizer
        self.criterion = criterion.to(self.device)
        self.logdir = logdir
        self._writer = SummaryWriter(log_dir=logdir)
    
    def _calculate_loss(self, batch, train=True):
        """
            batch: 
            
            returns: batch loss
        """
        src, trg = batch
        
        src = src.to(self.device)
        trg = trg.to(self.device)
        
        if train:
            output = self.model(src, trg, self.teacher_forcing_ratio)
        else:
            output = self.model(src, trg, 0)
        
        output = output[:,1:].reshape(-1, output.shape[-1])
        trg = trg[:,1:].reshape(-1)
        
        loss = self.criterion(output, trg)
        
        return loss
    
    def _train_step(self, dataloader):
        """
            returns: лосс на датасете для обучения
        """
        self.model.train()
        epoch_loss = 0.0
        
        for batch in tqdm(dataloader):
            self.optimizer.zero_grad()
            
            loss = self._calculate_loss(batch, train=True)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), self.clip)
            self.optimizer.step()
            
            epoch_loss += loss.item()
            
        return epoch_loss / len(dataloader)
    
    def _eval_step(self, dataloader):
        """
            dataloader: даталоадер для валидации
            
            returns: лосс на валидации
        """
        self.model.eval()
        
        epoch_loss = 0.0
        
        with torch.no_grad():
            for batch in dataloader:
                loss = self._calculate_loss(batch, train=False)
                epoch_loss += loss
            
        return epoch_loss / len(dataloader)
    
    def train(self, dataloaders, n_epochs, clip=1, teacher_forcing_ratio=0.5, verbose=True):
        """
            dataloaders: словарь вида {'train': train_dataloader, 'eval': eval_dataloader}
            n_epochs: количество эпох обучения
            verbose: нужно ли выводить каждую эпоху информацию про лоссы
        """
        start = time.time()
        
        self.train_loss = []
        self.val_loss = []
        
        self.clip=clip
        self.teacher_forcing_ratio=teacher_forcing_ratio
        
        self._n_epoch = 1
        for epoch in range(n_epochs):
            train_loss = self._train_step(dataloaders['train'])
            eval_loss = self._eval_step(dataloaders['eval'])
            
            self.train_loss.append(train_loss)
            self.val_loss.append(eval_loss)
            
            if self._writer is not None:
                self._writer.add_scalar('train/loss', train_loss, global_step=self._n_epoch)
                self._writer.add_scalar('eval/loss', eval_loss, global_step=self._n_epoch)
                
            if verbose:
                print(
                    'epoch: {:>2}, train loss: {:.4f}, eval loss: {:.4f}, time: {:.4f}' \
                        .format(epoch + 1, train_loss, eval_loss, time.time() - start)
                )
                    
            self._n_epoch += 1
            
        self.train_time = time.time() - start

In [62]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [63]:
INPUT_DIM = ru_tokenizer.vocab_size
OUTPUT_DIM = en_tokenizer.vocab_size
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device)

In [64]:
N_EPOCHS = 10
LR = 0.001
CLIP = 1
TEACHER_FORCING_RATIO = 0.5
BATCH_SIZE = 128

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

dataloaders = {
    "train": train_dataloader,
    "eval": val_dataloader
}

In [65]:
PAD_IDX = en_tokenizer.word2index[en_tokenizer.pad_token]

optimizer = optim.Adam(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

trainer = Trainer(model, optimizer, criterion, device=device)

In [59]:
class Evaluator:
    
    def __init__(self, model, src_tokenizer, trg_tokenizer, device):
        """
            model: model
            src_tokenizer: source language tokenizer
            trg_tokenizer: target language tokenizer
        """
        self.model = model.to(device)
        
        self.src_tokenizer = src_tokenizer
        self.trg_tokenizer = trg_tokenizer
        self.device = device
        
    @staticmethod
    def _remove_tech_tokens(mystr, tokens_to_remove=['<eos>', '<sos>', '<unk>', '<pad>']):
        return [x for x in mystr if x not in tokens_to_remove]
    
    def translate_text(self, text):
        self.model.eval()
        
        one_text_dataset = TranslationDataset(
            src=[text],
            trg=[text],
            src_tokenizer=ru_tokenizer,
            trg_tokenizer=en_tokenizer
        )
        one_text_dataloader = DataLoader(one_text_dataset, batch_size=1, shuffle=False, collate_fn=collate_fn)
        
        with torch.no_grad():

            src, trg = next(iter(one_text_dataloader))

            src = src.to(self.device)
            trg = trg.to(self.device)

            output = self.model(src, src, 0)

            output = output.argmax(dim=-1)

            original_text = [self.trg_tokenizer.detokenize(x) for x in trg.detach().cpu().numpy()]
            generated_text = [self.trg_tokenizer.detokenize(x) for x in output[:,1:].detach().cpu().numpy()]
            
        return original_text, generated_text

    def translate(self, dataloader):
        
        self.model.eval()
        
        original_text = []
        generated_text = []
        with torch.no_grad():
            for batch in tqdm(dataloader):

                src, trg = batch

                src = src.to(self.device)
                trg = trg.to(self.device)

                output = self.model(src, src, 0) # trg doesn't matter

                output = output.argmax(dim=-1)

                original_text.extend([self.trg_tokenizer.detokenize(x) for x in trg.detach().cpu().numpy()])
                generated_text.extend([self.trg_tokenizer.detokenize(x) for x in output[:,1:].detach().cpu().numpy()])
                
        return original_text, generated_text

In [60]:
device = torch.device('cpu')
print(device)

evaluator = Evaluator(trainer.model, ru_tokenizer, en_tokenizer, device)

cpu


In [61]:
original_text, generated_text = evaluator.translate(test_dataloader)

100%|██████████| 10/10 [00:24<00:00,  2.44s/it]


In [62]:
test[10:15]

[['The modern rooms are air conditioned and have a balcony. They offer a flat-screen TV and a fridge.',
  'Современные номера с балконом оснащены кондиционером, телевизором с плоским экраном и холодильником.'],
 ['Vivanta by Taj is located in Chennai’s commercial district, 200 metres from Spencer’s Plaza Mall.',
  'Отель Vivanta by Taj находится в коммерческом районе города Ченнай, в 200 метрах от торгового центра Spencer’s Plaza.'],
 ['Towels and bed linen are available.',
  'Предоставляются полотенца и постельное белье.'],
 ['This holiday home is 11 km from Santorini (Thira) Airport.',
  'Расстояние до аэропорта Санторини составляет 11 км.'],
 ['The functional rooms here will provide you with cable TV and air conditioning.',
  'Номера отеля отличаются практичным оформлением и имеют телевизор с кабельными каналами и кондиционер.']]

In [63]:
original_text[10:15]

['the modern rooms are air conditioned and have a balcony . they offer a flat - screen tv and a fridge .',
 'by is located in chennai ’ s commercial district , 200 metres from ’ s plaza mall .',
 'towels and bed linen are available .',
 'this holiday home is 11 km from santorini ( ) airport .',
 'the functional rooms here will provide you with cable tv and air conditioning .']

In [64]:
generated_text[10:15]

['the air - conditioned rooms are air conditioned and feature a flat - screen tv and a seating area . .',
 'is located in the heart of , just metres from the of , and metres from the .',
 'towels and bed linen are offered . apartment . . .',
 'the nearest airport is pulkovo airport , 19 km from the property . . .',
 'rooms at the are air conditioned and feature a tv and a minibar . .']

In [65]:
original_text = [text.split(" ") for text in original_text]
generated_text = [text.split(" ") for text in generated_text]

In [66]:
bleu = corpus_bleu([[text] for text in original_text], generated_text) * 100
print(bleu)

16.56573766349486


In [67]:
model_name = "seq2seq"

results[model_name]["bleu"] = bleu
results[model_name]["n_params"] = count_parameters(trainer.model)
results[model_name]["train_time"] = trainer.train_time
results[model_name]["train_loss"] = trainer.train_loss
results[model_name]["val_loss"] = trainer.val_loss

## seq2seq

In [68]:
from torch import Tensor
from torch.nn import Transformer
import math
from torch.nn.utils.rnn import pad_sequence

In [69]:
DEVICE = torch.device('cpu')
print(DEVICE)

cpu


In [70]:
SRC_PAD_IDX = ru_tokenizer.word2index["<pad>"]
TRG_PAD_IDX = en_tokenizer.word2index["<pad>"]

def collate_fn(batch):
    """
        batch: pair of lists of words indexes
        
        returns: list of src, list of trg
    """
    src, trg = map(list, zip(*batch))
    
    src = list(map(torch.tensor, src))
    trg = list(map(torch.tensor, trg))
    
    src = pad_sequence(src, padding_value=SRC_PAD_IDX)
    trg = pad_sequence(trg, padding_value=TRG_PAD_IDX)
    
    return src, trg

In [71]:
BATCH_SIZE = 256

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

dataloaders = {
    "train": train_dataloader,
    "eval": val_dataloader
}

In [72]:
batch = next(iter(train_dataloader))
src, trg = batch
src = src.to(DEVICE)
trg = trg.to(DEVICE)
print(src.shape)
print(trg.shape)
src

torch.Size([56, 256])
torch.Size([46, 256])


tensor([[  1,   1,   1,  ...,   1,   1,   1],
        [  4,  21,  34,  ...,  27, 166, 185],
        [  5,  22,   3,  ..., 261, 460, 202],
        ...,
        [  0,   0,   0,  ...,   0,   0,   0],
        [  0,   0,   0,  ...,   0,   0,   0],
        [  0,   0,   0,  ...,   0,   0,   0]])

In [73]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == SRC_PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == TRG_PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [74]:
src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, trg)

In [79]:
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])


class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)


class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

In [80]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = ru_tokenizer.vocab_size
TGT_VOCAB_SIZE = en_tokenizer.vocab_size
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

In [81]:
logits = transformer(src, trg, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
logits.shape

torch.Size([46, 256, 6738])

In [82]:
count_parameters(transformer)

24327250

In [83]:
class Trainer:
    
    def __init__(
            self, 
            model, 
            optimizer, 
            criterion, 
            logdir="./logs", 
            device=None
    ):
        self.device = device
        self.model = model.to(self.device)
        self.optimizer = optimizer
        self.criterion = criterion.to(self.device)
        self.logdir = logdir
        self._writer = SummaryWriter(log_dir=logdir)
    
    def _calculate_loss(self, batch, train=True):
        """
            batch: 
            
            returns: batch loss
        """
        src, trg = batch
        
        src = src.to(self.device)
        trg = trg.to(self.device)
        
        tgt_input = trg[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
        logits = self.model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = trg[1:, :]
        loss = self.criterion(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        
        return loss
    
    def _train_step(self, dataloader):
        """
            returns: лосс на датасете для обучения
        """
        self.model.train()
        epoch_loss = 0.0
        
        for batch in tqdm(dataloader):
            self.optimizer.zero_grad()
            
            loss = self._calculate_loss(batch, train=True)
            loss.backward()
            self.optimizer.step()
            
            epoch_loss += loss.item()
            
        return epoch_loss / len(dataloader)
    
    def _eval_step(self, dataloader):
        """
            dataloader: даталоадер для валидации
            
            returns: лосс на валидации
        """
        self.model.eval()
        
        epoch_loss = 0.0
        
        with torch.no_grad():
            for batch in dataloader:
                loss = self._calculate_loss(batch, train=False)
                epoch_loss += loss
            
        return epoch_loss / len(dataloader)
    
    def train(self, dataloaders, n_epochs, verbose=True):
        """
            dataloaders: словарь вида {'train': train_dataloader, 'eval': eval_dataloader}
            n_epochs: количество эпох обучения
            verbose: нужно ли выводить каждую эпоху информацию про лоссы
        """
        start = time.time()
        
        self.train_loss = []
        self.val_loss = []
        
        self._n_epoch = 1
        for epoch in range(n_epochs):
            train_loss = self._train_step(dataloaders['train'])
            eval_loss = self._eval_step(dataloaders['eval'])
            
            self.train_loss.append(train_loss)
            self.val_loss.append(eval_loss)
            
            if self._writer is not None:
                self._writer.add_scalar('train/loss', train_loss, global_step=self._n_epoch)
                self._writer.add_scalar('eval/loss', eval_loss, global_step=self._n_epoch)
                
            if verbose:
                print(
                    'epoch: {:>2}, train loss: {:.4f}, eval loss: {:.4f}, time: {:.4f}' \
                        .format(epoch + 1, train_loss, eval_loss, time.time() - start)
                )
                    
            self._n_epoch += 1
            
        self.train_time = time.time() - start

In [84]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

cuda


In [85]:
BATCH_SIZE = 64

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

dataloaders = {
    "train": train_dataloader,
    "eval": val_dataloader
}

In [86]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = ru_tokenizer.vocab_size
TGT_VOCAB_SIZE = en_tokenizer.vocab_size
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

In [87]:
criterion = nn.CrossEntropyLoss(ignore_index = en_tokenizer.word2index[en_tokenizer.pad_token])
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

trainer = Trainer(transformer, optimizer, criterion, device=DEVICE)

In [88]:
NUM_EPOCHS = 15

In [89]:
trainer.train(dataloaders, NUM_EPOCHS)

100%|██████████| 631/631 [00:45<00:00, 13.93it/s]


epoch:  1, train loss: 4.0490, eval loss: 2.8773, time: 47.7087


100%|██████████| 631/631 [00:45<00:00, 13.97it/s]


epoch:  2, train loss: 2.7067, eval loss: 2.3709, time: 95.2861


100%|██████████| 631/631 [00:45<00:00, 13.98it/s]


epoch:  3, train loss: 2.2997, eval loss: 2.1065, time: 142.8523


100%|██████████| 631/631 [00:45<00:00, 14.01it/s]


epoch:  4, train loss: 2.0423, eval loss: 1.9451, time: 190.3232


100%|██████████| 631/631 [00:45<00:00, 13.97it/s]


epoch:  5, train loss: 1.8547, eval loss: 1.8401, time: 237.9190


100%|██████████| 631/631 [00:45<00:00, 13.98it/s]


epoch:  6, train loss: 1.7062, eval loss: 1.7620, time: 285.4700


100%|██████████| 631/631 [00:45<00:00, 14.00it/s]


epoch:  7, train loss: 1.5822, eval loss: 1.6902, time: 332.9729


100%|██████████| 631/631 [00:45<00:00, 13.99it/s]


epoch:  8, train loss: 1.4753, eval loss: 1.6493, time: 380.5082


100%|██████████| 631/631 [00:45<00:00, 13.98it/s]


epoch:  9, train loss: 1.3837, eval loss: 1.6287, time: 428.0792


100%|██████████| 631/631 [00:45<00:00, 13.98it/s]


epoch: 10, train loss: 1.3025, eval loss: 1.5921, time: 475.6333


100%|██████████| 631/631 [00:45<00:00, 13.98it/s]


epoch: 11, train loss: 1.2278, eval loss: 1.5713, time: 523.2096


100%|██████████| 631/631 [00:45<00:00, 13.97it/s]


epoch: 12, train loss: 1.1581, eval loss: 1.5690, time: 570.8011


100%|██████████| 631/631 [00:45<00:00, 14.00it/s]


epoch: 13, train loss: 1.0978, eval loss: 1.5671, time: 618.2905


100%|██████████| 631/631 [00:45<00:00, 13.97it/s]


epoch: 14, train loss: 1.0392, eval loss: 1.5649, time: 665.8689


100%|██████████| 631/631 [00:45<00:00, 14.00it/s]


epoch: 15, train loss: 0.9868, eval loss: 1.5794, time: 713.3695


In [90]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == en_tokenizer.word2index[en_tokenizer.eos_token]:
            break
    return ys


def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    #src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    src = torch.tensor(ru_tokenizer(text)).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model, src, src_mask, max_len=num_tokens + 5, start_symbol=en_tokenizer.word2index[en_tokenizer.sos_token]).flatten()
    return en_tokenizer.detokenize(list(tgt_tokens.cpu().numpy()))

In [91]:
_, ru = map(list, zip(*test))

In [92]:
original_text = []
for _, trg in test_dataloader:
    original_text += [en_tokenizer.detokenize(tokens) for tokens in trg.T.detach().cpu().numpy()]

In [93]:
generated_text = []
for text in tqdm(ru):
    generated_text.append(translate(trainer.model, text))

100%|██████████| 2500/2500 [02:34<00:00, 16.13it/s]


In [94]:
test[10:15]

[['The modern rooms are air conditioned and have a balcony. They offer a flat-screen TV and a fridge.',
  'Современные номера с балконом оснащены кондиционером, телевизором с плоским экраном и холодильником.'],
 ['Vivanta by Taj is located in Chennai’s commercial district, 200 metres from Spencer’s Plaza Mall.',
  'Отель Vivanta by Taj находится в коммерческом районе города Ченнай, в 200 метрах от торгового центра Spencer’s Plaza.'],
 ['Towels and bed linen are available.',
  'Предоставляются полотенца и постельное белье.'],
 ['This holiday home is 11 km from Santorini (Thira) Airport.',
  'Расстояние до аэропорта Санторини составляет 11 км.'],
 ['The functional rooms here will provide you with cable TV and air conditioning.',
  'Номера отеля отличаются практичным оформлением и имеют телевизор с кабельными каналами и кондиционер.']]

In [95]:
original_text[10:15]

['the modern rooms are air conditioned and have a balcony . they offer a flat - screen tv and a fridge .',
 'by is located in chennai ’ s commercial district , 200 metres from ’ s plaza mall .',
 'towels and bed linen are available .',
 'this holiday home is 11 km from santorini ( ) airport .',
 'the functional rooms here will provide you with cable tv and air conditioning .']

In [96]:
generated_text[10:15]

['modern air - conditioned rooms feature a flat - screen tv and a refrigerator .',
 'by is located in the commercial centre of , 200 metres from ’ s plaza .',
 'towels and bed linen are offered in this apartment .',
 'santorini airport is 11 km away .',
 'rooms at the hotel are individually decorated and feature a tv with cable channels .']

In [97]:
original_text = [text.split(" ") for text in original_text]
generated_text = [text.split(" ") for text in generated_text]

In [98]:
bleu = corpus_bleu([[text] for text in original_text], generated_text) * 100
print(bleu)

33.42618242434841


#### so, final score is 33