#Assignment - Transliteration

In this task you are required to solve the transliteration problem of names from English to Russian. Transliteration of a string means writing this string using the alphabet of another language with the preservation of pronunciation, although not always.


## Instructions

To complete the assignment please do the following  steps (both are requred to get the full credits): 

###1. Complete this notebook

Upload a filled notebook with code (this file). You will be asked to implement a transformer-based approach for transliteration.

You should implement your ``train`` and ``classify`` functions in this notebook in the cells below. Your model should be implemented as a special class/function in this notebook (be sure if you add any outer dependencies that everything is improted correctly and can be reproducable). 


###2. Submit solution to the shared task

After the implementation of models' architectures you are asked to participate in the [competition](https://competitions.codalab.org/competitions/30932) to solve **Transliteration** task using your implemented code. 

You should use your code from the previous part to train, validate, and generate predictions for the public (Practice) and private (Evaluation) test sets. It will produce predictions (`preds_translit.tsv`) for the dataset and score them if the true answers are present. You can use these scores to evaluate your model on dev set and choose the best one. Be sure to download the [dataset](https://github.com/skoltech-nlp/filimdb_evaluation/blob/master/TRANSLIT.tar.gz) and unzip it with `wget` command and run them from notebook cells. 

Upload obtained TSV file with your predictions (``preds_translit.tsv``) in ``.zip`` for the best results to both phases of the competition.


**Important: You must indicate "DL4NLP-23" as your team name in Codalab. Without it your submission will be invalid!**


## Basic algorithm

The basic algorithm is based on the following idea: for transliteration, alphabetic n-grams from one language can be transformed into another language into n-grams of the same size, using the most frequent transformation rule found according to statistics on the training sample. 

To test the implementation, download the data, unzip the datasets, predict transliteration and run the evaluation script. To do this, you need to run the following commands:

In [None]:
!wget https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz

--2023-04-11 19:13:52--  https://github.com/s-nlp/filimdb_evaluation/raw/master/TRANSLIT.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz [following]
--2023-04-11 19:13:53--  https://raw.githubusercontent.com/s-nlp/filimdb_evaluation/master/TRANSLIT.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1546458 (1.5M) [application/octet-stream]
Saving to: ‘TRANSLIT.tar.gz’


2023-04-11 19:13:53 (43.2 MB/s) - ‘TRANSLIT.tar.gz’ saved [1546458/1546458]



In [None]:
!gunzip TRANSLIT.tar.gz

In [None]:
!tar -xf TRANSLIT.tar

### Baseline code

In [None]:
from typing import List, Any
from random import random
import collections as col

def baseline_train(
        train_source_strings: List[str],
        train_target_strings: List[str]) -> Any:
    """
    Trains transliretation model on the given train set represented as
    parallel list of input strings and their transliteration via labels.
    :param train_source_strings: a list of strings, one str per example
    :param train_target_strings: a list of strings, one str per example
    :return: learnt parameters, or any object you like (it will be passed to the classify function)
    """

    ngram_lvl = 3
    def obtain_train_dicts(train_source_strings, train_target_strings,
                            ngram_lvl):
        ngrams_dict = col.defaultdict(lambda: col.defaultdict(int))
        for src_str,dst_str in zip(train_source_strings,
                                        train_target_strings):
            try:
                src_ngrams = [src_str[i:i+ngram_lvl] for i in
                                range(len(src_str)-ngram_lvl+1)]
                dst_ngrams = [dst_str[i:i+ngram_lvl] for i in
                                range(len(dst_str)-ngram_lvl+1)]
            except TypeError as e:
                print(src_ngrams, dst_ngrams)
                print(e)
                raise StopIteration
            for src_ngram in src_ngrams:
                for dst_ngram in dst_ngrams:
                    ngrams_dict[src_ngram][dst_ngram] += 1
        return ngrams_dict
        
    ngrams_dict = col.defaultdict(lambda: col.defaultdict(int))
    for nl in range(1, ngram_lvl+1):
        ngrams_dict.update(
            obtain_train_dicts(train_source_strings,
                            train_target_strings, nl))
    return ngrams_dict 


def baseline_classify(strings: List[str], params: Any) -> List[str]:
    """
    Classify strings given previously learnt parameters.
    :param strings: strings to classify
    :param params: parameters received from train function
    :return: list of lists of predicted transliterated strings
      (for each source string -> [top_1 prediction, .., top_k prediction]
        if it is possible to generate more than one, otherwise
        -> [prediction])
        corresponding to the given list of strings
    """
       
    def predict_one_sample(sample, train_dict, ngram_lvl=1):
        ngrams = [sample[i:i+ngram_lvl] for i in
 range(0,(len(sample) // ngram_lvl * ngram_lvl)-ngram_lvl+1, ngram_lvl)] +\
                 ([] if len(sample) % ngram_lvl == 0 else
                    [sample[-(len(sample) % ngram_lvl):]])
        prediction = ''
        for ngram in ngrams:
            ngram_dict = train_dict[ngram]
            if len(ngram_dict.keys()) == 0:
                prediction += '?'*len(ngram)
            else:
                prediction += max(ngram_dict, key=lambda k: ngram_dict[k])
        return prediction 
    
    ngram_lvl = 3
    predictions = []
    ngrams_dict = params
    for string in strings:
        top_1_pred = predict_one_sample(string, ngrams_dict,
                                                ngram_lvl)
        predictions.append([top_1_pred])
    return predictions

### Evaluation code

In [None]:
PREDS_FNAME = "preds_translit_baseline.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [None]:
import codecs
from pandas import read_csv

def load_dataset(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2ixy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            if col_count == 2:
                strings, transliterations = zip(
                    *list(map(lambda l: l.strip('\n').split('\t'), lines))
                )
            elif col_count == 1:
                strings = list(map(lambda l: l.strip('\n'), lines))
                transliterations = None
            else:
                raise ValueError("wrong amount of columns")
        part2ixy[part] = (
            [f'{part}/{i}' for i in range(len(strings))],
            strings, transliterations,
        )
    return part2ixy


def load_transliterations_only(data_dir_path=None, parts: List[str] = SCORED_PARTS):
    part2iy = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        with open(path, 'r', encoding='utf-8') as rf:
            # first line is a header of the corresponding columns
            lines = rf.readlines()[1:]
            col_count = len(lines[0].strip('\n').split('\t'))
            n_lines = len(lines)
            if col_count == 2:
                transliterations = [l.strip('\n').split('\t')[1] for l in lines]
            elif col_count == 1:
                transliterations = None
            else:
                raise ValueError("Wrong amount of columns")
        part2iy[part] = (
            [f'{part}/{i}' for i in range(n_lines)],
            transliterations,
        )
    return part2iy


def save_preds(preds, preds_fname):
    """
    Save classifier predictions in format appropriate for scoring.
    """
    with codecs.open(preds_fname, 'w') as outp:
        for idx, preds in preds:
            print(idx, *preds, sep='\t', file=outp)
    print('Predictions saved to %s' % preds_fname)


def load_preds(preds_fname, top_k=1):
    """
    Load classifier predictions in format appropriate for scoring.
    """
    kwargs = {
        "filepath_or_buffer": preds_fname,
        "names": ["id", "pred"],
        "sep": '\t',
    }

    pred_ids = list(read_csv(**kwargs, usecols=["id"])["id"])

    pred_y = {
        pred_id: [y]
        for pred_id, y in zip(
            pred_ids, read_csv(**kwargs, usecols=["pred"])["pred"]
        )
    }

    for y in pred_y.values():
        assert len(y) == top_k

    return pred_ids, pred_y


def compute_hit_k(preds, k=10):
    raise NotImplementedError


def compute_mrr(preds):
    raise NotImplementedError


def compute_acc_1(preds, true):
    right_answers = 0
    bonus = 0
    for pred, y in zip(preds, true):
        if pred[0] == y:
            right_answers += 1
        elif pred[0] != pred[0] and y == 'нань':
            print('Your test file contained empty string, skipping %f and %s' % (pred[0], y))
            bonus += 1 # bugfix: skip empty line in test
    return right_answers / (len(preds) - bonus)


def score(preds, true):
    assert len(preds) == len(true), 'inconsistent amount of predictions and ground truth answers'
    acc_1 = compute_acc_1(preds, true)
    return {'acc@1': acc_1}


def score_preds(preds_path, data_dir, parts=SCORED_PARTS):
    part2iy = load_transliterations_only(data_dir, parts=parts)
    pred_ids, pred_dict = load_preds(preds_path)
    # pred_dict = {i:y for i,y in zip(pred_ids, pred_y)}
    scores = {}
    for part, (true_ids, true_y) in part2iy.items():
        if true_y is None:
            print('no labels for %s set' % part)
            continue
        pred_y = [pred_dict[i] for i in true_ids]
        score_values = score(pred_y, true_y)
        acc_1 = score_values['acc@1']
        print('%s set accuracy@1: %.2f' % (part, acc_1))
        scores[part] = score_values 
    return scores

### Train and predict results

In [None]:
from time import time
import numpy as np
import os


def train_and_predict(translit_path, scored_parts):
    top_k = 1
    part2ixy = load_dataset(translit_path, parts=scored_parts)
    train_ids, train_strings, train_transliterations = part2ixy['train']
    print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
    st = time()
    params = baseline_train(train_strings, train_transliterations)
    print('Classifier trained in %.2fs' % (time() - st))

    allpreds = []
    for part, (ids, x, y) in part2ixy.items():
        print('\nClassifying %s set with %d examples ...' % (part, len(x)))
        st = time()
        preds = baseline_classify(x, params)
        print('%s set classified in %.2fs' % (part, time() - st))
        count_of_values = list(map(len, preds))
        assert np.all(np.array(count_of_values) == top_k)
        #score(preds, y)
        allpreds.extend(zip(ids, preds))

    save_preds(allpreds, preds_fname=PREDS_FNAME)
    print('\nChecking saved predictions ...')
    return score_preds(preds_path=PREDS_FNAME, data_dir=translit_path, parts=scored_parts)

In [None]:
train_and_predict(TRANSLIT_PATH, SCORED_PARTS)


Training classifier on 105371 examples from train set ...
Classifier trained in 3.68s

Classifying train set with 105371 examples ...
train set classified in 27.17s

Classifying dev set with 26342 examples ...
dev set classified in 7.33s

Classifying train_small set with 2000 examples ...
train_small set classified in 0.51s

Classifying dev_small set with 2000 examples ...
dev_small set classified in 0.50s

Classifying test set with 32926 examples ...
test set classified in 8.39s
Predictions saved to preds_translit_baseline.tsv

Checking saved predictions ...
train set accuracy@1: 0.33
dev set accuracy@1: 0.31
train_small set accuracy@1: 0.34
dev_small set accuracy@1: 0.32
no labels for test set


{'train': {'acc@1': 0.32907536229133255},
 'dev': {'acc@1': 0.3112899552046162},
 'train_small': {'acc@1': 0.3365},
 'dev_small': {'acc@1': 0.323}}

## Transformer-based approach


To implement your algorithm, use the template code, which needs to be modified.

First, you need to add some details in the code of the Transformer architecture, implement the methods of the class `LrScheduler`, which is responsible for updating the learning rate during training.
Next, you need to select the hyperparameters for the model according to the proposed guide.

In [None]:
!pip install Levenshtein

[0m

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import numpy as np
import itertools as it
import collections as col
import random
import os
import copy
import json
from tqdm import tqdm
import datetime, time

import copy
import os
import pandas as pd
import torch
import torch.nn as nn
import torch.utils.data as torch_data
import itertools as it
import collections as col
import random

import Levenshtein as le

### Load dataset and embeddings

In [None]:
def load_datasets(data_dir_path, parts):
    datasets = {}
    for part in parts:
        path = os.path.join(data_dir_path, f'{part}.tsv')
        datasets[part] = pd.read_csv(path, sep='\t', na_filter=False)
        print(f'Loaded {part} dataset, length: {len(datasets[part])}')
    return datasets

In [None]:
class TextEncoder:
    def __init__(self, load_dir_path=None):
        self.lang_keys = ['en', 'ru']
        self.directions = ['id2token', 'token2id']
        self.service_token_names = {
            'pad_token': '<pad>',
            'start_token': '<start>',
            'unk_token': '<unk>',
            'end_token': '<end>'
        }
        service_id2token = dict(enumerate(self.service_token_names.values()))
        service_token2id ={v:k for k,v in service_id2token.items()}
        self.service_vocabs = dict(zip(self.directions,
                                       [service_id2token, service_token2id]))
        if load_dir_path is None:
            self.vocabs = {}
            for lk in self.lang_keys:
                self.vocabs[lk] = copy.deepcopy(self.service_vocabs)
        else:
            self.vocabs = self.load_vocabs(load_dir_path)
    def load_vocabs(self, load_dir_path):
        vocabs = {}
        load_path = os.path.join(load_dir_path, 'vocabs')
        for lk in self.lang_keys:
            vocabs[lk] = {}
            for d in self.directions:
                columns = d.split('2')
                print(lk, d)
                df = pd.read_csv(os.path.join(load_path, f'{lk}_{d}'))
                vocabs[lk][d] = dict(zip(*[df[c] for c in columns]))
        return vocabs
    
    def save_vocabs(self, save_dir_path):
        save_path = os.path.join(save_dir_path, 'vocabs')
        os.makedirs(save_path, exist_ok=True)
        for lk in self.lang_keys:
            for d in self.directions:
                columns = d.split('2')
                pd.DataFrame(data=self.vocabs[lk][d].items(),
                    columns=columns).to_csv(os.path.join(save_path, f'{lk}_{d}'),
                                                index=False,
                                                sep=',')
    def make_vocabs(self, data_df):
        for lk in self.lang_keys:
            tokens = col.Counter(''.join(list(it.chain(*data_df[lk])))).keys()
            part_id2t = dict(enumerate(tokens, start=len(self.service_token_names)))
            part_t2id = {k:v for v,k in part_id2t.items()}
            part_vocabs = [part_id2t, part_t2id]
            for i in range(len(self.directions)):
                self.vocabs[lk][self.directions[i]].update(part_vocabs[i])
                
        self.src_vocab_size = len(self.vocabs['en']['id2token'])
        self.tgt_vocab_size = len(self.vocabs['ru']['id2token'])
                
    def frame(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_token_names['start_token']
        if end_token is None:
            end_token=self.service_token_names['end_token']
        return [start_token] + sample + [end_token]
    def token2id(self, samples, frame, lang_key):
        if frame:
            samples = list(map(self.frame, samples))
        vocab = self.vocabs[lang_key]['token2id']
        return list(map(lambda s:
                        [vocab[t] if t in vocab.keys() else vocab[self.service_token_names['unk_token']]
                         for t in s], samples))
    
    def unframe(self, sample, start_token=None, end_token=None):
        if start_token is None:
            start_token=self.service_vocabs['token2id'][self.service_token_names['start_token']]
        if end_token is None:
            end_token=self.service_vocabs['token2id'][self.service_token_names['end_token']]
        pad_token=self.service_vocabs['token2id'][self.service_token_names['pad_token']]
        return list(it.takewhile(lambda e: e != end_token and e != pad_token, sample[1:]))
    def id2token(self, samples, unframe, lang_key):
        if unframe:
            samples = list(map(self.unframe, samples))
        vocab = self.vocabs[lang_key]['id2token']
        return list(map(lambda s:
                        [vocab[idx] if idx in vocab.keys() else self.service_token_names['unk_token'] for idx in s], samples))


class TranslitData(torch_data.Dataset):
    def __init__(self, source_strings, target_strings,
                text_encoder):
        super(TranslitData, self).__init__()
        self.source_strings = source_strings
        self.text_encoder = text_encoder
        if target_strings is not None:
            assert len(source_strings) == len(target_strings)
            self.target_strings = target_strings
        else:
            self.target_strings = None
    def __len__(self):
        return len(self.source_strings)
    def __getitem__(self, idx):
        src_str = self.source_strings[idx]
        encoder_input = self.text_encoder.token2id([list(src_str)], frame=True, lang_key='en')[0]
        if self.target_strings is not None:
            tgt_str = self.target_strings[idx]
            tmp = self.text_encoder.token2id([list(tgt_str)], frame=True, lang_key='ru')[0]
            decoder_input = tmp[:-1]
            decoder_target = tmp[1:]
            return (encoder_input, decoder_input, decoder_target)
        else:
            return (encoder_input,)


class BatchSampler(torch_data.BatchSampler):
    def __init__(self, sampler, batch_size, drop_last, shuffle_each_epoch):
        super(BatchSampler, self).__init__(sampler, batch_size, drop_last)
        self.batches = []
        for b in super(BatchSampler, self).__iter__():
            self.batches.append(b)
        self.shuffle_each_epoch = shuffle_each_epoch
        if self.shuffle_each_epoch:
            random.shuffle(self.batches)
        self.index = 0
        #print(f'Batches collected: {len(self.batches)}')
    def __iter__(self):
        self.index = 0
        return self
    def __next__(self):
        if self.index == len(self.batches):
            if self.shuffle_each_epoch:
                random.shuffle(self.batches)
            raise StopIteration
        else:
            batch = self.batches[self.index]
            self.index += 1
            return batch

def collate_fn(batch_list):
    '''batch_list can store either 3 components:
        encoder_inputs, decoder_inputs, decoder_targets
        or single component: encoder_inputs'''
    components = list(zip(*batch_list))
    batch_tensors = []
    for data in components:
        max_len = max([len(sample) for sample in data])
        #print(f'Maximum length in batch = {max_len}')
        sample_tensors = [torch.tensor(s, requires_grad=False, dtype=torch.int64)
                         for s in data]
        batch_tensors.append(nn.utils.rnn.pad_sequence(
            sample_tensors,
            batch_first=True, padding_value=0))
    return tuple(batch_tensors) 


def create_dataloader(source_strings, target_strings,
                      text_encoder, batch_size,
                      shuffle_batches_each_epoch):
    '''target_strings parameter can be None'''
    dataset = TranslitData(source_strings, target_strings,
                                text_encoder=text_encoder)
    seq_sampler = torch_data.SequentialSampler(dataset)
    batch_sampler = BatchSampler(seq_sampler, batch_size=batch_size,
                                drop_last=False,
                                shuffle_each_epoch=shuffle_batches_each_epoch)
    dataloader = torch_data.DataLoader(dataset,
                                       batch_sampler=batch_sampler,
                                       collate_fn=collate_fn)
    return dataloader

### Metric function

In [None]:
def compute_metrics(predicted_strings, target_strings, metrics):
    metric_values = {}
    for m in metrics:
        if m == 'acc@1':
            metric_values[m] = sum(predicted_strings == target_strings) / len(target_strings)
        elif m =='mean_ld@1':
            metric_values[m] =\
                np.mean(list(map(lambda e: le.distance(*e), zip(predicted_strings, target_strings))))
        else: 
            raise ValueError(f'Unknown metric: {m}')
    return metric_values

###  Positional Encoding

As you remember, Transformer treats an input sequence of elements as a time series. Since the Encoder inside the Transformer simultaneously processes the entire input sequence, the information about the position of the element needs to be encoded inside its embedding, since it is not identified in any other way inside the model. That is why the PositionalEncoding layer is used, which sums embeddings with a vector of the same dimension.
Let the matrix of these vectors for each position of the time series be denoted as $PE$. Then the elements of the matrix are:

$$ PE_{(pos,2i)} = \sin{(pos/10000^{2i/d_{model}})}$$
$$ PE_{(pos,2i+1)} = \cos{(pos/10000^{2i/d_{model}})}$$

where $pos$ - is the position, $i$ - index of the component of the corresponging vector, $d_{model}$ - dimension of each vector. Thus, even components represent sine values, and odd ones represent cosine values with different arguments.

In this task you are required to implement these formulas inside the class constructor *PositionalEncoding* in the main file ``translit.py``, which you are to upload. To run the test use the following function:

`test_positional_encoding()`

Make sure that there is no any `AssertionError`!


In [None]:
import math

class Embedding(nn.Module):
    def __init__(self, hidden_size, vocab_size):
        super(Embedding, self).__init__()
        self.emb_layer = nn.Embedding(vocab_size, hidden_size)
        self.hidden_size = hidden_size

    def forward(self, x):
        return self.emb_layer(x)

class PositionalEncoding(nn.Module):
    def __init__(self, hidden_size, max_len=512):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, hidden_size, requires_grad=False)
        pos = torch.arange(0, max_len).unsqueeze(1).expand(max_len, hidden_size)
        ten_pow = torch.exp(math.log(10 ** 4) *  torch.arange(0, hidden_size, 2) / hidden_size)
        pe[:,0::2] = torch.sin(pos[:,0::2] / ten_pow)
        pe[:,1::2] = torch.cos(pos[:,1::2] / ten_pow)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: shape (batch size, sequence length, hidden size)
        x = x + self.pe[:, :x.size(1)]
        return x

In [None]:
def test_positional_encoding():
    pe = PositionalEncoding(max_len=3, hidden_size=4)
    res_1 = torch.tensor([[[ 0.0000,  1.0000,  0.0000,  1.0000],
                           [ 0.8415,  0.5403,  0.0100,  0.9999],
                           [ 0.9093, -0.4161,  0.0200,  0.9998]]])
    # print(pe.pe - res_1)
    assert torch.all(torch.abs(pe.pe - res_1) < 1e-4).item()
    print('Test is passed!')

In [None]:
test_positional_encoding()

Test is passed!


### LayerNorm

In [None]:
class LayerNorm(nn.Module):
    "Layer Normalization layer"

    def __init__(self, hidden_size, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.gain = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gain * (x - mean) / (std + self.eps) + self.bias

### SublayerConnection

In [None]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer normalization.
    """

    def __init__(self, hidden_size, dropout):
        super(SublayerConnection, self).__init__()
        self.layer_norm = LayerNorm(hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        return self.layer_norm(x + self.dropout(sublayer(x)))

def padding_mask(x, pad_idx=0):
    assert len(x.size()) >= 2
    return (x != pad_idx).unsqueeze(-2)

def look_ahead_mask(size):
    "Mask out the right context"
    attn_shape = (1, size, size)
    look_ahead_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(look_ahead_mask) == 0

def compositional_mask(x, pad_idx=0):
    pm = padding_mask(x, pad_idx=pad_idx)
    seq_length = x.size(-1)
    result_mask = pm & \
                  look_ahead_mask(seq_length).type_as(pm.data)
    return result_mask

### FeedForward

In [None]:
class FeedForward(nn.Module):
    def __init__(self, hidden_size, ff_hidden_size, dropout=0.1):
        super(FeedForward, self).__init__()
        self.pre_linear = nn.Linear(hidden_size, ff_hidden_size)
        self.post_linear = nn.Linear(ff_hidden_size, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.post_linear(self.dropout(F.relu(self.pre_linear(x))))

def clone_layer(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

###  MultiHeadAttention


Then you are required to implement `attention` method in the class  `MultiHeadAttention`. The MultiHeadAttention layer takes as input  query vectors, key and value vectors for each step of the sequence of matrices  Q,K,V correspondingly. Each key vector, value vector, and query vector is obtained as a result of linear projection using one of three trained vector parameter matrices from the previous layer. This semantics can be represented in the form of formulas:
$$
Attention(Q, K, V)=softmax\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V\\
$$

$$
MultiHead(Q, K, V) = Concat\left(head_1, ... , head_h\right) W^O\\
$$

$$
head_i=Attention\left(Q W_i^Q, K W_i^K, V W_i^V\right)\\
$$
$h$ - the number of attention heads - parallel sub-layers for Scaled Dot-Product Attention on a vector of smaller dimension ($d_{k} = d_{q} = d_{v} = d_{model} / h$). 
The logic of  \texttt{MultiHeadAttention} is presented in the picture (from original  [paper](https://arxiv.org/abs/1706.03762)):

![](https://lilianweng.github.io/lil-log/assets/images/transformer.png)


Inside a method `attention` you are required to create a dropout layer from  MultiHeadAttention class constructor. Dropout layer is to be applied directly on the attention weights - the result of softmax operation. Value of drop probability  can be regulated in the train in the `model_config['dropout']['attention']`.

The correctness of implementation can be checked with
`test_multi_head_attention()`



In [None]:
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, hidden_size, dropout=None):
        super(MultiHeadAttention, self).__init__()
        assert hidden_size % n_heads == 0
        self.head_hidden_size = hidden_size // n_heads
        self.n_heads = n_heads
        self.linears = clone_layer(nn.Linear(hidden_size, hidden_size), 4)
        self.attn_weights = None
        self.dropout = dropout
        if self.dropout is not None:
            self.dropout_layer = nn.Dropout(p=self.dropout)

    def attention(self, query, key, value, mask):
        """Compute 'Scaled Dot Product Attention'
            query, key and value tensors have the same shape:
                (batch size, number of heads, sequence length, head hidden size)
            mask shape: (batch size, 1, sequence length, sequence length)
                '1' dimension value will be broadcasted to number of heads inside your operations
            mask should be applied before using softmax to get attn_weights
        """
        ## attn_weights shape: (batch size, number of heads, sequence length, sequence length)
        ## output shape: (batch size, number of heads, sequence length, head hidden size)
        ## TODO: provide your implementation here
        d = self.head_hidden_size
        scores = query @ key.transpose(-2, -1) / math.sqrt(d)
        if mask is not None:
              scores.masked_fill_(mask == 0, -1e9)
        attn_weights = F.softmax(scores, dim=-1)
        if self.dropout is not None:
              attn_weights = self.dropout_layer(attn_weights)
        output = attn_weights @ value
        ## don't forget to apply dropout to attn_weights if self.dropout is not None
        return output, attn_weights

    def forward(self, query, key, value, mask=None):
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        batch_size = query.size(0)

        # Split vectors for different attention heads (from hidden_size => n_heads x head_hidden_size)
        # and do separate linear projection, for separate trainable weights
        query, key, value = \
            [l(x).view(batch_size, -1, self.n_heads, self.head_hidden_size).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

        x, self.attn_weights = self.attention(query, key, value, mask=mask)
        # x shape: (batch size, number of heads, sequence length, head hidden size)
        # self.attn_weights shape: (batch size, number of heads, sequence length, sequence length)

        # Concatenate the output of each head
        x = x.transpose(1, 2).contiguous() \
            .view(batch_size, -1, self.n_heads * self.head_hidden_size)

        return self.linears[-1](x)

In [None]:
def test_multi_head_attention():
    mha = MultiHeadAttention(n_heads=1, hidden_size=5, dropout=None)
    # batch_size == 2, sequence length == 3, hidden_size == 5
    # query = torch.arange(150).reshape(2, 3, 5)
    query = torch.tensor([[[[ 0.64144618, -0.95817388,  0.37432297,  0.58427106,
          -0.94668716]],
        [[-0.23199289,  0.66329209, -0.46507035, -0.54272512,
          -0.98640698]],
        [[ 0.07546638, -0.09277002,  0.20107185, -0.97407381,
          -0.27713414]]],
       [[[ 0.14727783,  0.4747886 ,  0.44992016, -0.2841419 ,
          -0.81820319]],
        [[-0.72324994,  0.80643179, -0.47655449,  0.45627872,
           0.60942404]],
        [[ 0.61712569, -0.62947282, -0.95215713, -0.38721959,
          -0.73289725]]]])
    key = torch.tensor([[[[-0.81759856, -0.60049991, -0.05923424,  0.51898901,
          -0.3366209 ]],
        [[ 0.83957818, -0.96361722,  0.62285191,  0.93452467,
           0.51219613]],
        [[-0.72758847,  0.41256154,  0.00490795,  0.59892503,
          -0.07202049]]],
       [[[ 0.72315339, -0.49896314,  0.94254637, -0.54356006,
          -0.04837949]],
        [[ 0.51759322, -0.43927061, -0.59924184,  0.92241702,
          -0.86811696]],
        [[-0.54322046, -0.92323003, -0.827746  ,  0.90842783,
           0.88428119]]]])
    value = torch.tensor([[[[-0.83895431,  0.805027  ,  0.22298283, -0.84849915,
          -0.34906026]],
        [[-0.02899652, -0.17456128, -0.17535998, -0.73160314,
          -0.13468061]],
        [[ 0.75234265,  0.02675947,  0.84766286, -0.5475651 ,
          -0.83319316]]],
       [[[-0.47834413,  0.34464645, -0.41921457,  0.33867964,
           0.43470836]],
        [[-0.99000979,  0.10220893, -0.4932273 ,  0.95938905,
           0.01927012]],
        [[ 0.91607137,  0.57395644, -0.90914179,  0.97212912,
           0.33078759]]]])
    query = query.float().transpose(1,2)
    key = key.float().transpose(1,2)
    value = value.float().transpose(1,2)

    x,_ = torch.max(query[:,0,:,:], axis=-1)
    mask = compositional_mask(x)
    mask.unsqueeze_(1)
    for n,t in [('query', query), ('key', key), ('value', value), ('mask', mask)]:
        print(f'Name: {n}, shape: {t.size()}')
    with torch.no_grad():
        output, attn_weights = mha.attention(query, key, value, mask=mask)
    assert output.size() == torch.Size([2,1,3,5])
    assert attn_weights.size() == torch.Size([2,1,3,3])

    truth_output = torch.tensor([[[[-0.8390,  0.8050,  0.2230, -0.8485, -0.3491],
          [-0.6043,  0.5212,  0.1076, -0.8146, -0.2870],
          [-0.0665,  0.2461,  0.3038, -0.7137, -0.4410]]],
        [[[-0.4783,  0.3446, -0.4192,  0.3387,  0.4347],
          [-0.7959,  0.1942, -0.4652,  0.7239,  0.1769],
          [-0.3678,  0.2868, -0.5799,  0.7987,  0.2086]]]])
    truth_attn_weights = torch.tensor([[[[1.0000, 0.0000, 0.0000],
          [0.7103, 0.2897, 0.0000],
          [0.3621, 0.3105, 0.3274]]],
        [[[1.0000, 0.0000, 0.0000],
          [0.3793, 0.6207, 0.0000],
          [0.2642, 0.4803, 0.2555]]]])
    # print(torch.abs(output - truth_output))
    # print(torch.abs(attn_weights - truth_attn_weights))
    assert torch.all(torch.abs(output - truth_output) < 1e-4).item()
    assert torch.all(torch.abs(attn_weights - truth_attn_weights) < 1e-4).item()
    print('Test is passed!')

In [None]:
test_multi_head_attention()

Name: query, shape: torch.Size([2, 1, 3, 5])
Name: key, shape: torch.Size([2, 1, 3, 5])
Name: value, shape: torch.Size([2, 1, 3, 5])
Name: mask, shape: torch.Size([2, 1, 3, 3])
Test is passed!


### Encoder

In [None]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 2)

    def forward(self, x, mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayers[1](x, self.feed_forward)

class Encoder(nn.Module):
    def __init__(self, config):
        super(Encoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['src_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_src_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.encoder_layer = EncoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.encoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, mask)
        return self.layer_norm(x)

### Decoder

In [None]:
class DecoderLayer(nn.Module):
    """
    Decoder is made of 3 sublayers: self attention, encoder-decoder attention
    and feed forward"
    """

    def __init__(self, hidden_size, ff_hidden_size, n_heads, dropout):
        super(DecoderLayer, self).__init__()

        self.self_attn = MultiHeadAttention(n_heads, hidden_size,
                                            dropout=dropout['attention'])
        self.encdec_attn = MultiHeadAttention(n_heads, hidden_size,
                                              dropout=dropout['attention'])
        self.feed_forward = FeedForward(hidden_size, ff_hidden_size,
                                        dropout=dropout['relu'])
        self.sublayers = clone_layer(SublayerConnection(hidden_size, dropout['residual']), 3)

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.sublayers[0](x, lambda x: self.self_attn(x, x, x, decoder_mask))
        x = self.sublayers[1](x, lambda x: self.encdec_attn(x, encoder_output,
                                                            encoder_output, encoder_mask))
        return self.sublayers[2](x, self.feed_forward)

class Decoder(nn.Module):
    def __init__(self, config):
        super(Decoder, self).__init__()
        self.embedder = Embedding(config['hidden_size'],
                                  config['tgt_vocab_size'])
        self.positional_encoder = PositionalEncoding(config['hidden_size'],
                                                     max_len=config['max_tgt_seq_length'])
        self.embedding_dropout = nn.Dropout(p=config['dropout']['embedding'])
        self.decoder_layer = DecoderLayer(config['hidden_size'],
                                          config['ff_hidden_size'],
                                          config['n_heads'],
                                          config['dropout'])
        self.layers = clone_layer(self.decoder_layer, config['n_layers'])
        self.layer_norm = LayerNorm(config['hidden_size'])

    def forward(self, x, encoder_output, encoder_mask, decoder_mask):
        x = self.embedding_dropout(self.positional_encoder(self.embedder(x)))
        for layer in self.layers:
            x = layer(x, encoder_output, encoder_mask, decoder_mask)
        return self.layer_norm(x)

### Transformer

In [None]:
class Transformer(nn.Module):
    def __init__(self, config):
        super(Transformer, self).__init__()
        self.config = config
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)
        self.proj = nn.Linear(config['hidden_size'], config['tgt_vocab_size'])

        self.pad_idx = config['pad_idx']
        self.tgt_vocab_size = config['tgt_vocab_size']

    def encode(self, encoder_input, encoder_input_mask):
        return self.encoder(encoder_input, encoder_input_mask)

    def decode(self, encoder_output, encoder_input_mask, decoder_input, decoder_input_mask):
        return self.decoder(decoder_input, encoder_output, encoder_input_mask, decoder_input_mask)

    def linear_project(self, x):
        return self.proj(x)

    def forward(self, encoder_input, decoder_input):
        encoder_input_mask = padding_mask(encoder_input, pad_idx=self.config['pad_idx'])
        decoder_input_mask = compositional_mask(decoder_input, pad_idx=self.config['pad_idx'])
        encoder_output = self.encode(encoder_input, encoder_input_mask)
        decoder_output = self.decode(encoder_output, encoder_input_mask,
                                     decoder_input, decoder_input_mask)
        output_logits = self.linear_project(decoder_output)
        return output_logits


def prepare_model(config):
    model = Transformer(config)

    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

####  LrScheduler

The last thing you have to prepare is the class  `LrScheduler`, which is in charge of  learning rate updating after every step of the optimizer. You are required to fill the class constructor and the method `learning_rate`. The preferable stratagy of updating the learning rate (lr), is the following two stages:

* "warmup" stage - lr linearly increases until the defined value during the fixed number of steps (the proportion of all training steps - the parameter `train_config['warmup\_steps\_part']` in the train function). 
* "decrease" stage - lr linearly decreases until 0 during the left training steps.

`learning_rate()` call should return the value of  lr at this step,  which number is stored at self.step. The class constructor takes not only `warmup_steps_part` but the peak learning rate value `lr_peak` at the end of "warmup" stage and a string name of the strategy of learning rate scheduling. You can test other strategies if you want to with `self.type attribute`. 

Correctness check: `test_lr_scheduler()`


In [None]:
class LrScheduler:
    def __init__(self, n_steps, **kwargs):
        self.type = kwargs['type']
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            self.lr_peak = kwargs['lr_peak']
            self.warmup_steps = kwargs['warmup_steps_part'] * n_steps
            self.decay_steps = (1 - kwargs['warmup_steps_part']) * n_steps
            self.n_steps = n_steps
        else:
            raise ValueError(f'Unknown type argument: {self.type}')
        self._step = 0
        self._lr = 0

    def step(self, optimizer):
        self._step += 1
        lr = self.learning_rate()
        for p in optimizer.param_groups:
            p['lr'] = lr

    def learning_rate(self, step=None):
        if step is None:
            step = self._step
        if self.type == 'warmup,decay_linear':
            ## TODO: provide your implementation here
            if step <= self.warmup_steps:
                incr = self.lr_peak / self.warmup_steps
                self._lr = incr * step
            else:
                decr = self.lr_peak / self.decay_steps
                self._lr = self.lr_peak - decr * (step - self.warmup_steps)
        return self._lr

    def state_dict(self):
        sd = copy.deepcopy(self.__dict__)
        return sd

    def load_state_dict(self, sd):
        for k in sd.keys():
            self.__setattr__(k, sd[k])

In [None]:
def test_lr_scheduler():
    lrs_type = 'warmup,decay_linear'
    warmup_steps_part =  0.1
    lr_peak = 3e-4
    sch = LrScheduler(100, type=lrs_type, warmup_steps_part=warmup_steps_part,
                      lr_peak=lr_peak)
    assert sch.learning_rate(step=5) - 15e-5 < 1e-6
    assert sch.learning_rate(step=10) - 3e-4 < 1e-6
    assert sch.learning_rate(step=50) - 166e-6 < 1e-6
    assert sch.learning_rate(step=100) - 0. < 1e-6
    print('Test is passed!')

In [None]:
test_lr_scheduler()

Test is passed!


### Run and translate

In [None]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))


def run_epoch(data_iter, model, lr_scheduler, optimizer, device, verbose=False):
    start = time.time()
    local_start = start
    total_tokens = 0
    total_loss = 0
    tokens = 0
    loss_fn = nn.CrossEntropyLoss(reduction='sum')
    for i, batch in tqdm(enumerate(data_iter)):
        encoder_input = batch[0].to(device)
        decoder_input = batch[1].to(device)
        decoder_target = batch[2].to(device)
        logits = model(encoder_input, decoder_input)
        loss = loss_fn(logits.view(-1, model.tgt_vocab_size),
                       decoder_target.view(-1))
        total_loss += loss.item()
        batch_n_tokens = (decoder_target != model.pad_idx).sum().item()
        total_tokens += batch_n_tokens
        if optimizer is not None:
            optimizer.zero_grad()
            lr_scheduler.step(optimizer)
            loss.backward()
            optimizer.step()

        tokens += batch_n_tokens
        if verbose and i % 1000 == 1:
            elapsed = time.time() - local_start
            if epoch % 10 == 0:
                print("batch number: %d, accumulated average loss: %f, tokens per second: %f" %
                      (i, total_loss / total_tokens, tokens / elapsed))
            local_start = time.time()
            tokens = 0

    average_loss = total_loss / total_tokens
    print('** End of epoch, accumulated average loss = %f **' % average_loss)
    epoch_elapsed_time = format_time(time.time() - start)
    return average_loss


def save_checkpoint(epoch, model, lr_scheduler, optimizer, model_dir_path):
    save_path = os.path.join(model_dir_path,f'cpkt_{epoch}_epoch')
    save_path = model_dir_path + f'epoch_{epoch}'
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'lr_scheduler_state_dict': lr_scheduler.state_dict()
    }, save_path)
    wandb.save(save_path)
    print(f'Saved checkpoint to {save_path}')

def load_model(epoch, model_dir_path):
    save_path = os.path.join(model_dir_path, f'cpkt_{epoch}_epoch')
    checkpoint = torch.load(save_path)
    with open(os.path.join(model_dir_path, 'model_config.json'), 'r', encoding='utf-8') as rf:
        model_config = json.load(rf)
    model = prepare_model(model_config)
    model.load_state_dict(checkpoint['model_state_dict'])
    return model

def greedy_decode(model, device, encoder_input, max_len, start_symbol):
    batch_size = encoder_input.size()[0]
    decoder_input = torch.ones(batch_size, 1).fill_(start_symbol).type_as(encoder_input.data).to(device)

    for i in range(max_len):
        logits = model(encoder_input, decoder_input)

        _, predicted_ids = torch.max(logits, dim=-1)
        next_word = predicted_ids[:, i]
        # print(next_word)
        rest = torch.ones(batch_size, 1).type_as(decoder_input.data)
        # print(rest[:,0].size(), next_word.size())
        rest[:, 0] = next_word
        decoder_input = torch.cat([decoder_input, rest], dim=1).to(device)
        # print(decoder_input)
    return decoder_input

def generate_predictions(dataloader, max_decoding_len, text_encoder, model, device):
    # print(f'Max decoding length = {max_decoding_len}')
    model.eval()
    predictions = []
    start_token_id = text_encoder.service_vocabs['token2id'][
        text_encoder.service_token_names['start_token']]
    with torch.no_grad():
        for batch in tqdm(dataloader):
            encoder_input = batch[0].to(device)
            prediction_tensor = \
                greedy_decode(model, device, encoder_input, max_decoding_len,
                              start_token_id)

            predictions.extend([''.join(e) for e in text_encoder.id2token(prediction_tensor.cpu().numpy(),
                                                                          unframe=True, lang_key='ru')])
    return np.array(predictions)


def train(source_strings, target_strings, tune_params = None):
    '''Common training cycle for final run (fixed hyperparameters,
    no evaluation during training)'''
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    train_df = pd.DataFrame({'en': source_strings, 'ru': target_strings})
    text_encoder = TextEncoder()
    text_encoder.make_vocabs(train_df)
    model_config = {
        'src_vocab_size': text_encoder.src_vocab_size,
        'tgt_vocab_size': text_encoder.tgt_vocab_size,
        'max_src_seq_length': max(train_df['en'].aggregate(len)) + 2, #including start_token and end_token
        'max_tgt_seq_length': max(train_df['ru'].aggregate(len)) + 2,
        'n_layers': 2,
        'n_heads': 2,
        'hidden_size': 128,
        'ff_hidden_size': 256,
        'dropout': {
            'embedding': tune_params['embedding'] if tune_params else 0.1,
            'attention': tune_params['attention'] if tune_params else 0.1,
            'residual': tune_params['residual'] if tune_params else 0.1,
            'relu': tune_params['relu'] if tune_params else 0.1
        },
        'pad_idx': 0
    }
    
    wandb.init(project="hw1-NLP", 
           group="Bamblbi",
           config=model_config)
    
    model = prepare_model(model_config)
    model.to(device)

    train_config = {'batch_size': 200, 'n_epochs': 300, 'lr_scheduler': {
        'type': 'warmup,decay_linear',
        'warmup_steps_part': 0.1,
        'lr_peak': tune_params['lr_peak'] if tune_params else 3e-4,
    }}

    #Model training procedure
    optimizer = torch.optim.Adam(model.parameters(), lr=0.)
    n_steps = (len(train_df) // train_config['batch_size'] + 1) * train_config['n_epochs']
    lr_scheduler = LrScheduler(n_steps, **train_config['lr_scheduler'])

    # prepare train data
    source_strings, target_strings = zip(*sorted(zip(source_strings, target_strings),
                                                 key=lambda e: len(e[0])))
    train_dataloader = create_dataloader(source_strings, target_strings, text_encoder,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    # training cycle
    for epoch in range(1,train_config['n_epochs']+1):
        print('\n' + '-'*40)
        print(f'Epoch: {epoch}')
        print(f'Run training...')
        model.train()
        av_loss = run_epoch(train_dataloader, model,
                          lr_scheduler, optimizer, device=device, verbose=False)
        wandb.log({'loss_train': av_loss})
        if epoch % 20 == 0:
            save_checkpoint(epoch, model, lr_scheduler, optimizer, f'bamblbi_')

    learnable_params = {
        'model': model,
        'text_encoder': text_encoder,
    }
    return learnable_params

def classify(source_strings, learnable_params):
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    model = learnable_params['model']
    text_encoder = learnable_params['text_encoder']
    batch_size = 200
    dataloader = create_dataloader(source_strings, None, text_encoder,
                                   batch_size, shuffle_batches_each_epoch=False)
    max_decoding_len = model.config['max_tgt_seq_length']
    predictions = generate_predictions(dataloader, max_decoding_len, text_encoder, model, device)
    #return single top1 prediction for each sample
    return np.expand_dims(predictions, 1)

### Training

In [None]:
PREDS_FNAME = "preds_translit_bl_0.tsv"
SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')
TRANSLIT_PATH = "TRANSLIT"

In [None]:
%%capture
!pip install wandb -qqq

In [None]:
import wandb

In [None]:
top_k = 1
part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)
train_ids, train_strings, train_transliterations = part2ixy['train']
print('\nTraining classifier on %d examples from train set ...' % len(train_strings))
st = time.time()
params = train(train_strings, train_transliterations)
print('Classifier trained in %.2fs' % (time.time() - st))

In [None]:
allpreds = []
for part, (ids, x, y) in part2ixy.items():
    print('\nClassifying %s set with %d examples ...' % (part, len(x)))
    st = time.time()
    preds = classify(x, params)
    print('%s set classified in %.2fs' % (part, time.time() - st))
    count_of_values = list(map(len, preds))
    assert np.all(np.array(count_of_values) == top_k)
    #score(preds, y)
    allpreds.extend(zip(ids, preds))

save_preds(allpreds, preds_fname=PREDS_FNAME)
print('\nChecking saved predictions ...')
score_preds(preds_path=PREDS_FNAME, data_dir=TRANSLIT_PATH, parts=SCORED_PARTS)

###  Hyper-parameters choice

The model is ready. Now we need to find the optimal hyper-parameters.

The quality of models with different hyperparameters should be monitored on dev or on dev_small samples (in order to save time, since generating transliterations is a rather time-consuming process, comparable to one training epoch).

To generate predictions, you can use the `generate_predictions` function, to calculate the accuracy@1 metric, and then you can use the `compute_metrics` function.



Hyper-parameters are stored in the dictionary `model_config` and `train_config` in train function. The following hyperparameters in `model_config` and `train_config` are suggested to leave unmodified:

* n_layers $=$ 2
* n_heads $=$ 2
* hidden_size $=$ 128
* fc_hidden_size $=$ 256
* warmup_steps_part $=$ 0.1
* batch_size $=$ 200

 You can vary the dropout value. The model has 4 types of : ***embedding dropout*** applied on embdeddings before sending to the first layer of  Encoder or Decoder, ***attention*** dropout applied on the attention weights in the MultiHeadAttention layer, ***residual dropout*** applied on the output of each sublayer (MultiHeadAttention or FeedForward) in layers Encoder and Decoder and, finaly, ***relu dropout*** in used in FeedForward layer. For all 4 types it is suggested to test the same value of dropout from the list: 0.1, 0.15, 0.2.
 Also it is suggested to test several peak levels of learning rate - **lr_peak** : 5e-4, 1e-3, 2e-3.

Note that if you are using a GPU, then training one epoch takes about 1 minute, and up to 1 GB of video memory is required. When using the CPU, the learning speed slows down by about 2 times. If there are problems with insufficient RAM / video memory, reduce the batch size, but in this case the optimal range of learning rate values will change, and it must be determined again. To train a model with  batch_size $=$ 200 , it will take at least 300 epochs to achieve accuracy 0.66 on dev_small dataset.

*Question: What are the optimal hyperpameters according to your experiments? Add plots or other descriptions here.* 

```
I had got my best score on the leaderboard (0.66) just by training model in 600 epochs with initial parameters which was given in this notebook.

And I didn't have time to choose best hyperparameters. But here is the code that could get it.

```



Here are also some loss plots. But it is too early to make any conclusions:

In [None]:
%%capture

!pip install optuna-dashboard

!optuna-dashboard sqlite:///example-study.db

In [None]:
import optuna

from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice

SEED = 1337

np.random.seed(SEED)

In [None]:
def objective(trial):
    SCORED_PARTS = ('train', 'dev', 'train_small', 'dev_small', 'test')

    tune_params = {
                'embedding': trial.suggest_categorical('embedding', [0.1, 0.15, 0.2]),
                'attention': trial.suggest_categorical('attention', [0.1, 0.15, 0.2]),
                'residual': trial.suggest_categorical('residual', [0.1, 0.15, 0.2]),
                'relu': trial.suggest_categorical('relu', [0.1, 0.15, 0.2]),
                'lr_peak': trial.suggest_categorical('lr_peak', [5e-4, 1e-3, 2e-3])
            }
    top_k = 1
    part2ixy = load_dataset(TRANSLIT_PATH, parts=SCORED_PARTS)
    train_ids, train_strings, train_transliterations = part2ixy['train']
    source_strings, target_strings = train_strings, train_transliterations
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f'Using GPU device: {device}')
    else:
        device = torch.device('cpu')
        print(f'GPU is not available, using CPU device {device}')

    train_df = pd.DataFrame({'en': source_strings, 'ru': target_strings})
    text_encoder = TextEncoder()
    text_encoder.make_vocabs(train_df)
    model_config = {
        'src_vocab_size': text_encoder.src_vocab_size,
        'tgt_vocab_size': text_encoder.tgt_vocab_size,
        'max_src_seq_length': max(train_df['en'].aggregate(len)) + 2, #including start_token and end_token
        'max_tgt_seq_length': max(train_df['ru'].aggregate(len)) + 2,
        'n_layers': 2,
        'n_heads': 2,
        'hidden_size': 128,
        'ff_hidden_size': 256,
        'dropout': {
            'embedding': tune_params['embedding'] if tune_params else 0.1,
            'attention': tune_params['attention'] if tune_params else 0.1,
            'residual': tune_params['residual'] if tune_params else 0.1,
            'relu': tune_params['relu'] if tune_params else 0.1
        },
        'pad_idx': 0
    }
    
    wandb.init(project="hw1-NLP", 
               name=f"preds_translit_bl_emb_{tune_params['embedding']}_att_{tune_params['attention']}"
                           +f"res_{tune_params['residual']}_relu_{tune_params['relu']}_lr_{tune_params['lr_peak']}",
               group="Bamblbi_finetuning_new",
               config=model_config)
    
    model = prepare_model(model_config)
    model.to(device)

    train_config = {'batch_size': 200, 'n_epochs': 100, 'lr_scheduler': {
        'type': 'warmup,decay_linear',
        'warmup_steps_part': 0.1,
        'lr_peak': tune_params['lr_peak'] if tune_params else 3e-4,
    }}

    #Model training procedure
    optimizer = torch.optim.Adam(model.parameters(), lr=0.)
    n_steps = (len(train_df) // train_config['batch_size'] + 1) * train_config['n_epochs']
    lr_scheduler = LrScheduler(n_steps, **train_config['lr_scheduler'])

    # prepare train data
    source_strings, target_strings = zip(*sorted(zip(source_strings, target_strings),
                                                 key=lambda e: len(e[0])))
    train_dataloader = create_dataloader(source_strings, target_strings, text_encoder,
                                         train_config['batch_size'],
                                         shuffle_batches_each_epoch=True)
    # training cycle
    for epoch in range(1,train_config['n_epochs']+1):
            print('\n' + '-'*40)
            print(f'Epoch: {epoch}')
            print(f'Run training...')
            model.train()
            av_loss = run_epoch(train_dataloader, model,
                              lr_scheduler, optimizer, device=device, verbose=False)
            wandb.log({'loss_train': av_loss})
            if epoch % 20 == 0:
                save_checkpoint(epoch, model, lr_scheduler, optimizer, f'bamblbi_f')

            params = {
            'model': model,
            'text_encoder': text_encoder,
             }
    allpreds = []
    for part, (ids, x, y) in part2ixy.items():
        if part == 'dev' or part == 'dev_small':
            st = time.time()
            preds = classify(x, params)
            count_of_values = list(map(len, preds))
            assert np.all(np.array(count_of_values) == top_k)
            #score(preds, y)
            allpreds.extend(zip(ids, preds))
    SCORED_PARTS = ('dev', 'dev_small')
    PREDS_FNAME = f"preds_translit_bl_f{tune_params['embedding']}_{tune_params['attention']}_{tune_params['residual']}_{tune_params['relu']}_{tune_params['lr_peak']}.tsv"
    save_preds(allpreds, preds_fname=PREDS_FNAME)
    score = score_preds(preds_path=PREDS_FNAME, data_dir=TRANSLIT_PATH, parts=SCORED_PARTS)
    return score['dev']['acc@1']

     

In [None]:
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=SEED),
)
study.optimize(objective, n_trials=8)

[32m[I 2023-04-11 19:18:10,487][0m A new study created in memory with name: no-name-3ed3d218-dc4c-4e5c-9fc1-8ff073304dfd[0m


Using GPU device: cuda


VBox(children=(Label(value='0.001 MB of 0.042 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.025709…

0,1
loss_train,█▃▁

0,1
loss_train,2.2568



----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.74it/s]


** End of epoch, accumulated average loss = 4.207470 **

----------------------------------------
Epoch: 2
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 2.953771 **

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 34.19it/s]


** End of epoch, accumulated average loss = 2.102429 **

----------------------------------------
Epoch: 4
Run training...


527it [00:16, 32.68it/s]


** End of epoch, accumulated average loss = 1.265408 **

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 34.24it/s]


** End of epoch, accumulated average loss = 1.014860 **

----------------------------------------
Epoch: 6
Run training...


527it [00:16, 32.53it/s]


** End of epoch, accumulated average loss = 0.865737 **

----------------------------------------
Epoch: 7
Run training...


527it [00:15, 34.07it/s]


** End of epoch, accumulated average loss = 0.738598 **

----------------------------------------
Epoch: 8
Run training...


527it [00:15, 33.30it/s]


** End of epoch, accumulated average loss = 0.661671 **

----------------------------------------
Epoch: 9
Run training...


527it [00:15, 33.76it/s]


** End of epoch, accumulated average loss = 0.618672 **

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.581310 **

----------------------------------------
Epoch: 11
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.555280 **

----------------------------------------
Epoch: 12
Run training...


527it [00:15, 33.29it/s]


** End of epoch, accumulated average loss = 0.528049 **

----------------------------------------
Epoch: 13
Run training...


527it [00:15, 34.18it/s]


** End of epoch, accumulated average loss = 0.508095 **

----------------------------------------
Epoch: 14
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.489967 **

----------------------------------------
Epoch: 15
Run training...


527it [00:15, 34.17it/s]


** End of epoch, accumulated average loss = 0.475790 **

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 32.90it/s]


** End of epoch, accumulated average loss = 0.461721 **

----------------------------------------
Epoch: 17
Run training...


527it [00:15, 33.98it/s]


** End of epoch, accumulated average loss = 0.451526 **

----------------------------------------
Epoch: 18
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.442905 **

----------------------------------------
Epoch: 19
Run training...


527it [00:15, 33.83it/s]


** End of epoch, accumulated average loss = 0.432661 **

----------------------------------------
Epoch: 20
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.427170 **
Saved checkpoint to bamblbi_fepoch_20

----------------------------------------
Epoch: 21
Run training...


527it [00:15, 33.98it/s]


** End of epoch, accumulated average loss = 0.422271 **

----------------------------------------
Epoch: 22
Run training...


527it [00:16, 32.93it/s]


** End of epoch, accumulated average loss = 0.414504 **

----------------------------------------
Epoch: 23
Run training...


527it [00:15, 34.02it/s]


** End of epoch, accumulated average loss = 0.413119 **

----------------------------------------
Epoch: 24
Run training...


527it [00:16, 32.88it/s]


** End of epoch, accumulated average loss = 0.406008 **

----------------------------------------
Epoch: 25
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.400984 **

----------------------------------------
Epoch: 26
Run training...


527it [00:15, 33.11it/s]


** End of epoch, accumulated average loss = 0.399502 **

----------------------------------------
Epoch: 27
Run training...


527it [00:15, 33.23it/s]


** End of epoch, accumulated average loss = 0.395590 **

----------------------------------------
Epoch: 28
Run training...


527it [00:15, 33.06it/s]


** End of epoch, accumulated average loss = 0.391406 **

----------------------------------------
Epoch: 29
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.389312 **

----------------------------------------
Epoch: 30
Run training...


527it [00:16, 32.23it/s]


** End of epoch, accumulated average loss = 0.385862 **

----------------------------------------
Epoch: 31
Run training...


527it [00:15, 34.14it/s]


** End of epoch, accumulated average loss = 0.382567 **

----------------------------------------
Epoch: 32
Run training...


527it [00:15, 32.96it/s]


** End of epoch, accumulated average loss = 0.380349 **

----------------------------------------
Epoch: 33
Run training...


527it [00:15, 34.14it/s]


** End of epoch, accumulated average loss = 0.377040 **

----------------------------------------
Epoch: 34
Run training...


527it [00:16, 32.92it/s]


** End of epoch, accumulated average loss = 0.376099 **

----------------------------------------
Epoch: 35
Run training...


527it [00:15, 33.14it/s]


** End of epoch, accumulated average loss = 0.373625 **

----------------------------------------
Epoch: 36
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.371619 **

----------------------------------------
Epoch: 37
Run training...


527it [00:15, 33.35it/s]


** End of epoch, accumulated average loss = 0.369341 **

----------------------------------------
Epoch: 38
Run training...


527it [00:15, 33.17it/s]


** End of epoch, accumulated average loss = 0.367012 **

----------------------------------------
Epoch: 39
Run training...


527it [00:15, 33.91it/s]


** End of epoch, accumulated average loss = 0.366295 **

----------------------------------------
Epoch: 40
Run training...


527it [00:15, 33.02it/s]


** End of epoch, accumulated average loss = 0.363398 **
Saved checkpoint to bamblbi_fepoch_40

----------------------------------------
Epoch: 41
Run training...


527it [00:15, 33.39it/s]


** End of epoch, accumulated average loss = 0.362143 **

----------------------------------------
Epoch: 42
Run training...


527it [00:15, 32.96it/s]


** End of epoch, accumulated average loss = 0.360022 **

----------------------------------------
Epoch: 43
Run training...


527it [00:15, 33.74it/s]


** End of epoch, accumulated average loss = 0.358536 **

----------------------------------------
Epoch: 44
Run training...


527it [00:15, 33.04it/s]


** End of epoch, accumulated average loss = 0.357044 **

----------------------------------------
Epoch: 45
Run training...


527it [00:15, 33.82it/s]


** End of epoch, accumulated average loss = 0.356197 **

----------------------------------------
Epoch: 46
Run training...


527it [00:16, 32.79it/s]


** End of epoch, accumulated average loss = 0.354834 **

----------------------------------------
Epoch: 47
Run training...


527it [00:15, 33.81it/s]


** End of epoch, accumulated average loss = 0.351553 **

----------------------------------------
Epoch: 48
Run training...


527it [00:16, 32.48it/s]


** End of epoch, accumulated average loss = 0.350657 **

----------------------------------------
Epoch: 49
Run training...


527it [00:15, 34.12it/s]


** End of epoch, accumulated average loss = 0.350113 **

----------------------------------------
Epoch: 50
Run training...


527it [00:16, 32.58it/s]


** End of epoch, accumulated average loss = 0.348711 **

----------------------------------------
Epoch: 51
Run training...


527it [00:15, 33.88it/s]


** End of epoch, accumulated average loss = 0.347494 **

----------------------------------------
Epoch: 52
Run training...


527it [00:16, 32.81it/s]


** End of epoch, accumulated average loss = 0.346574 **

----------------------------------------
Epoch: 53
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.345277 **

----------------------------------------
Epoch: 54
Run training...


527it [00:15, 33.07it/s]


** End of epoch, accumulated average loss = 0.343887 **

----------------------------------------
Epoch: 55
Run training...


527it [00:15, 33.73it/s]


** End of epoch, accumulated average loss = 0.342055 **

----------------------------------------
Epoch: 56
Run training...


527it [00:16, 32.65it/s]


** End of epoch, accumulated average loss = 0.341018 **

----------------------------------------
Epoch: 57
Run training...


527it [00:15, 33.62it/s]


** End of epoch, accumulated average loss = 0.339621 **

----------------------------------------
Epoch: 58
Run training...


527it [00:15, 33.08it/s]


** End of epoch, accumulated average loss = 0.339104 **

----------------------------------------
Epoch: 59
Run training...


527it [00:15, 34.25it/s]


** End of epoch, accumulated average loss = 0.337610 **

----------------------------------------
Epoch: 60
Run training...


527it [00:15, 33.20it/s]


** End of epoch, accumulated average loss = 0.335890 **
Saved checkpoint to bamblbi_fepoch_60

----------------------------------------
Epoch: 61
Run training...


527it [00:15, 33.66it/s]


** End of epoch, accumulated average loss = 0.335895 **

----------------------------------------
Epoch: 62
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.334006 **

----------------------------------------
Epoch: 63
Run training...


527it [00:15, 33.76it/s]


** End of epoch, accumulated average loss = 0.333520 **

----------------------------------------
Epoch: 64
Run training...


527it [00:15, 33.12it/s]


** End of epoch, accumulated average loss = 0.332254 **

----------------------------------------
Epoch: 65
Run training...


527it [00:15, 34.28it/s]


** End of epoch, accumulated average loss = 0.331492 **

----------------------------------------
Epoch: 66
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.329827 **

----------------------------------------
Epoch: 67
Run training...


527it [00:15, 34.25it/s]


** End of epoch, accumulated average loss = 0.329267 **

----------------------------------------
Epoch: 68
Run training...


527it [00:15, 33.37it/s]


** End of epoch, accumulated average loss = 0.327297 **

----------------------------------------
Epoch: 69
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.326864 **

----------------------------------------
Epoch: 70
Run training...


527it [00:15, 33.37it/s]


** End of epoch, accumulated average loss = 0.327375 **

----------------------------------------
Epoch: 71
Run training...


527it [00:15, 33.71it/s]


** End of epoch, accumulated average loss = 0.324503 **

----------------------------------------
Epoch: 72
Run training...


527it [00:15, 33.10it/s]


** End of epoch, accumulated average loss = 0.324369 **

----------------------------------------
Epoch: 73
Run training...


527it [00:15, 34.03it/s]


** End of epoch, accumulated average loss = 0.322751 **

----------------------------------------
Epoch: 74
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.322879 **

----------------------------------------
Epoch: 75
Run training...


527it [00:15, 34.38it/s]


** End of epoch, accumulated average loss = 0.321650 **

----------------------------------------
Epoch: 76
Run training...


527it [00:16, 32.62it/s]


** End of epoch, accumulated average loss = 0.320783 **

----------------------------------------
Epoch: 77
Run training...


527it [00:15, 34.30it/s]


** End of epoch, accumulated average loss = 0.319761 **

----------------------------------------
Epoch: 78
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.319280 **

----------------------------------------
Epoch: 79
Run training...


527it [00:15, 34.00it/s]


** End of epoch, accumulated average loss = 0.317638 **

----------------------------------------
Epoch: 80
Run training...


527it [00:15, 32.99it/s]


** End of epoch, accumulated average loss = 0.317181 **
Saved checkpoint to bamblbi_fepoch_80

----------------------------------------
Epoch: 81
Run training...


527it [00:15, 34.30it/s]


** End of epoch, accumulated average loss = 0.316667 **

----------------------------------------
Epoch: 82
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.316174 **

----------------------------------------
Epoch: 83
Run training...


527it [00:15, 33.96it/s]


** End of epoch, accumulated average loss = 0.314330 **

----------------------------------------
Epoch: 84
Run training...


527it [00:16, 32.89it/s]


** End of epoch, accumulated average loss = 0.313323 **

----------------------------------------
Epoch: 85
Run training...


527it [00:15, 34.15it/s]


** End of epoch, accumulated average loss = 0.314244 **

----------------------------------------
Epoch: 86
Run training...


527it [00:15, 33.35it/s]


** End of epoch, accumulated average loss = 0.311951 **

----------------------------------------
Epoch: 87
Run training...


527it [00:15, 33.63it/s]


** End of epoch, accumulated average loss = 0.311941 **

----------------------------------------
Epoch: 88
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.311273 **

----------------------------------------
Epoch: 89
Run training...


527it [00:15, 34.28it/s]


** End of epoch, accumulated average loss = 0.309695 **

----------------------------------------
Epoch: 90
Run training...


527it [00:15, 33.17it/s]


** End of epoch, accumulated average loss = 0.310241 **

----------------------------------------
Epoch: 91
Run training...


527it [00:15, 34.28it/s]


** End of epoch, accumulated average loss = 0.308384 **

----------------------------------------
Epoch: 92
Run training...


527it [00:16, 32.61it/s]


** End of epoch, accumulated average loss = 0.309103 **

----------------------------------------
Epoch: 93
Run training...


527it [00:15, 33.80it/s]


** End of epoch, accumulated average loss = 0.308574 **

----------------------------------------
Epoch: 94
Run training...


527it [00:15, 33.11it/s]


** End of epoch, accumulated average loss = 0.306948 **

----------------------------------------
Epoch: 95
Run training...


527it [00:15, 33.90it/s]


** End of epoch, accumulated average loss = 0.307469 **

----------------------------------------
Epoch: 96
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.306775 **

----------------------------------------
Epoch: 97
Run training...


527it [00:15, 33.77it/s]


** End of epoch, accumulated average loss = 0.305821 **

----------------------------------------
Epoch: 98
Run training...


527it [00:15, 33.07it/s]


** End of epoch, accumulated average loss = 0.305666 **

----------------------------------------
Epoch: 99
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.305067 **

----------------------------------------
Epoch: 100
Run training...


527it [00:16, 32.94it/s]


** End of epoch, accumulated average loss = 0.304366 **
Saved checkpoint to bamblbi_fepoch_100
Using GPU device: cuda


100%|██████████| 132/132 [00:14<00:00,  9.06it/s]


Using GPU device: cuda


100%|██████████| 10/10 [00:01<00:00,  7.30it/s]
[32m[I 2023-04-11 19:45:22,090][0m Trial 0 finished with value: 0.633095436944803 and parameters: {'embedding': 0.2, 'attention': 0.2, 'residual': 0.15, 'relu': 0.2, 'lr_peak': 0.001}. Best is trial 0 with value: 0.633095436944803.[0m


Predictions saved to preds_translit_bl_f0.2_0.2_0.15_0.2_0.001.tsv
dev set accuracy@1: 0.63
dev_small set accuracy@1: 0.65
Using GPU device: cuda


VBox(children=(Label(value='47.204 MB of 47.243 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.9991…

0,1
loss_train,█▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
loss_train,0.30437



----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.75it/s]


** End of epoch, accumulated average loss = 3.815826 **

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.21it/s]


** End of epoch, accumulated average loss = 2.369842 **

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 34.10it/s]


** End of epoch, accumulated average loss = 1.197638 **

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.908341 **

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 33.86it/s]


** End of epoch, accumulated average loss = 0.744621 **

----------------------------------------
Epoch: 6
Run training...


527it [00:15, 33.09it/s]


** End of epoch, accumulated average loss = 0.649059 **

----------------------------------------
Epoch: 7
Run training...


527it [00:15, 34.43it/s]


** End of epoch, accumulated average loss = 0.598724 **

----------------------------------------
Epoch: 8
Run training...


527it [00:16, 32.92it/s]


** End of epoch, accumulated average loss = 0.563207 **

----------------------------------------
Epoch: 9
Run training...


527it [00:15, 34.12it/s]


** End of epoch, accumulated average loss = 0.555723 **

----------------------------------------
Epoch: 10
Run training...


527it [00:16, 32.58it/s]


** End of epoch, accumulated average loss = 0.526684 **

----------------------------------------
Epoch: 11
Run training...


527it [00:15, 34.07it/s]


** End of epoch, accumulated average loss = 0.505789 **

----------------------------------------
Epoch: 12
Run training...


527it [00:15, 33.36it/s]


** End of epoch, accumulated average loss = 0.486882 **

----------------------------------------
Epoch: 13
Run training...


527it [00:15, 33.79it/s]


** End of epoch, accumulated average loss = 0.478513 **

----------------------------------------
Epoch: 14
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.458887 **

----------------------------------------
Epoch: 15
Run training...


527it [00:15, 34.17it/s]


** End of epoch, accumulated average loss = 0.453513 **

----------------------------------------
Epoch: 16
Run training...


527it [00:16, 32.93it/s]


** End of epoch, accumulated average loss = 0.442731 **

----------------------------------------
Epoch: 17
Run training...


527it [00:15, 34.47it/s]


** End of epoch, accumulated average loss = 0.435614 **

----------------------------------------
Epoch: 18
Run training...


527it [00:16, 32.58it/s]


** End of epoch, accumulated average loss = 0.431412 **

----------------------------------------
Epoch: 19
Run training...


527it [00:15, 34.38it/s]


** End of epoch, accumulated average loss = 0.423516 **

----------------------------------------
Epoch: 20
Run training...


527it [00:16, 32.83it/s]


** End of epoch, accumulated average loss = 0.417221 **
Saved checkpoint to bamblbi_fepoch_20

----------------------------------------
Epoch: 21
Run training...


527it [00:15, 33.74it/s]


** End of epoch, accumulated average loss = 0.412255 **

----------------------------------------
Epoch: 22
Run training...


527it [00:15, 32.96it/s]


** End of epoch, accumulated average loss = 0.410746 **

----------------------------------------
Epoch: 23
Run training...


527it [00:15, 34.08it/s]


** End of epoch, accumulated average loss = 0.405322 **

----------------------------------------
Epoch: 24
Run training...


527it [00:15, 33.37it/s]


** End of epoch, accumulated average loss = 0.402569 **

----------------------------------------
Epoch: 25
Run training...


527it [00:15, 34.01it/s]


** End of epoch, accumulated average loss = 0.401167 **

----------------------------------------
Epoch: 26
Run training...


527it [00:16, 32.78it/s]


** End of epoch, accumulated average loss = 0.396862 **

----------------------------------------
Epoch: 27
Run training...


527it [00:15, 34.09it/s]


** End of epoch, accumulated average loss = 0.392966 **

----------------------------------------
Epoch: 28
Run training...


527it [00:15, 33.38it/s]


** End of epoch, accumulated average loss = 0.388029 **

----------------------------------------
Epoch: 29
Run training...


527it [00:15, 33.50it/s]


** End of epoch, accumulated average loss = 0.385230 **

----------------------------------------
Epoch: 30
Run training...


527it [00:15, 33.26it/s]


** End of epoch, accumulated average loss = 0.383785 **

----------------------------------------
Epoch: 31
Run training...


527it [00:15, 33.68it/s]


** End of epoch, accumulated average loss = 0.381893 **

----------------------------------------
Epoch: 32
Run training...


527it [00:15, 33.32it/s]


** End of epoch, accumulated average loss = 0.379155 **

----------------------------------------
Epoch: 33
Run training...


527it [00:15, 34.25it/s]


** End of epoch, accumulated average loss = 0.377888 **

----------------------------------------
Epoch: 34
Run training...


527it [00:16, 32.66it/s]


** End of epoch, accumulated average loss = 0.374351 **

----------------------------------------
Epoch: 35
Run training...


527it [00:15, 34.29it/s]


** End of epoch, accumulated average loss = 0.372308 **

----------------------------------------
Epoch: 36
Run training...


527it [00:16, 32.55it/s]


** End of epoch, accumulated average loss = 0.370228 **

----------------------------------------
Epoch: 37
Run training...


527it [00:15, 34.15it/s]


** End of epoch, accumulated average loss = 0.367873 **

----------------------------------------
Epoch: 38
Run training...


527it [00:15, 33.51it/s]


** End of epoch, accumulated average loss = 0.366432 **

----------------------------------------
Epoch: 39
Run training...


527it [00:15, 33.77it/s]


** End of epoch, accumulated average loss = 0.364590 **

----------------------------------------
Epoch: 40
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.363080 **
Saved checkpoint to bamblbi_fepoch_40

----------------------------------------
Epoch: 41
Run training...


527it [00:15, 33.66it/s]


** End of epoch, accumulated average loss = 0.360696 **

----------------------------------------
Epoch: 42
Run training...


527it [00:15, 33.09it/s]


** End of epoch, accumulated average loss = 0.358651 **

----------------------------------------
Epoch: 43
Run training...


527it [00:15, 34.11it/s]


** End of epoch, accumulated average loss = 0.358244 **

----------------------------------------
Epoch: 44
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.355134 **

----------------------------------------
Epoch: 45
Run training...


527it [00:15, 34.08it/s]


** End of epoch, accumulated average loss = 0.354368 **

----------------------------------------
Epoch: 46
Run training...


527it [00:15, 33.14it/s]


** End of epoch, accumulated average loss = 0.352433 **

----------------------------------------
Epoch: 47
Run training...


527it [00:15, 34.02it/s]


** End of epoch, accumulated average loss = 0.350354 **

----------------------------------------
Epoch: 48
Run training...


527it [00:15, 33.11it/s]


** End of epoch, accumulated average loss = 0.349979 **

----------------------------------------
Epoch: 49
Run training...


527it [00:15, 33.58it/s]


** End of epoch, accumulated average loss = 0.347494 **

----------------------------------------
Epoch: 50
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.345837 **

----------------------------------------
Epoch: 51
Run training...


527it [00:15, 34.22it/s]


** End of epoch, accumulated average loss = 0.344499 **

----------------------------------------
Epoch: 52
Run training...


527it [00:16, 32.87it/s]


** End of epoch, accumulated average loss = 0.343815 **

----------------------------------------
Epoch: 53
Run training...


527it [00:15, 34.01it/s]


** End of epoch, accumulated average loss = 0.341693 **

----------------------------------------
Epoch: 54
Run training...


527it [00:15, 33.57it/s]


** End of epoch, accumulated average loss = 0.339189 **

----------------------------------------
Epoch: 55
Run training...


527it [00:15, 33.68it/s]


** End of epoch, accumulated average loss = 0.338528 **

----------------------------------------
Epoch: 56
Run training...


527it [00:15, 33.18it/s]


** End of epoch, accumulated average loss = 0.337955 **

----------------------------------------
Epoch: 57
Run training...


527it [00:15, 33.45it/s]


** End of epoch, accumulated average loss = 0.335886 **

----------------------------------------
Epoch: 58
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.335103 **

----------------------------------------
Epoch: 59
Run training...


527it [00:15, 34.30it/s]


** End of epoch, accumulated average loss = 0.333541 **

----------------------------------------
Epoch: 60
Run training...


527it [00:16, 32.89it/s]


** End of epoch, accumulated average loss = 0.332388 **
Saved checkpoint to bamblbi_fepoch_60

----------------------------------------
Epoch: 61
Run training...


527it [00:15, 33.90it/s]


** End of epoch, accumulated average loss = 0.330945 **

----------------------------------------
Epoch: 62
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.330093 **

----------------------------------------
Epoch: 63
Run training...


527it [00:15, 34.31it/s]


** End of epoch, accumulated average loss = 0.328253 **

----------------------------------------
Epoch: 64
Run training...


527it [00:15, 32.95it/s]


** End of epoch, accumulated average loss = 0.327482 **

----------------------------------------
Epoch: 65
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.325332 **

----------------------------------------
Epoch: 66
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.325503 **

----------------------------------------
Epoch: 67
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.323579 **

----------------------------------------
Epoch: 68
Run training...


527it [00:15, 32.97it/s]


** End of epoch, accumulated average loss = 0.322336 **

----------------------------------------
Epoch: 69
Run training...


527it [00:15, 34.19it/s]


** End of epoch, accumulated average loss = 0.321286 **

----------------------------------------
Epoch: 70
Run training...


527it [00:15, 33.12it/s]


** End of epoch, accumulated average loss = 0.320610 **

----------------------------------------
Epoch: 71
Run training...


527it [00:15, 33.95it/s]


** End of epoch, accumulated average loss = 0.318713 **

----------------------------------------
Epoch: 72
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.317463 **

----------------------------------------
Epoch: 73
Run training...


527it [00:15, 33.80it/s]


** End of epoch, accumulated average loss = 0.316273 **

----------------------------------------
Epoch: 74
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.315246 **

----------------------------------------
Epoch: 75
Run training...


527it [00:15, 33.96it/s]


** End of epoch, accumulated average loss = 0.313930 **

----------------------------------------
Epoch: 76
Run training...


527it [00:15, 33.00it/s]


** End of epoch, accumulated average loss = 0.311713 **

----------------------------------------
Epoch: 77
Run training...


527it [00:15, 34.27it/s]


** End of epoch, accumulated average loss = 0.311391 **

----------------------------------------
Epoch: 78
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.311052 **

----------------------------------------
Epoch: 79
Run training...


527it [00:15, 34.33it/s]


** End of epoch, accumulated average loss = 0.308834 **

----------------------------------------
Epoch: 80
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.308853 **
Saved checkpoint to bamblbi_fepoch_80

----------------------------------------
Epoch: 81
Run training...


527it [00:15, 33.81it/s]


** End of epoch, accumulated average loss = 0.307189 **

----------------------------------------
Epoch: 82
Run training...


527it [00:15, 33.45it/s]


** End of epoch, accumulated average loss = 0.306036 **

----------------------------------------
Epoch: 83
Run training...


527it [00:15, 33.73it/s]


** End of epoch, accumulated average loss = 0.305654 **

----------------------------------------
Epoch: 84
Run training...


527it [00:15, 33.64it/s]


** End of epoch, accumulated average loss = 0.304746 **

----------------------------------------
Epoch: 85
Run training...


527it [00:15, 34.37it/s]


** End of epoch, accumulated average loss = 0.303122 **

----------------------------------------
Epoch: 86
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.302359 **

----------------------------------------
Epoch: 87
Run training...


527it [00:15, 34.35it/s]


** End of epoch, accumulated average loss = 0.302065 **

----------------------------------------
Epoch: 88
Run training...


527it [00:16, 32.68it/s]


** End of epoch, accumulated average loss = 0.301258 **

----------------------------------------
Epoch: 89
Run training...


527it [00:15, 34.66it/s]


** End of epoch, accumulated average loss = 0.299794 **

----------------------------------------
Epoch: 90
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 0.298708 **

----------------------------------------
Epoch: 91
Run training...


527it [00:15, 34.27it/s]


** End of epoch, accumulated average loss = 0.298449 **

----------------------------------------
Epoch: 92
Run training...


527it [00:15, 33.21it/s]


** End of epoch, accumulated average loss = 0.297340 **

----------------------------------------
Epoch: 93
Run training...


527it [00:15, 34.42it/s]


** End of epoch, accumulated average loss = 0.295910 **

----------------------------------------
Epoch: 94
Run training...


527it [00:15, 33.17it/s]


** End of epoch, accumulated average loss = 0.295425 **

----------------------------------------
Epoch: 95
Run training...


527it [00:15, 34.32it/s]


** End of epoch, accumulated average loss = 0.294676 **

----------------------------------------
Epoch: 96
Run training...


527it [00:15, 33.20it/s]


** End of epoch, accumulated average loss = 0.293617 **

----------------------------------------
Epoch: 97
Run training...


527it [00:15, 34.21it/s]


** End of epoch, accumulated average loss = 0.292870 **

----------------------------------------
Epoch: 98
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.292338 **

----------------------------------------
Epoch: 99
Run training...


527it [00:15, 34.10it/s]


** End of epoch, accumulated average loss = 0.291819 **

----------------------------------------
Epoch: 100
Run training...


527it [00:15, 33.41it/s]


** End of epoch, accumulated average loss = 0.291544 **
Saved checkpoint to bamblbi_fepoch_100
Using GPU device: cuda


100%|██████████| 132/132 [00:14<00:00,  9.16it/s]


Using GPU device: cuda


100%|██████████| 10/10 [00:01<00:00,  9.12it/s]
[32m[I 2023-04-11 20:12:26,475][0m Trial 1 finished with value: 0.6393212360488953 and parameters: {'embedding': 0.15, 'attention': 0.2, 'residual': 0.2, 'relu': 0.1, 'lr_peak': 0.002}. Best is trial 1 with value: 0.6393212360488953.[0m


Predictions saved to preds_translit_bl_f0.15_0.2_0.2_0.1_0.002.tsv
dev set accuracy@1: 0.64
dev_small set accuracy@1: 0.66
Using GPU device: cuda


VBox(children=(Label(value='47.204 MB of 47.204 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, m…

0,1
loss_train,█▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
loss_train,0.29154



----------------------------------------
Epoch: 1
Run training...


527it [00:16, 31.86it/s]


** End of epoch, accumulated average loss = 4.080651 **

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 2.917995 **

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 34.08it/s]


** End of epoch, accumulated average loss = 1.929560 **

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 33.44it/s]


** End of epoch, accumulated average loss = 1.122450 **

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 33.85it/s]


** End of epoch, accumulated average loss = 0.899653 **

----------------------------------------
Epoch: 6
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.773904 **

----------------------------------------
Epoch: 7
Run training...


527it [00:15, 34.33it/s]


** End of epoch, accumulated average loss = 0.685507 **

----------------------------------------
Epoch: 8
Run training...


527it [00:15, 33.23it/s]


** End of epoch, accumulated average loss = 0.633948 **

----------------------------------------
Epoch: 9
Run training...


527it [00:15, 34.43it/s]


** End of epoch, accumulated average loss = 0.591570 **

----------------------------------------
Epoch: 10
Run training...


527it [00:15, 33.23it/s]


** End of epoch, accumulated average loss = 0.559699 **

----------------------------------------
Epoch: 11
Run training...


527it [00:15, 34.22it/s]


** End of epoch, accumulated average loss = 0.525923 **

----------------------------------------
Epoch: 12
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.500096 **

----------------------------------------
Epoch: 13
Run training...


527it [00:15, 34.70it/s]


** End of epoch, accumulated average loss = 0.480750 **

----------------------------------------
Epoch: 14
Run training...


527it [00:16, 32.71it/s]


** End of epoch, accumulated average loss = 0.461272 **

----------------------------------------
Epoch: 15
Run training...


527it [00:15, 34.52it/s]


** End of epoch, accumulated average loss = 0.449063 **

----------------------------------------
Epoch: 16
Run training...


527it [00:15, 33.06it/s]


** End of epoch, accumulated average loss = 0.435489 **

----------------------------------------
Epoch: 17
Run training...


527it [00:15, 34.40it/s]


** End of epoch, accumulated average loss = 0.427128 **

----------------------------------------
Epoch: 18
Run training...


527it [00:15, 33.36it/s]


** End of epoch, accumulated average loss = 0.419301 **

----------------------------------------
Epoch: 19
Run training...


527it [00:15, 33.93it/s]


** End of epoch, accumulated average loss = 0.410888 **

----------------------------------------
Epoch: 20
Run training...


527it [00:15, 33.42it/s]


** End of epoch, accumulated average loss = 0.406228 **
Saved checkpoint to bamblbi_fepoch_20

----------------------------------------
Epoch: 21
Run training...


527it [00:15, 33.73it/s]


** End of epoch, accumulated average loss = 0.398207 **

----------------------------------------
Epoch: 22
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 0.395790 **

----------------------------------------
Epoch: 23
Run training...


527it [00:15, 34.67it/s]


** End of epoch, accumulated average loss = 0.388935 **

----------------------------------------
Epoch: 24
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.383636 **

----------------------------------------
Epoch: 25
Run training...


527it [00:15, 34.10it/s]


** End of epoch, accumulated average loss = 0.379067 **

----------------------------------------
Epoch: 26
Run training...


527it [00:16, 32.90it/s]


** End of epoch, accumulated average loss = 0.376390 **

----------------------------------------
Epoch: 27
Run training...


527it [00:15, 34.25it/s]


** End of epoch, accumulated average loss = 0.372751 **

----------------------------------------
Epoch: 28
Run training...


527it [00:15, 33.24it/s]


** End of epoch, accumulated average loss = 0.369845 **

----------------------------------------
Epoch: 29
Run training...


527it [00:15, 33.93it/s]


** End of epoch, accumulated average loss = 0.365342 **

----------------------------------------
Epoch: 30
Run training...


527it [00:15, 33.30it/s]


** End of epoch, accumulated average loss = 0.364132 **

----------------------------------------
Epoch: 31
Run training...


527it [00:15, 34.22it/s]


** End of epoch, accumulated average loss = 0.360340 **

----------------------------------------
Epoch: 32
Run training...


527it [00:15, 33.20it/s]


** End of epoch, accumulated average loss = 0.357857 **

----------------------------------------
Epoch: 33
Run training...


527it [00:15, 34.37it/s]


** End of epoch, accumulated average loss = 0.355829 **

----------------------------------------
Epoch: 34
Run training...


527it [00:15, 33.14it/s]


** End of epoch, accumulated average loss = 0.352563 **

----------------------------------------
Epoch: 35
Run training...


527it [00:15, 34.26it/s]


** End of epoch, accumulated average loss = 0.351477 **

----------------------------------------
Epoch: 36
Run training...


527it [00:15, 33.23it/s]


** End of epoch, accumulated average loss = 0.348241 **

----------------------------------------
Epoch: 37
Run training...


527it [00:15, 34.00it/s]


** End of epoch, accumulated average loss = 0.347049 **

----------------------------------------
Epoch: 38
Run training...


527it [00:15, 33.29it/s]


** End of epoch, accumulated average loss = 0.345137 **

----------------------------------------
Epoch: 39
Run training...


527it [00:15, 34.67it/s]


** End of epoch, accumulated average loss = 0.342394 **

----------------------------------------
Epoch: 40
Run training...


527it [00:16, 32.85it/s]


** End of epoch, accumulated average loss = 0.341308 **
Saved checkpoint to bamblbi_fepoch_40

----------------------------------------
Epoch: 41
Run training...


527it [00:15, 34.00it/s]


** End of epoch, accumulated average loss = 0.339594 **

----------------------------------------
Epoch: 42
Run training...


527it [00:15, 33.17it/s]


** End of epoch, accumulated average loss = 0.337609 **

----------------------------------------
Epoch: 43
Run training...


527it [00:15, 34.63it/s]


** End of epoch, accumulated average loss = 0.336173 **

----------------------------------------
Epoch: 44
Run training...


527it [00:15, 33.44it/s]


** End of epoch, accumulated average loss = 0.333449 **

----------------------------------------
Epoch: 45
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.333082 **

----------------------------------------
Epoch: 46
Run training...


527it [00:15, 33.34it/s]


** End of epoch, accumulated average loss = 0.331455 **

----------------------------------------
Epoch: 47
Run training...


527it [00:15, 34.06it/s]


** End of epoch, accumulated average loss = 0.330049 **

----------------------------------------
Epoch: 48
Run training...


527it [00:15, 33.40it/s]


** End of epoch, accumulated average loss = 0.327556 **

----------------------------------------
Epoch: 49
Run training...


527it [00:15, 34.60it/s]


** End of epoch, accumulated average loss = 0.325936 **

----------------------------------------
Epoch: 50
Run training...


527it [00:16, 32.82it/s]


** End of epoch, accumulated average loss = 0.325643 **

----------------------------------------
Epoch: 51
Run training...


527it [00:15, 34.60it/s]


** End of epoch, accumulated average loss = 0.324661 **

----------------------------------------
Epoch: 52
Run training...


527it [00:15, 33.23it/s]


** End of epoch, accumulated average loss = 0.322460 **

----------------------------------------
Epoch: 53
Run training...


527it [00:15, 34.21it/s]


** End of epoch, accumulated average loss = 0.321279 **

----------------------------------------
Epoch: 54
Run training...


527it [00:15, 33.65it/s]


** End of epoch, accumulated average loss = 0.320153 **

----------------------------------------
Epoch: 55
Run training...


527it [00:15, 34.17it/s]


** End of epoch, accumulated average loss = 0.318773 **

----------------------------------------
Epoch: 56
Run training...


527it [00:15, 33.86it/s]


** End of epoch, accumulated average loss = 0.317585 **

----------------------------------------
Epoch: 57
Run training...


527it [00:15, 33.70it/s]


** End of epoch, accumulated average loss = 0.317460 **

----------------------------------------
Epoch: 58
Run training...


527it [00:15, 33.48it/s]


** End of epoch, accumulated average loss = 0.315807 **

----------------------------------------
Epoch: 59
Run training...


527it [00:15, 34.21it/s]


** End of epoch, accumulated average loss = 0.314496 **

----------------------------------------
Epoch: 60
Run training...


527it [00:15, 33.57it/s]


** End of epoch, accumulated average loss = 0.313317 **
Saved checkpoint to bamblbi_fepoch_60

----------------------------------------
Epoch: 61
Run training...


527it [00:15, 33.91it/s]


** End of epoch, accumulated average loss = 0.313101 **

----------------------------------------
Epoch: 62
Run training...


527it [00:15, 33.43it/s]


** End of epoch, accumulated average loss = 0.312021 **

----------------------------------------
Epoch: 63
Run training...


527it [00:15, 33.98it/s]


** End of epoch, accumulated average loss = 0.311018 **

----------------------------------------
Epoch: 64
Run training...


527it [00:15, 33.53it/s]


** End of epoch, accumulated average loss = 0.308808 **

----------------------------------------
Epoch: 65
Run training...


527it [00:15, 34.07it/s]


** End of epoch, accumulated average loss = 0.309149 **

----------------------------------------
Epoch: 66
Run training...


527it [00:15, 33.11it/s]


** End of epoch, accumulated average loss = 0.307932 **

----------------------------------------
Epoch: 67
Run training...


527it [00:15, 33.84it/s]


** End of epoch, accumulated average loss = 0.306218 **

----------------------------------------
Epoch: 68
Run training...


527it [00:15, 33.47it/s]


** End of epoch, accumulated average loss = 0.305593 **

----------------------------------------
Epoch: 69
Run training...


527it [00:15, 34.00it/s]


** End of epoch, accumulated average loss = 0.304434 **

----------------------------------------
Epoch: 70
Run training...


527it [00:15, 34.09it/s]


** End of epoch, accumulated average loss = 0.303546 **

----------------------------------------
Epoch: 71
Run training...


527it [00:15, 33.29it/s]


** End of epoch, accumulated average loss = 0.302318 **

----------------------------------------
Epoch: 72
Run training...


527it [00:15, 33.61it/s]


** End of epoch, accumulated average loss = 0.302541 **

----------------------------------------
Epoch: 73
Run training...


527it [00:15, 33.68it/s]


** End of epoch, accumulated average loss = 0.300626 **

----------------------------------------
Epoch: 74
Run training...


527it [00:15, 34.20it/s]


** End of epoch, accumulated average loss = 0.299682 **

----------------------------------------
Epoch: 75
Run training...


527it [00:15, 33.89it/s]


** End of epoch, accumulated average loss = 0.299370 **

----------------------------------------
Epoch: 76
Run training...


527it [00:15, 33.18it/s]


** End of epoch, accumulated average loss = 0.298730 **

----------------------------------------
Epoch: 77
Run training...


527it [00:15, 33.75it/s]


** End of epoch, accumulated average loss = 0.297545 **

----------------------------------------
Epoch: 78
Run training...


527it [00:15, 33.82it/s]


** End of epoch, accumulated average loss = 0.297334 **

----------------------------------------
Epoch: 79
Run training...


527it [00:15, 33.58it/s]


** End of epoch, accumulated average loss = 0.296387 **

----------------------------------------
Epoch: 80
Run training...


527it [00:15, 34.14it/s]


** End of epoch, accumulated average loss = 0.295295 **
Saved checkpoint to bamblbi_fepoch_80

----------------------------------------
Epoch: 81
Run training...


527it [00:15, 33.27it/s]


** End of epoch, accumulated average loss = 0.295429 **

----------------------------------------
Epoch: 82
Run training...


527it [00:15, 34.09it/s]


** End of epoch, accumulated average loss = 0.294001 **

----------------------------------------
Epoch: 83
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 0.293457 **

----------------------------------------
Epoch: 84
Run training...


527it [00:15, 33.29it/s]


** End of epoch, accumulated average loss = 0.292055 **

----------------------------------------
Epoch: 85
Run training...


527it [00:15, 33.72it/s]


** End of epoch, accumulated average loss = 0.292159 **

----------------------------------------
Epoch: 86
Run training...


527it [00:15, 33.82it/s]


** End of epoch, accumulated average loss = 0.291041 **

----------------------------------------
Epoch: 87
Run training...


527it [00:15, 33.58it/s]


** End of epoch, accumulated average loss = 0.290949 **

----------------------------------------
Epoch: 88
Run training...


527it [00:15, 33.60it/s]


** End of epoch, accumulated average loss = 0.290250 **

----------------------------------------
Epoch: 89
Run training...


527it [00:15, 33.76it/s]


** End of epoch, accumulated average loss = 0.289624 **

----------------------------------------
Epoch: 90
Run training...


527it [00:15, 33.83it/s]


** End of epoch, accumulated average loss = 0.288499 **

----------------------------------------
Epoch: 91
Run training...


527it [00:15, 33.92it/s]


** End of epoch, accumulated average loss = 0.288202 **

----------------------------------------
Epoch: 92
Run training...


527it [00:15, 33.13it/s]


** End of epoch, accumulated average loss = 0.287305 **

----------------------------------------
Epoch: 93
Run training...


527it [00:15, 34.09it/s]


** End of epoch, accumulated average loss = 0.286813 **

----------------------------------------
Epoch: 94
Run training...


527it [00:15, 33.56it/s]


** End of epoch, accumulated average loss = 0.286695 **

----------------------------------------
Epoch: 95
Run training...


527it [00:15, 33.83it/s]


** End of epoch, accumulated average loss = 0.286399 **

----------------------------------------
Epoch: 96
Run training...


527it [00:15, 33.88it/s]


** End of epoch, accumulated average loss = 0.285481 **

----------------------------------------
Epoch: 97
Run training...


527it [00:15, 33.30it/s]


** End of epoch, accumulated average loss = 0.284186 **

----------------------------------------
Epoch: 98
Run training...


527it [00:15, 33.70it/s]


** End of epoch, accumulated average loss = 0.284236 **

----------------------------------------
Epoch: 99
Run training...


527it [00:15, 33.52it/s]


** End of epoch, accumulated average loss = 0.284176 **

----------------------------------------
Epoch: 100
Run training...


527it [00:15, 34.04it/s]


** End of epoch, accumulated average loss = 0.284043 **
Saved checkpoint to bamblbi_fepoch_100
Using GPU device: cuda


100%|██████████| 132/132 [00:14<00:00,  9.02it/s]


Using GPU device: cuda


100%|██████████| 10/10 [00:01<00:00,  9.32it/s]
[32m[I 2023-04-11 20:39:24,583][0m Trial 2 finished with value: 0.6439526231873054 and parameters: {'embedding': 0.15, 'attention': 0.15, 'residual': 0.15, 'relu': 0.2, 'lr_peak': 0.001}. Best is trial 2 with value: 0.6439526231873054.[0m


Predictions saved to preds_translit_bl_f0.15_0.15_0.15_0.2_0.001.tsv
dev set accuracy@1: 0.64
dev_small set accuracy@1: 0.66
Using GPU device: cuda


VBox(children=(Label(value='47.204 MB of 47.204 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, m…

0,1
loss_train,█▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
loss_train,0.28404


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016668858916667282, max=1.0…


----------------------------------------
Epoch: 1
Run training...


527it [00:17, 30.86it/s]


** End of epoch, accumulated average loss = 4.140375 **

----------------------------------------
Epoch: 2
Run training...


527it [00:15, 34.19it/s]


** End of epoch, accumulated average loss = 2.905445 **

----------------------------------------
Epoch: 3
Run training...


527it [00:15, 33.44it/s]


** End of epoch, accumulated average loss = 1.777207 **

----------------------------------------
Epoch: 4
Run training...


527it [00:15, 34.26it/s]


** End of epoch, accumulated average loss = 1.051733 **

----------------------------------------
Epoch: 5
Run training...


527it [00:15, 32.97it/s]


** End of epoch, accumulated average loss = 0.846230 **

----------------------------------------
Epoch: 6
Run training...


527it [00:15, 34.36it/s]


** End of epoch, accumulated average loss = 0.704747 **

----------------------------------------
Epoch: 7
Run training...


527it [00:15, 33.16it/s]


** End of epoch, accumulated average loss = 0.625566 **

----------------------------------------
Epoch: 8
Run training...


527it [00:15, 34.42it/s]


** End of epoch, accumulated average loss = 0.577045 **

----------------------------------------
Epoch: 9
Run training...


527it [00:15, 33.53it/s]


** End of epoch, accumulated average loss = 0.541610 **

----------------------------------------
Epoch: 10
Run training...


527it [00:15, 33.86it/s]


** End of epoch, accumulated average loss = 0.515631 **

----------------------------------------
Epoch: 11
Run training...


527it [00:15, 33.25it/s]


** End of epoch, accumulated average loss = 0.485034 **

----------------------------------------
Epoch: 12
Run training...


527it [00:15, 34.62it/s]


** End of epoch, accumulated average loss = 0.463827 **

----------------------------------------
Epoch: 13
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.444624 **

----------------------------------------
Epoch: 14
Run training...


527it [00:15, 34.37it/s]


** End of epoch, accumulated average loss = 0.429483 **

----------------------------------------
Epoch: 15
Run training...


527it [00:16, 32.86it/s]


** End of epoch, accumulated average loss = 0.416389 **

----------------------------------------
Epoch: 16
Run training...


527it [00:15, 34.61it/s]


** End of epoch, accumulated average loss = 0.407497 **

----------------------------------------
Epoch: 17
Run training...


527it [00:15, 33.60it/s]


** End of epoch, accumulated average loss = 0.399744 **

----------------------------------------
Epoch: 18
Run training...


527it [00:15, 33.88it/s]


** End of epoch, accumulated average loss = 0.392562 **

----------------------------------------
Epoch: 19
Run training...


527it [00:15, 33.28it/s]


** End of epoch, accumulated average loss = 0.384226 **

----------------------------------------
Epoch: 20
Run training...


527it [00:15, 33.53it/s]


** End of epoch, accumulated average loss = 0.379708 **
Saved checkpoint to bamblbi_fepoch_20

----------------------------------------
Epoch: 21
Run training...


527it [00:15, 33.05it/s]


** End of epoch, accumulated average loss = 0.373595 **

----------------------------------------
Epoch: 22
Run training...


527it [00:15, 34.34it/s]


** End of epoch, accumulated average loss = 0.370009 **

----------------------------------------
Epoch: 23
Run training...


527it [00:15, 33.08it/s]


** End of epoch, accumulated average loss = 0.365998 **

----------------------------------------
Epoch: 24
Run training...


527it [00:15, 34.63it/s]


** End of epoch, accumulated average loss = 0.361224 **

----------------------------------------
Epoch: 25
Run training...


527it [00:16, 32.70it/s]


** End of epoch, accumulated average loss = 0.356718 **

----------------------------------------
Epoch: 26
Run training...


527it [00:15, 34.31it/s]


** End of epoch, accumulated average loss = 0.353296 **

----------------------------------------
Epoch: 27
Run training...


527it [00:15, 33.30it/s]


** End of epoch, accumulated average loss = 0.351587 **

----------------------------------------
Epoch: 28
Run training...


527it [00:15, 34.32it/s]


** End of epoch, accumulated average loss = 0.348763 **

----------------------------------------
Epoch: 29
Run training...


527it [00:15, 33.32it/s]


** End of epoch, accumulated average loss = 0.346149 **

----------------------------------------
Epoch: 30
Run training...


527it [00:15, 34.05it/s]


** End of epoch, accumulated average loss = 0.343021 **

----------------------------------------
Epoch: 31
Run training...


527it [00:15, 33.06it/s]


** End of epoch, accumulated average loss = 0.340358 **

----------------------------------------
Epoch: 32
Run training...


527it [00:15, 34.49it/s]


** End of epoch, accumulated average loss = 0.337245 **

----------------------------------------
Epoch: 33
Run training...


527it [00:15, 33.35it/s]


** End of epoch, accumulated average loss = 0.335790 **

----------------------------------------
Epoch: 34
Run training...


527it [00:15, 34.56it/s]


** End of epoch, accumulated average loss = 0.332585 **

----------------------------------------
Epoch: 35
Run training...


527it [00:15, 33.35it/s]


** End of epoch, accumulated average loss = 0.331654 **

----------------------------------------
Epoch: 36
Run training...


527it [00:15, 34.14it/s]


** End of epoch, accumulated average loss = 0.329549 **

----------------------------------------
Epoch: 37
Run training...


527it [00:15, 33.20it/s]


** End of epoch, accumulated average loss = 0.327364 **

----------------------------------------
Epoch: 38
Run training...


527it [00:15, 34.47it/s]


** End of epoch, accumulated average loss = 0.325509 **

----------------------------------------
Epoch: 39
Run training...


527it [00:15, 33.33it/s]


** End of epoch, accumulated average loss = 0.325070 **

----------------------------------------
Epoch: 40
Run training...


527it [00:15, 34.43it/s]


** End of epoch, accumulated average loss = 0.321986 **
Saved checkpoint to bamblbi_fepoch_40

----------------------------------------
Epoch: 41
Run training...


527it [00:16, 32.62it/s]


** End of epoch, accumulated average loss = 0.320142 **

----------------------------------------
Epoch: 42
Run training...


527it [00:15, 34.57it/s]


** End of epoch, accumulated average loss = 0.319101 **

----------------------------------------
Epoch: 43
Run training...


527it [00:15, 33.60it/s]


** End of epoch, accumulated average loss = 0.316842 **

----------------------------------------
Epoch: 44
Run training...


527it [00:15, 34.06it/s]


** End of epoch, accumulated average loss = 0.316264 **

----------------------------------------
Epoch: 45
Run training...


527it [00:15, 33.50it/s]


** End of epoch, accumulated average loss = 0.314388 **

----------------------------------------
Epoch: 46
Run training...


527it [00:15, 34.00it/s]


** End of epoch, accumulated average loss = 0.313560 **

----------------------------------------
Epoch: 47
Run training...


527it [00:15, 33.60it/s]


** End of epoch, accumulated average loss = 0.311800 **

----------------------------------------
Epoch: 48
Run training...


527it [00:15, 34.65it/s]


** End of epoch, accumulated average loss = 0.310555 **

----------------------------------------
Epoch: 49
Run training...


527it [00:15, 33.86it/s]


** End of epoch, accumulated average loss = 0.308885 **

----------------------------------------
Epoch: 50
Run training...


527it [00:15, 34.78it/s]


** End of epoch, accumulated average loss = 0.307983 **

----------------------------------------
Epoch: 51
Run training...


527it [00:15, 33.90it/s]


** End of epoch, accumulated average loss = 0.306239 **

----------------------------------------
Epoch: 52
Run training...


527it [00:15, 34.50it/s]


** End of epoch, accumulated average loss = 0.306094 **

----------------------------------------
Epoch: 53
Run training...


527it [00:15, 34.49it/s]


** End of epoch, accumulated average loss = 0.304210 **

----------------------------------------
Epoch: 54
Run training...


527it [00:15, 34.14it/s]


** End of epoch, accumulated average loss = 0.302951 **

----------------------------------------
Epoch: 55
Run training...


527it [00:15, 34.61it/s]


** End of epoch, accumulated average loss = 0.302405 **

----------------------------------------
Epoch: 56
Run training...


527it [00:15, 34.24it/s]


** End of epoch, accumulated average loss = 0.300857 **

----------------------------------------
Epoch: 57
Run training...


527it [00:15, 33.76it/s]


** End of epoch, accumulated average loss = 0.300032 **

----------------------------------------
Epoch: 58
Run training...


527it [00:15, 34.71it/s]


** End of epoch, accumulated average loss = 0.298985 **

----------------------------------------
Epoch: 59
Run training...


527it [00:15, 34.03it/s]


** End of epoch, accumulated average loss = 0.297774 **

----------------------------------------
Epoch: 60
Run training...


527it [00:15, 34.34it/s]


** End of epoch, accumulated average loss = 0.297395 **
Saved checkpoint to bamblbi_fepoch_60

----------------------------------------
Epoch: 61
Run training...


527it [00:15, 33.93it/s]


** End of epoch, accumulated average loss = 0.295727 **

----------------------------------------
Epoch: 62
Run training...


527it [00:15, 33.97it/s]


** End of epoch, accumulated average loss = 0.294818 **

----------------------------------------
Epoch: 63
Run training...


244it [00:07, 34.89it/s]

## Label smoothing

We suggest to implement an additional regularization method - **label smoothing**. Now imagine that we have a prediction vector from probabilities at position t in the sequence of tokens for each token id from the vocabulary. CrossEntropy compares it with ground truth one-hot representation

$$[0, ... 0, 1, 0, ..., 0].$$

And now imagine that we are slightly "smoothed" the values in the ground truth vector and obtained

$$[\frac{\alpha}{|V|}, ..., \frac{\alpha}{|V|}, 1(1-\alpha)+\frac{\alpha}{|V|},  \frac{\alpha}{|V|}, ... \frac{\alpha}{|V|}],$$

where $\alpha$ - parameter from 0 to 1, $|V|$ - vocabulary size - number of components in the ground truth vector. The values ​​of this new vector are still summed to 1. Calculate the cross-entropy of our prediction vector and the new ground truth. Now, firstly, cross-entropy will never reach 0, and secondly, the result of the error function will require the model, as usual, to return the highest probability vector compared to other components of the probability vector for the correct token in the dictionary, but at the same time not too large, because as the value of this probability approaches 1, the value of the error function increases. For research on the use of label smoothing, see the [paper](https://arxiv.org/abs/1906.02629).
    
Accordingly, in order to embed label smoothing into the model, it is necessary to carry out the transformation described above on the ground truth vectors, as well as to implement the cross-entropy calculation, since the used `torch.nn.CrossEntropy` class is not quite suitable, since for the ground truth representation of `__call__` method takes the id of the correct token and builds a one-hot vector already inside. However, it is possible to implement what is required based on the internal implementation of this class [CrossEntropyLoss](https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss).
    

Test different values of $\alpha$ (e.x, 0.05, 0.1, 0.2). Describe your experiments and results.


In [None]:
class LabelSmoothing(nn.Module):
    "Implement label smoothing."
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
        
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))