## Download and explore data

In [1]:
!./download_opus_100_ru_en.sh

mkdir: data: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64.6M  100 64.6M    0     0  8463k      0  0:00:07  0:00:07 --:--:-- 11.7M000:03  0:00:13 3950k0:14  0:00:04  0:00:10 4512k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  115M  100  115M    0     0  12.8M      0  0:00:09  0:00:09 --:--:-- 13.7M05  0:00:03 14.3M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  169k  100  169k    0     0   301k      0 --:--:-- --:--:-- --:--:--  181k:-- --:--:-- --:--:--  307k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  298k  10

In [2]:
!ls data

test_en.txt  test_ru.txt  train_en.txt train_ru.txt


In [4]:
!head data/test_en.txt

If you only stay there.
I don't know how you do it, Pop, carrying these boxes around every day.
We might have a slight edge in mediation.
How long is it going to take you to get him what he needs?
On 1 April President of the Nagorno Karabagh Republic Bako Sahakyan met head of the General Staff of the Republic of Armenia's Armed forces colonel-general Yuri Khachaturov.
Mr Priesner also noted that the E-justice management system has not only improved case management, but has also led to a significant streamlining in procedures.
You don't like chicken noodle soup?
Posted: 14 May 2005, 20:31
Now, for a minute, I thought maybe he was being tailed.
« : 26 Октябрь 2017, 06:50:24 »


In [5]:
!head data/test_ru.txt

Только бы не вылететь.
И как ты только справляешься, папа, таская эти коробки взад-вперед целый день.
Возможно, у нас есть небольшое преимущество в переговорах.
Сколько времени вы будете делать то, что ему нужно?
1 апреля Президент НКР Бако Саакян принял начальника Генштаба Вооруженных сил Республики Армения генерал-полковника Юрия Хачатурова.
Г-н Приснер также упомянул, что система электронного правосудия не только позволила улучшить процесс ведения дел, но также способствует значительному упорядочению процедур.
- Неплохо, да.
Posted: 15 Dec 2006, 00:07
И на минутку я подумал, что за ним могут следить.
«: 11 Октябрь 2011, 17:15:34»


## Define dataloader

In [146]:
from collections import defaultdict
from enum import IntEnum
import re
from tqdm import tqdm


class SpecToken(IntEnum):
    START = 0
    STOP = 1
    UNK = 2
    PAD = 3


class Tokenizer:
    def __init__(self, preprocessor=None, special_tokens=SpecToken):
        self.preprocessor = preprocessor if preprocessor else self._default_preprocessor
        self.tokens = special_tokens
        self._word2id = None
        self._id2work = None

    def encode(self, word):
        if word in self._word2id:
            return self._word2id[word]
        else:
            return int(self.tokens.UNK)

    def decode(self, token):
        token = int(token)
        if token in self._id2word:
            return self._id2word[token]
        else:
            return str(self.tokens.UNK)

    def encode_line(self, line):
        return [self.encode(x) for x in self.preprocessor(line).split(" ")]

    def decode_line(self, tokens):
        return " ".join([self.decode(x) for x in tokens])

    def fit(self, data, max_words=10000, verbose=True):
        word2cnt = defaultdict(lambda: 0)
        word_list = self._extract_words(data)
        with tqdm(total=len(word_list)) as pbar:
            for idx, word in enumerate(word_list):
                if len(word) == 0:
                    continue
                word2cnt[word] += 1
                if idx % 10000 == 0:
                    pbar.set_description(f'Processed {idx + 1}/{len(word_list)}')
                pbar.update(1)
        loaded_cnt = len(word2cnt)
        if verbose:
            print(f'Loaded {loaded_cnt} unique words from {len(word_list)} corpus'
                  f'({100. * loaded_cnt / len(word_list)} %)')
        words_limit = max_words - len(self.tokens)
        popular_words = sorted(
            word2cnt.items(), reverse=True, key=lambda x: x[1])[:words_limit]
        if verbose:
            print(f'20 most popular words:\n{popular_words[:20]}')

        self._word2id = {item.name: item.value for item in self.tokens}
        for word, _ in popular_words:
            self._word2id[word] = len(self._word2id)
        assert len(self._word2id) <= max_words
        self._id2word = {v: k for k, v in self._word2id.items()}

    def _extract_words(self, data):
        words = []
        if isinstance(data, list):
            for line in data:
                assert isinstance(line, str)
                words.extend([self.preprocessor(x) for x in line.split(' ')])
        else:
            assert isinstance(data, str)
            words.extend([self.preprocessor(x) for x in data.split(' ')])
        return words

    @staticmethod
    def _default_preprocessor(word):
        return re.sub('[^A-Za-zА-Яа-я\s]+', '', word).lower().strip()

    @property
    def start_token(self):
        return int(self.tokens.START)

    @property
    def stop_token(self):
        return int(self.tokens.STOP)

    @property
    def pad_token(self):
        return int(self.tokens.PAD)

    @property
    def unk_token(self):
        return int(self.tokens.UNK)

In [147]:
import numpy as np
import os
import re
import torch
from torch.utils.data import Dataset, DataLoader
import unicodedata


class OpusTranslationDataset(Dataset):
    def __init__(
        self,
        en_data_path,
        ru_data_path,
        en_tokenizer,
        ru_tokenizer,
        fit_tokenizer=False,
        max_words=40
    ):
        self.en_lang = en_tokenizer
        self.ru_lang = ru_tokenizer
        
        with open(en_data_path, 'rt', encoding='utf-8') as f:
            self._inputs = f.readlines()
        with open(ru_data_path, 'rt', encoding='utf-8') as f:
            self._targets = f.readlines()
       
        self._max_words = max_words
        self._filter_samples()
        assert len(self._inputs) == len(self._targets)
        
        if fit_tokenizer:
            self.en_lang.fit(self._inputs)
            self.ru_lang.fit(self._targets)   

    def _filter_samples(self):
        valid_indices = []
        
        def valid(sample):
            return 10 < len(sample.split(' ')) <= self._max_words - 2

        for i in range(len(self._inputs)):
            if valid(self._inputs[i]) and valid(self._targets[i]):
                valid_indices.append(i)
        
        self._inputs = [self._inputs[x] for x in valid_indices]
        self._targets = [self._targets[x] for x in valid_indices]

    def __len__(self):
        return len(self._inputs)
    
    @staticmethod
    def _pad(seq, tokenizer, size, prepadding=False):
        pad_length = size - len(seq)
        if pad_length == 0:
            return seq
        if prepadding:
            return pad_length * [tokenizer.pad_token] + seq
        else:
            return seq + pad_length * [tokenizer.pad_token]

    def __getitem__(self, index):       
        in_sentence = self._inputs[index]
        encoder_input = self.en_lang.encode_line(in_sentence)[:self._max_words]
        encoder_input = OpusTranslationDataset._pad(encoder_input, self.en_lang, self._max_words, prepadding=True)

        target_sentence = self._targets[index]
        decoder_output = self.ru_lang.encode_line(target_sentence)[:self._max_words] + [self.ru_lang.stop_token]
        decoder_input = [self.ru_lang.start_token] + decoder_output
        decoder_output = OpusTranslationDataset._pad(decoder_output, self.ru_lang, self._max_words)
        decoder_input = OpusTranslationDataset._pad(decoder_input, self.ru_lang, self._max_words)
        
        return {
            'encoder_input': np.asarray(encoder_input),
            'decoder_input': np.asarray(decoder_input),
            'decoder_output': np.asarray(decoder_output),
        }

In [148]:
DATA_ROOT = 'data'
TRAIN_RU = os.path.join(DATA_ROOT, 'train_ru.txt')
TRAIN_EN = os.path.join(DATA_ROOT, 'train_en.txt')
VAL_RU = os.path.join(DATA_ROOT, 'test_ru.txt')
VAL_EN = os.path.join(DATA_ROOT, 'test_en.txt')
TRAIN_BATCH_SIZE = 4

en_tokenizer = Tokenizer()
ru_tokenizer = Tokenizer()

train_dataset = OpusTranslationDataset(
    en_data_path=TRAIN_EN,
    ru_data_path=TRAIN_RU,
    en_tokenizer=en_tokenizer,
    ru_tokenizer=ru_tokenizer,
    fit_tokenizer=True,
    max_words=40
)
train_dataloader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, shuffle=True)

val_dataset = OpusTranslationDataset(
    en_data_path=VAL_EN,
    ru_data_path=VAL_RU,
    en_tokenizer=en_tokenizer,
    ru_tokenizer=ru_tokenizer,
    max_words=40
)
val_dataloader = DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, shuffle=True)

Processed 4480001/4485356:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉   | 4374378/4485356 [00:03<00:00, 1344229.85it/s]


Loaded 105950 unique words from 4485356 corpus(2.3621313447583647 %)
20 most popular words:
[('the', 319859), ('of', 175422), ('and', 144455), ('to', 130756), ('in', 98446), ('a', 74961), ('for', 48953), ('that', 48535), ('on', 40347), ('is', 39394), ('you', 38259), ('i', 33471), ('with', 31357), ('be', 28517), ('it', 28059), ('as', 24946), ('by', 23479), ('was', 22478), ('this', 21546), ('are', 19783)]


Processed 3950001/3951889:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉   | 3855738/3951889 [00:03<00:00, 1207533.12it/s]


Loaded 231795 unique words from 3951889 corpus(5.8654228395585 %)
20 most popular words:
[('в', 151362), ('и', 139293), ('на', 56561), ('что', 47700), ('с', 45879), ('не', 42111), ('по', 35579), ('я', 28110), ('для', 25541), ('о', 20299), ('к', 19514), ('как', 18566), ('это', 16668), ('мы', 14845), ('а', 14565), ('из', 14542), ('он', 13587), ('за', 13556), ('от', 13487), ('его', 13121)]


In [149]:
item = train_dataset[1200]
print(train_dataset.en_lang.decode_line(item['encoder_input']))
print(train_dataset.ru_lang.decode_line(item['decoder_input']))
print(train_dataset.ru_lang.decode_line(item['decoder_output']))

PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD encouraging and facilitating networking among educational and training institutions in developed countries with those in developing countries particularly ldcs to enhance voluntary services in UNK
START необходимо поощрять и стимулировать развитие UNK связей между UNK и UNK UNK UNK в развитых и развивающихся странах особенно нрс в интересах расширения UNK участия в программах UNK STOP PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
необходимо поощрять и стимулировать развитие UNK связей между UNK и UNK UNK UNK в развитых и развивающихся странах особенно нрс в интересах расширения UNK участия в программах UNK STOP PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

## Define the model

In [150]:
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

In [151]:
class Encoder(nn.Module):
    def __init__(self, num_embeddings=10000, hidden_size=256, embedding_dim=128):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=num_embeddings,
            embedding_dim=embedding_dim,
            max_norm=True
        )
        self._enc_lstm_0 = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            batch_first=True,
            dropout=0.,
        )
        self._enc_lstm_1 = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            batch_first=True,
            dropout=0.,
        )

    def forward(self, tokens):
        embedded = self.embedding(tokens)
        enc0_out, (enc0_h, enc0_c) = self._enc_lstm_0(embedded)
        enc1_out, (enc1_h, enc1_c) = self._enc_lstm_1(enc0_out)
        return {
            'enc0_h': enc0_h,
            'enc0_c': enc0_c,
            'enc1_h': enc1_h,
            'enc1_c': enc1_c,
        }

In [152]:
enc = Encoder()

In [153]:
sample = torch.Tensor(item['encoder_input']).int().unsqueeze(0)
sample.shape

torch.Size([1, 40])

In [154]:
for k, v in enc(sample).items():
    print(f"{k}\t{v.shape}")

enc0_h	torch.Size([1, 1, 256])
enc0_c	torch.Size([1, 1, 256])
enc1_h	torch.Size([1, 1, 256])
enc1_c	torch.Size([1, 1, 256])


In [155]:
class Decoder(nn.Module):
    def __init__(self, num_embeddings=10000, hidden_size=256, embedding_dim=128):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(
            num_embeddings=num_embeddings,
            embedding_dim=embedding_dim,
            max_norm=True
        )
        self._dec_lstm_0 = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            batch_first=True,
            dropout=0.,
        )
        self._dec_lstm_1 = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            batch_first=True,
            dropout=0.,
        )
        self._dec_dense = nn.Linear(
            in_features=hidden_size,
            out_features=num_embeddings,
        )
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, tokens, context):
        dec0_in_h = context['enc0_h']
        dec0_in_c = context['enc0_c']
        dec1_in_h = context['enc1_h']
        dec1_in_c = context['enc1_c']
        
        embedded = self.embedding(tokens)
        dec0_out, (dec0_h, dec0_c) = self._dec_lstm_0(embedded, (dec0_in_h, dec0_in_c))
        dec1_out, (dec1_h, dec1_c) = self._dec_lstm_1(dec0_out, (dec1_in_h, dec1_in_c))
        
        logits = self._dec_dense(dec1_out)
        scores = self.softmax(logits)
        
        return scores

    def infer_token(self, token, context):
        dec0_in_h = context['enc0_h']
        dec0_in_c = context['enc0_c']
        dec1_in_h = context['enc1_h']
        dec1_in_c = context['enc1_c']
        
        assert tuple(token.shape) == (1, 1)
        
        embedding = self.embedding(token)
        dec0_out, (dec0_h, dec0_c) = self._dec_lstm_0(embedding, (dec0_in_h, dec0_in_c))
        dec1_out, (dec1_h, dec1_c) = self._dec_lstm_1(dec0_out, (dec1_in_h, dec1_in_c))
        
        logits = self._dec_dense(dec1_out)
        scores = self.softmax(logits)
        
        return scores, {
            'enc0_h': dec0_h,
            'enc0_c': dec0_c,
            'enc1_h': dec1_h,
            'enc1_c': dec1_c,
        }

In [156]:
class Translator(nn.Module):
    def __init__(self, max_length=40, start_token=SpecToken.START, stop_token=SpecToken.STOP):
        super(Translator, self).__init__()
        self._encoder = Encoder()
        self._decoder = Decoder()
        self._loss = nn.CrossEntropyLoss()
        self._start_token = torch.Tensor([int(start_token)]).long()
        self._stop_token = torch.Tensor([int(stop_token)]).long()
        self._max_length = max_length

    def forward(self, tokens, dec_input=None, dec_target=None):
        context = self._encoder(tokens)
        if self.training:
            scores = self._decoder(dec_input, context)
            b, n, c = scores.shape
            scores_reshaped = torch.reshape(scores, [b * n, -1])
            targets_reshaped = torch.reshape(dec_target, [b * n]).long()           
            loss = self._loss(
                scores_reshaped,
                targets_reshaped,
            )
            return loss
        else:
            b, n = tokens.shape
            assert b == 1, f"Inference is now available only for batch size 1"
            context = self._encoder(tokens)
                       
            res = []
            cur_token = self._start_token.unsqueeze(0)
            scores, context = self._decoder.infer_token(cur_token, context)
            res.append(torch.argmax(scores[0, 0]))
            
            
            for token_idx in range(1, self._max_length):
                cur_token = res[-1]
                if cur_token == self._stop_token:
                    break
                cur_token = cur_token.unsqueeze(0).unsqueeze(0)
                scores, context = self._decoder.infer_token(cur_token, context)
                res.append(torch.argmax(scores[0, 0]))
                
            return res
            

In [157]:
model = Translator()
model.train()

Translator(
  (_encoder): Encoder(
    (embedding): Embedding(10000, 128, max_norm=True)
    (_enc_lstm_0): LSTM(128, 256, batch_first=True)
    (_enc_lstm_1): LSTM(256, 256, batch_first=True)
  )
  (_decoder): Decoder(
    (embedding): Embedding(10000, 128, max_norm=True)
    (_dec_lstm_0): LSTM(128, 256, batch_first=True)
    (_dec_lstm_1): LSTM(256, 256, batch_first=True)
    (_dec_dense): Linear(in_features=256, out_features=10000, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
  (_loss): CrossEntropyLoss()
)

In [158]:
sample = torch.Tensor(item['encoder_input']).int().unsqueeze(0)
sample_dec = torch.Tensor(item['decoder_input']).int().unsqueeze(0)
target = torch.Tensor(item['decoder_output']).int().unsqueeze(0)

loss = model(sample, sample_dec, target)
loss

tensor(9.2099, grad_fn=<NllLossBackward0>)

In [159]:
model.eval()
r = model(sample)

In [160]:
print(train_dataset.ru_lang.decode_line(r))

START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START START


## Train loop draft

In [169]:
from torch.optim import RMSprop


model.train()

train_iter = iter(train_dataloader)
optimizer = RMSprop(model.parameters(), lr=0.01)

for step in range(1000):
    optimizer.zero_grad()
    
    sample = next(train_iter)
    in_tokens = torch.Tensor(sample['encoder_input']).long()
    dec_inputs = torch.Tensor(sample['decoder_input']).long()
    dec_targets = torch.Tensor(sample['decoder_output']).long()
    
    
    loss = model(
        tokens=in_tokens,
        dec_input=dec_inputs,
        dec_target=dec_targets,
    )
    
    loss.backward()
    optimizer.step()
    
    if step % 100 == 0:
        print(f'Step {step} Loss: {loss}')
        print(f'Tokens shape: {in_tokens.shape}')

Step 0 Loss: 9.346098899841309
Tokens shape: torch.Size([4, 40])
Step 100 Loss: 2.2243309020996094
Tokens shape: torch.Size([4, 40])
Step 200 Loss: 2.5397326946258545
Tokens shape: torch.Size([4, 40])
Step 300 Loss: 4.161799907684326
Tokens shape: torch.Size([4, 40])
Step 400 Loss: 3.258639097213745
Tokens shape: torch.Size([4, 40])
Step 500 Loss: 2.9066083431243896
Tokens shape: torch.Size([4, 40])
Step 600 Loss: 2.0057032108306885
Tokens shape: torch.Size([4, 40])
Step 700 Loss: 3.516080379486084
Tokens shape: torch.Size([4, 40])
Step 800 Loss: 2.524286985397339
Tokens shape: torch.Size([4, 40])
Step 900 Loss: 2.571601390838623
Tokens shape: torch.Size([4, 40])


In [171]:
sample = next(iter(val_dataloader))
in_tokens = torch.Tensor(sample['encoder_input']).long()
dec_inputs = torch.Tensor(sample['decoder_input']).long()
dec_targets = torch.Tensor(sample['decoder_output']).long()

In [174]:
sample = in_tokens[0]
target = dec_targets[0]
print(train_dataset.en_lang.decode_line(sample))
print(train_dataset.ru_lang.decode_line(target))

PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD our booking system is secure search a hotel in finland compare rates and make your booking
вы UNK подтверждение UNK UNK гостиницы по UNK наша система UNK и UNK выберите отель в UNK UNK UNK цены и UNK ваш заказ STOP PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD


In [175]:
model.eval()
prediction = model(torch.Tensor(sample).unsqueeze(0))

In [176]:
prediction

[tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0),
 tensor(0)]