<a href="https://colab.research.google.com/github/hissain/mlworks/blob/main/codes/Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This notebook is prepared from [this source](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/032d653a4f5a9c1ec32b9fc7c989ffe1/seq2seq_translation_tutorial.ipynb#scrollTo=cE7uNTa2SEpW).

In [21]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

Here, two recurrent neural networks work together to transform one sequence to another. An encoder network condenses an input sequence into a vector, and a decoder network unfolds that vector into a new sequence.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/seq2seq.png)

Here, we will be representing each word in a language as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/word-encoding.png)


### Import

In [22]:
from io import open
import unicodedata
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

import pandas as pd

In [23]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [24]:
device

device(type='cpu')

### Dataset creation

Download the data from [here](https://drive.google.com/file/d/12eWXVWwr3XL96w-CuKlzPfuWIuW_PvGi/view?usp=sharing). This dataset is collected from [link](http://www.manythings.org/anki/ben-eng.zip)

In [25]:
SOS_token = 0
EOS_token = 1

In [26]:
MAX_LENGTH = 10

hidden_size = 256
num_layers = 3
batch_size = 256

N_EPOCHS = 150

In [27]:
class Tokenizer:
    def __init__(self, name):
        self.name = name

        self.word2index = {"SOS":0, "EOS":1}
        self.index2word = {0: "SOS", 1: "EOS"}

        self.word2count = {}
        self.n_words = 2  # Count SOS and EOS

    def add_sentence(self, sentence):
        for word in sentence.split(' '):
            self.add_word(word)

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word

            self.word2count[word] = 1
            self.n_words += 1
        else:
            self.word2count[word] += 1

    def get_tokens(self, sentences):
        tokens = []
        for sentence in sentences:
            sentence_token = [EOS_token] * MAX_LENGTH
            for i, word in enumerate(sentence.split()):
                if i >= MAX_LENGTH:
                    break
                sentence_token[i] = self.word2index[word]
            tokens.append(torch.tensor(sentence_token))
        return torch.stack(tokens, dim=0)


In [28]:
# Lowercase, trim, and remove non-letter characters
def preprocess_text(s):
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"([।!?])", r" \1", s)
    return s.strip()

In [29]:
def read(lang1, lang2):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open(f'../datasets/en-bn/{lang1}-{lang2}.txt', encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[preprocess_text(s) for s in l.split('\t')[:2]] for l in lines]

    input_lang = Tokenizer(lang1)
    output_lang = Tokenizer(lang2)

    return input_lang, output_lang, pairs

In [30]:
def prepare_data(lang1, lang2):
    input_lang, output_lang, pairs = read(lang1, lang2)
    print("Read %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    dataset = pd.DataFrame(data=pairs, columns=['input', 'output'])
    # dataset['output'] = dataset['output'].apply(lambda x: "SOS " + x + " EOS")
    return input_lang, output_lang, dataset

input_lang, output_lang, pairs = prepare_data('en', 'bn')
pairs.head()

Reading lines...
Read 6509 sentence pairs
Counting words...
Counted words:
en 3165
bn 4471


Unnamed: 0,input,output
0,Go .,যাও ।
1,Go .,যান ।
2,Go .,যা ।
3,Run !,পালাও !
4,Run !,পালান !


In [31]:
inputs = ['Go .', 'Go .', 'Go .', 'Run  !', 'Run  !', 'Who  ?', 'Wow  !', 'Fire  !']
inputs = input_lang.get_tokens(inputs)
print(inputs.size())
print(inputs)

torch.Size([8, 10])
tensor([[ 2,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 4,  6,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 4,  6,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 7,  8,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 9,  6,  1,  1,  1,  1,  1,  1,  1,  1],
        [10,  6,  1,  1,  1,  1,  1,  1,  1,  1]])


### Model Building

In [32]:
# Define the Encoder
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)

    def forward(self, input_seqs, hidden):
        # print(input_seqs.size())
        embedded = self.embedding(input_seqs)
        # print(embedded.size())
        output, hidden = self.rnn(embedded, hidden)
        return output, hidden

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size),
                torch.zeros(self.num_layers, batch_size, self.hidden_size))


In [33]:
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden
        decoder_outputs = []

        for i in range(MAX_LENGTH):
            decoder_output, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_hidden

    def forward_step(self, input, hidden):
        output = F.relu(self.embedding(input))
        output, hidden = self.rnn(output, hidden)
        output = self.out(output)
        return output, hidden

In [34]:
# Define the Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seqs, target_seqs=None):
        batch_size = input_seqs.size(0)
        target_vocab_size = self.decoder.out.out_features

        encoder_hidden = self.encoder.init_hidden(batch_size)
        encoder_output, encoder_hidden = self.encoder(input_seqs, encoder_hidden)

        decoder_output, decoder_hidden = self.decoder(
            encoder_output,
            encoder_hidden,
            target_seqs
        )

        return decoder_output


In [35]:
# Initialize the model
input_size = input_lang.n_words
output_size = output_lang.n_words

encoder = Encoder(input_size, hidden_size, num_layers)
decoder = Decoder(hidden_size, output_size, num_layers)

# Move model to device
model = Seq2Seq(encoder, decoder).to(device)

In [36]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [37]:

for epoch in range(N_EPOCHS):
    model.train()
    epoch_loss = 0
    for i in range(len(pairs) // batch_size + 1):
        inputs = pairs.iloc[(i)*batch_size:(i+1)*batch_size, 0]
        targets = pairs.iloc[(i)*batch_size:(i+1)*batch_size, 1]

        if len(inputs) == 0:
            break
        # print(inputs.tolist(), targets.tolist())
        # inputs = [input_lang.get_one_hot_sentence(input.split()) for input in inputs]
        # inputs = torch.row_stack(inputs)
        inputs = input_lang.get_tokens(inputs.tolist())
        targets = output_lang.get_tokens(targets.tolist())

        # print(inputs.view(-1).size())
        # print(targets.size())

        optimizer.zero_grad()
        predictions = model(inputs, targets)
        # predictions = torch.argmax(predictions, dim=2)

        # print(predictions.size(), targets.size())
        target_one_hot = F.one_hot(targets, output_lang.n_words)

        loss = criterion(predictions.float().flatten(), target_one_hot.float().flatten())
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f'Epoch: {epoch+1:02} | Train Loss: {epoch_loss:.3f}')


Epoch: 01 | Train Loss: 806441.037
Epoch: 02 | Train Loss: 771490.146
Epoch: 03 | Train Loss: 762700.184
Epoch: 04 | Train Loss: 737697.350
Epoch: 05 | Train Loss: 736923.672
Epoch: 06 | Train Loss: 731633.666
Epoch: 07 | Train Loss: 724998.471
Epoch: 08 | Train Loss: 712320.208
Epoch: 09 | Train Loss: 704486.602
Epoch: 10 | Train Loss: 693964.934
Epoch: 11 | Train Loss: 684892.390
Epoch: 12 | Train Loss: 681904.144
Epoch: 13 | Train Loss: 674734.675
Epoch: 14 | Train Loss: 668384.838
Epoch: 15 | Train Loss: 663065.026
Epoch: 16 | Train Loss: 658551.104
Epoch: 17 | Train Loss: 653281.809
Epoch: 18 | Train Loss: 650563.725
Epoch: 19 | Train Loss: 647577.294
Epoch: 20 | Train Loss: 645539.924
Epoch: 21 | Train Loss: 642609.330
Epoch: 22 | Train Loss: 647308.371
Epoch: 23 | Train Loss: 643341.818
Epoch: 24 | Train Loss: 637528.523
Epoch: 25 | Train Loss: 631946.917
Epoch: 26 | Train Loss: 626064.383
Epoch: 27 | Train Loss: 620745.490
Epoch: 28 | Train Loss: 614416.796
Epoch: 29 | Train Lo

In [42]:
def evaluate(model, sentence, input_lang, output_lang):
    with torch.no_grad():
        inputs = input_lang.get_tokens([sentence])
        predictions = model(inputs)
        predictions = torch.argmax(predictions, dim=2)[0]

        output = ""
        for prediction in predictions:
            word = output_lang.index2word[prediction.item()]
            if word == "EOS":
                break
            output += " " + word
        return output


evaluate(model, "help .", input_lang, output_lang)

' কেউ কাজ শুরু করে ।'

In [43]:
evaluate(model, "what do you do .", input_lang, input_lang)

" full guys She's She's She's midnight ."

In [40]:
evaluate(model, "কেমন আছেন ?", output_lang, output_lang)

' তুমি কি ?'

In [80]:
print(evaluate(model, "are you kidding .", input_lang, output_lang))
print(evaluate(model, "Queen is nobody .", input_lang, output_lang))

 তুমি বলেন কথা করো ।
 ওটা গুরুত্বপূর্ণ ভুল ।
