# I. What We Will Do
Here, I wanted to try to make my own french translator. Me, obviously are not a french speaker but I wanted to know how to speak it while using Google Translate is too common these days. This made decide to make a new one. Well, is it going to be comparable with Google Translate? Only time will tell. 

# II. How We Will Do It

In general, I will use a feature that is so-called Transformer. This is like a state-of-art technology in Natural Language Processing right now, which is also used by Google to make their translator. Then the rest of the process is just like any other Machine Learning project. Obtain the data, clean them off, find the best architecture and train them. Look how well it performed, and if we think it's not good enough, then we can fix some of the issues during the whole process or maybe start doing that again from the start. Simple isn't it?

But anyway, we get to the point. How are we going to do that? This is what we will do:

1. Insert the Data
2. Building Tokenizers and Vocabularies
3. Making Dataloaders
4. Building Transformer
5. Training and Evaluation
6. Translating

# III. Build 'Em
## III.1. Insert the Data
This is what we will do to insert the data to this notebook.
1. Upload the data, which is the french text and its english translation. I used *pandas* like always.
2. Take some of the data, or use all of them if you knew you have the required computational power.
3. Transform it to Dataset so PyTorch can use them later on as DataLoader.

In [None]:
!python -m spacy download fr_core_news_sm

In [None]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import dask.dataframe as dd
from torch.utils.data import Dataset, DataLoader
import spacy
import torch.nn as nn
from torchtext.vocab import vocab, FastText
import torch

In [None]:
dd_ = dd.read_csv('/kaggle/input/en-fr-translation-dataset/en-fr.csv')
dd = pd.DataFrame(dd_.head(100000))

class words_dataset(Dataset):
    def __init__(self, text):
        self.text = text
    def __len__(self):
        return len(self.text)
    def __getitem__(self, idx):
        return self.text[idx]
    
en_dataset = words_dataset(dd['en'])
fr_dataset = words_dataset(dd['fr'])

## III.2. Building Tokenizers and Vocabularies
This is what we will do:
1. Make the tokenizers for each language.
2. Make vocabularies for each language.
3. Insert special tokens to the vocabularies

In [None]:
en_spacy = spacy.load('en_core_web_sm')
fr_spacy = spacy.load('fr_core_news_sm')

en_tokenizer = lambda x: [y.text for y in en_spacy(str(x))]
fr_tokenizer = lambda x: [y.text for y in fr_spacy(str(x))]

en_fasttext = FastText(language='en')
fr_fasttext = FastText(language='fr')

In [None]:
en_vocab = vocab(en_fasttext.stoi)
fr_vocab = vocab(fr_fasttext.stoi)

special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for i, s in zip(range(4), special_symbols):
    en_vocab.insert_token(s, i)
    fr_vocab.insert_token(s, i)
    
en_vocab.set_default_index(en_vocab['<unk>'])
fr_vocab.set_default_index(fr_vocab['<unk>'])

## III.3. Making Dataloaders
This is what we will do:
1. Split the data for training and evaluation
2. Make *collate_fn* function
3. Make the dataloaders

In [None]:
from torch.nn.utils.rnn import pad_sequence

pad_idx = 1
bos_idx = 2
eos_idx = 3

def collate_batch_en(batch):
    batch_process = []
    for text in batch:
        process_1 = en_vocab(en_tokenizer(text))
        process_2 = torch.cat((torch.tensor([bos_idx]), 
                               torch.tensor(process_1), 
                               torch.tensor([eos_idx])))
        batch_process.append(process_2)
    output = pad_sequence(batch_process, padding_value=pad_idx)
    return output

def collate_batch_fr(batch):
    batch_process = []
    for text in batch:
        process_1 = fr_vocab(fr_tokenizer(text))
        process_2 = torch.cat((torch.tensor([bos_idx]), 
                               torch.tensor(process_1), 
                               torch.tensor([eos_idx])))
        batch_process.append(process_2)
    output = pad_sequence(batch_process, padding_value=pad_idx)
    return output

In [None]:
from torch.utils.data.dataset import random_split

batch_size = 128
num_train = int(len(en_dataset) * 0.95)
num_valid = len(en_dataset) - num_train

en_train, en_valid = random_split(en_dataset, [num_train, num_valid])
fr_train, fr_valid = random_split(fr_dataset, [num_train, num_valid])

train_dataloader = {x: DataLoader(z, batch_size=batch_size, collate_fn=fn) 
                    for x, z, fn in zip(['en', 'fr'], [en_train, fr_train], [collate_batch_en, collate_batch_fr])}
valid_dataloader = {x: DataLoader(z, batch_size=batch_size, collate_fn=fn) 
                    for x, z, fn in zip(['en', 'fr'], [en_valid, fr_valid], [collate_batch_en, collate_batch_fr])}

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

## III.4. Building Transformer
1. Make function to generate masks
2. Make object for Positional Encoding
3. Make object for transformer

In [None]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=device).type(torch.bool)
    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_padding_mask = (src == pad_idx).transpose(0, 1).to(device)
    tgt_padding_mask = (tgt == pad_idx).transpose(0, 1).to(device)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

In [None]:
import torch.nn as nn
from torch import Tensor
import math

class PositionalEncoding(nn.Module):
    def __init__(self, emb_size:int, dropout:float, maxlen:int=5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(-torch.arange(0, emb_size, 2) * math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)
    def forward(self, token_embedding:Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])
    
class Seq2SeqTransformer(nn.Module):
    def __init__(self, num_encoder_layers:int, num_decoder_layers:int, emb_size:int,
                 nhead:int, src_vocab_size:int, tgt_vocab_size:int, dim_feedforward:int, dropout:float=0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = nn.Transformer(d_model=emb_size, nhead=nhead, num_encoder_layers=num_encoder_layers,
                                          num_decoder_layers=num_decoder_layers, dim_feedforward=dim_feedforward,
                                          dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = nn.Embedding.from_pretrained(fr_fasttext.vectors, freeze=True)
        self.tgt_tok_emb = nn.Embedding.from_pretrained(en_fasttext.vectors, freeze=True)
        self.positional_encoding = PositionalEncoding(emb_size, dropout=dropout)
        self.emb_size = emb_size
    def forward(self, src:Tensor, tgt:Tensor, src_mask:Tensor, tgt_mask:Tensor,
                src_padding_mask:Tensor, tgt_padding_mask:Tensor, memory_key_padding_mask:Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src) * math.sqrt(self.emb_size))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(tgt) * math.sqrt(self.emb_size))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)
    def encode(self, src:Tensor, src_mask:Tensor):
        return self.transformer.encoder(self.positional_encoding(self.src_tok_emb(src)), src_mask)
    def decode(self, tgt:Tensor, memory:Tensor, tgt_mask:Tensor):
        return self.transformer.decoder(self.positional_encoding(self.tgt_tok_emb(tgt)), memory, tgt_mask)

## III.5. Training and Evaluation
This is what we will do on this part:
1. Make function for training
2. Make function for evaluation
3. Initialize the model and its hyperparameter
4. Train and evaluate the model

In [None]:
def train_epoch(model, src_dataloader, tgt_dataloader):
    model.train()
    losses = 0
    
    for src, tgt in zip(src_dataloader, tgt_dataloader):
        src = src.to(device)
        tgt = tgt.to(device)
        tgt_input = tgt[:-1, :]
        
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
        optimizer.zero_grad()
        
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()
        
        optimizer.step()
        losses += loss.item()
        
    return losses / len(src_dataloader)

In [None]:
def evaluate(model, src_dataloader, tgt_dataloader):
    model.eval()
    losses = 0
    
    for src, tgt in zip(src_dataloader, tgt_dataloader):
        src = src.to(device)
        tgt = tgt.to(device)
        tgt_input = tgt[:-1, :]
        
        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)
        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)
        
        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()
    
    return losses / len(src_dataloader)

In [None]:
src_vocab_size = fr_fasttext.vectors.shape[0]
tgt_vocab_size = en_fasttext.vectors.shape[0]
emb_size = fr_fasttext.vectors.shape[1] 
nhead = 5
ffn_hid_dim = emb_size
batch_size = 128
num_encoder_layers = 3
num_decoder_layers = 3

transformer = Seq2SeqTransformer(num_encoder_layers, num_decoder_layers, emb_size, nhead,
                                 src_vocab_size, tgt_vocab_size, ffn_hid_dim)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
        
transformer = transformer.to(device)
loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001)

In [None]:
import time

num_epochs = 20
for epoch in range(1, num_epochs + 1):
    start_time = time.time()
    train_loss = train_epoch(transformer, train_dataloader['fr'], train_dataloader['en'])
    end_time = time.time()
    eval_loss = evaluate(transformer, valid_dataloader['fr'], valid_dataloader['en'])
    
    elapsed = end_time - train_time
    elapsed_time = '{} m and {:.2f} s'.format(elapsed // 60, elapsed % 60)
    
    print('Epoch : {}\tTrain Loss : {:.3f}\tEval Loss : {:.3f}'.format(elapsed_time, train_loss, eval_loss))

## III.6. Translating

In [None]:
def transform_text_fr(text:str):
    process = fr_vocab(fr_tokenizer(text))
    process = torch.cat((torch.tensor([bos_idx]), 
                         torch.tensor(process), 
                         torch.tensor([eos_idx])))
    return process

In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(device)
    src_mask = src_mask.to(device)
    
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(device)
    
    for i in range(max_len - 1):
        memory = memory.to(device)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0)).type(torch.bool)).to(device)
        
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()
        
        ys = torch.cat([ys, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        
        if next_word == eos_idx:
            break
            
    return ys

In [None]:
def translate(model, src_sentence:str):
    model.eval()
    src = text_transform_fr(src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(model, src, src_mask, max_len=(num_tokens + 5), start_symbol=bos_idx).flatten()
    return ' '.join(en_vocab.lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace('<bos>', '')

To use the translation machine, you can use the function: 

*translate(transformer, #your_text)*