# Assignmnet 3 (100 + 5 points)

**Name:** <br>
**Email:** <br>
**Group:** A/B <br>
**Hours spend *(optional)* :** <br>

### Question 1: Transformer model *(100 points)*

As a Machine Learning engineer at a tech company, you were given a task to develop a machine translation system that translates **English (source) to German (Target)**. You can use existing libraries but the training needs to be done from scratch (usage of pretrained weights is not allowed). You have the freedom to select any dataset for training the model. Use a small subset of data as a validation dataset and report the BLEU score on the validation set. Also, provide a short description of your transformer model architecture, hyperparameters, and training (also provide the training loss curve).

<h3> Submission </h3>

The test set **(test.txt)** will be released one week before the deadline. You should submit the output of your model on the test set separately. Name the output file as **"first name_last_name_test_result.txt"**. Each line of the submission file should contain only the translated text of the corresponding sentence from 'test.txt'.

The 'first name_last_name_test_result.txt' file will be evaluated by your instructor and the student who could get the best BLEU score will get 5 additional points. 

**Dataset**

Here are some of the parallel datasets (see Datasets and Resources file):
* Europarl Parallel corpus - https://www.statmt.org/europarl/v7/de-en.tgz
* News Commentary - https://www.statmt.org/wmt14/training-parallel-nc-v9.tgz (use DE-EN parallel data)
* Common Crawl corpus - https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz (use DE-EN parallel data)

You can also use other datasets of your choice. In the above datasets, **'.en'** file has the text in English, and **'.de'** file contains their corresponding German translations.

## Notes:

1) You can also consider using a small subset of the dataset if the training dataset is large
2) Sometimes you can also get out of memory errors while training, so choose the hyperparameters carefully.
3) Your training will be much faster if you use a GPU. If you are using a CPU, it may take several hours or even days. (you can also use Google Colab GPUs for training. link: https://colab.research.google.com/)

**MY MACHINE TRANSLATION SYSTEM**

In [1]:
import torch
import torch.optim as optim
from sklearn.model_selection import train_test_split
from nltk.translate.bleu_score import corpus_bleu
from nltk.tokenize import word_tokenize
import random
import collections
import re
import string
import math
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence

In [2]:
#loading training data
with open('news-commentary-v9.de-en.de', 'r', encoding='utf-8') as file1:
    german_corpus = file1.readlines()


with open('news-commentary-v9.de-en.en', 'r', encoding='utf-8') as file1:
    english_corpus = file1.readlines()

print('Data Loaded')
print(len(german_corpus))
print(len(english_corpus))

Data Loaded
201854
201995


In [3]:
#sample sentences
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1,
                                               english_corpus[sample_i]))
    print('small_vocab_de Line {}:  {}'.format(sample_i + 1, 
                                               german_corpus[sample_i]))

small_vocab_en Line 1:  $10,000 Gold?

small_vocab_de Line 1:  Steigt Gold auf 10.000 Dollar?

small_vocab_en Line 2:  SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.

small_vocab_de Line 2:  SAN FRANCISCO – Es war noch nie leicht, ein rationales Gespräch über den Wert von Gold zu führen.



In [4]:

english_words = [word for sentence in english_corpus for word in sentence.split()]
english_words_counter = collections.Counter(english_words)

german_words = [word for sentence in german_corpus for word in sentence.split()]
german_words_counter = collections.Counter(german_words)

print("Total English words: ", len(english_words))
print("Unique English words: ", len(english_words_counter))
print("Total German words : ", len(german_words))
print("Unique German words: ", len(german_words_counter))

Total English words:  4386024
Unique English words:  152217
Total German words :  4500845
Unique German words:  245674


In [5]:
min_length = min(len(german_corpus), len(english_corpus))

german_corpus = german_corpus[:min_length]
english_corpus = english_corpus[:min_length]

src = german_corpus
tgt = english_corpus
print(len(german_corpus))
print(len(english_corpus))

# Splitting into training and validation sets with a ratio of 80-20
src_train, src_val, tgt_train, tgt_val = train_test_split(src, 
                                                          tgt, 
                                                          test_size=0.2,
                                                          random_state=42)
print("Data split into val and train set ")

201854
201854
Data split into val and train set 


In [6]:
#preprocessing
def preprocess_and_tokenize(corpus):
    tokenized_corpus = []
    for sentence in corpus:
        # Converting to lowercase
        sentence = sentence.lower()
        # Removing non-printable characters
        sentence = ''.join(filter(lambda x: x in string.printable, sentence))
        # Removing punctuation
        sentence = re.sub(r'[^\w\s]', '', sentence)
        # Tokenizing
        words = word_tokenize(sentence)
        # Removing non-alphabetic tokens
        words = [word for word in words if word.isalpha()]
        # Handle empty sentences after preprocessing
        if len(words) == 0:
            words = []
        tokenized_corpus.append(words)
    return tokenized_corpus

src_train = preprocess_and_tokenize(src_train)
tgt_train = preprocess_and_tokenize(tgt_train)
src_val = preprocess_and_tokenize(src_val)
tgt_val = preprocess_and_tokenize(tgt_val)


english_processed = preprocess_and_tokenize(english_corpus)
german_processed = preprocess_and_tokenize(german_corpus)

#vocabulary
english_vocab = [word for sentence in english_processed for word in sentence]
german_vocab = [word for sentence in german_processed for word in sentence]
english_words_counter = collections.Counter(english_vocab)
german_words_counter = collections.Counter(german_vocab)

print("Total English words after preprocessing: ", len(english_vocab))
print("Unique English words: ", len(english_words_counter))
print("Total German words after preprocessing: ", len(german_vocab))
print("Unique German words: ", len(german_words_counter))

Total English words after preprocessing:  4313195
Unique English words:  63324
Total German words after preprocessing:  4418492
Unique German words:  144204


In [7]:
# LSTM model
class SimpleLSTM(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim)
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
    def forward(self, src):
        embedded = self.embedding(src)
        output, (hidden, cell) = self.rnn(embedded)
        prediction = self.fc_out(output)
        return prediction

INPUT_DIM = len(english_words_counter) + 1
OUTPUT_DIM = len(german_words_counter) + 1
EMB_DIM = 256
HID_DIM = 512

model = SimpleLSTM(INPUT_DIM, EMB_DIM, HID_DIM, OUTPUT_DIM)
print(model)

SimpleLSTM(
  (embedding): Embedding(63325, 256)
  (rnn): LSTM(256, 512)
  (fc_out): Linear(in_features=512, out_features=144205, bias=True)
)


In [8]:
#sting to interger
english_stoi = {word: index+1 for index, (word, _)
                in enumerate(english_words_counter.most_common())}
german_stoi = {word: index+1 for index, (word, _) 
               in enumerate(german_words_counter.most_common())}

train_english_sentences = [[english_stoi[word] for word in sentence] 
                           for sentence in tgt_train]
train_german_sentences = [[german_stoi[word] for word in sentence] 
                          for sentence in src_train]
val_english_sentences = [[english_stoi[word] for word in sentence] 
                         for sentence in tgt_val]
val_german_sentences = [[german_stoi[word] for word in sentence]
                        for sentence in src_val]


train_english_sentences_pad = pad_sequence([torch.as_tensor(s) 
                                            for s in train_english_sentences], 
                                           padding_value=0)
train_german_sentences_pad = pad_sequence([torch.as_tensor(s) 
                                           for s in train_german_sentences], 
                                          padding_value=0)

val_english_sentences_pad = pad_sequence([torch.as_tensor(s) 
                                          for s in val_english_sentences], 
                                         padding_value=0)
val_german_sentences_pad = pad_sequence([torch.as_tensor(s) 
                                         for s in val_german_sentences], 
                                        padding_value=0)

In [9]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

train_data = TensorDataset(torch.transpose(train_english_sentences_pad, 0, 1),
                           torch.transpose(train_german_sentences_pad, 0, 1))
val_data = TensorDataset(torch.transpose(val_english_sentences_pad, 0, 1), 
                         torch.transpose(val_german_sentences_pad, 0, 1))


BATCH_SIZE = 32

train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(dataset=val_data, batch_size=BATCH_SIZE, shuffle=True)

In [10]:
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
import nltk
from nltk.translate.bleu_score import sentence_bleu

def train_model(model, train_loader, optimizer, criterion, device):
    model.train()
    epoch_loss = 0
    for src, tgt in train_loader:
        src, tgt = src.to(device), tgt.to(device)
        optimizer.zero_grad()
        output = model(src)
        output = output.view(-1, output.shape[-1])
        tgt = tgt.view(-1)
        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(train_loader)

# Define the evaluation function
def evaluate_model(model, val_loader, criterion, device):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for src, tgt in val_loader:
            src, tgt = src.to(device), tgt.to(device)
            output = model(src)
            output = output.view(-1, output.shape[-1])
            tgt = tgt.view(-1)
            loss = criterion(output, tgt)
            epoch_loss += loss.item()
    return epoch_loss / len(val_loader)

# Calculate BLEU score
def calculate_bleu_score(model, val_loader, english_itos, german_itos, device):
    model.eval()
    references, hypotheses = [], []
    with torch.no_grad():
        for src, tgt in val_loader:
            src, tgt = src.to(device), tgt.to(device)
            output = model(src)
            output = output.argmax(dim=-1)
            for i in range(tgt.size(0)):
                ref = [german_itos[idx] for idx in tgt[i].cpu().numpy() if idx != 0]
                hyp = [german_itos[idx] for idx in output[i].cpu().numpy() if idx != 0]
                references.append([ref])
                hypotheses.append(hyp)
    return corpus_bleu(references, hypotheses)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Define optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    train_loss = train_model(model, train_loader, optimizer, criterion, device)
    val_loss = evaluate_model(model, val_loader, criterion, device)
    print(f'Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}')

# Calculate BLEU score
english_itos = {idx: word for word, idx in english_stoi.items()}
german_itos = {idx: word for word, idx in german_stoi.items()}
bleu_score = calculate_bleu_score(model, val_loader, english_itos, german_itos, device)
print(f'BLEU score = {bleu_score * 100:.2f}')


ValueError: Expected input batch_size (4896) to match target batch_size (4288).