
<div style="color:#254E58;margin:0;font-size:48px;font-family:Georgia;text-align:center;display:fill;border-radius:5px;overflow:hidden;font-weight:600;"> Fine-tuning for the machine translation model </div>

<h5 style="text-align: center; font-family: Verdana; font-size: 12px; font-style: normal; font-weight: bold; text-decoration: None; text-transform: none; letter-spacing: 1px; color: #7B0F2D; background-color: #ffffff;">CREATED BY: NGUYEN THI CAM LAI</h5>


<h2 style="font-family: Verdana; font-size: 30px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #0A64A2; background-color: #ffffff;"><b>General </b> introduction </h2>

**Problem:** Machine translation using Transformer architecture model

**Project objectives:**  This project will train an existing [model](https://huggingface.co/Helsinki-NLP/opus-mt-en-vi) on a new dataset [new dataset](https://huggingface.co/datasets/mt_eng_vietnamese/viewer/iwslt2015-vi-en/train) by ***fine-tuning*** the `weights` and `hyperparameters`, to improve the accuracy and performance of the trained model

**Training model:** (original model)

- Model name: Helsinki-NLP/opus-mt-en-vi

- Link (Hugging Face): https://huggingface.co/Helsinki-NLP/opus-mt-en-vi

- Source (Github): https://github.com/Helsinki-NLP/OPUS-MT-app/

**Training dataset:**
- Dataset name: mt_eng_vietnamese (iwslt2015-en-vi)
- Link (Hugging Face): https://huggingface.co/datasets/mt_eng_vietnamese/viewer/iwslt2015-vi-en/train



<h2 style="font-family: Verdana; font-size: 30px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #0A64A2; background-color: #ffffff;"><b>Finetuning </b> process </h2>

## ðŸ“Œ Before you begin, make sure you have all the necessary libraries installed!

In [None]:
! pip install -U git+https://github.com/huggingface/transformers.git
! pip install -U git+https://github.com/huggingface/accelerate.git

In [None]:
!pip install datasets evaluate sacrebleu

## ðŸ“Œ Sign in to Hugging Face so you can upload and share your models. When prompted, enter your token to login

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## ðŸ“Œ Load `mt_eng_vietnamese` dataset


**Start by loading the `iwslt2015-en-vi` subset of the `mt_eng_vietnamese` dataset from the `datasets` library:**

In [None]:
from datasets import load_dataset

data = load_dataset("mt_eng_vietnamese",'iwslt2015-en-vi')

In [None]:
data.shape

## ðŸ“Œ Split the dataset into a train and test set with the `train_test_split` method:

In [None]:
data = data["train"].train_test_split(test_size=0.2)

In [None]:
data.shape

In [None]:
data["train"][2]

## ðŸ“Œ Preprocess

**The next step is to load a `Helsinki-NLP/opus-mt-en-vi` tokenizer to process the English-VietNam language pairs:**

In [None]:
from transformers import AutoTokenizer

checkpoint = "Helsinki-NLP/opus-mt-en-vi"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

**The preprocessing function you want to create needs to:**

* Prefix the input with a prompt so Helsinki-NLP/opus-mt-en-vi knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
* Tokenize the input (English) and target (VietNam) separately because you canâ€™t tokenize VietNam text with a tokenizer pretrained on an English vocabulary.
* Truncate sequences to be no longer than the maximum length set by the max_length parameter.

In [None]:
source_lang = "en"
target_lang = "vi"
prefix = "translate English to Vietnamese: "


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

**To apply the preprocessing function over the entire dataset, use `datasets map` method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:**



In [None]:
tokenized_data = data.map(preprocess_function, batched=True)

**Now create a batch of examples using `DataCollatorForSeq2Seq`. Itâ€™s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length:**

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## ðŸ“Œ Evaluate

**Use library `evaluate` to get the fastest model evaluation score:**

In [None]:
! pip install evaluate

In [None]:
! pip install sacrebleu

In [None]:
import evaluate

metric = evaluate.load("sacrebleu")

**Create a function that passes your predictions and labels to compute to calculate the `SacreBLEU` score:**

In [None]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## ðŸ“Œ Train

**Load `Helsinki-NLP/opus-mt-en-vi` with `AutoModelForSeq2SeqLM`:**

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) 

**At this point, only three steps remain:**

* Define your training hyperparameters in `Seq2SeqTrainingArguments`. The only required parameter is output_dir which specifies where to save your model. Youâ€™ll push this model to the Hub by setting `push_to_hub=True `(you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the `Trainer` will evaluate the SacreBLEU metric and save the training checkpoint.
* Pass the training arguments to `Seq2SeqTrainer` along with the model, dataset, tokenizer, data collator, and compute_metrics function.
* Call `train()` to finetune your model.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir = "en_vi_translation_1",
    evaluation_strategy = "epoch",
    learning_rate = 2e-05,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    seed = 42,
    adam_epsilon=1e-08,
    adam_beta1=0.9,
    adam_beta2=0.999,
    weight_decay=0.01,
    save_total_limit=3,
    predict_with_generate=True,
    lr_scheduler_type = 'linear', 
    num_train_epochs = 3,
    #push_to_hub=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

## ðŸ“Œ Test translation on new sentence

In [None]:
text_example_1 = "Hi, call me Lai. I am studying Data Science at VNUHCM - University of Science."
text_example_2 = "I'm here to assusage my enthusiasm for Natural language processing."
text_example_3 = "Natural Language Processing (NLP) is a branch of artificial intelligence."


In [None]:
from transformers import pipeline
translator = pipeline("translation", model="en_vi_translation_1")
trans_text = translator(text_example_1)
print(trans_text)
trans_text = translator(text_example_2)
print(trans_text)
trans_text = translator(text_example_3)
print(trans_text)

## ðŸ“Œ Save weight

In [None]:
import torch

In [None]:
torch.save(model.state_dict(), 'fine_tuned_weights.pth')

<h2 style="font-family: Verdana; font-size: 30px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #0A64A2; background-color: #ffffff;"><b>Project </b> result </h2>

- This model is a fine-tuned version of [Helsinki-NLP/opus-mt-en-vi](https://huggingface.co/Helsinki-NLP/opus-mt-en-vi) on the [mt_eng_vietnamese](https://huggingface.co/datasets/mt_eng_vietnamese) dataset. It achieves the following results on the evaluation set:

    - Loss: 1.376056

    - Bleu: 34.515300

    - Gen Len: 27.230600
    

- The model after training is saved at: https://huggingface.co/ntclai/en_vi_translation_1


<h2 style="font-family: Verdana; font-size: 30px; font-style: normal; font-weight: normal; text-decoration: none; text-transform: none; letter-spacing: 2px; color: #0A64A2; background-color: #ffffff;"><b>Reference </b> </h2>

- https://huggingface.co/docs/transformers/tasks/translation?fbclid=IwAR039-v3EKAUPU6hiim5iTAUoQtE2B_iK_5HY2U7ThR1HlJyeEb30PaUIOU

- https://github.com/mariaviana21/fine-tuning-machine-translation


<div style="color:#254E58;margin:0;font-size:48px;font-family:Georgia;text-align:center;display:fill;border-radius:5px;overflow:hidden;font-weight:600;"> Thank you for watching! </div>