<a href="https://colab.research.google.com/github/pavaris-pm/machine-translation-from-th/blob/main/translation_th_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install necessary packages

In [1]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install sacrebleu
!pip install sentencepiece
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 13.4 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 77.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 50.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 15.1 MB

# Translation model from Thai to English using KDE4 datasets
- to get a valid pair of code, you need to take a look at https://opus.nlpl.eu/KDE4.php

In [2]:
!unzip /content/en-th.txt.zip

unzip:  cannot find or open /content/en-th.txt.zip, /content/en-th.txt.zip.zip or /content/en-th.txt.zip.ZIP.


In [3]:
from datasets import load_dataset

# we will choose a target language later, in this part, we need to import the data in the format that it has first
raw_datasets = load_dataset("kde4", lang1="en", lang2="th")

Downloading builder script:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.10k [00:00<?, ?B/s]



Downloading and preparing dataset kde4/en-th to /root/.cache/huggingface/datasets/kde4/en-th-lang1=en,lang2=th/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac...


Downloading data:   0%|          | 0.00/1.64M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset kde4 downloaded and prepared to /root/.cache/huggingface/datasets/kde4/en-th-lang1=en,lang2=th/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 70634
    })
})

In [5]:
# split the dataset to get a train and validation dataset
split_datasets = raw_datasets['train'].train_test_split(train_size=0.9, seed = 100)
split_datasets["validation"] = split_datasets.pop("test")

In [6]:
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 63570
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 7064
    })
})

In [7]:
# observe our datasets
split_datasets['train'][1]['translation']

{'en': "An error occurred while trying to share folder '%1 '. Make sure that the Perl script'fileshareset' is set suid root.",
 'th': "เกิดข้อผิดพลาดระหว่างปรับให้ใช้โฟลเดอร์ '% 1' ร่วมกัน โปรดตรวจสอบว่า สคริปต์คำสั่งเพิร์ล 'fileshareset' ได้ถูกตั้งให้ประมวลผลโดยใช้สิทธิ์ของ root แล้ว (suid root)"}

# Data Preprocessing
- we need to process our data before pass it into our model, in order to do that, we should process our dataset first.
which can be done by using `AutoTokenizer` from transformers library

In [8]:
model_checkpoint = "Helsinki-NLP/opus-mt-th-en"

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors='pt')

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/810k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

In [10]:
# to see how our tokenizer works
en_sentence = split_datasets['train'][1]['translation']['en']
th_sentence = split_datasets['train'][1]['translation']['th']

print(f"en sentence : {en_sentence}")
print(f"th sentence : {th_sentence}")

en sentence : An error occurred while trying to share folder '%1 '. Make sure that the Perl script'fileshareset' is set suid root.
th sentence : เกิดข้อผิดพลาดระหว่างปรับให้ใช้โฟลเดอร์ '% 1' ร่วมกัน โปรดตรวจสอบว่า สคริปต์คำสั่งเพิร์ล 'fileshareset' ได้ถูกตั้งให้ประมวลผลโดยใช้สิทธิ์ของ root แล้ว (suid root)


In [11]:
# use our tokenizer, note that we need to specify the target lang as well
tokenizer(th_sentence, text_target = en_sentence)

{'input_ids': [19987, 1795, 6230, 186, 537, 3405, 4348, 10731, 12, 5308, 46840, 12, 15589, 3722, 34427, 320, 25257, 5325, 7237, 9, 3280, 5, 43, 540, 1377, 186, 10596, 12192, 7085, 119, 2908, 35797, 101, 77, 8885, 5106, 2908, 35797, 80, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1607, 3829, 7037, 668, 770, 7, 1709, 1502, 10483, 320, 2, 2635, 482, 16, 4, 9506, 1075, 13558, 5, 25257, 51982, 12069, 5, 19, 669, 19065, 5106, 3711, 2, 0]}

In [12]:
# observe its performance, to see it works perfectly on the language that we have
print(tokenizer.convert_ids_to_tokens(tokenizer(th_sentence, text_target = en_sentence)['input_ids']))
print(tokenizer.convert_ids_to_tokens(tokenizer(th_sentence, text_target = en_sentence)['labels']))

['▁เกิดข้อผิดพลาด', 'ระหว่าง', 'ปรับ', 'ให้', 'ใช้', 'โฟลเดอร์', "▁'%", "▁1'", '▁', 'ร่วมกัน', '▁โปรดตรวจสอบว่า', '▁', 'สคริปต์', 'คําสั่ง', 'เพิร์ล', "▁'", 'file', 'sh', 'are', 's', 'et', "'", '▁ได้', 'ถูก', 'ตั้ง', 'ให้', 'ประมวลผล', 'โดยใช้', 'สิทธิ์', 'ของ', '▁r', 'oot', '▁แล้ว', '▁(', 'su', 'id', '▁r', 'oot', ')', '</s>']
['▁An', '▁error', '▁occurred', '▁while', '▁trying', '▁to', '▁share', '▁folder', "▁'%1", "▁'", '.', '▁Make', '▁sure', '▁that', '▁the', '▁Per', 'l', '▁script', "'", 'file', 'share', 'set', "'", '▁is', '▁set', '▁su', 'id', '▁root', '.', '</s>']


In [13]:
# to see all keys produced by the tokenizer
sample = tokenizer(th_sentence, text_target=en_sentence)
sample.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [14]:
# define function to preprocess our datasets, where maxlength we consier it to be 256
# since thai language seems to have longer sequence compared to another language
max_length = 128

def preprocess_function(examples):
  inputs = [txt['th'] for txt in examples['translation']]
  targets = [txt['en'] for txt in examples['translation']]

  model_inputs = tokenizer(
      inputs, text_target=targets, truncation=True, max_length=max_length
  )

  return model_inputs

In [15]:
# apply the preprocess dataset into our entire dataset that was splitted
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets['train'].column_names,
)

  0%|          | 0/64 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

In [16]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 63570
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 7064
    })
})

In [17]:
# observe the value in tokenized datasets
print(tokenized_datasets['train'][1]['input_ids'])
print(tokenized_datasets['train'][1]['labels'])

[19987, 1795, 6230, 186, 537, 3405, 4348, 10731, 12, 5308, 46840, 12, 15589, 3722, 34427, 320, 25257, 5325, 7237, 9, 3280, 5, 43, 540, 1377, 186, 10596, 12192, 7085, 119, 2908, 35797, 101, 77, 8885, 5106, 2908, 35797, 80, 0]
[1607, 3829, 7037, 668, 770, 7, 1709, 1502, 10483, 320, 2, 2635, 482, 16, 4, 9506, 1075, 13558, 5, 25257, 51982, 12069, 5, 19, 669, 19065, 5106, 3711, 2, 0]


# Train the model with Trainer API

In [18]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/307M [00:00<?, ?B/s]

In [19]:
# once we already have our model, the next one is to add a data collator for make a padding in every batch
# note that in this case we use seq2seq model, so that the data collator that we will use maybe different from other tasks

from transformers import DataCollatorForSeq2Seq

# we need to specify the model as well since we deal with different architecture in translation task
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [20]:
# to observe how our data collator work
batch = data_collator([tokenized_datasets['train'][i] for i in range(1,3)])

In [21]:
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [22]:
# to see what data collator work in backend and also the reason that we didnt specify padding at the tokenizer
print(batch['labels'])
print(batch['decoder_input_ids'])

tensor([[ 1607,  3829,  7037,   668,   770,     7,  1709,  1502, 10483,   320,
             2,  2635,   482,    16,     4,  9506,  1075, 13558,     5, 25257,
         51982, 12069,     5,    19,   669, 19065,  5106,  3711,     2,     0],
        [  503, 37323, 43856,   365,    34,  1822,     7,  2036, 46128,     2,
             0,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100]])
tensor([[62306,  1607,  3829,  7037,   668,   770,     7,  1709,  1502, 10483,
           320,     2,  2635,   482,    16,     4,  9506,  1075, 13558,     5,
         25257, 51982, 12069,     5,    19,   669, 19065,  5106,  3711,     2],
        [62306,   503, 37323, 43856,   365,    34,  1822,     7,  2036, 46128,
             2,     0, 62306, 62306, 62306, 62306, 62306, 62306, 62306, 62306,
         62306, 62306, 62306, 62306, 62306, 62306, 62306, 62306, 62306, 62306]])


In [23]:
# define the metrics to evaluate performance of our translation model
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [24]:
# define function before put it into trainer API in order to make it compute the performance of our model
import numpy as np # in order to implement 'where' command

def compute_metric(eval_preds):
  preds, labels = eval_preds

  # to take just only one logit since the predictions has more than one logit
  if isinstance(preds, tuple):
    preds = preds[0]

  # to process data before put it into the bleu
  # since the translation task, we need to text as an output and also to evaluate, so we need to decode our ID back
  preds_decode = tokenizer.batch_decode(preds, skip_special_tokens=True)

  # then, replace -100 in the label that we cannot encode it (can be done by our tokenizer)
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  labels_decode = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # some simple post processing (using strip() to remove whitespace)
  preds_decode = [pred.strip() for pred in preds_decode]
  labels_decode = [[label.strip()] for label in labels_decode]

  result = metric.compute(predictions=preds_decode, references=labels_decode)

  return {"bleu score" : result["score"]}


In [25]:
import os
os.mkdir("/content/th-en_translation_model")

In [26]:
from transformers import Seq2SeqTrainingArguments

# since the translation model will be evaluated just only before and after train only!
# that is why we specify evaluation strategy to be "no" since we do not need it during training
args = Seq2SeqTrainingArguments(
    "/content/th-en_translation_model",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
)

In [27]:
# define our trainer
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset = tokenized_datasets['train'],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metric
)

In [28]:
# now, the trainer still not trained, let's see how its performance our data before trained
trainer.evaluate(max_length = max_length)

***** Running Evaluation *****
  Num examples = 7064
  Batch size = 64


{'eval_loss': 2.5177578926086426,
 'eval_bleu score': 48.867679986792915,
 'eval_runtime': 291.732,
 'eval_samples_per_second': 24.214,
 'eval_steps_per_second': 0.38}

In [29]:
import torch
torch.cuda.empty_cache() # to allocate CUDA memory before trained

In [30]:
# once we saw the performance, now, try to train it to see its performance after train (fine-tuned) with the dataset
%%time
trainer.train()

***** Running training *****
  Num examples = 63570
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 19870
  Number of trainable parameters = 76039680


Step,Training Loss
500,1.9965
1000,1.9221
1500,1.888
2000,1.8352
2500,1.6295
3000,1.6315
3500,1.5928
4000,1.5905
4500,1.428
5000,1.41


Saving model checkpoint to /content/th-en_translation_model/checkpoint-500
Configuration saved in /content/th-en_translation_model/checkpoint-500/config.json
Model weights saved in /content/th-en_translation_model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/th-en_translation_model/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/th-en_translation_model/checkpoint-500/special_tokens_map.json
Saving model checkpoint to /content/th-en_translation_model/checkpoint-1000
Configuration saved in /content/th-en_translation_model/checkpoint-1000/config.json
Model weights saved in /content/th-en_translation_model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /content/th-en_translation_model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /content/th-en_translation_model/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to /content/th-en_translation_model/checkpoint-1500
Configuration saved i

CPU times: user 1h 8min 50s, sys: 12min 47s, total: 1h 21min 37s
Wall time: 1h 22min 41s


TrainOutput(global_step=19870, training_loss=1.2861123210290717, metrics={'train_runtime': 4961.5345, 'train_samples_per_second': 128.126, 'train_steps_per_second': 4.005, 'total_flos': 7384951596515328.0, 'train_loss': 1.2861123210290717, 'epoch': 10.0})

In [31]:
# finally, observe performance of the model after training
trainer.evaluate(max_length = max_length)

***** Running Evaluation *****
  Num examples = 7064
  Batch size = 64


{'eval_loss': 1.5220731496810913,
 'eval_bleu score': 49.54953075509462,
 'eval_runtime': 420.2678,
 'eval_samples_per_second': 16.808,
 'eval_steps_per_second': 0.264,
 'epoch': 10.0}

In [33]:
# since we observe loss during training, the loss at checkpoint 19500 performs the best, so we will use it as our pipeline
from transformers import pipeline

checkpoint = "/content/th-en_translation_model/checkpoint-19500"
tuned_model = pipeline("translation", model=checkpoint)

loading configuration file /content/th-en_translation_model/checkpoint-19500/config.json
Model config MarianConfig {
  "_name_or_path": "/content/th-en_translation_model/checkpoint-19500",
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      62306
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 62306,
  "decoder_vocab_size": 62307,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 0,
  "extra_pos_embeddings": 62307,
  "forced_eos_token_id": 0,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "

# Test the fine-tuned model

In [40]:
test_data = [
    "สวัสดีปีใหม่นะครับ ขอให้ปีนี้เป็นปีที่ดีสำหรับทุกคนนะ",
    "ในทุก ๆ วัน หากเราอยู่กับใครมาก ๆ จะเป็นคนในครอบครัว หรือทำงานร่วมกับใครบ่อย ๆ แล้วต้อง “เป็นฝ่ายขอบคุณเสมอ” นั่นหมายความว่าเราเป็นผู้รับจากเขามากกว่า เขาต้องช่วยเหลือเรามากกว่า เลยเถิดไปถึงว่าเราอาจทำอะไรได้ไม่ค่อยดีประจำเขาต้องช่วยประจำ มีศักยภาพน้อยเกินไป ใช่ว่าจะเป็นความผิดเสียทีเดียว",
    "ประเทศไทยมีประชากรประมาณ 60 ล้านคน"
]

In [41]:
# test our translation with test data (better if we split the data to have a test dataset as well)
tuned_model(test_data)

[{'translation_text': 'Have a new year. Have a nice year for everybody.'},
 {'translation_text': 'On every day, if we are with one another, family members, or often working with one another, be sure to be "genuine". This means that we are more of them than they are for us. Even if there is no good in us, they are less likely to do so, and are less likely to be able to do so. This is not a sin.'},
 {'translation_text': 'Thailand has a population of 60 billion.'}]

In [42]:
# so that if we want the better model, we need to train with a lot more epochs
# there also has an extreme case, which need some specific dataset to handle with that
extreme_data = "วันมาฆบูชา เป็นวันขึ้น ๑๕ ค่ำ เดือน ๓ มีเหตุการณ์อัศจรรย์ที่ พระสงฆ์สาวกของพระพุทธเจ้าจำนวน ๑,๒๕๐ รูป มาเฝ้าพระพุทธเจ้า ณ วัดเวฬุวัน เมืองราชคฤห์ แคว้นมคธ โดยมิได้นัดหมายกันพระสงฆ์ ทั้งหมดเป็นพระอรหันต์"

In [43]:
# to visualize how it perform with technical term data (better with a larger and related dataset)
tuned_model(extreme_data)

[{'translation_text': 'It is a new moon of the year: a new moon, a new moon, a new moon, a new moon, a day of the year, a day of the year, a day of the year, a day of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year, a year of the year of the year, when all the sun and the moon'}]

# Save the model
- save the fine-tuned into Google Drive

In [44]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [46]:
# to zip file as a tar format
!tar chvfz th-en_model_10epch.tar.gz "/content/th-en_translation_model"

tar: Removing leading `/' from member names
/content/th-en_translation_model/
/content/th-en_translation_model/runs/
/content/th-en_translation_model/runs/Jan01_11-45-32_1ba8c0635139/
/content/th-en_translation_model/runs/Jan01_11-45-32_1ba8c0635139/events.out.tfevents.1672573828.1ba8c0635139.99.0
/content/th-en_translation_model/runs/Jan01_11-45-32_1ba8c0635139/events.out.tfevents.1672579210.1ba8c0635139.99.2
/content/th-en_translation_model/runs/Jan01_11-45-32_1ba8c0635139/1672573828.8501909/
/content/th-en_translation_model/runs/Jan01_11-45-32_1ba8c0635139/1672573828.8501909/events.out.tfevents.1672573828.1ba8c0635139.99.1
/content/th-en_translation_model/checkpoint-19500/
/content/th-en_translation_model/checkpoint-19500/config.json
/content/th-en_translation_model/checkpoint-19500/rng_state.pth
/content/th-en_translation_model/checkpoint-19500/source.spm
/content/th-en_translation_model/checkpoint-19500/scheduler.pt
/content/th-en_translation_model/checkpoint-19500/special_tokens_

In [48]:
# save model into Google Drive (need to wait for a while until the tar file appeared in Google Drive, not more than 1 min)
!cp -r "/content/th-en_model_10epch.tar.gz" "/content/gdrive/MyDrive/AIML/trained_model"

In [None]:
@InProceedings{TIEDEMANN12.463,
  author = {Jörg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }