# Understanding Our Training Strategy: Why We Use Checkpoints
**Efficient Resource Management:** Training an entire model in one go can easily exhaust GPU memory or take an incredibly long time. By checkpointing, we train in smaller segments, which allows for more efficient use of our available resources for each chunk.

**Seamless Resumption:** Imagine your training gets interrupted—perhaps due to a power outage, a system crash, or simply needing to free up the GPU for another task. Without checkpoints, you'd lose all your hard-earned progress. With them, we can resume training from the last saved point, saving immense time and effort.

**Better Monitoring & Adjustment:** Checkpoints provide specific points in time where we can save the model's entire state. This means we can:

**Enhanced Stability:** For very deep or complex neural networks, continuous long-term training can sometimes lead to stability issues (e.g., exploding or vanishing gradients). Checkpointing provides natural breakpoints to ensure stability and allows for re-evaluation if needed.

# How We Implement Checkpointing
The process is straightforward and is a standard practice in most deep learning workflows:

**Saving the Model's State:** At predefined intervals (e.g., after every 'X' epochs, or every 'Y' batches), we systematically save the model's current weights, the optimizer's state (which is crucial for continuing training properly), and any relevant hyperparameters or training configurations.

**Loading the State:** When we need to continue training, we simply load the saved state from our latest checkpoint. This effectively brings the model and its entire training progression back to life exactly where it left off.

Most modern deep learning frameworks, such as TensorFlow and PyTorch, offer robust and easy-to-use functionalities for saving and loading these checkpoints. This makes our training process much more resilient, flexible, and robust.

In [None]:
!pip install matplotlib torch torchvision
!pip install pandas numpy
!pip install transformers datasets accelerate sentencepiece evaluate sacrebleu
!pip install tqdm
!pip install langdetect fsspec==2025.3.0
!pip install -U transformers[torch] accelerate datasets

In [None]:
!pip install gdown 
!gdown --id 1dh7uWJ8GnHbb-2aWpUuB3VCE5L4gRaCw

In [None]:
import torch
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
import pandas as pd
from datasets import Dataset, load_from_disk
from transformers import AutoTokenizer, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback, Seq2SeqTrainingArguments ,AutoModelForSeq2SeqLM, AutoTokenizer
import os
from sklearn.model_selection import train_test_split
import numpy as np
from transformers.integrations import TensorBoardCallback
from huggingface_hub import login
import nltk
from nltk.translate.bleu_score import sentence_bleu
nltk.download("punkt")

In [None]:
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
import pandas as pd

trainDf = pd.read_csv('train.tsv',  sep='\t', names = ["src_lang", "src", "tgt"])
print(trainDf.head())
print(f'Length: {len(trainDf)}')

In [None]:
trainDf = trainDf.dropna(subset=["src_lang", "src", "tgt"])
trainDf = trainDf[
    (trainDf["src"].str.strip() != "") &
    (trainDf["tgt"].str.strip() != "") &
    (trainDf["src"].str.len() > 5) &
    (trainDf["tgt"].str.len() > 5)
]  
trainDf = trainDf.reset_index(drop=True)

In [None]:
# from datasets import Dataset
from sklearn.model_selection import train_test_split

# Bước 1: Chia 80% train, 20% còn lại (temp)
train_df, temp_df = train_test_split(trainDf, test_size=0.2, random_state=42)

# Bước 2: Chia temp tiếp thành 50% valid, 50% test (10% + 10%)
valid_df, eval_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Convert sang HuggingFace Dataset
train_ds = Dataset.from_pandas(train_df)
valid_ds = Dataset.from_pandas(valid_df)
eval_ds  = Dataset.from_pandas(eval_df)

print("✅ train:", len(train_ds))
print("✅ valid:", len(valid_ds))
print("✅ eval: ", len(eval_ds))

In [None]:
for param in model.model.encoder.parameters():
    param.requires_grad = False

for name, param in model.model.encoder.named_parameters():
    if param.requires_grad:
        print(f"{name} is trainable")
    else:
        print(f"{name} is frozen")

In [None]:
import re

def clean_text(text):
      text = re.sub(r"[^a-zA-ZÀ-ỹ0-9\s.,?!:;()-]+", " ", text)
      text = re.sub(r"\s+", " ", text).strip()
      text = re.sub(r"\s+([.,?!:;()-])", r"\1", text)

      text = re.sub(r"^[()]+|[()]+$", "", text).strip()

      return text
def preprocess(example):
    try:
        tokenizer.src_lang = example["src_lang"]
        tokenizer.tgt_lang = "vie_Latn"

        encoded = tokenizer(
            clean_text(example["src"]),
            text_target=clean_text(example["tgt"]),
            truncation=True,
            padding="max_length",
            max_length=300
        )

        return encoded

    except Exception as e:
        print("❌ Lỗi:", e)
        print("Example:", example)
        return None

In [None]:

if os.path.exists("./nllb_train_processed") and os.path.exists("./nllb_valid_processed") and os.path.exists("./nllb_eval_processed"):
    train_ds = load_from_disk("./nllb_train_processed")
    valid_ds = load_from_disk("./nllb_valid_processed")
    eval_ds = load_from_disk("./nllb_eval_processed")
else:
    # print("Error")
    train_ds = train_ds.map(preprocess, remove_columns=train_ds.column_names)
    valid_ds = valid_ds.map(preprocess, remove_columns=valid_ds.column_names)
    eval_ds = eval_ds.map(preprocess, remove_columns=eval_ds.column_names)

    train_ds.save_to_disk("./nllb_train_processed")
    valid_ds.save_to_disk("./nllb_valid_processed")
    eval_ds.save_to_disk("./nllb_eval_processed")

print(len(train_ds))
print(len(valid_ds))
print(len(eval_ds))


In [None]:
tensorboard_callback = TensorBoardCallback()

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./nllb200-finetuned-checkpoints",
    logging_dir="./logs_file",
    
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,

    logging_steps=1000,
    evaluation_strategy="steps",
    # num_train_epochs=4,
    learning_rate=2e-5,
    report_to="none",
    save_total_limit=2,

    predict_with_generate=False,
    greater_is_better=False,
    remove_unused_columns=False,
    load_best_model_at_end=False,
    fp16=True,

    max_steps=20000,
    save_steps=500,
    eval_steps=1000,
    save_strategy="steps",
    
)
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    tokenizer=tokenizer,
    callbacks=[tensorboard_callback],
)

In [None]:
import os
last_checkpoint = None

if os.path.isdir("./nllb200-finetuned-checkpoints"):
    checkpoints = [os.path.join("./nllb200-finetuned-checkpoints", d) for d in os.listdir("./nllb200-finetuned-checkpoints") if d.startswith("checkpoint")]

    if checkpoints:
        last_checkpoint = sorted(checkpoints, key = lambda x : int(x.split("-")[-1]))[-1]

if last_checkpoint:
    print(f"Resuming from {last_checkpoint}")
    trainer.train(resum_from_checkpoint=last_checkpoint)
else:
    print("Training from scratch")
    trainer.train()

In [None]:
import os

os.environ['HF_TOKEN'] = 'MY_TOKEN'
login(token=os.getenv('HF_TOKEN'))

In [None]:
last_checkpoint = None

if os.path.isdir("./nllb200-finetuned-checkpoints"):
    checkpoints = [os.path.join("./nllb200-finetuned-checkpoints", d) for d in os.listdir("./nllb200-finetuned-checkpoints") if d.startswith("checkpoint")]

    if checkpoints:
        last_checkpoint = sorted(checkpoints, key = lambda x : int(x.split("-")[-1]))[-1]

if last_checkpoint:
    print(f"Load {last_checkpoint}")
    modelPath = os.path.join("nllb200-finetuned-checkpoints", last_checkpoint)
    
    model = AutoModelForSeq2SeqLM.from_pretrained(modelPath)
    tokenizer = AutoTokenizer.from_pretrained(modelPath, tgt_lang="vie_Latn")

    model.push_to_hub("phuckhangne/nllb-200-600M-GDGoC-AI-Challenge")
    tokenizer.push_to_hub("phuckhangne/nllb-200-600M-GDGoC-AI-Challenge")
else:
    print('Get error in loading model')