# 🤖 Model Training & Evaluation

## 🎯 Objective

## 📈 Evaluation

In this project, we use three popular metrics- BLEU-4, ROUGE-L and GPT-Score, to assess the model's performance.

**Inputs:**

- Reference captions: List of ground truth captions.

- Hypothesis caption: Generated caption from the model.

In [45]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

### 📌 BLEU-4 (Bilingual Evaluation Understudy)

- BLEU measures n-gram overlaps between the generated caption and reference captions.

- BLEU-4 specifically considers 4-grams to ensure both lexical and syntactic accuracy.

General BLEU formula:

$$ BLEU = BP \times \exp \left( \sum_{n=1}^{4} w_n \log p_n \right) $$

Where:
- $BP$ (Brevity Penalty) penalizes short captions.

- $p_n$ represents the n-gram match ratio.

- $w_n$ is the weight for each n-gram (BLEU-4: $w_n$ for n = 1 to 4).


To easily compute, we use `nltk.translate.bleu_score` library.

In [None]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2024.11.6-cp310-cp310-win_amd64.whl.metadata (41 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 8.5 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 8.0 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp310-cp310-win_amd64.whl (274 kB)
Using cached click-8.1.8-py3-none-any.whl (98 kB)
Installing collected packages: regex, click, nltk
Successfully installed click-8.1.8 nltk-3.9.1 regex-2024.11.6


In [4]:
import nltk
from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

# Ensure required NLTK packages are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [30]:
def compute_bleu(references, hypotheses, total=True):
    """
    Compute BLEU-4 score.
    :param references: List of reference captions (each reference is a list of tokenized captions)
    :param hypotheses: List of hypothesis captions (each hypothesis is a tokenized caption)
    :param total: If True, compute BLEU score for all hypotheses against all references
                  If False, compute BLEU score for each hypothesis against its reference
    :return: BLEU-4 score or list of BLEU-4 scores for each hypothesis
    """
    smoothie = SmoothingFunction().method1
    if total:
        # Compute BLEU score for all hypotheses against all references
        score = corpus_bleu(references, hypotheses, smoothing_function=smoothie)
        return score

    bleu_scores = []
    for ref, hyp in zip(references, hypotheses):
        # Compute BLEU score for each hypothesis against its reference
        score = corpus_bleu([ref], [hyp], smoothing_function=smoothie)
        bleu_scores.append(score)
    return bleu_scores

In [35]:
text_references = [
    ["vận_động_viên chuẩn_bị phát bóng", "cầu_thủ chuẩn_bị đá bóng"],
    ["cầu_thủ đang chạy", "vận_động_viên đang chạy"]
]
text_hypotheses = [
    "cầu_thủ chuẩn_bị phát bóng",
    "vận_động_viên đang chạy"
]
print("Total BLEU-4 score:")
bleu_score = compute_bleu(text_references, text_hypotheses)
print(f"BLEU-4 score: {bleu_score:.4f}")

print("\nIndividual BLEU-4 scores:")
bleu_scores = compute_bleu(text_references, text_hypotheses, total=False)
for i, score in enumerate(bleu_scores):
    print(f"Hypothesis {i+1}: BLEU-4 score: {score:.4f}")


Total BLEU-4 score:
BLEU-4 score: 0.9896

Individual BLEU-4 scores:
Hypothesis 1: BLEU-4 score: 0.9802
Hypothesis 2: BLEU-4 score: 1.0000


### 📌 ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation)

- ROUGE evaluates the longest common subsequence (LCS) between two texts.

ROUGE-L formula:

$$ ROUGE-L = \frac{LCS(reference, hypothesis)}{\text{max length}} $$

Where:
- Recall: $\frac{LCS}{\text{number of words in ref}}$
- Precision: $\frac{LCS}{\text{number of words in hypo}}$
- F1-score: Harmonic mean of Recall and Precision.

In [9]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting absl-py (from rouge_score)
  Downloading absl_py-2.2.1-py3-none-any.whl.metadata (2.4 kB)
Downloading absl_py-2.2.1-py3-none-any.whl (277 kB)
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py): started
  Building wheel for rouge_score (setup.py): finished with status 'done'
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24972 sha256=1e6c4b61fbe02db709de8fcc28936aa61806a0594f42a3a65b93ec3374981004
  Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\5f\dd\89\461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: absl-py, rouge_score
Successfully installed absl-py-2.2.1 rouge_score-0.1.2


In [18]:
from rouge_score import rouge_scorer

In [33]:
def compute_rouge(references, hypotheses, total=True):
    """
    Compute ROUGE-L score.
    :param references: List of reference captions (each reference is a list of tokenized captions)
    :param hypotheses: List of hypothesis captions (each hypothesis is a tokenized caption)
    :param total: If True, compute ROUGE-L score for all hypotheses against all references
                  If False, compute ROUGE-L score for each hypothesis against its reference
    :return: ROUGE-L score or list of ROUGE-L scores for each hypothesis
    """
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    if total:
        scores = []
        for refs, hyp in zip(references, hypotheses):
            max_score = max(scorer.score(ref, hyp)['rougeL'].fmeasure for ref in refs)
            scores.append(max_score)
        return sum(scores) / len(scores)
    
    # Compute ROUGE score for each hypothesis against its reference
    rouge_scores = []
    for refs, hyp in zip(references, hypotheses):
        max_score = max(scorer.score(ref, hyp)['rougeL'].fmeasure for ref in refs)
        rouge_scores.append(max_score)
    return rouge_scores

In [34]:
text_references = [
    ["vận_động_viên chuẩn_bị phát bóng", "cầu_thủ chuẩn_bị đá bóng"],
    ["cầu_thủ đang chạy", "vận_động_viên đang chạy"]
]
text_hypotheses = [
    "cầu_thủ chuẩn_bị phát bóng",
    "vận_động_viên đang chạy"
]

print("Total ROUGE-L score:")
rouge_score = compute_rouge(text_references, text_hypotheses)
print(f"ROUGE-L score: {rouge_score:.4f}")

print("\nIndividual ROUGE-L scores:")
rouge_scores = compute_rouge(text_references, text_hypotheses, total=False)
for i, score in enumerate(rouge_scores):
    print(f"Hypothesis {i+1}: ROUGE-L score: {score:.4f}")

Total ROUGE-L score:
ROUGE-L score: 0.9444

Individual ROUGE-L scores:
Hypothesis 1: ROUGE-L score: 0.8889
Hypothesis 2: ROUGE-L score: 1.0000


### 📌 BERTScore (Bidirectional Encoder Representations from Transformers Score)

- BERTScore evaluates semantic similarity by comparing contextualized embeddings from a transformer-based model.

- Instead of exact word matching, BERTScore considers deep contextual meaning.

- It calculates:
    + Precision: Similarity between predicted and reference embeddings.
    + Recall: Similarity between reference and predicted embeddings.
    + F1-score: Harmonic mean of Precision and Recall.

- BERTScore is especially useful for Vietnamese, as it captures meaning beyond exact word matches.

To simply implement, we use `evaluate.load("bertscore")`

In [39]:
!pip install evaluate
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting transformers>=3.0.0 (from bert_score)
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers>=3.0.0->bert_score)
  Using cached tokenizers-0.21.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers>=3.0.0->bert_score)
  Using cached safetensors-0.5.3-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
Downloading transformers-4.50.3-py3-none-any.whl (10.2 MB)
   ---------------------------------------- 0.0/10.2 MB ? eta -:--:--
   ------ --------------------------------- 1.6/10.2 MB 9.3 MB/s eta 0:00:01
   -------------- ------------------------- 3.7/10.2 MB 9.9 MB/s eta 0:00:01
   ---------------------- ----------------- 5.8/10.2 MB 9.5 MB/s eta 0:00:01
   --------------------------- ------------ 7.1/10.2 MB 8.9 MB/s eta 0:00:01
   --------

In [40]:
from evaluate import load

# Load BERTScore
bertscore = load("bertscore")

In [48]:
def compute_bertscore(references, hypotheses, lang="vi", total=True):
    """
    Compute BERTScore.
    :param references: List of reference captions (each reference is a list of captions)
    :param hypotheses: List of hypothesis captions
    :param lang: Language for BERTScore (default is Vietnamese "vi")
    :param total: If True, compute BERTScore for all hypotheses against all references
                  If False, compute BERTScore for each hypothesis against its reference
    :return: Average BERTScore F1-score
    """
    scores = []
    for refs, hyp in zip(references, hypotheses):
        best_ref = max(refs, key=len)  # Choose the longest reference per hypothesis
        score = bertscore.compute(predictions=[hyp], references=[best_ref], lang=lang, device="cpu")['f1'][0]
        scores.append(score)
    if total: 
        return sum(scores) / len(scores)
    return scores

In [49]:
text_references = [
    ["vận_động_viên chuẩn_bị phát bóng", "cầu_thủ chuẩn_bị đá bóng"],
    ["cầu_thủ đang chạy", "vận_động_viên đang chạy"]
]
text_hypotheses = [
    "cầu_thủ chuẩn_bị phát bóng",
    "vận_động_viên đang chạy"
]

print("Total BERTScore:")
bertscore_score = compute_bertscore(text_references, text_hypotheses)
print(f"BERTScore: {bertscore_score:.4f}")

print("\nIndividual BERTScores:")
bertscore_scores = compute_bertscore(text_references, text_hypotheses, total=False)
for i, score in enumerate(bertscore_scores):
    print(f"Hypothesis {i+1}: BERTScore: {score:.4f}")

Total BERTScore:
BERTScore: 0.9549

Individual BERTScores:
Hypothesis 1: BERTScore: 0.9099
Hypothesis 2: BERTScore: 1.0000
