# Demo: Evaluation Metrics cho Vietnamese Cultural VQA



Notebook n√†y demo c√°ch s·ª≠ d·ª•ng 2 metrics ƒë√°nh gi√°:
1. **BERTScore** - Semantic similarity
2. **LLM-as-a-Judge** - Expert scoring v·ªõi Gemini API

## 1. Setup

In [None]:
# Install dependencies (uncomment n·∫øu ch∆∞a c√†i)
# !pip install bert-score google-generativeai numpy tqdm

In [None]:
import os
import json
from evaluation_metrics import (
    BERTScoreEvaluator,
    LLMJudgeEvaluator,
    VQAEvaluator
)

# Set API key (QUAN TR·ªåNG!)
# L·∫•y API key mi·ªÖn ph√≠ t·∫°i: https://makersuite.google.com/app/apikey
os.environ["GEMINI_API_KEY"] = "YOUR_API_KEY_HERE"  

## 2. Demo BERTScore

In [None]:
# Initialize BERTScore evaluator
bert_evaluator = BERTScoreEvaluator(
    model_name="bert-base-multilingual-cased",
    device="cuda"  # ƒê·ªïi sang "cpu" n·∫øu kh√¥ng c√≥ GPU
)

In [None]:
# Test case 1: C√¢u tr·∫£ l·ªùi g·∫ßn gi·ªëng
prediction1 = "Ch√πa M·ªôt C·ªôt ƒë∆∞·ª£c x√¢y d·ª±ng v√†o nƒÉm 1049 d∆∞·ªõi tri·ªÅu L√Ω."
ground_truth1 = "Ch√πa M·ªôt C·ªôt, tri·ªÅu L√Ω, nƒÉm 1049."

score1 = bert_evaluator.evaluate_single(prediction1, ground_truth1)
print(f"Test 1 - C√¢u tr·∫£ l·ªùi t·ªët:")
print(f"  Prediction: {prediction1}")
print(f"  Ground Truth: {ground_truth1}")
print(f"  BERTScore: {score1:.4f}\n")

In [None]:
# Test case 2: C√¢u tr·∫£ l·ªùi sai
prediction2 = "ƒê√¢y l√† VƒÉn Mi·∫øu, ƒë∆∞·ª£c x√¢y d·ª±ng nƒÉm 1070."
ground_truth2 = "Ch√πa M·ªôt C·ªôt, tri·ªÅu L√Ω, nƒÉm 1049."

score2 = bert_evaluator.evaluate_single(prediction2, ground_truth2)
print(f"Test 2 - C√¢u tr·∫£ l·ªùi sai:")
print(f"  Prediction: {prediction2}")
print(f"  Ground Truth: {ground_truth2}")
print(f"  BERTScore: {score2:.4f}\n")

In [None]:
# Test batch evaluation
predictions = [
    "V·ªãnh H·∫° Long l√† di s·∫£n thi√™n nhi√™n th·∫ø gi·ªõi",
    "Ph·ªë c·ªï H·ªôi An ƒë∆∞·ª£c UNESCO c√¥ng nh·∫≠n nƒÉm 1999",
    "VƒÉn Mi·∫øu l√† tr∆∞·ªùng ƒë·∫°i h·ªçc ƒë·∫ßu ti√™n c·ªßa Vi·ªát Nam"
]

references = [
    "V·ªãnh H·∫° Long ƒë∆∞·ª£c UNESCO c√¥ng nh·∫≠n l√† di s·∫£n th·∫ø gi·ªõi nƒÉm 1994",
    "H·ªôi An l√† di s·∫£n vƒÉn h√≥a th·∫ø gi·ªõi t·ª´ nƒÉm 1999",
    "VƒÉn Mi·∫øu - Qu·ªëc T·ª≠ Gi√°m, tr∆∞·ªùng ƒë·∫°i h·ªçc ƒë·∫ßu ti√™n VN, th√†nh l·∫≠p 1076"
]

scores = bert_evaluator.evaluate_batch(predictions, references)
stats = bert_evaluator.get_statistics(scores)

print("Batch Evaluation Results:")
for i, (pred, ref, score) in enumerate(zip(predictions, references, scores), 1):
    print(f"\n{i}. Score: {score:.4f}")
    print(f"   Pred: {pred}")
    print(f"   Ref:  {ref}")

print(f"\nStatistics:")
print(f"  Mean: {stats['mean']:.4f}")
print(f"  Std:  {stats['std']:.4f}")

## 3. Demo LLM-as-a-Judge

In [None]:
# Initialize LLM Judge
llm_evaluator = LLMJudgeEvaluator(
    api_key=None,  # Will use GEMINI_API_KEY from environment
    model_name="gemini-1.5-flash"  # Free tier model
)

In [None]:
# Test case 1: C√¢u tr·∫£ l·ªùi xu·∫•t s·∫Øc
question1 = "ƒê·ªãa ƒëi·ªÉm trong ·∫£nh c√≥ ph·∫£i di s·∫£n thi√™n nhi√™n th·∫ø gi·ªõi kh√¥ng?"
prediction1 = "C√≥, ƒë√¢y l√† V·ªãnh H·∫° Long, m·ªôt di s·∫£n thi√™n nhi√™n th·∫ø gi·ªõi ƒë∆∞·ª£c UNESCO c√¥ng nh·∫≠n nƒÉm 1994."
ground_truth1 = "C√≥, V·ªãnh H·∫° Long l√† di s·∫£n thi√™n nhi√™n th·∫ø gi·ªõi, ƒë∆∞·ª£c UNESCO c√¥ng nh·∫≠n nƒÉm 1994."

score1, reasoning1 = llm_evaluator.evaluate_single(question1, prediction1, ground_truth1)

print("=" * 80)
print("Test 1: C√¢u tr·∫£ l·ªùi xu·∫•t s·∫Øc")
print("=" * 80)
print(f"C√¢u h·ªèi: {question1}")
print(f"\nD·ª± ƒëo√°n: {prediction1}")
print(f"\nGround Truth: {ground_truth1}")
print(f"\nüìä ƒêi·ªÉm: {score1}/5")
print(f"üí≠ L√Ω do: {reasoning1}")
print()

In [None]:
# Test case 2: C√¢u tr·∫£ l·ªùi c√≥ hallucination
question2 = "Ch√πa M·ªôt C·ªôt ƒë∆∞·ª£c x√¢y d·ª±ng v√†o th·ªùi n√†o?"
prediction2 = "Ch√πa M·ªôt C·ªôt ƒë∆∞·ª£c x√¢y d·ª±ng v√†o nƒÉm 1076 d∆∞·ªõi th·ªùi vua L√Ω Nh√¢n T√¥ng."
ground_truth2 = "Ch√πa M·ªôt C·ªôt ƒë∆∞·ª£c x√¢y d·ª±ng nƒÉm 1049 d∆∞·ªõi th·ªùi vua L√Ω Th√°i T√¥ng."

score2, reasoning2 = llm_evaluator.evaluate_single(question2, prediction2, ground_truth2)

print("=" * 80)
print("Test 2: C√¢u tr·∫£ l·ªùi c√≥ sai s√≥t (hallucination)")
print("=" * 80)
print(f"C√¢u h·ªèi: {question2}")
print(f"\nD·ª± ƒëo√°n: {prediction2}")
print(f"\nGround Truth: {ground_truth2}")
print(f"\nüìä ƒêi·ªÉm: {score2}/5")
print(f"üí≠ L√Ω do: {reasoning2}")
print()

## 4. Demo Combined Evaluator

In [None]:
# Initialize combined evaluator
evaluator = VQAEvaluator(
    use_bert_score=True,
    use_llm_judge=True,
    gemini_api_key=None,  # Use env var
    bert_model="bert-base-multilingual-cased",
    device="cuda"
)

In [None]:
# Evaluate single sample
result = evaluator.evaluate_single(
    image_id="001234",
    question="ƒê√¢y l√† di t√≠ch g√¨?",
    prediction="ƒê√¢y l√† VƒÉn Mi·∫øu - Qu·ªëc T·ª≠ Gi√°m, ƒë∆∞·ª£c x√¢y d·ª±ng nƒÉm 1070.",
    ground_truth="VƒÉn Mi·∫øu - Qu·ªëc T·ª≠ Gi√°m, th√†nh l·∫≠p nƒÉm 1070 d∆∞·ªõi tri·ªÅu L√Ω."
)

print("=" * 80)
print("EVALUATION RESULT")
print("=" * 80)
print(f"Image ID: {result.image_id}")
print(f"Question: {result.question}")
print(f"\nPredicted: {result.predicted_answer}")
print(f"\nGround Truth: {result.ground_truth}")
print(f"\nüìä Metrics:")
print(f"  - BERTScore: {result.bert_score:.4f}")
print(f"  - LLM Judge: {result.llm_judge_score}/5")
print(f"\nüí≠ LLM Reasoning: {result.llm_judge_reasoning}")

## 5. Evaluate Full Dataset

ƒê·ªÉ ch·∫°y tr√™n to√†n b·ªô dataset, c·∫ßn chu·∫©n b·ªã file JSON v·ªõi format:
```json
[
  {
    "image_id": "001234",
    "question": "...",
    "prediction": "...",
    "ground_truth": "..."
  }
]
```

In [None]:
# Create sample predictions file for demo
sample_data = [
    {
        "image_id": "001234",
        "question": "ƒê√¢y l√† di t√≠ch g√¨?",
        "prediction": "Ch√πa M·ªôt C·ªôt ƒë∆∞·ª£c x√¢y d·ª±ng nƒÉm 1049 d∆∞·ªõi tri·ªÅu L√Ω.",
        "ground_truth": "Ch√πa M·ªôt C·ªôt, tri·ªÅu L√Ω, nƒÉm 1049."
    },
    {
        "image_id": "001235",
        "question": "ƒê·ªãa ƒëi·ªÉm n√†y c√≥ ph·∫£i di s·∫£n th·∫ø gi·ªõi kh√¥ng?",
        "prediction": "C√≥, V·ªãnh H·∫° Long l√† di s·∫£n thi√™n nhi√™n th·∫ø gi·ªõi.",
        "ground_truth": "C√≥, ƒë∆∞·ª£c UNESCO c√¥ng nh·∫≠n nƒÉm 1994."
    },
    {
        "image_id": "001236",
        "question": "VƒÉn Mi·∫øu ƒë∆∞·ª£c x√¢y d·ª±ng khi n√†o?",
        "prediction": "VƒÉn Mi·∫øu ƒë∆∞·ª£c th√†nh l·∫≠p nƒÉm 1070.",
        "ground_truth": "VƒÉn Mi·∫øu - Qu·ªëc T·ª≠ Gi√°m th√†nh l·∫≠p nƒÉm 1070 d∆∞·ªõi tri·ªÅu L√Ω."
    }
]

# Save to file
with open("sample_predictions.json", "w", encoding="utf-8") as f:
    json.dump(sample_data, f, ensure_ascii=False, indent=2)

print("‚úì Created sample_predictions.json")

In [None]:
# Evaluate entire dataset
stats = evaluator.evaluate_dataset(
    predictions_file="sample_predictions.json",
    output_file="evaluation_results.json",
    llm_judge_delay=2.0  # 2 seconds delay to avoid rate limit
)

print("\n" + "="*80)
print("FINAL STATISTICS")
print("="*80)
print(json.dumps(stats, indent=2, ensure_ascii=False))

## 6. Visualize Results

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Load results
with open("evaluation_results.json", "r", encoding="utf-8") as f:
    results_data = json.load(f)

# Extract scores
bert_scores = [r["bert_score"] for r in results_data["results"] if r["bert_score"] is not None]
llm_scores = [r["llm_judge_score"] for r in results_data["results"] if r["llm_judge_score"] is not None]

# Create plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# BERTScore distribution
axes[0].hist(bert_scores, bins=20, color='skyblue', edgecolor='black')
axes[0].axvline(results_data["statistics"]["bert_score"]["mean"], 
                color='red', linestyle='--', label=f'Mean: {results_data["statistics"]["bert_score"]["mean"]:.4f}')
axes[0].set_xlabel('BERTScore')
axes[0].set_ylabel('Frequency')
axes[0].set_title('BERTScore Distribution')
axes[0].legend()
axes[0].grid(alpha=0.3)

# LLM Judge distribution
axes[1].hist(llm_scores, bins=5, color='lightgreen', edgecolor='black', range=(1, 5))
axes[1].axvline(results_data["statistics"]["llm_judge"]["mean"], 
                color='red', linestyle='--', label=f'Mean: {results_data["statistics"]["llm_judge"]["mean"]:.2f}')
axes[1].set_xlabel('LLM Judge Score')
axes[1].set_ylabel('Frequency')
axes[1].set_title('LLM Judge Score Distribution')
axes[1].set_xticks([1, 2, 3, 4, 5])
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('evaluation_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì Saved visualization to evaluation_distributions.png")

In [None]:
# Correlation between BERTScore and LLM Judge
plt.figure(figsize=(8, 6))
plt.scatter(bert_scores, llm_scores, alpha=0.6)
plt.xlabel('BERTScore')
plt.ylabel('LLM Judge Score')
plt.title('Correlation: BERTScore vs LLM Judge')
plt.grid(alpha=0.3)

# Add correlation coefficient
correlation = np.corrcoef(bert_scores, llm_scores)[0, 1]
plt.text(0.05, 4.8, f'Correlation: {correlation:.3f}', 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.savefig('score_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úì Saved correlation plot to score_correlation.png")