# ƒê√°nh gi√° c√°c m√¥ h√¨nh Embedding cho h·ªá th·ªëng Legal Chatbot

Notebook n√†y ƒë√°nh gi√° v√† so s√°nh hi·ªáu su·∫•t c·ªßa c√°c m√¥ h√¨nh embedding sau:
1. **thanhtantran/Vietnamese_Embedding_v2** - M√¥ h√¨nh chuy√™n cho ti·∫øng Vi·ªát
2. **Baai/bge-m3** - M√¥ h√¨nh ƒëa ng√¥n ng·ªØ
3. **intfloat/multilingual-e5-large** - M√¥ h√¨nh ƒëa ng√¥n ng·ªØ l·ªõn

Dataset: **anti-ai/ViNLI-Zalo-supervised** (triplet format: query, positive, hard_neg)

## üöÄ QUICK START cho Google Colab T4

**QUAN TR·ªåNG - L√†m theo th·ª© t·ª±:**

1. ‚úÖ **Runtime ‚Üí Restart runtime** (ƒë·ªÉ clear memory c≈©)
2. ‚úÖ Ch·∫°y cell "C√†i ƒë·∫∑t th∆∞ vi·ªán"
3. ‚úÖ Ch·∫°y cell "Import th∆∞ vi·ªán"
4. ‚úÖ Ch·∫°y cell "Clear GPU Memory"
5. ‚úÖ Load dataset
6. ‚úÖ **Ch·ªçn c·∫•u h√¨nh:**
   - Ch·∫°y c·∫£ 3 models: `MAX_SAMPLES=1000, BATCH_SIZE=4`
   - Ch·∫°y 1 model: `MAX_SAMPLES=2000, BATCH_SIZE=8`
7. ‚úÖ Ch·∫°y c√°c cell c√≤n l·∫°i

**N·∫øu g·∫∑p OOM:** Restart runtime v√† gi·∫£m MAX_SAMPLES

## 1. C√†i ƒë·∫∑t th∆∞ vi·ªán

In [1]:
!pip install -q datasets transformers sentence-transformers torch scikit-learn numpy pandas tqdm matplotlib seaborn

## 2. Import th∆∞ vi·ªán

## ‚ö†Ô∏è GPU Memory Management (Quan tr·ªçng cho Colab T4!)

**H∆Ø·ªöNG D·∫™N TR√ÅNH L·ªñI "CUDA OUT OF MEMORY" TR√äN COLAB T4 (15GB):**

### ‚úÖ B∆∞·ªõc 1: Restart Runtime tr∆∞·ªõc khi ch·∫°y
- Menu: **Runtime ‚Üí Restart runtime**
- Ho·∫∑c: **Ctrl+M .** (ph√≠m t·∫Øt)

### ‚úÖ B∆∞·ªõc 2: Clear GPU memory n·∫øu c·∫ßn
- Ch·∫°y cell "Clear GPU Memory" ·ªü section 3

### ‚úÖ B∆∞·ªõc 3: ƒêi·ªÅu ch·ªânh c·∫•u h√¨nh
**Trong cell config (section 3):**
```python
MAX_SAMPLES = 2000   # Gi·∫£m xu·ªëng 1000 n·∫øu v·∫´n OOM
BATCH_SIZE = 8       # Gi·∫£m xu·ªëng 4 n·∫øu v·∫´n OOM  
USE_FP16 = True      # B·∫ÆT BU·ªòC ph·∫£i True
```

### ‚úÖ B∆∞·ªõc 4: Ch·∫°y t·ª´ng model m·ªôt
**Trong cell "ƒê·ªãnh nghƒ©a c√°c m√¥ h√¨nh" (section 4):**
- Uncomment m·ªôt trong c√°c d√≤ng ƒë·ªÉ ch·ªâ ch·∫°y 1 model
- V√≠ d·ª•: `models_to_evaluate = {"BGE-M3": "BAAI/bge-m3"}`

### üìä Khuy·∫øn ngh·ªã c·∫•u h√¨nh cho Colab T4:
- **Ch·∫°y t·∫•t c·∫£ 3 models:** MAX_SAMPLES=1000, BATCH_SIZE=4, USE_FP16=True
- **Ch·∫°y 1 model:** MAX_SAMPLES=2000, BATCH_SIZE=8, USE_FP16=True
- **An to√†n nh·∫•t:** MAX_SAMPLES=500, BATCH_SIZE=4, USE_FP16=True

In [None]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
import gc
warnings.filterwarnings('ignore')

# Thi·∫øt l·∫≠p device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Thi·∫øt l·∫≠p memory management cho CUDA
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Total GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"Allocated GPU Memory: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    print(f"Cached GPU Memory: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


## 3. Load Dataset

## üîÑ Clear GPU Memory (Ch·∫°y cell n√†y n·∫øu g·∫∑p OOM)

**N·∫øu b·∫°n g·∫∑p l·ªói "CUDA out of memory", ch·∫°y cell d∆∞·ªõi ƒë√¢y ƒë·ªÉ clear memory:**

In [None]:
# Clear t·∫•t c·∫£ GPU memory
import torch
import gc

# Kill t·∫•t c·∫£ process ƒëang d√πng GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    
    # Force clear all allocated memory
    torch.cuda.synchronize()
    
    print("‚úÖ GPU Memory cleared!")
    print(f"GPU Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    print(f"GPU Memory cached: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")
    
# N·∫øu v·∫´n kh√¥ng ƒë∆∞·ª£c, restart runtime:
# Runtime -> Restart runtime (trong Colab menu)

In [3]:
# Load dataset t·ª´ HuggingFace
print("Loading dataset...")
dataset = load_dataset("anti-ai/ViNLI-Zalo-supervised")

# Ki·ªÉm tra c·∫•u tr√∫c dataset
print(f"\nDataset splits: {dataset.keys()}")
print(f"\nSample data:")
print(dataset['train'][0] if 'train' in dataset else dataset[list(dataset.keys())[0]][0])

Loading dataset...


Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32980/32980 [00:00<00:00, 45453.00 examples/s]


Dataset splits: dict_keys(['train'])

Sample data:
{'query': 'T·ªï s√°t_h·∫°ch c·∫•p gi·∫•y_ph√©p l√°i t√†u_h·ªèa c√≥ bao_nhi√™u th√†nh_vi√™n ?', 'positive': 'ƒêi·ªÅu 30 . T·ªï s√°t_h·∫°ch 1 . T·ªï s√°t_h·∫°ch do C·ª•c_tr∆∞·ªüng C·ª•c ƒê∆∞·ªùng_s·∫Øt Vi·ªát_Nam th√†nh_l·∫≠p , ch·ªãu s·ª± ch·ªâ_ƒë·∫°o tr·ª±c_ti·∫øp c·ªßa H·ªôi_ƒë·ªìng s√°t_h·∫°ch . \n 2 . T·ªï s√°t_h·∫°ch c√≥ √≠t_nh·∫•t 05 th√†nh_vi√™n , bao_g·ªìm t·ªï_tr∆∞·ªüng , c√°c s√°t_h·∫°ch vi√™n l√Ω_thuy·∫øt v√† s√°t_h·∫°ch vi√™n th·ª±c_h√†nh . T·ªï_tr∆∞·ªüng T·ªï s√°t_h·∫°ch l√† c√¥ng_ch·ª©c C·ª•c ƒê∆∞·ªùng_s·∫Øt Vi·ªát_Nam ho·∫∑c l√£nh_ƒë·∫°o doanh_nghi·ªáp c√≥ th√≠_sinh d·ª± k·ª≥ s√°t_h·∫°ch , c√°c s√°t_h·∫°ch vi√™n l√† ng∆∞·ªùi ƒëang c√¥ng_t√°c t·∫°i doanh_nghi·ªáp c√≥ th√≠_sinh tham_d·ª± k·ª≥ s√°t_h·∫°ch v√† ng∆∞·ªùi ƒëang c√¥ng_t√°c t·∫°i c√°c c∆°_s·ªü ƒë√†o_t·∫°o li√™n_quan ƒë·∫øn l√°i t√†u . \n 3 . Ti√™u_chu·∫©n c·ªßa s√°t_h·∫°ch vi√™n : \n a ) C√≥ t∆∞_c√°ch ƒë·∫°o_ƒë·ª©c t·ªët v√† c√≥ chuy√™n_m√¥n ph√π_h·ª£p ; \n b ) ƒ




In [None]:
# ‚öôÔ∏è C·∫§U H√åNH CHO GOOGLE COLAB T4 GPU (15GB)
MAX_SAMPLES = 2000  # Gi·∫£m xu·ªëng 2000 samples ƒë·ªÉ tr√°nh OOM tr√™n T4
BATCH_SIZE = 8      # Gi·∫£m batch size xu·ªëng 8 ƒë·ªÉ ti·∫øt ki·ªám memory
USE_FP16 = True     # D√πng FP16 (half precision) ƒë·ªÉ gi·∫£m 50% memory

print("‚öôÔ∏è CONFIGURATION FOR COLAB T4 GPU:")
print(f"  - Max samples: {MAX_SAMPLES}")
print(f"  - Batch size: {BATCH_SIZE}")
print(f"  - Use FP16: {USE_FP16}")
print(f"  - Device: {device}")
print("\nüí° N·∫øu v·∫´n OOM, gi·∫£m MAX_SAMPLES xu·ªëng 1000 ho·∫∑c BATCH_SIZE xu·ªëng 4")

# Chu·∫©n b·ªã d·ªØ li·ªáu - l·∫•y split ph√π h·ª£p
if 'test' in dataset:
    eval_data = dataset['test']
elif 'validation' in dataset:
    eval_data = dataset['validation']
elif 'train' in dataset:
    # N·∫øu ch·ªâ c√≥ train, l·∫•y m·ªôt s·ªë m·∫´u ƒë·ªÉ test
    eval_data = dataset['train'].select(range(min(MAX_SAMPLES, len(dataset['train']))))
else:
    # L·∫•y split ƒë·∫ßu ti√™n
    split_name = list(dataset.keys())[0]
    eval_data = dataset[split_name].select(range(min(MAX_SAMPLES, len(dataset[split_name]))))

print(f"\n‚úÖ Number of evaluation samples: {len(eval_data)}")
print(f"‚úÖ Columns: {eval_data.column_names}")


Number of evaluation samples: 20000

Columns: ['query', 'positive', 'hard_neg']


## 4. ƒê·ªãnh nghƒ©a c√°c m√¥ h√¨nh c·∫ßn ƒë√°nh gi√°

In [None]:
# Danh s√°ch c√°c m√¥ h√¨nh c·∫ßn ƒë√°nh gi√°
models_to_evaluate = {
    "Vietnamese_Embedding_v2": "thanhtantran/Vietnamese_Embedding_v2",
    "BGE-M3": "BAAI/bge-m3",
    "Multilingual-E5-Large": "intfloat/multilingual-e5-large"
}

# üî• OPTION: Ch·ªâ ch·∫°y 1 model ƒë·ªÉ tr√°nh OOM
# Uncomment d√≤ng d∆∞·ªõi ƒë·ªÉ ch·ªâ test 1 model
# models_to_evaluate = {"Vietnamese_Embedding_v2": "thanhtantran/Vietnamese_Embedding_v2"}
# models_to_evaluate = {"BGE-M3": "BAAI/bge-m3"}
# models_to_evaluate = {"Multilingual-E5-Large": "intfloat/multilingual-e5-large"}

print("Models to evaluate:")
for name, model_id in models_to_evaluate.items():
    print(f"  - {name}: {model_id}")
print(f"\nüìä Total: {len(models_to_evaluate)} model(s)")

Models to evaluate:
  - Vietnamese_Embedding_v2: thanhtantran/Vietnamese_Embedding_v2
  - BGE-M3: BAAI/bge-m3
  - Multilingual-E5-Large: intfloat/multilingual-e5-large


## 5. H√†m ƒë√°nh gi√°

In [None]:
def compute_embeddings(model: SentenceTransformer, texts: List[str], batch_size: int = 16) -> np.ndarray:
    """T√≠nh embeddings cho m·ªôt list c√°c vƒÉn b·∫£n"""
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True  # Normalize ƒë·ªÉ t√≠nh cosine similarity
    )
    return embeddings


def evaluate_triplet_ranking(queries_emb: np.ndarray, 
                             positives_emb: np.ndarray, 
                             negatives_emb: np.ndarray) -> Dict[str, float]:
    """ƒê√°nh gi√° m√¥ h√¨nh d·ª±a tr√™n triplet ranking
    
    Args:
        queries_emb: Embeddings c·ªßa queries
        positives_emb: Embeddings c·ªßa positive documents
        negatives_emb: Embeddings c·ªßa negative documents
    
    Returns:
        Dictionary ch·ª©a c√°c metrics
    """
    # T√≠nh similarity scores
    pos_scores = np.sum(queries_emb * positives_emb, axis=1)  # Cosine similarity (ƒë√£ normalize)
    neg_scores = np.sum(queries_emb * negatives_emb, axis=1)
    
    # Metrics
    # 1. Accuracy: T·ª∑ l·ªá positive c√≥ score cao h∆°n negative
    accuracy = np.mean(pos_scores > neg_scores)
    
    # 2. Mean Positive Score
    mean_pos_score = np.mean(pos_scores)
    
    # 3. Mean Negative Score
    mean_neg_score = np.mean(neg_scores)
    
    # 4. Mean Score Difference (margin)
    mean_diff = np.mean(pos_scores - neg_scores)
    
    # 5. MRR (Mean Reciprocal Rank)
    # Trong tr∆∞·ªùng h·ª£p triplet, n·∫øu positive rank 1 th√¨ MRR = 1, rank 2 th√¨ MRR = 0.5
    ranks = np.where(pos_scores > neg_scores, 1, 2)
    mrr = np.mean(1.0 / ranks)
    
    # 6. Recall@1: T·ª∑ l·ªá positive n·∫±m ·ªü top-1
    recall_at_1 = accuracy  # Gi·ªëng accuracy trong tr∆∞·ªùng h·ª£p triplet
    
    return {
        "accuracy": accuracy,
        "mean_positive_score": mean_pos_score,
        "mean_negative_score": mean_neg_score,
        "mean_score_difference": mean_diff,
        "mrr": mrr,
        "recall@1": recall_at_1
    }


def evaluate_model(model_name: str, model_path: str, eval_data, batch_size: int = 16) -> Dict[str, float]:
    """ƒê√°nh gi√° m·ªôt m√¥ h√¨nh embedding
    
    Args:
        model_name: T√™n m√¥ h√¨nh
        model_path: ƒê∆∞·ªùng d·∫´n ho·∫∑c ID c·ªßa m√¥ h√¨nh tr√™n HuggingFace
        eval_data: Dataset ƒë·ªÉ ƒë√°nh gi√°
        batch_size: Batch size cho encoding
    
    Returns:
        Dictionary ch·ª©a c√°c metrics
    """
    print(f"\n{'='*60}")
    print(f"Evaluating: {model_name}")
    print(f"{'='*60}")
    
    # Clear memory tr∆∞·ªõc khi load model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        gc.collect()
        print(f"GPU Memory before loading: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    
    # Load model
    print(f"Loading model from {model_path}...")
    try:
        model = SentenceTransformer(model_path, device=device)
        
        # D√πng half precision (FP16) ƒë·ªÉ gi·∫£m 50% memory tr√™n T4
        if device == "cuda" and USE_FP16:
            model = model.half()
            print("‚úÖ Using FP16 (half precision) to save memory")
        
        if torch.cuda.is_available():
            print(f"GPU Memory after loading: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    except Exception as e:
        print(f"‚ùå Error loading model: {e}")
        raise
    
    # Extract texts
    queries = [item['query'] for item in eval_data]
    positives = [item['positive'] for item in eval_data]
    negatives = [item['hard_neg'] for item in eval_data]
    
    # Compute embeddings
    print(f"\nComputing embeddings for {len(queries)} samples...")
    print("Encoding queries...")
    queries_emb = compute_embeddings(model, queries, batch_size)
    
    # Clear cache sau m·ªói l·∫ßn encode
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    print("Encoding positives...")
    positives_emb = compute_embeddings(model, positives, batch_size)
    
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    print("Encoding negatives...")
    negatives_emb = compute_embeddings(model, negatives, batch_size)
    
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Evaluate
    print("\nCalculating metrics...")
    metrics = evaluate_triplet_ranking(queries_emb, positives_emb, negatives_emb)
    
    # Print results
    print(f"\n{model_name} Results:")
    print(f"  - Accuracy: {metrics['accuracy']:.4f}")
    print(f"  - MRR: {metrics['mrr']:.4f}")
    print(f"  - Recall@1: {metrics['recall@1']:.4f}")
    print(f"  - Mean Positive Score: {metrics['mean_positive_score']:.4f}")
    print(f"  - Mean Negative Score: {metrics['mean_negative_score']:.4f}")
    print(f"  - Mean Score Difference: {metrics['mean_score_difference']:.4f}")
    
    # Clear memory - QUAN TR·ªåNG!
    del model
    del queries_emb, positives_emb, negatives_emb
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    
    print(f"\n‚úÖ Memory cleared after evaluating {model_name}")
    if torch.cuda.is_available():
        print(f"GPU Memory after cleanup: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
    
    return metrics

## 6. Ch·∫°y ƒë√°nh gi√° cho t·∫•t c·∫£ c√°c m√¥ h√¨nh

In [None]:
# L∆∞u k·∫øt qu·∫£
results = {}

# ƒê√°nh gi√° t·ª´ng m√¥ h√¨nh
for model_name, model_path in models_to_evaluate.items():
    try:
        # Clear memory tr∆∞·ªõc m·ªói model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()
        
        metrics = evaluate_model(model_name, model_path, eval_data, batch_size=BATCH_SIZE)
        results[model_name] = metrics
        
        # ƒê·ª£i m·ªôt ch√∫t ƒë·ªÉ GPU cleanup ho√†n to√†n
        import time
        time.sleep(2)
        
    except Exception as e:
        print(f"\n‚ùå Error evaluating {model_name}: {str(e)}")
        print(f"\nüí° Tip: Try reducing MAX_SAMPLES or BATCH_SIZE in the configuration cell")
        results[model_name] = None
        
        # Clear memory sau l·ªói
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

print("\n" + "="*60)
print("Evaluation completed!")
print("="*60)

## 7. So s√°nh k·∫øt qu·∫£

In [None]:
# T·∫°o DataFrame ƒë·ªÉ so s√°nh
df_results = pd.DataFrame(results).T
df_results = df_results.sort_values('accuracy', ascending=False)

print("\n" + "="*80)
print("COMPARISON OF ALL MODELS")
print("="*80)
print(df_results.to_string())
print("\n")

# T√¨m m√¥ h√¨nh t·ªët nh·∫•t
best_model = df_results['accuracy'].idxmax()
print(f"\nüèÜ BEST MODEL: {best_model}")
print(f"   Accuracy: {df_results.loc[best_model, 'accuracy']:.4f}")
print(f"   MRR: {df_results.loc[best_model, 'mrr']:.4f}")

## 8. Visualization

In [None]:
# Thi·∫øt l·∫≠p style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# T·∫°o figure v·ªõi subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Embedding Models Evaluation on ViNLI-Zalo Dataset', fontsize=16, fontweight='bold')

# 1. Accuracy comparison
ax1 = axes[0, 0]
df_results['accuracy'].plot(kind='bar', ax=ax1, color='skyblue', edgecolor='black')
ax1.set_title('Accuracy Comparison', fontsize=12, fontweight='bold')
ax1.set_ylabel('Accuracy', fontsize=10)
ax1.set_xlabel('Model', fontsize=10)
ax1.set_ylim([0, 1])
ax1.grid(axis='y', alpha=0.3)
for i, v in enumerate(df_results['accuracy']):
    ax1.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

# 2. MRR comparison
ax2 = axes[0, 1]
df_results['mrr'].plot(kind='bar', ax=ax2, color='lightcoral', edgecolor='black')
ax2.set_title('Mean Reciprocal Rank (MRR)', fontsize=12, fontweight='bold')
ax2.set_ylabel('MRR', fontsize=10)
ax2.set_xlabel('Model', fontsize=10)
ax2.set_ylim([0, 1])
ax2.grid(axis='y', alpha=0.3)
for i, v in enumerate(df_results['mrr']):
    ax2.text(i, v + 0.02, f'{v:.3f}', ha='center', fontweight='bold')

# 3. Score comparison (Positive vs Negative)
ax3 = axes[1, 0]
x = np.arange(len(df_results))
width = 0.35
ax3.bar(x - width/2, df_results['mean_positive_score'], width, label='Positive', color='lightgreen', edgecolor='black')
ax3.bar(x + width/2, df_results['mean_negative_score'], width, label='Negative', color='salmon', edgecolor='black')
ax3.set_title('Mean Similarity Scores', fontsize=12, fontweight='bold')
ax3.set_ylabel('Cosine Similarity', fontsize=10)
ax3.set_xlabel('Model', fontsize=10)
ax3.set_xticks(x)
ax3.set_xticklabels(df_results.index, rotation=45, ha='right')
ax3.legend()
ax3.grid(axis='y', alpha=0.3)

# 4. Score Difference (Margin)
ax4 = axes[1, 1]
df_results['mean_score_difference'].plot(kind='bar', ax=ax4, color='plum', edgecolor='black')
ax4.set_title('Mean Score Difference (Margin)', fontsize=12, fontweight='bold')
ax4.set_ylabel('Score Difference', fontsize=10)
ax4.set_xlabel('Model', fontsize=10)
ax4.grid(axis='y', alpha=0.3)
for i, v in enumerate(df_results['mean_score_difference']):
    ax4.text(i, v + 0.01, f'{v:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('embedding_evaluation_results.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nüìä Visualization saved as 'embedding_evaluation_results.png'")

## 9. Chi ti·∫øt ph√¢n t√≠ch v√† khuy·∫øn ngh·ªã

In [None]:
print("\n" + "="*80)
print("DETAILED ANALYSIS & RECOMMENDATION")
print("="*80)

# Ranking c√°c m√¥ h√¨nh theo t·ª´ng metric
metrics_to_analyze = ['accuracy', 'mrr', 'mean_score_difference', 'mean_positive_score']

print("\nüìä Model Rankings by Metrics:")
for metric in metrics_to_analyze:
    sorted_models = df_results[metric].sort_values(ascending=False)
    print(f"\n{metric.upper().replace('_', ' ')}:")
    for i, (model, score) in enumerate(sorted_models.items(), 1):
        print(f"  {i}. {model}: {score:.4f}")

# T√≠nh overall score (weighted average)
weights = {
    'accuracy': 0.4,
    'mrr': 0.3,
    'mean_score_difference': 0.2,
    'mean_positive_score': 0.1
}

df_results['overall_score'] = sum(
    df_results[metric] * weight 
    for metric, weight in weights.items()
)

best_overall = df_results['overall_score'].idxmax()

print(f"\n\n{'='*80}")
print(f"üèÜ FINAL RECOMMENDATION: {best_overall}")
print(f"{'='*80}")
print(f"\nOverall Score: {df_results.loc[best_overall, 'overall_score']:.4f}")
print(f"\nKey Metrics:")
print(f"  ‚Ä¢ Accuracy: {df_results.loc[best_overall, 'accuracy']:.4f}")
print(f"  ‚Ä¢ MRR: {df_results.loc[best_overall, 'mrr']:.4f}")
print(f"  ‚Ä¢ Mean Score Difference: {df_results.loc[best_overall, 'mean_score_difference']:.4f}")
print(f"  ‚Ä¢ Mean Positive Score: {df_results.loc[best_overall, 'mean_positive_score']:.4f}")

print(f"\nüí° Recommendation for Vietnamese Legal Chatbot:")
print(f"   Use '{best_overall}' as the embedding model for your RAG system.")
print(f"\n   Model path: {models_to_evaluate[best_overall]}")

## 10. L∆∞u k·∫øt qu·∫£

In [None]:
# L∆∞u k·∫øt qu·∫£ ra CSV
output_file = 'embedding_evaluation_results.csv'
df_results.to_csv(output_file)
print(f"\n‚úÖ Results saved to '{output_file}'")

# L∆∞u khuy·∫øn ngh·ªã
recommendation = f"""
EMBEDDING MODEL EVALUATION RESULTS
==================================
Date: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}
Dataset: anti-ai/ViNLI-Zalo-supervised
Number of samples: {len(eval_data)}

RECOMMENDED MODEL: {best_overall}
Model Path: {models_to_evaluate[best_overall]}

Performance Metrics:
- Accuracy: {df_results.loc[best_overall, 'accuracy']:.4f}
- MRR: {df_results.loc[best_overall, 'mrr']:.4f}
- Mean Score Difference: {df_results.loc[best_overall, 'mean_score_difference']:.4f}
- Overall Score: {df_results.loc[best_overall, 'overall_score']:.4f}

ALL MODELS COMPARISON:
{df_results.to_string()}
"""

with open('embedding_recommendation.txt', 'w', encoding='utf-8') as f:
    f.write(recommendation)

print("‚úÖ Recommendation saved to 'embedding_recommendation.txt'")
print("\n" + recommendation)