# üè• ƒê√°nh Gi√° M√¥ H√¨nh Medical Chatbot - PhoBERT RAG

**Notebook n√†y ƒë√°nh gi√° hi·ªáu su·∫•t c·ªßa h·ªá th·ªëng Medical Chatbot s·ª≠ d·ª•ng:**
- **PhoBERT** (vinai/phobert-base) cho embedding ti·∫øng Vi·ªát
- **Hybrid Search** (BM25 + Vector Search)
- **RAG** (Retrieval-Augmented Generation) v·ªõi GPT-4

## üìä C√°c Metrics ƒê√°nh Gi√°

### 1. **Retrieval Metrics** (ƒê√°nh gi√° kh·∫£ nƒÉng t√¨m ki·∫øm)
- **Precision@K**: T·ª∑ l·ªá k·∫øt qu·∫£ ƒë√∫ng trong top K k·∫øt qu·∫£
- **Recall@K**: T·ª∑ l·ªá t√¨m ƒë∆∞·ª£c t√†i li·ªáu ƒë√∫ng trong top K
- **MRR (Mean Reciprocal Rank)**: V·ªã tr√≠ trung b√¨nh c·ªßa k·∫øt qu·∫£ ƒë√∫ng ƒë·∫ßu ti√™n
- **NDCG@K**: Normalized Discounted Cumulative Gain

### 2. **Response Quality Metrics** (ƒê√°nh gi√° ch·∫•t l∆∞·ª£ng c√¢u tr·∫£ l·ªùi)
- **BLEU Score**: So s√°nh n-gram v·ªõi c√¢u tr·∫£ l·ªùi chu·∫©n
- **ROUGE Score**: ƒêo ƒë·ªô ch·ªìng l·∫•p v·ªõi c√¢u tr·∫£ l·ªùi chu·∫©n
- **BERTScore**: ƒêo semantic similarity b·∫±ng BERT
- **Semantic Similarity**: Cosine similarity gi·ªØa embeddings

### 3. **Medical Accuracy Metrics** (ƒê√°nh gi√° ƒë·ªô ch√≠nh x√°c y t·∫ø)
- **Medical Entity Accuracy**: T·ª∑ l·ªá th·ª±c th·ªÉ y t·∫ø (b·ªánh, thu·ªëc) ƒë√∫ng
- **Factual Consistency**: ƒê·ªô nh·∫•t qu√°n v·ªõi ngu·ªìn
- **Hallucination Rate**: T·ª∑ l·ªá th√¥ng tin sai/kh√¥ng c√≥ trong ngu·ªìn

### 4. **System Performance**
- **Response Time**: Th·ªùi gian ph·∫£n h·ªìi
- **Token Usage**: S·ªë token GPT s·ª≠ d·ª•ng
- **Cost Estimation**: ∆Ø·ªõc t√≠nh chi ph√≠ API

---

## üöÄ H∆∞·ªõng D·∫´n S·ª≠ D·ª•ng

1. **Upload test dataset** (CSV/JSON) v·ªõi format:
   ```json
   {
     "question": "S·ªët xu·∫•t huy·∫øt c√≥ tri·ªáu ch·ª©ng g√¨?",
     "expected_answer": "S·ªët cao, ƒëau ƒë·∫ßu, ƒëau kh·ªõp...",
     "relevant_doc_ids": ["doc_123", "doc_456"]
   }
   ```

2. **C·∫•u h√¨nh API keys** trong ph·∫ßn Setup

3. **Ch·∫°y t·ª´ng cell** ƒë·ªÉ ƒë√°nh gi√°

4. **Xem k·∫øt qu·∫£** ·ªü cu·ªëi notebook (charts + summary table)


## üì¶ 1. C√†i ƒê·∫∑t Dependencies

In [None]:
!pip install -q transformers torch chromadb openai python-dotenv
!pip install -q nltk rouge-score bert-score sacrebleu
!pip install -q scikit-learn pandas matplotlib seaborn tqdm
!pip install -q sentence-transformers  # For BERTScore and reranking

print("‚úÖ All packages installed successfully!")

## üìö 2. Import Libraries

In [None]:
import os
import json
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Any
from tqdm.auto import tqdm

# NLP Libraries
import torch
from transformers import AutoModel, AutoTokenizer
from openai import OpenAI

# Evaluation Metrics
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from bert_score import score as bert_score
from sklearn.metrics.pairwise import cosine_similarity

# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

# Plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.family'] = 'DejaVu Sans'

print("‚úÖ Libraries imported successfully!")

## ‚öôÔ∏è 3. Configuration & API Setup

In [None]:
# ==================== C·∫§U H√åNH API ====================
# Nh·∫≠p API keys c·ªßa b·∫°n
OPENAI_API_KEY = "your-openai-api-key-here"  # Thay b·∫±ng key th·∫≠t

# Ho·∫∑c upload file .env v√† uncomment d√≤ng d∆∞·ªõi:
# from google.colab import files
# uploaded = files.upload()  # Upload file .env
# from dotenv import load_dotenv
# load_dotenv()
# OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

# ==================== C·∫§U H√åNH M√î H√åNH ====================
PHOBERT_MODEL = "vinai/phobert-base"  # Gi·ªëng v·ªõi backend c·ªßa b·∫°n
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"üîß Device: {DEVICE}")
print(f"ü§ñ PhoBERT Model: {PHOBERT_MODEL}")
print(f"‚úÖ Configuration completed!")

## üß† 4. PhoBERT Embedding Class (Gi·ªëng Backend)

In [None]:
class PhoBERTEmbedding:
    """PhoBERT embedding function - gi·ªëng v·ªõi backend"""
    
    def __init__(self, model_name=PHOBERT_MODEL, device=DEVICE, max_length=256):
        print(f"Loading PhoBERT model: {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.to(device)
        self.model.eval()
        self.device = device
        self.max_length = max_length
        print(f"‚úÖ PhoBERT loaded on {device}")
    
    def encode(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for texts"""
        if isinstance(texts, str):
            texts = [texts]
        
        encoded = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            return_tensors='pt',
            max_length=self.max_length
        )
        
        encoded = {k: v.to(self.device) for k, v in encoded.items()}
        
        with torch.no_grad():
            output = self.model(**encoded)
        
        # Mean pooling
        embeddings = self._mean_pooling(output.last_hidden_state, encoded['attention_mask'])
        return embeddings.cpu().numpy()
    
    def _mean_pooling(self, token_embeddings, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

# Initialize PhoBERT
phobert = PhoBERTEmbedding()

## üìÇ 5. Load Test Dataset

**Upload file CSV ho·∫∑c JSON v·ªõi format:**

```csv
question,expected_answer,relevant_doc_ids
"S·ªët xu·∫•t huy·∫øt c√≥ tri·ªáu ch·ª©ng g√¨?","S·ªët cao ƒë·ªôt ng·ªôt, ƒëau ƒë·∫ßu, ƒëau kh·ªõp, xu·∫•t huy·∫øt","doc_123,doc_456"
```

Ho·∫∑c t·∫°o test data m·∫´u:

In [None]:
# ==================== OPTION 1: Upload File ====================
# Uncomment ƒë·ªÉ upload file t·ª´ m√°y t√≠nh
# from google.colab import files
# uploaded = files.upload()
# test_data_file = list(uploaded.keys())[0]
# test_df = pd.read_csv(test_data_file) if test_data_file.endswith('.csv') else pd.read_json(test_data_file)

# ==================== OPTION 2: Sample Data ====================
# T·∫°o test data m·∫´u (thay b·∫±ng data th·∫≠t c·ªßa b·∫°n)
test_data = [
    {
        "question": "S·ªët xu·∫•t huy·∫øt c√≥ tri·ªáu ch·ª©ng g√¨?",
        "expected_answer": "S·ªët cao ƒë·ªôt ng·ªôt (39-40¬∞C), ƒëau ƒë·∫ßu d·ªØ d·ªôi, ƒëau m·ªèi c∆° v√† kh·ªõp, bu·ªìn n√¥n, n√¥n, xu·∫•t hi·ªán ban xu·∫•t huy·∫øt d∆∞·ªõi da.",
        "relevant_doc_ids": ["dengue_001"],
        "category": "symptoms"
    },
    {
        "question": "C√°ch ƒëi·ªÅu tr·ªã s·ªët xu·∫•t huy·∫øt?",
        "expected_answer": "Ngh·ªâ ng∆°i tuy·ªát ƒë·ªëi, u·ªëng nhi·ªÅu n∆∞·ªõc, h·∫° s·ªët b·∫±ng paracetamol, theo d√µi s·ªë l∆∞·ª£ng ti·ªÉu c·∫ßu. Kh√¥ng d√πng aspirin hay ibuprofen. Nh·∫≠p vi·ªán n·∫øu xu·∫•t huy·∫øt n·∫∑ng.",
        "relevant_doc_ids": ["dengue_001", "dengue_treatment_001"],
        "category": "treatment"
    },
    {
        "question": "Vi√™m gan B l√¢y qua ƒë∆∞·ªùng n√†o?",
        "expected_answer": "L√¢y qua ƒë∆∞·ªùng m√°u (kim ti√™m, d·ª•ng c·ª• y t·∫ø), quan h·ªá t√¨nh d·ª•c kh√¥ng an to√†n, t·ª´ m·∫π sang con khi sinh.",
        "relevant_doc_ids": ["hepatitis_b_001"],
        "category": "transmission"
    },
    {
        "question": "Paracetamol u·ªëng li·ªÅu bao nhi√™u?",
        "expected_answer": "Ng∆∞·ªùi l·ªõn: 500-1000mg m·ªói l·∫ßn, c√°ch 4-6 gi·ªù, t·ªëi ƒëa 4000mg/ng√†y. Tr·∫ª em: 10-15mg/kg c√¢n n·∫∑ng m·ªói l·∫ßn.",
        "relevant_doc_ids": ["paracetamol_001"],
        "category": "medication"
    },
    {
        "question": "Ph√≤ng ng·ª´a COVID-19 nh∆∞ th·∫ø n√†o?",
        "expected_answer": "Ti√™m v·∫Øc-xin ƒë·∫ßy ƒë·ªß, ƒëeo kh·∫©u trang n∆°i ƒë√¥ng ng∆∞·ªùi, r·ª≠a tay th∆∞·ªùng xuy√™n, gi·ªØ kho·∫£ng c√°ch an to√†n, tr√°nh t·ª• t·∫≠p ƒë√¥ng ng∆∞·ªùi.",
        "relevant_doc_ids": ["covid19_prevention_001"],
        "category": "prevention"
    }
]

test_df = pd.DataFrame(test_data)

print(f"‚úÖ Loaded {len(test_df)} test questions")
print(f"\nCategories: {test_df['category'].value_counts().to_dict()}")
test_df.head()

## üîß 6. Mock RAG System (Simulating Your Backend)

**L∆∞u √Ω:** ƒê√¢y l√† mock system ƒë·ªÉ test. Trong th·ª±c t·∫ø, b·∫°n n√™n:
1. Connect tr·ª±c ti·∫øp ƒë·∫øn backend API c·ªßa b·∫°n
2. Ho·∫∑c load ChromaDB t·ª´ backend ƒë·ªÉ test offline

In [None]:
class MockMedicalRAG:
    """Mock RAG system - thay b·∫±ng API call th·∫≠t ho·∫∑c load ChromaDB"""
    
    def __init__(self, phobert_model):
        self.phobert = phobert_model
        self.documents = self._load_mock_documents()
        self.doc_embeddings = self._create_embeddings()
    
    def _load_mock_documents(self) -> List[Dict]:
        """Load mock medical documents (thay b·∫±ng ChromaDB th·∫≠t)"""
        return [
            {
                "id": "dengue_001",
                "disease_name": "S·ªët xu·∫•t huy·∫øt",
                "symptoms": "S·ªët cao ƒë·ªôt ng·ªôt 39-40¬∞C, ƒëau ƒë·∫ßu d·ªØ d·ªôi, ƒëau m·ªèi c∆° kh·ªõp, bu·ªìn n√¥n, ban xu·∫•t huy·∫øt",
                "treatment": "Ngh·ªâ ng∆°i, u·ªëng nhi·ªÅu n∆∞·ªõc, paracetamol h·∫° s·ªët, theo d√µi ti·ªÉu c·∫ßu",
                "prevention": "Di·ªát mu·ªói, kh√¥ng ƒë·ªÉ n∆∞·ªõc ƒë·ªçng"
            },
            {
                "id": "dengue_treatment_001",
                "disease_name": "ƒêi·ªÅu tr·ªã s·ªët xu·∫•t huy·∫øt",
                "treatment": "Kh√¥ng d√πng aspirin, ibuprofen. Ch·ªâ d√πng paracetamol. Nh·∫≠p vi·ªán n·∫øu xu·∫•t huy·∫øt n·∫∑ng, ti·ªÉu c·∫ßu d∆∞·ªõi 50,000",
            },
            {
                "id": "hepatitis_b_001",
                "disease_name": "Vi√™m gan B",
                "symptoms": "M·ªát m·ªèi, v√†ng da, n∆∞·ªõc ti·ªÉu s·∫´m m√†u, ƒëau b·ª•ng",
                "transmission": "L√¢y qua m√°u, quan h·ªá t√¨nh d·ª•c, m·∫π sang con",
                "prevention": "Ti√™m v·∫Øc-xin, kh√¥ng d√πng chung kim ti√™m"
            },
            {
                "id": "paracetamol_001",
                "medication_name": "Paracetamol",
                "dosage": "Ng∆∞·ªùi l·ªõn: 500-1000mg/l·∫ßn, 4-6h/l·∫ßn, max 4000mg/ng√†y. Tr·∫ª em: 10-15mg/kg",
                "indication": "H·∫° s·ªët, gi·∫£m ƒëau"
            },
            {
                "id": "covid19_prevention_001",
                "disease_name": "COVID-19",
                "prevention": "Ti√™m v·∫Øc-xin, ƒëeo kh·∫©u trang, r·ª≠a tay, gi·ªØ kho·∫£ng c√°ch",
                "symptoms": "S·ªët, ho, kh√≥ th·ªü, m·∫•t v·ªã gi√°c"
            }
        ]
    
    def _create_embeddings(self) -> np.ndarray:
        """Create embeddings for all documents"""
        texts = []
        for doc in self.documents:
            # Combine all fields into searchable text
            text = " ".join([str(v) for v in doc.values() if v and isinstance(v, str)])
            texts.append(text)
        
        return self.phobert.encode(texts)
    
    def search(self, question: str, top_k: int = 3) -> List[Dict]:
        """Search for relevant documents"""
        # Encode question
        query_embedding = self.phobert.encode([question])
        
        # Calculate cosine similarity
        similarities = cosine_similarity(query_embedding, self.doc_embeddings)[0]
        
        # Get top K
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                "id": self.documents[idx]["id"],
                "metadata": self.documents[idx],
                "relevance_score": float(similarities[idx]),
                "distance": 1 - float(similarities[idx])
            })
        
        return results
    
    def generate_answer(self, question: str, search_results: List[Dict]) -> str:
        """Generate answer using GPT (gi·ªëng backend)"""
        if not search_results:
            return "Xin l·ªói, t√¥i kh√¥ng t√¨m th·∫•y th√¥ng tin ph√π h·ª£p."
        
        # Prepare context
        context_parts = []
        for idx, result in enumerate(search_results[:3], 1):
            metadata = result['metadata']
            context_parts.append(f"[Ngu·ªìn {idx}] {json.dumps(metadata, ensure_ascii=False)}")
        
        context = "\n".join(context_parts)
        
        # Generate with GPT
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "B·∫°n l√† b√°c sƒ© AI. Tr·∫£ l·ªùi d·ª±a tr√™n ngu·ªìn ƒë∆∞·ª£c cung c·∫•p. Kh√¥ng t·ª± suy lu·∫≠n."},
                    {"role": "user", "content": f"Ngu·ªìn:\n{context}\n\nC√¢u h·ªèi: {question}\n\nTr·∫£ l·ªùi:"}
                ],
                temperature=0.3,
                max_tokens=300
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            return f"Error: {str(e)}"

# Initialize mock RAG system
rag_system = MockMedicalRAG(phobert)
print(f"‚úÖ Mock RAG system initialized with {len(rag_system.documents)} documents")

## üìä 7. Evaluation Functions

In [None]:
# ==================== RETRIEVAL METRICS ====================

def calculate_precision_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    """Precision@K: T·ª∑ l·ªá t√†i li·ªáu ƒë√∫ng trong top K"""
    retrieved_k = retrieved_ids[:k]
    relevant_set = set(relevant_ids)
    correct = sum(1 for doc_id in retrieved_k if doc_id in relevant_set)
    return correct / k if k > 0 else 0.0

def calculate_recall_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    """Recall@K: T·ª∑ l·ªá t√¨m ƒë∆∞·ª£c t√†i li·ªáu ƒë√∫ng trong top K"""
    retrieved_k = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    if len(relevant_set) == 0:
        return 0.0
    correct = len(retrieved_k & relevant_set)
    return correct / len(relevant_set)

def calculate_mrr(retrieved_ids: List[str], relevant_ids: List[str]) -> float:
    """Mean Reciprocal Rank: 1 / v·ªã tr√≠ c·ªßa k·∫øt qu·∫£ ƒë√∫ng ƒë·∫ßu ti√™n"""
    relevant_set = set(relevant_ids)
    for i, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_set:
            return 1.0 / i
    return 0.0

def calculate_ndcg_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    """NDCG@K: Normalized Discounted Cumulative Gain"""
    retrieved_k = retrieved_ids[:k]
    relevant_set = set(relevant_ids)
    
    # DCG
    dcg = sum((1 if doc_id in relevant_set else 0) / np.log2(i + 2) 
              for i, doc_id in enumerate(retrieved_k))
    
    # IDCG (ideal DCG)
    ideal_k = min(k, len(relevant_ids))
    idcg = sum(1 / np.log2(i + 2) for i in range(ideal_k))
    
    return dcg / idcg if idcg > 0 else 0.0

# ==================== RESPONSE QUALITY METRICS ====================

def calculate_bleu(generated: str, reference: str) -> float:
    """BLEU Score: N-gram overlap"""
    reference_tokens = reference.split()
    generated_tokens = generated.split()
    smoothing = SmoothingFunction().method1
    return sentence_bleu([reference_tokens], generated_tokens, smoothing_function=smoothing)

def calculate_rouge(generated: str, reference: str) -> Dict[str, float]:
    """ROUGE Scores: ROUGE-1, ROUGE-2, ROUGE-L"""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
    scores = scorer.score(reference, generated)
    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

def calculate_semantic_similarity(generated: str, reference: str, model) -> float:
    """Semantic Similarity using PhoBERT embeddings"""
    emb1 = model.encode([generated])
    emb2 = model.encode([reference])
    return float(cosine_similarity(emb1, emb2)[0][0])

# ==================== MEDICAL ACCURACY METRICS ====================

def extract_medical_entities(text: str) -> set:
    """Extract medical entities (simple keyword-based, c√≥ th·ªÉ d√πng NER model)"""
    # Danh s√°ch t·ª´ kh√≥a y t·∫ø ph·ªï bi·∫øn (m·ªü r·ªông theo nhu c·∫ßu)
    medical_keywords = [
        's·ªët', 'ƒëau', 'vi√™m', 'nhi·ªÖm', 'b·ªánh', 'thu·ªëc', 'paracetamol', 'aspirin',
        'xu·∫•t huy·∫øt', 'ti·ªÉu c·∫ßu', 'gan', 'ph·ªïi', 'tim', 'v·∫Øc-xin', 'ƒëi·ªÅu tr·ªã'
    ]
    text_lower = text.lower()
    return {kw for kw in medical_keywords if kw in text_lower}

def calculate_entity_accuracy(generated: str, reference: str) -> float:
    """Medical Entity Accuracy: T·ª∑ l·ªá th·ª±c th·ªÉ y t·∫ø ƒë√∫ng"""
    gen_entities = extract_medical_entities(generated)
    ref_entities = extract_medical_entities(reference)
    
    if len(ref_entities) == 0:
        return 1.0 if len(gen_entities) == 0 else 0.0
    
    correct = len(gen_entities & ref_entities)
    return correct / len(ref_entities)

print("‚úÖ Evaluation functions defined!")

## üöÄ 8. Run Full Evaluation

In [None]:
def evaluate_rag_system(test_df: pd.DataFrame, rag_system, phobert_model, k_values=[1, 3, 5]):
    """
    ƒê√°nh gi√° to√†n di·ªán h·ªá th·ªëng RAG
    
    Returns:
        results_df: DataFrame v·ªõi k·∫øt qu·∫£ chi ti·∫øt t·ª´ng c√¢u h·ªèi
        metrics_summary: Dict v·ªõi metrics t·ªïng h·ª£p
    """
    results = []
    
    print(f"üîç Evaluating {len(test_df)} questions...\n")
    
    for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Evaluating"):
        question = row['question']
        expected_answer = row['expected_answer']
        relevant_doc_ids = row['relevant_doc_ids']
        
        # Measure response time
        start_time = time.time()
        
        # 1. RETRIEVAL
        search_results = rag_system.search(question, top_k=max(k_values))
        retrieved_ids = [r['id'] for r in search_results]
        
        # 2. GENERATION
        generated_answer = rag_system.generate_answer(question, search_results)
        
        response_time = time.time() - start_time
        
        # 3. CALCULATE METRICS
        result = {
            'question': question,
            'expected_answer': expected_answer,
            'generated_answer': generated_answer,
            'retrieved_ids': retrieved_ids,
            'relevant_ids': relevant_doc_ids,
            'response_time': response_time,
        }
        
        # Retrieval metrics
        for k in k_values:
            result[f'precision@{k}'] = calculate_precision_at_k(retrieved_ids, relevant_doc_ids, k)
            result[f'recall@{k}'] = calculate_recall_at_k(retrieved_ids, relevant_doc_ids, k)
            result[f'ndcg@{k}'] = calculate_ndcg_at_k(retrieved_ids, relevant_doc_ids, k)
        
        result['mrr'] = calculate_mrr(retrieved_ids, relevant_doc_ids)
        
        # Response quality metrics
        result['bleu'] = calculate_bleu(generated_answer, expected_answer)
        rouge_scores = calculate_rouge(generated_answer, expected_answer)
        result.update(rouge_scores)
        result['semantic_similarity'] = calculate_semantic_similarity(
            generated_answer, expected_answer, phobert_model
        )
        
        # Medical accuracy
        result['entity_accuracy'] = calculate_entity_accuracy(generated_answer, expected_answer)
        
        results.append(result)
    
    results_df = pd.DataFrame(results)
    
    # Calculate summary statistics
    metrics_summary = {}
    
    # Retrieval metrics
    for k in k_values:
        metrics_summary[f'Precision@{k}'] = results_df[f'precision@{k}'].mean()
        metrics_summary[f'Recall@{k}'] = results_df[f'recall@{k}'].mean()
        metrics_summary[f'NDCG@{k}'] = results_df[f'ndcg@{k}'].mean()
    
    metrics_summary['MRR'] = results_df['mrr'].mean()
    
    # Response quality
    metrics_summary['BLEU'] = results_df['bleu'].mean()
    metrics_summary['ROUGE-1'] = results_df['rouge1'].mean()
    metrics_summary['ROUGE-2'] = results_df['rouge2'].mean()
    metrics_summary['ROUGE-L'] = results_df['rougeL'].mean()
    metrics_summary['Semantic Similarity'] = results_df['semantic_similarity'].mean()
    
    # Medical accuracy
    metrics_summary['Entity Accuracy'] = results_df['entity_accuracy'].mean()
    
    # Performance
    metrics_summary['Avg Response Time (s)'] = results_df['response_time'].mean()
    
    return results_df, metrics_summary

# RUN EVALUATION
results_df, metrics_summary = evaluate_rag_system(test_df, rag_system, phobert)

print("\n" + "="*60)
print("üìä EVALUATION RESULTS")
print("="*60)
for metric, value in metrics_summary.items():
    print(f"{metric:.<40} {value:.4f}")
print("="*60)

## üìà 9. Visualization

In [None]:
# ==================== METRICS COMPARISON ====================
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Retrieval Metrics
retrieval_metrics = {
    'Precision@1': metrics_summary['Precision@1'],
    'Precision@3': metrics_summary['Precision@3'],
    'Recall@1': metrics_summary['Recall@1'],
    'Recall@3': metrics_summary['Recall@3'],
    'MRR': metrics_summary['MRR'],
    'NDCG@3': metrics_summary['NDCG@3']
}
axes[0, 0].bar(retrieval_metrics.keys(), retrieval_metrics.values(), color='skyblue')
axes[0, 0].set_title('Retrieval Metrics', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_ylim(0, 1)
axes[0, 0].grid(axis='y', alpha=0.3)
plt.setp(axes[0, 0].xaxis.get_majorticklabels(), rotation=45, ha='right')

# 2. Response Quality Metrics
quality_metrics = {
    'BLEU': metrics_summary['BLEU'],
    'ROUGE-1': metrics_summary['ROUGE-1'],
    'ROUGE-2': metrics_summary['ROUGE-2'],
    'ROUGE-L': metrics_summary['ROUGE-L'],
    'Semantic Sim': metrics_summary['Semantic Similarity']
}
axes[0, 1].bar(quality_metrics.keys(), quality_metrics.values(), color='lightcoral')
axes[0, 1].set_title('Response Quality Metrics', fontsize=14, fontweight='bold')
axes[0, 1].set_ylabel('Score')
axes[0, 1].set_ylim(0, 1)
axes[0, 1].grid(axis='y', alpha=0.3)
plt.setp(axes[0, 1].xaxis.get_majorticklabels(), rotation=45, ha='right')

# 3. Per-Question Performance
results_df['question_short'] = results_df['question'].str[:30] + '...'
x_pos = np.arange(len(results_df))
axes[1, 0].bar(x_pos, results_df['semantic_similarity'], alpha=0.7, label='Semantic Sim', color='green')
axes[1, 0].bar(x_pos, results_df['bleu'], alpha=0.7, label='BLEU', color='orange')
axes[1, 0].set_title('Per-Question Performance', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Question Index')
axes[1, 0].set_ylabel('Score')
axes[1, 0].legend()
axes[1, 0].grid(axis='y', alpha=0.3)

# 4. Response Time Distribution
axes[1, 1].hist(results_df['response_time'], bins=10, color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].axvline(results_df['response_time'].mean(), color='red', linestyle='--', 
                   label=f'Mean: {results_df["response_time"].mean():.2f}s')
axes[1, 1].set_title('Response Time Distribution', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Response Time (seconds)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# ==================== DETAILED RESULTS TABLE ====================
print("\nüìã DETAILED RESULTS (First 3 Questions)\n")
display_cols = ['question', 'precision@3', 'recall@3', 'bleu', 'semantic_similarity', 'entity_accuracy']
print(results_df[display_cols].head(3).to_string(index=False))

## üíæ 10. Export Results

In [None]:
# Export to CSV
results_df.to_csv('evaluation_results.csv', index=False, encoding='utf-8-sig')

# Export summary to JSON
with open('metrics_summary.json', 'w', encoding='utf-8') as f:
    json.dump(metrics_summary, f, ensure_ascii=False, indent=2)

print("‚úÖ Results exported:")
print("   - evaluation_results.csv")
print("   - metrics_summary.json")

# Download files (for Colab)
try:
    from google.colab import files
    files.download('evaluation_results.csv')
    files.download('metrics_summary.json')
except:
    print("\n(Not in Colab - files saved locally)")

## üìñ 11. Interpretation Guide

### üéØ C√°ch ƒê√°nh Gi√° M√¥ H√¨nh T·ªët Hay Kh√¥ng

#### **A. Retrieval Metrics (Kh·∫£ nƒÉng t√¨m ki·∫øm)**

| Metric | T·ªët | Trung B√¨nh | Y·∫øu | √ù Nghƒ©a |
|--------|-----|------------|-----|----------|
| **Precision@3** | ‚â• 0.8 | 0.5-0.8 | < 0.5 | T·ª∑ l·ªá t√†i li·ªáu ƒë√∫ng trong top 3 |
| **Recall@3** | ‚â• 0.9 | 0.6-0.9 | < 0.6 | T√¨m ƒë∆∞·ª£c bao nhi√™u % t√†i li·ªáu li√™n quan |
| **MRR** | ‚â• 0.8 | 0.5-0.8 | < 0.5 | T√†i li·ªáu ƒë√∫ng c√≥ ·ªü v·ªã tr√≠ cao kh√¥ng |
| **NDCG@3** | ‚â• 0.85 | 0.6-0.85 | < 0.6 | Ch·∫•t l∆∞·ª£ng ranking t·ªïng th·ªÉ |

**N·∫øu Retrieval k√©m:**
- ‚úÖ C·∫£i thi·ªán embedding (fine-tune PhoBERT)
- ‚úÖ TƒÉng tr·ªçng s·ªë BM25 (keyword matching)
- ‚úÖ Th√™m query expansion
- ‚úÖ Reranking v·ªõi Cross-Encoder

---

#### **B. Response Quality Metrics (Ch·∫•t l∆∞·ª£ng c√¢u tr·∫£ l·ªùi)**

| Metric | T·ªët | Trung B√¨nh | Y·∫øu | √ù Nghƒ©a |
|--------|-----|------------|-----|----------|
| **BLEU** | ‚â• 0.4 | 0.2-0.4 | < 0.2 | ƒê·ªô gi·ªëng v·ªÅ t·ª´ ng·ªØ |
| **ROUGE-L** | ‚â• 0.5 | 0.3-0.5 | < 0.3 | ƒê·ªô ch·ªìng l·∫•p c√¢u d√†i |
| **Semantic Similarity** | ‚â• 0.75 | 0.6-0.75 | < 0.6 | ƒê·ªô gi·ªëng v·ªÅ √Ω nghƒ©a (quan tr·ªçng nh·∫•t!) |
| **Entity Accuracy** | ‚â• 0.85 | 0.7-0.85 | < 0.7 | T·ª∑ l·ªá thu·∫≠t ng·ªØ y t·∫ø ƒë√∫ng |

**N·∫øu Response Quality k√©m:**
- ‚úÖ C·∫£i thi·ªán prompt cho GPT
- ‚úÖ TƒÉng s·ªë l∆∞·ª£ng context (top K results)
- ‚úÖ Fine-tune GPT v·ªõi medical data
- ‚úÖ Th√™m fact-checking layer

---

#### **C. System Performance**

| Metric | T·ªët | Ch·∫•p Nh·∫≠n ƒê∆∞·ª£c | Y·∫øu |
|--------|-----|----------------|-----|
| **Response Time** | < 2s | 2-5s | > 5s |
| **Token Usage** | < 500 tokens/query | 500-1000 | > 1000 |

---

### üîç V√≠ D·ª• Ph√¢n T√≠ch

**K·ªãch b·∫£n 1: Retrieval t·ªët nh∆∞ng Response k√©m**
```
Precision@3: 0.85 ‚úÖ
Semantic Similarity: 0.45 ‚ùå
```
‚Üí **Nguy√™n nh√¢n:** GPT kh√¥ng t·ªïng h·ª£p th√¥ng tin t·ªët  
‚Üí **Gi·∫£i ph√°p:** C·∫£i thi·ªán prompt, th√™m examples

**K·ªãch b·∫£n 2: Retrieval k√©m nh∆∞ng Response t·ªët**
```
Precision@3: 0.40 ‚ùå
Semantic Similarity: 0.80 ‚úÖ
```
‚Üí **Nguy√™n nh√¢n:** GPT t·ª± suy lu·∫≠n (hallucination!)  
‚Üí **Gi·∫£i ph√°p:** C·∫£i thi·ªán retrieval, gi·∫£m temperature GPT

**K·ªãch b·∫£n 3: C·∫£ 2 ƒë·ªÅu t·ªët**
```
Precision@3: 0.90 ‚úÖ
Semantic Similarity: 0.85 ‚úÖ
```
‚Üí **M√¥ h√¨nh ƒë·∫°t chu·∫©n production!** üéâ

---

### üìù Checklist ƒê√°nh Gi√° Cu·ªëi C√πng

- [ ] **Retrieval Precision@3 ‚â• 0.7**
- [ ] **Semantic Similarity ‚â• 0.7**
- [ ] **Entity Accuracy ‚â• 0.8**
- [ ] **Response Time < 3s**
- [ ] **Kh√¥ng c√≥ hallucination nghi√™m tr·ªçng**

N·∫øu ƒë·∫°t **4/5 ti√™u ch√≠** ‚Üí M√¥ h√¨nh **T·ªêT**, c√≥ th·ªÉ deploy!
