# Train Naive Bayes Classifier - Optimized

**Author:** phamlucchuong  
**Date:** 2025-01-18  
**Dataset:** raw_data_bayes.csv (with disease_id)  

Training Naive Bayes v·ªõi:
- Multiple model comparison
- Cross-validation
- Parallel processing
- Detailed evaluation

In [1]:
import pandas as pd
import json
import pickle
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score,
    precision_recall_fscore_support
)
from collections import Counter
import warnings
import time
from datetime import datetime

warnings.filterwarnings('ignore')

print("="*70)
print("üöÄ NAIVE BAYES CLASSIFIER TRAINING")
print("="*70)
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Author: phamlucchuong")
print("="*70)
print("\n‚úÖ Libraries loaded")

üöÄ NAIVE BAYES CLASSIFIER TRAINING
Start time: 2025-11-19 11:50:42
Author: phamlucchuong

‚úÖ Libraries loaded


## 1. Load d·ªØ li·ªáu

In [2]:
print("\n" + "="*70)
print("üìÇ LOADING DATA")
print("="*70)

# Load binary data
df = pd.read_csv("../../data/processed/naive_bayes_data.csv")
print(f"‚úì Loaded binary data: {df.shape}")

# Load symptom mapping
with open("../../data/processed/symptom_mapping.json", "r", encoding="utf-8") as f:
    symptom_mapping = json.load(f)

all_symptoms = symptom_mapping["all_symptoms"]
print(f"‚úì Loaded symptom mapping: {len(all_symptoms)} symptoms")

# Load disease mapping
with open("../../data/processed/disease_mapping.json", "r", encoding="utf-8") as f:
    disease_mapping = json.load(f)

print(f"‚úì Loaded disease mapping: {len(disease_mapping)} diseases")

print(f"\nüìä Dataset Info:")
print(f"   Samples: {len(df)}")
print(f"   Features: {len(all_symptoms)}")
print(f"   Classes: {df['disease'].nunique()}")
print(f"   Columns: {df.columns.tolist()[:5]} ... (+{len(df.columns)-5} more)")

# Display sample
print(f"\nüìã Sample data (first 3 rows, first 8 columns):")
print(df.iloc[:3, :8])


üìÇ LOADING DATA
‚úì Loaded binary data: (2173, 627)
‚úì Loaded symptom mapping: 625 symptoms
‚úì Loaded disease mapping: 20 diseases

üìä Dataset Info:
   Samples: 2173
   Features: 625
   Classes: 21
   Columns: ['disease_id', 'disease', 'Ban_ƒë·ªè', 'B·ªã_b√≠t_m≈©i', 'B·ªã_nh·ª©c_ƒë·∫ßu'] ... (+622 more)

üìã Sample data (first 3 rows, first 8 columns):
  disease_id                disease  Ban_ƒë·ªè  B·ªã_b√≠t_m≈©i  B·ªã_nh·ª©c_ƒë·∫ßu  B·ªã_s·ªët  \
0       D001  C·∫£m l·∫°nh th√¥ng th∆∞·ªùng       0           0            0       0   
1       D001  C·∫£m l·∫°nh th√¥ng th∆∞·ªùng       0           0            0       0   
2       D001  C·∫£m l·∫°nh th√¥ng th∆∞·ªùng       0           0            0       0   

   B·ªã_s·ªï_m≈©i  Ch·∫£y_m≈©i  
0          0         0  
1          0         0  
2          0         0  


## 2. Ph√¢n t√≠ch ph√¢n b·ªë d·ªØ li·ªáu

In [3]:
print("\n" + "="*70)
print("üìä DATA ANALYSIS")
print("="*70)

# Ph√¢n b·ªë b·ªánh
disease_counts = df['disease'].value_counts()

print(f"\nüìà Ph√¢n b·ªë m·∫´u theo b·ªánh:")
print(f"\n{'Disease':<45} {'ID':<8} {'Samples':>8} {'%':>7}")
print("="*70)

# T·∫°o reverse mapping (disease_name -> disease_id)
reverse_disease_map = {v: k for k, v in disease_mapping.items()}

for disease, count in disease_counts.items():
    disease_id = reverse_disease_map.get(disease, "N/A")
    percentage = count / len(df) * 100
    print(f"{disease:<45} {disease_id:<8} {count:>8} {percentage:>6.1f}%")

print("\nüìä Th·ªëng k√™:")
print(f"   Mean:   {disease_counts.mean():.1f} samples/disease")
print(f"   Median: {disease_counts.median():.1f} samples/disease")
print(f"   Min:    {disease_counts.min()} samples")
print(f"   Max:    {disease_counts.max()} samples")
print(f"   Std:    {disease_counts.std():.1f}")

# Ph√¢n t√≠ch tri·ªáu ch·ª©ng
print(f"\nüîç Ph√¢n t√≠ch tri·ªáu ch·ª©ng:")
symptom_counts = df[all_symptoms].sum().sort_values(ascending=False)
print(f"   Top 10 tri·ªáu ch·ª©ng ph·ªï bi·∫øn nh·∫•t:")
for i, (symptom, count) in enumerate(symptom_counts.head(10).items(), 1):
    print(f"   {i:2d}. {symptom:<40} {count:>3} l·∫ßn ({count/len(df)*100:.1f}%)")


üìä DATA ANALYSIS

üìà Ph√¢n b·ªë m·∫´u theo b·ªánh:

Disease                                       ID        Samples       %
B·ªánh Tim m·∫°ch v√†nh (ƒêau th·∫Øt ng·ª±c, Nh·ªìi m√°u c∆° tim) D014          158    7.3%
B·ªánh TƒÉng huy·∫øt √°p (Tim m·∫°ch)                 D013          135    6.2%
Vi√™m d·∫° d√†y ru·ªôt (Gastroenteritis)            D011          130    6.0%
Th·ªßy ƒë·∫≠u (Chickenpox)                         D012          118    5.4%
B·ªánh Ph·ªïi t·∫Øc ngh·∫Ωn m·∫°n t√≠nh (COPD)           D017          117    5.4%
Vi√™m kh·ªõp m·∫°n t√≠nh                            D020          115    5.3%
ƒê√°i th√°o ƒë∆∞·ªùng (Ti·ªÉu ƒë∆∞·ªùng)                   D015          110    5.1%
Vi√™m lo√©t d·∫° d√†y v√† T√° tr√†ng                  D018          101    4.6%
B·ªánh Lao ph·ªïi                                 D005          100    4.6%
C·∫£m l·∫°nh th√¥ng th∆∞·ªùng                         D001          100    4.6%
Vi√™m ph·∫ø qu·∫£n c·∫•p                             D004     

## 3. Chu·∫©n b·ªã d·ªØ li·ªáu cho training

In [4]:
print("\n" + "="*70)
print("üîß DATA PREPARATION")
print("="*70)

# T√°ch features v√† labels
X = df[all_symptoms].values
y = df["disease"].values

print(f"\n‚úì Features (X): {X.shape}")
print(f"‚úì Labels (y): {y.shape}")

# L·ªçc b·ªè classes c√≥ √≠t h∆°n 2 m·∫´u
min_samples = 2
disease_counts = Counter(y)
valid_classes = [d for d, count in disease_counts.items() if count >= min_samples]
removed_classes = [d for d, count in disease_counts.items() if count < min_samples]

if removed_classes:
    print(f"\n‚ö†Ô∏è  Removing {len(removed_classes)} classes with < {min_samples} samples:")
    for disease in removed_classes:
        disease_id = reverse_disease_map.get(disease, "N/A")
        print(f"   - {disease} ({disease_id}): {disease_counts[disease]} samples")
    
    mask = np.isin(y, valid_classes)
    X = X[mask]
    y = y[mask]
    print(f"\n‚úì After filtering: {len(X)} samples, {len(valid_classes)} classes")
else:
    print(f"\n‚úì All classes have >= {min_samples} samples")

# Laplace smoothing
smoothing_value = 0.01
X = X + smoothing_value
print(f"\n‚úì Applied Laplace smoothing (alpha={smoothing_value})")

print(f"\nüìä Final data shape:")
print(f"   X: {X.shape}")
print(f"   y: {y.shape}")
print(f"   Classes: {len(np.unique(y))}")


üîß DATA PREPARATION

‚úì Features (X): (2173, 625)
‚úì Labels (y): (2173,)

‚ö†Ô∏è  Removing 1 classes with < 2 samples:
   - B·ªánh Ph·ªïi t·∫Øc_ngh·∫Ωn_m·∫°n_t√≠nh (COPD) (N/A): 1 samples

‚úì After filtering: 2172 samples, 20 classes

‚úì Applied Laplace smoothing (alpha=0.01)

üìä Final data shape:
   X: (2172, 625)
   y: (2172,)
   Classes: 20


## 4. Split train/test

In [5]:
print("\n" + "="*70)
print("‚úÇÔ∏è  TRAIN/TEST SPLIT")
print("="*70)

test_size = 0.2
random_state = 42

# Try stratified split
try:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=test_size, 
        random_state=random_state, 
        stratify=y
    )
    split_method = "Stratified"
    print(f"‚úì Using stratified split")
except ValueError as e:
    print(f"‚ö†Ô∏è  Stratified split failed: {e}")
    print(f"   ‚Üí Using random split")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=test_size, 
        random_state=random_state
    )
    split_method = "Random"

print(f"\nüìä Split summary:")
print(f"   Method: {split_method}")
print(f"   Test size: {test_size*100:.0f}%")
print(f"   Random state: {random_state}")
print(f"\n   Train: {len(X_train):>4} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Test:  {len(X_test):>4} samples ({len(X_test)/len(X)*100:.1f}%)")

# Class distribution
train_classes = len(np.unique(y_train))
test_classes = len(np.unique(y_test))

print(f"\nüìà Class distribution:")
print(f"   Train: {train_classes} unique classes")
print(f"   Test:  {test_classes} unique classes")

if train_classes != test_classes:
    print(f"\n‚ö†Ô∏è  Warning: Train and test have different number of classes!")
    missing_in_test = set(y_train) - set(y_test)
    if missing_in_test:
        print(f"   Classes missing in test set: {missing_in_test}")


‚úÇÔ∏è  TRAIN/TEST SPLIT
‚úì Using stratified split

üìä Split summary:
   Method: Stratified
   Test size: 20%
   Random state: 42

   Train: 1737 samples (80.0%)
   Test:   435 samples (20.0%)

üìà Class distribution:
   Train: 20 unique classes
   Test:  20 unique classes


## 5. Training Multiple Models

In [6]:
print("\n" + "="*70)
print("üöÄ MODEL TRAINING")
print("="*70)

# Define models
models = {
    "MultinomialNB (Œ±=0.1)": MultinomialNB(alpha=0.1),
    "MultinomialNB (Œ±=0.5)": MultinomialNB(alpha=0.5),
    "MultinomialNB (Œ±=1.0)": MultinomialNB(alpha=1.0),
    "MultinomialNB (Œ±=2.0)": MultinomialNB(alpha=2.0),
    "ComplementNB (Œ±=0.5)": ComplementNB(alpha=0.5),
    "ComplementNB (Œ±=1.0)": ComplementNB(alpha=1.0),
}

results = {}
start_time = time.time()

for name, model in models.items():
    print(f"\n{'‚îÄ'*70}")
    print(f"Training: {name}")
    print(f"{'‚îÄ'*70}")
    
    model_start = time.time()
    
    # Train
    model.fit(X_train, y_train)
    train_time = time.time() - model_start
    
    # Predict
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Metrics
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)
    
    # Cross-validation (n·∫øu c√≥ ƒë·ªß samples)
    try:
        n_splits = min(3, len(np.unique(y_train)))
        if n_splits >= 2:
            cv_scores = cross_val_score(
                model, X_train, y_train,
                cv=n_splits,
                scoring='accuracy',
                n_jobs=-1
            )
            cv_mean = cv_scores.mean()
            cv_std = cv_scores.std()
        else:
            cv_mean = 0
            cv_std = 0
    except:
        cv_mean = 0
        cv_std = 0
    
    # Precision, Recall, F1
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, y_pred_test, average='weighted', zero_division=0
    )
    
    results[name] = {
        'model': model,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'cv_mean': cv_mean,
        'cv_std': cv_std,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'train_time': train_time,
        'y_pred': y_pred_test
    }
    
    print(f"‚úì Train Accuracy:  {train_acc:.4f}")
    print(f"‚úì Test Accuracy:   {test_acc:.4f}")
    if cv_mean > 0:
        print(f"‚úì CV Accuracy:     {cv_mean:.4f} (¬±{cv_std:.4f})")
    print(f"‚úì Precision:       {precision:.4f}")
    print(f"‚úì Recall:          {recall:.4f}")
    print(f"‚úì F1-Score:        {f1:.4f}")
    print(f"‚è±Ô∏è  Training time:   {train_time:.4f}s")

total_time = time.time() - start_time
print(f"\n{'='*70}")
print(f"‚úÖ All models trained in {total_time:.2f}s")
print(f"{'='*70}")


üöÄ MODEL TRAINING

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Training: MultinomialNB (Œ±=0.1)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úì Train Accuracy:  0.9942
‚úì Test Accuracy:   0.9862
‚úì CV Accuracy:     0.9856 (¬±0.0059)
‚úì Precision:       0.9870
‚úì Recall:          0.9862
‚úì F1-Score:        0.9863
‚è±Ô∏è  Training time:   0.0185s

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Training: MultinomialNB (Œ±=0.5)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

## 6. Model Comparison

In [7]:
print("\n" + "="*70)
print("üìä MODEL COMPARISON")
print("="*70)

# Create comparison table
comparison_data = []
for name, res in results.items():
    comparison_data.append({
        'Model': name,
        'Train Acc': f"{res['train_acc']:.4f}",
        'Test Acc': f"{res['test_acc']:.4f}",
        'CV Acc': f"{res['cv_mean']:.4f}" if res['cv_mean'] > 0 else "N/A",
        'Precision': f"{res['precision']:.4f}",
        'Recall': f"{res['recall']:.4f}",
        'F1': f"{res['f1']:.4f}",
        'Time (s)': f"{res['train_time']:.4f}"
    })

comparison_df = pd.DataFrame(comparison_data)
print("\n" + comparison_df.to_string(index=False))

# Select best model
best_model_name = max(results, key=lambda x: results[x]['test_acc'])
best_result = results[best_model_name]
best_model = best_result['model']

print(f"\n{'='*70}")
print(f"üèÜ BEST MODEL: {best_model_name}")
print(f"{'='*70}")
print(f"   Test Accuracy:  {best_result['test_acc']:.4f}")
print(f"   Precision:      {best_result['precision']:.4f}")
print(f"   Recall:         {best_result['recall']:.4f}")
print(f"   F1-Score:       {best_result['f1']:.4f}")
print(f"   Training time:  {best_result['train_time']:.4f}s")


üìä MODEL COMPARISON

                Model Train Acc Test Acc CV Acc Precision Recall     F1 Time (s)
MultinomialNB (Œ±=0.1)    0.9942   0.9862 0.9856    0.9870 0.9862 0.9863   0.0185
MultinomialNB (Œ±=0.5)    0.9919   0.9839 0.9799    0.9848 0.9839 0.9839   0.0098
MultinomialNB (Œ±=1.0)    0.9908   0.9816 0.9729    0.9828 0.9816 0.9815   0.0093
MultinomialNB (Œ±=2.0)    0.9856   0.9770 0.9603    0.9791 0.9770 0.9768   0.0070
 ComplementNB (Œ±=0.5)    0.9620   0.9563 0.9326    0.9607 0.9563 0.9555   0.0077
 ComplementNB (Œ±=1.0)    0.9620   0.9563 0.9315    0.9607 0.9563 0.9555   0.0078

üèÜ BEST MODEL: MultinomialNB (Œ±=0.1)
   Test Accuracy:  0.9862
   Precision:      0.9870
   Recall:         0.9862
   F1-Score:       0.9863
   Training time:  0.0185s


## 7. Detailed Evaluation

In [8]:
print("\n" + "="*70)
print(f"üìä DETAILED EVALUATION - {best_model_name}")
print("="*70)

y_pred_best = best_result['y_pred']

# Classification Report
print("\nüìã Classification Report:")
print("\n" + classification_report(y_test, y_pred_best, zero_division=0))

# Per-class accuracy
print("\nüìà Per-class Performance:")
print(f"\n{'Disease':<45} {'ID':<8} {'Acc':>6} {'Samples':>8}")
print("="*70)

for disease in sorted(np.unique(y_test)):
    mask = y_test == disease
    if mask.sum() > 0:
        acc = accuracy_score(y_test[mask], y_pred_best[mask])
        disease_id = reverse_disease_map.get(disease, "N/A")
        n_samples = mask.sum()
        print(f"{disease:<45} {disease_id:<8} {acc:>5.1%} {n_samples:>8}")

# Confusion Matrix (top 10 classes)
print("\nüìä Confusion Matrix (Top 10 most common classes in test set):")
unique_test_classes = np.unique(y_test)

if len(unique_test_classes) > 10:
    test_counts = Counter(y_test)
    top_classes = [c for c, _ in test_counts.most_common(10)]
    
    mask = np.isin(y_test, top_classes)
    y_test_top = y_test[mask]
    y_pred_top = y_pred_best[mask]
    
    cm = confusion_matrix(y_test_top, y_pred_top, labels=top_classes)
    cm_df = pd.DataFrame(cm, index=top_classes, columns=top_classes)
else:
    cm = confusion_matrix(y_test, y_pred_best)
    cm_df = pd.DataFrame(cm, index=unique_test_classes, columns=unique_test_classes)

print("\n" + cm_df.to_string())


üìä DETAILED EVALUATION - MultinomialNB (Œ±=0.1)

üìã Classification Report:

                                                     precision    recall  f1-score   support

                                    B·ªánh G√∫t (Gout)       1.00      1.00      1.00        18
                                      B·ªánh Lao ph·ªïi       0.95      0.95      0.95        20
                B·ªánh Ph·ªïi t·∫Øc ngh·∫Ωn m·∫°n t√≠nh (COPD)       1.00      1.00      1.00        24
                            B·ªánh Tay - Ch√¢n - Mi·ªáng       1.00      1.00      1.00        20
B·ªánh Tim m·∫°ch v√†nh (ƒêau th·∫Øt ng·ª±c, Nh·ªìi m√°u c∆° tim)       1.00      0.94      0.97        32
                                 B·ªánh Ti√™u ch·∫£y c·∫•p       1.00      1.00      1.00        20
                      B·ªánh TƒÉng huy·∫øt √°p (Tim m·∫°ch)       0.90      1.00      0.95        27
                                            C·∫£m c√∫m       1.00      1.00      1.00        20
                          

## 8. Save Model

In [9]:
import os

print("\n" + "="*70)
print("üíæ SAVING MODEL")
print("="*70)

os.makedirs("../../models", exist_ok=True)

# 1. Save best model
model_path = "../../models/naive_bayes_model.pkl"
with open(model_path, "wb") as f:
    pickle.dump(best_model, f)
print(f"\n‚úì Saved model: {model_path}")

# 2. Save model info
model_info = {
    "model_name": best_model_name,
    "model_type": type(best_model).__name__,
    "symptoms": all_symptoms,
    "diseases": sorted(np.unique(y_train).tolist()),
    "disease_mapping": disease_mapping,
    "metrics": {
        "test_accuracy": float(best_result['test_acc']),
        "precision": float(best_result['precision']),
        "recall": float(best_result['recall']),
        "f1_score": float(best_result['f1']),
        "cv_accuracy_mean": float(best_result['cv_mean']) if best_result['cv_mean'] > 0 else None,
        "cv_accuracy_std": float(best_result['cv_std']) if best_result['cv_std'] > 0 else None
    },
    "data_info": {
        "n_samples_train": len(X_train),
        "n_samples_test": len(X_test),
        "n_features": len(all_symptoms),
        "n_classes": len(np.unique(y_train)),
        "split_method": split_method,
        "test_size": test_size,
        "random_state": random_state,
        "smoothing_value": smoothing_value
    },
    "training_info": {
        "training_time": float(best_result['train_time']),
        "timestamp": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        "author": "phamlucchuong"
    }
}

info_path = "../../models/model_info.json"
with open(info_path, "w", encoding="utf-8") as f:
    json.dump(model_info, f, ensure_ascii=False, indent=2)
print(f"‚úì Saved info: {info_path}")

# 3. Save all model results for comparison
all_results_path = "../../models/all_models_comparison.json"
all_results_data = {
    name: {
        "train_accuracy": float(res['train_acc']),
        "test_accuracy": float(res['test_acc']),
        "cv_mean": float(res['cv_mean']) if res['cv_mean'] > 0 else None,
        "cv_std": float(res['cv_std']) if res['cv_std'] > 0 else None,
        "precision": float(res['precision']),
        "recall": float(res['recall']),
        "f1_score": float(res['f1']),
        "training_time": float(res['train_time'])
    }
    for name, res in results.items()
}

with open(all_results_path, "w", encoding="utf-8") as f:
    json.dump(all_results_data, f, ensure_ascii=False, indent=2)
print(f"‚úì Saved comparison: {all_results_path}")

print(f"\nüì¶ Saved files:")
print(f"   1. {model_path}")
print(f"   2. {info_path}")
print(f"   3. {all_results_path}")


üíæ SAVING MODEL

‚úì Saved model: ../../models/naive_bayes_model.pkl
‚úì Saved info: ../../models/model_info.json
‚úì Saved comparison: ../../models/all_models_comparison.json

üì¶ Saved files:
   1. ../../models/naive_bayes_model.pkl
   2. ../../models/model_info.json
   3. ../../models/all_models_comparison.json


## 9. Test v·ªõi d·ªØ li·ªáu m·ªõi

In [10]:
print("\n" + "="*70)
print("üß™ TESTING WITH NEW DATA")
print("="*70)

def predict_disease(symptoms_list, model, all_symptoms, disease_mapping, top_k=5):
    """
    D·ª± ƒëo√°n b·ªánh t·ª´ danh s√°ch tri·ªáu ch·ª©ng
    
    Args:
        symptoms_list: List of symptoms
        model: Trained model
        all_symptoms: List of all possible symptoms
        disease_mapping: Dict mapping disease_id to disease_name
        top_k: Number of top predictions
    
    Returns:
        prediction, top_diseases, disease_id
    """
    # Convert to binary vector
    vector = np.array([1 if s in symptoms_list else 0 for s in all_symptoms])
    vector = vector.reshape(1, -1) + smoothing_value
    
    # Predict
    prediction = model.predict(vector)[0]
    probabilities = model.predict_proba(vector)[0]
    
    # Get disease_id
    reverse_map = {v: k for k, v in disease_mapping.items()}
    disease_id = reverse_map.get(prediction, "Unknown")
    
    # Top K
    top_indices = np.argsort(probabilities)[::-1][:top_k]
    top_diseases = []
    for i in top_indices:
        disease_name = model.classes_[i]
        disease_id_top = reverse_map.get(disease_name, "Unknown")
        top_diseases.append((disease_name, disease_id_top, probabilities[i]))
    
    return prediction, top_diseases, disease_id

# Test cases
test_cases = [
    {
        "name": "Case 1: C√∫m",
        "symptoms": ["s·ªët_cao_tr√™n_38_5_ƒë·ªô", "ƒëau_ƒë·∫ßu_d·ªØ_d·ªôi", "ƒëau_nh·ª©c_c∆°_b·∫Øp", "ho_khan"],
        "expected": "C·∫£m c√∫m (Influenza)"
    },
    {
        "name": "Case 2: C·∫£m l·∫°nh",
        "symptoms": ["h·∫Øt_h∆°i", "s·ªï_m≈©i", "ngh·∫πt_m≈©i", "ƒëau_h·ªçng_nh·∫π"],
        "expected": "C·∫£m l·∫°nh th√¥ng th∆∞·ªùng"
    },
    {
        "name": "Case 3: Vi√™m h·ªçng",
        "symptoms": ["ƒëau_h·ªçng_d·ªØ_d·ªôi", "kh√≥_nu·ªët", "s·ªët", "amidan_s∆∞ng_ƒë·ªè"],
        "expected": "Vi√™m h·ªçng v√† Vi√™m Amidan c·∫•p"
    },
    {
        "name": "Case 4: Ti√™u ch·∫£y",
        "symptoms": ["ti√™u_ch·∫£y_ph√¢n_l·ªèng", "ƒëau_b·ª•ng_qu·∫∑n", "s·ªët_nh·∫π", "m·∫•t_n∆∞·ªõc"],
        "expected": "B·ªánh Ti√™u ch·∫£y c·∫•p"
    },
    {
        "name": "Case 5: G√∫t",
        "symptoms": ["ƒëau_kh·ªõp_ng√≥n_ch√¢n_c√°i_d·ªØ_d·ªôi", "s∆∞ng_ƒë·ªè_n√≥ng", "ƒëau_ƒë·ªôt_ng·ªôt_ban_ƒë√™m"],
        "expected": "B·ªánh G√∫t (Gout)"
    },
]

print("\nüìù Test Results:\n")

correct = 0
total = len(test_cases)

for i, test in enumerate(test_cases, 1):
    symptoms = test["symptoms"]
    expected = test.get("expected", "Unknown")
    
    pred, top_diseases, disease_id = predict_disease(
        symptoms, best_model, all_symptoms, disease_mapping
    )
    
    is_correct = pred == expected
    if is_correct:
        correct += 1
    
    print(f"{test['name']}:")
    print(f"   Symptoms: {', '.join(symptoms[:3])}" + (" ..." if len(symptoms) > 3 else ""))
    print(f"   Expected: {expected}")
    print(f"   Predicted: {pred} ({disease_id}) {'‚úÖ' if is_correct else '‚ùå'}")
    print(f"   \n   Top 5 predictions:")
    for rank, (disease, did, prob) in enumerate(top_diseases, 1):
        marker = "üëâ" if disease == pred else "  "
        print(f"   {marker} {rank}. {disease:<45} ({did:<8}) {prob:>6.2%}")
    print()

print(f"{'='*70}")
print(f"üìä Test Accuracy: {correct}/{total} ({correct/total*100:.1f}%)")
print(f"{'='*70}")


üß™ TESTING WITH NEW DATA

üìù Test Results:

Case 1: C√∫m:
   Symptoms: s·ªët_cao_tr√™n_38_5_ƒë·ªô, ƒëau_ƒë·∫ßu_d·ªØ_d·ªôi, ƒëau_nh·ª©c_c∆°_b·∫Øp ...
   Expected: C·∫£m c√∫m (Influenza)
   Predicted: C·∫£m c√∫m (D002) ‚ùå
   
   Top 5 predictions:
   üëâ 1. C·∫£m c√∫m                                       (D002    ) 88.90%
      2. S·ªët xu·∫•t huy·∫øt                                (D006    )  2.17%
      3. Vi√™m h·ªçng v√† Vi√™m Amidan c·∫•p                  (D003    )  1.70%
      4. S·ªët r√©t                                       (D007    )  0.99%
      5. Vi√™m ph·∫ø qu·∫£n c·∫•p                             (D004    )  0.97%

Case 2: C·∫£m l·∫°nh:
   Symptoms: h·∫Øt_h∆°i, s·ªï_m≈©i, ngh·∫πt_m≈©i ...
   Expected: C·∫£m l·∫°nh th√¥ng th∆∞·ªùng
   Predicted: C·∫£m c√∫m (D002) ‚ùå
   
   Top 5 predictions:
   üëâ 1. C·∫£m c√∫m                                       (D002    ) 44.23%
      2. Vi√™m h·ªçng v√† Vi√™m Amidan c·∫•p                  (D003    ) 11.02%
      3. Vi√™m d

## 10. Summary

In [11]:
print("\n" + "="*70)
print("‚úÖ TRAINING SUMMARY")
print("="*70)

print(f"\nüìä Dataset:")
print(f"   Total samples: {len(df)}")
print(f"   Train samples: {len(X_train)}")
print(f"   Test samples: {len(X_test)}")
print(f"   Features: {len(all_symptoms)}")
print(f"   Classes: {len(np.unique(y_train))}")

print(f"\nüèÜ Best Model:")
print(f"   Name: {best_model_name}")
print(f"   Test Accuracy: {best_result['test_acc']:.2%}")
print(f"   F1-Score: {best_result['f1']:.2%}")

print(f"\nüíæ Saved Files:")
print(f"   - naive_bayes_model.pkl")
print(f"   - model_info.json")
print(f"   - all_models_comparison.json")

print(f"\n‚è±Ô∏è  Training Info:")
print(f"   Training time: {best_result['train_time']:.4f}s")
print(f"   Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"   Author: phamlucchuong")

print(f"\n" + "="*70)
print("‚úÖ DONE!")
print("="*70)

print(f"\nüìù Next steps:")
print(f"   1. Review model performance in model_info.json")
print(f"   2. Test model with API: python api/main.py")
print(f"   3. Deploy to production")


‚úÖ TRAINING SUMMARY

üìä Dataset:
   Total samples: 2173
   Train samples: 1737
   Test samples: 435
   Features: 625
   Classes: 20

üèÜ Best Model:
   Name: MultinomialNB (Œ±=0.1)
   Test Accuracy: 98.62%
   F1-Score: 98.63%

üíæ Saved Files:
   - naive_bayes_model.pkl
   - model_info.json
   - all_models_comparison.json

‚è±Ô∏è  Training Info:
   Training time: 0.0185s
   Timestamp: 2025-11-19 11:50:49
   Author: phamlucchuong

‚úÖ DONE!

üìù Next steps:
   1. Review model performance in model_info.json
   2. Test model with API: python api/main.py
   3. Deploy to production
