# üè• Medical Model Evaluation: Hands-On Practice

## Table of Contents
1. [Medical Benchmark Dataset Loading](#practice-1-medical-benchmark-dataset-loading)
2. [Accuracy and Basic Metrics](#practice-2-accuracy-and-basic-metrics)
3. [Hallucination Detection](#practice-3-hallucination-detection)
4. [Factuality Scoring](#practice-4-factuality-scoring)
5. [Consistency Measures](#practice-5-consistency-measures)
6. [Uncertainty Quantification](#practice-6-uncertainty-quantification)
7. [Calibration Metrics](#practice-7-calibration-metrics)
8. [Clinical Relevance Scoring](#practice-8-clinical-relevance-scoring)
9. [Safety Assessment](#practice-9-safety-assessment)
10. [Performance Dashboard Creation](#practice-10-performance-dashboard-creation)

## Installing and Importing Essential Libraries

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.calibration import calibration_curve
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')
sns.set_palette('husl')

print("‚úÖ All libraries loaded successfully!")

---
## Practice 1: Medical Benchmark Dataset Loading

### üéØ Learning Objectives
- Load and explore medical Q&A benchmark data
- Understand the structure of MedQA-style datasets
- Prepare data for evaluation

### üìñ Key Concepts
**Medical Benchmarks:** MedQA (11,450 questions), USMLE-style, 4-choice format

In [None]:
# 1.1 Create synthetic medical Q&A dataset
def create_medical_qa_dataset(n_samples=100):
    """
    Create a synthetic medical Q&A dataset for evaluation practice
    Simulates MedQA-style benchmark data
    """
    np.random.seed(42)
    
    # Simulate model predictions with varying confidence
    true_labels = np.random.choice([0, 1, 2, 3], size=n_samples, p=[0.25, 0.25, 0.25, 0.25])
    
    # Model predictions (with some errors)
    predicted_labels = true_labels.copy()
    error_indices = np.random.choice(n_samples, size=int(n_samples * 0.15), replace=False)
    predicted_labels[error_indices] = np.random.choice([0, 1, 2, 3], size=len(error_indices))
    
    # Generate confidence scores (higher for correct predictions)
    confidence_scores = np.zeros(n_samples)
    correct_mask = (true_labels == predicted_labels)
    confidence_scores[correct_mask] = np.random.beta(8, 2, size=np.sum(correct_mask))
    confidence_scores[~correct_mask] = np.random.beta(3, 5, size=np.sum(~correct_mask))
    
    # Create difficulty levels
    difficulty = np.random.choice(['Easy', 'Moderate', 'Hard', 'Expert'], 
                                   size=n_samples, 
                                   p=[0.3, 0.3, 0.25, 0.15])
    
    # Create medical specialties
    specialties = np.random.choice(['Cardiology', 'Neurology', 'Pediatrics', 'Surgery', 'Internal Medicine'],
                                    size=n_samples,
                                    p=[0.2, 0.2, 0.2, 0.2, 0.2])
    
    df = pd.DataFrame({
        'question_id': range(1, n_samples + 1),
        'true_answer': true_labels,
        'predicted_answer': predicted_labels,
        'confidence': confidence_scores,
        'difficulty': difficulty,
        'specialty': specialties,
        'is_correct': true_labels == predicted_labels
    })
    
    return df

# Load dataset
medical_data = create_medical_qa_dataset(n_samples=200)

print("üìä Medical Q&A Dataset Overview")
print("=" * 60)
print(f"Total samples: {len(medical_data)}")
print(f"\nFirst 5 rows:")
print(medical_data.head())
print(f"\nDataset Info:")
print(medical_data.info())
print(f"\nBasic Statistics:")
print(medical_data.describe())

---
## Practice 2: Accuracy and Basic Metrics

### üéØ Learning Objectives
- Calculate accuracy, precision, recall, F1-score
- Understand the limitations of accuracy alone
- Visualize performance metrics

### üìñ Key Concepts
**Clinical Utility vs Accuracy:** High accuracy doesn't always mean clinical usefulness!

In [None]:
# 2.1 Calculate basic evaluation metrics
def calculate_basic_metrics(df):
    """
    Calculate and display basic evaluation metrics
    """
    y_true = df['true_answer']
    y_pred = df['predicted_answer']
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    f1_macro = f1_score(y_true, y_pred, average='macro')
    f1_weighted = f1_score(y_true, y_pred, average='weighted')
    
    print("üìà Basic Evaluation Metrics")
    print("=" * 60)
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"F1-Score (Macro): {f1_macro:.4f}")
    print(f"F1-Score (Weighted): {f1_weighted:.4f}")
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Confusion Matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
    axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Predicted Label')
    axes[0].set_ylabel('True Label')
    
    # Performance by difficulty
    difficulty_acc = df.groupby('difficulty')['is_correct'].mean()
    difficulty_order = ['Easy', 'Moderate', 'Hard', 'Expert']
    difficulty_acc = difficulty_acc.reindex(difficulty_order)
    
    axes[1].bar(difficulty_acc.index, difficulty_acc.values, color=['#2ecc71', '#f39c12', '#e74c3c', '#c0392b'])
    axes[1].set_title('Accuracy by Difficulty Level', fontsize=14, fontweight='bold')
    axes[1].set_xlabel('Difficulty')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_ylim([0, 1])
    axes[1].axhline(y=accuracy, color='red', linestyle='--', label=f'Overall: {accuracy:.3f}')
    axes[1].legend()
    
    # Add percentage labels on bars
    for i, v in enumerate(difficulty_acc.values):
        axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    return accuracy, f1_macro

accuracy, f1 = calculate_basic_metrics(medical_data)

---
## Practice 3: Hallucination Detection

### üéØ Learning Objectives
- Identify low-confidence predictions
- Detect potential hallucinations
- Set confidence thresholds

### üìñ Key Concepts
**Hallucination Detection:** Models should know when they don't know!

In [None]:
# 3.1 Implement hallucination detection
def detect_hallucinations(df, confidence_threshold=0.7):
    """
    Detect potential hallucinations based on confidence scores
    """
    # Identify low confidence predictions
    df['potential_hallucination'] = (df['confidence'] < confidence_threshold) & (~df['is_correct'])
    
    hallucination_rate = df['potential_hallucination'].mean()
    
    print("üîç Hallucination Detection Analysis")
    print("=" * 60)
    print(f"Confidence Threshold: {confidence_threshold}")
    print(f"Potential Hallucinations Detected: {df['potential_hallucination'].sum()}")
    print(f"Hallucination Rate: {hallucination_rate*100:.2f}%")
    
    # Analyze by confidence bins
    df['confidence_bin'] = pd.cut(df['confidence'], bins=[0, 0.3, 0.5, 0.7, 0.9, 1.0],
                                   labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Confidence distribution by correctness
    axes[0].hist(df[df['is_correct']]['confidence'], bins=20, alpha=0.6, label='Correct', color='green')
    axes[0].hist(df[~df['is_correct']]['confidence'], bins=20, alpha=0.6, label='Incorrect', color='red')
    axes[0].axvline(x=confidence_threshold, color='black', linestyle='--', linewidth=2, label=f'Threshold={confidence_threshold}')
    axes[0].set_xlabel('Confidence Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Confidence Distribution by Correctness', fontsize=14, fontweight='bold')
    axes[0].legend()
    
    # Accuracy by confidence bin
    confidence_acc = df.groupby('confidence_bin', observed=True)['is_correct'].agg(['mean', 'count'])
    
    bars = axes[1].bar(range(len(confidence_acc)), confidence_acc['mean'], 
                       color=['#e74c3c', '#e67e22', '#f39c12', '#2ecc71', '#27ae60'])
    axes[1].set_xticks(range(len(confidence_acc)))
    axes[1].set_xticklabels(confidence_acc.index, rotation=45)
    axes[1].set_xlabel('Confidence Bin')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_title('Accuracy by Confidence Level', fontsize=14, fontweight='bold')
    axes[1].set_ylim([0, 1])
    
    # Add count labels
    for i, (acc, cnt) in enumerate(zip(confidence_acc['mean'], confidence_acc['count'])):
        axes[1].text(i, acc + 0.02, f'{acc*100:.1f}%\n(n={cnt})', ha='center', fontsize=9)
    
    plt.tight_layout()
    plt.show()
    
    return hallucination_rate

hallucination_rate = detect_hallucinations(medical_data, confidence_threshold=0.7)

---
## Practice 4: Factuality Scoring

### üéØ Learning Objectives
- Implement evidence-based scoring
- Weight predictions by confidence and correctness
- Create a factuality score pyramid

### üìñ Key Concepts
**Evidence Levels:** Peer-reviewed research > Clinical guidelines > Medical textbooks > Expert opinion

In [None]:
# 4.1 Calculate factuality scores
def calculate_factuality_score(df):
    """
    Calculate factuality score based on correctness and confidence
    Score: 0 (unverifiable) ‚Üí 10 (multiple high-quality sources)
    """
    def assign_factuality(row):
        if row['is_correct']:
            # Correct predictions
            if row['confidence'] >= 0.9:
                return 10  # High confidence, correct
            elif row['confidence'] >= 0.7:
                return 8   # Medium-high confidence, correct
            elif row['confidence'] >= 0.5:
                return 6   # Medium confidence, correct
            else:
                return 4   # Low confidence, but correct
        else:
            # Incorrect predictions
            if row['confidence'] >= 0.7:
                return 1   # High confidence, wrong - dangerous!
            else:
                return 2   # Low confidence, wrong
    
    df['factuality_score'] = df.apply(assign_factuality, axis=1)
    
    # Calculate statistics
    mean_factuality = df['factuality_score'].mean()
    
    print("üìä Factuality Scoring Analysis")
    print("=" * 60)
    print(f"Mean Factuality Score: {mean_factuality:.2f} / 10")
    print(f"\nScore Distribution:")
    print(df['factuality_score'].value_counts().sort_index())
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Score distribution
    score_counts = df['factuality_score'].value_counts().sort_index()
    colors = ['#e74c3c' if s <= 3 else '#f39c12' if s <= 6 else '#2ecc71' for s in score_counts.index]
    
    axes[0].bar(score_counts.index, score_counts.values, color=colors, edgecolor='black')
    axes[0].set_xlabel('Factuality Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Factuality Score Distribution', fontsize=14, fontweight='bold')
    axes[0].axvline(x=mean_factuality, color='red', linestyle='--', linewidth=2, label=f'Mean={mean_factuality:.1f}')
    axes[0].legend()
    
    # Score by specialty
    specialty_scores = df.groupby('specialty')['factuality_score'].mean().sort_values()
    axes[1].barh(range(len(specialty_scores)), specialty_scores.values, 
                 color=sns.color_palette('RdYlGn', len(specialty_scores)))
    axes[1].set_yticks(range(len(specialty_scores)))
    axes[1].set_yticklabels(specialty_scores.index)
    axes[1].set_xlabel('Mean Factuality Score')
    axes[1].set_title('Factuality by Medical Specialty', fontsize=14, fontweight='bold')
    axes[1].axvline(x=mean_factuality, color='red', linestyle='--', linewidth=2, alpha=0.7)
    
    plt.tight_layout()
    plt.show()
    
    return mean_factuality

factuality = calculate_factuality_score(medical_data)

---
## Practice 5: Consistency Measures

### üéØ Learning Objectives
- Test response consistency across multiple runs
- Measure paraphrase robustness
- Calculate agreement rates

### üìñ Key Concepts
**Consistency Target:** >90% consistency for critical clinical decisions

In [None]:
# 5.1 Simulate consistency testing
def test_consistency(df, n_repeats=5):
    """
    Simulate repeated model queries to test consistency
    """
    np.random.seed(42)
    
    # Simulate multiple runs for a subset of questions
    sample_size = 50
    sample_df = df.sample(n=sample_size, random_state=42)
    
    consistency_results = []
    
    for idx, row in sample_df.iterrows():
        # Simulate repeated predictions with some variation
        base_pred = row['predicted_answer']
        base_conf = row['confidence']
        
        # Higher confidence = more consistent
        consistency_prob = 0.5 + (base_conf * 0.4)  # 50-90% consistency range
        
        repeated_preds = [base_pred]
        for _ in range(n_repeats - 1):
            if np.random.random() < consistency_prob:
                repeated_preds.append(base_pred)
            else:
                # Random different answer
                other_answers = [a for a in [0, 1, 2, 3] if a != base_pred]
                repeated_preds.append(np.random.choice(other_answers))
        
        # Calculate consistency for this question
        consistency = (np.array(repeated_preds) == base_pred).mean()
        consistency_results.append({
            'question_id': row['question_id'],
            'confidence': base_conf,
            'consistency': consistency,
            'is_correct': row['is_correct']
        })
    
    consistency_df = pd.DataFrame(consistency_results)
    mean_consistency = consistency_df['consistency'].mean()
    
    print("üîÑ Consistency Testing Results")
    print("=" * 60)
    print(f"Number of questions tested: {sample_size}")
    print(f"Repeats per question: {n_repeats}")
    print(f"Mean consistency: {mean_consistency*100:.2f}%")
    print(f"Target: >90% for critical decisions")
    print(f"\nConsistency by correctness:")
    print(consistency_df.groupby('is_correct')['consistency'].agg(['mean', 'std', 'min', 'max']))
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Consistency histogram
    axes[0].hist(consistency_df['consistency'], bins=15, edgecolor='black', alpha=0.7, color='skyblue')
    axes[0].axvline(x=0.9, color='red', linestyle='--', linewidth=2, label='Target (90%)')
    axes[0].axvline(x=mean_consistency, color='green', linestyle='-', linewidth=2, label=f'Mean ({mean_consistency*100:.1f}%)')
    axes[0].set_xlabel('Consistency Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Consistency Score Distribution', fontsize=14, fontweight='bold')
    axes[0].legend()
    
    # Confidence vs Consistency scatter
    colors = ['green' if c else 'red' for c in consistency_df['is_correct']]
    axes[1].scatter(consistency_df['confidence'], consistency_df['consistency'], 
                    c=colors, alpha=0.6, s=50, edgecolors='black')
    axes[1].set_xlabel('Confidence')
    axes[1].set_ylabel('Consistency')
    axes[1].set_title('Confidence vs Consistency', fontsize=14, fontweight='bold')
    axes[1].axhline(y=0.9, color='red', linestyle='--', alpha=0.5)
    axes[1].grid(True, alpha=0.3)
    
    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor='green', alpha=0.6, label='Correct'),
                       Patch(facecolor='red', alpha=0.6, label='Incorrect')]
    axes[1].legend(handles=legend_elements)
    
    plt.tight_layout()
    plt.show()
    
    return mean_consistency

consistency_score = test_consistency(medical_data, n_repeats=5)

---
## Practice 6: Uncertainty Quantification

### üéØ Learning Objectives
- Distinguish between aleatoric and epistemic uncertainty
- Set clinical confidence thresholds
- Implement "I don't know" responses

### üìñ Key Concepts
**Uncertainty Types:** Aleatoric (data randomness) vs Epistemic (model knowledge gaps)

In [None]:
# 6.1 Implement uncertainty quantification
def quantify_uncertainty(df):
    """
    Categorize predictions into confidence levels and recommend actions
    """
    def categorize_confidence(conf):
        if conf >= 0.9:
            return 'High (>90%)', 'Proceed with recommendation'
        elif conf >= 0.7:
            return 'Medium (70-90%)', 'Flag for review'
        elif conf >= 0.5:
            return 'Low (50-70%)', 'Require expert consultation'
        else:
            return 'Very Low (<50%)', 'Decline to answer'
    
    df[['confidence_level', 'action']] = df['confidence'].apply(
        lambda x: pd.Series(categorize_confidence(x))
    )
    
    # Calculate statistics
    level_counts = df['confidence_level'].value_counts()
    
    print("‚öñÔ∏è Uncertainty Quantification Results")
    print("=" * 60)
    print("\nDistribution by Confidence Level:")
    print(level_counts)
    print("\nAction Recommendations:")
    print(df.groupby('confidence_level')['action'].first())
    
    # Calculate accuracy by confidence level
    level_accuracy = df.groupby('confidence_level')['is_correct'].mean()
    print("\nAccuracy by Confidence Level:")
    print(level_accuracy)
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Distribution by level
    level_order = ['Very Low (<50%)', 'Low (50-70%)', 'Medium (70-90%)', 'High (>90%)']
    level_counts_ordered = level_counts.reindex(level_order, fill_value=0)
    colors = ['#e74c3c', '#e67e22', '#f39c12', '#2ecc71']
    
    axes[0, 0].bar(range(len(level_counts_ordered)), level_counts_ordered.values, color=colors)
    axes[0, 0].set_xticks(range(len(level_counts_ordered)))
    axes[0, 0].set_xticklabels(['Very Low', 'Low', 'Medium', 'High'], rotation=45)
    axes[0, 0].set_ylabel('Count')
    axes[0, 0].set_title('Distribution by Confidence Level', fontsize=13, fontweight='bold')
    
    # Accuracy by level
    level_acc_ordered = level_accuracy.reindex(level_order, fill_value=0)
    axes[0, 1].bar(range(len(level_acc_ordered)), level_acc_ordered.values, color=colors)
    axes[0, 1].set_xticks(range(len(level_acc_ordered)))
    axes[0, 1].set_xticklabels(['Very Low', 'Low', 'Medium', 'High'], rotation=45)
    axes[0, 1].set_ylabel('Accuracy')
    axes[0, 1].set_ylim([0, 1])
    axes[0, 1].set_title('Accuracy by Confidence Level', fontsize=13, fontweight='bold')
    axes[0, 1].axhline(y=0.9, color='red', linestyle='--', alpha=0.5, label='90% target')
    axes[0, 1].legend()
    
    # Confidence distribution with thresholds
    axes[1, 0].hist(df['confidence'], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[1, 0].axvline(x=0.9, color='green', linestyle='--', linewidth=2, label='High (90%)')
    axes[1, 0].axvline(x=0.7, color='orange', linestyle='--', linewidth=2, label='Medium (70%)')
    axes[1, 0].axvline(x=0.5, color='red', linestyle='--', linewidth=2, label='Low (50%)')
    axes[1, 0].set_xlabel('Confidence Score')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Confidence Distribution with Thresholds', fontsize=13, fontweight='bold')
    axes[1, 0].legend()
    
    # Action recommendation pie chart
    action_counts = df['action'].value_counts()
    axes[1, 1].pie(action_counts.values, labels=action_counts.index, autopct='%1.1f%%',
                   colors=colors, startangle=90)
    axes[1, 1].set_title('Recommended Actions Distribution', fontsize=13, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    return df

medical_data_with_uncertainty = quantify_uncertainty(medical_data)

---
## Practice 7: Calibration Metrics

### üéØ Learning Objectives
- Calculate Expected Calibration Error (ECE)
- Create reliability diagrams
- Understand model calibration

### üìñ Key Concepts
**Well-Calibrated:** When model says 80% confident, it's correct 80% of the time

In [None]:
# 7.1 Calculate calibration metrics
def calculate_calibration(df, n_bins=10):
    """
    Calculate Expected Calibration Error (ECE) and create reliability diagram
    """
    # Prepare data for binary classification (correct vs incorrect)
    y_true = df['is_correct'].astype(int)
    y_prob = df['confidence']
    
    # Calculate calibration curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins, strategy='uniform')
    
    # Calculate ECE
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(y_prob, bin_edges[1:-1])
    
    ece = 0
    for i in range(n_bins):
        mask = bin_indices == i
        if mask.sum() > 0:
            avg_confidence = y_prob[mask].mean()
            avg_accuracy = y_true[mask].mean()
            ece += (mask.sum() / len(y_prob)) * abs(avg_confidence - avg_accuracy)
    
    print("üéØ Calibration Analysis")
    print("=" * 60)
    print(f"Expected Calibration Error (ECE): {ece:.4f}")
    print(f"Lower is better (0 = perfect calibration)")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Reliability diagram
    axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect Calibration', linewidth=2)
    axes[0].plot(prob_pred, prob_true, 's-', label='Model', linewidth=2, markersize=8)
    axes[0].set_xlabel('Predicted Probability (Confidence)')
    axes[0].set_ylabel('True Probability (Accuracy)')
    axes[0].set_title('Reliability Diagram', fontsize=14, fontweight='bold')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    axes[0].set_xlim([0, 1])
    axes[0].set_ylim([0, 1])
    
    # Confidence histogram
    axes[1].hist(df['confidence'], bins=20, edgecolor='black', alpha=0.7, color='lightcoral')
    axes[1].set_xlabel('Confidence Score')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('Confidence Score Distribution', fontsize=14, fontweight='bold')
    axes[1].axvline(x=df['confidence'].mean(), color='red', linestyle='--', 
                    linewidth=2, label=f'Mean={df["confidence"].mean():.3f}')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
    
    return ece

ece = calculate_calibration(medical_data)

---
## Practice 8: Clinical Relevance Scoring

### üéØ Learning Objectives
- Score predictions based on clinical impact
- Weight by urgency and patient outcomes
- Identify high-risk errors

### üìñ Key Concepts
**Clinical Relevance:** Would this information improve patient care and outcomes?

In [None]:
# 8.1 Calculate clinical relevance scores
def assess_clinical_relevance(df):
    """
    Assess clinical relevance of predictions
    Score: 1 (harmful) ‚Üí 10 (highly valuable)
    """
    def calculate_relevance(row):
        # Base score on correctness and confidence
        if row['is_correct']:
            base_score = 7 + (row['confidence'] * 3)  # 7-10 for correct
        else:
            if row['confidence'] > 0.7:
                base_score = 1  # High confidence but wrong = harmful
            else:
                base_score = 3  # Low confidence, wrong = not helpful
        
        # Adjust by difficulty (harder questions are more valuable when correct)
        difficulty_weights = {'Easy': 1.0, 'Moderate': 1.1, 'Hard': 1.2, 'Expert': 1.3}
        weight = difficulty_weights.get(row['difficulty'], 1.0)
        
        score = min(base_score * weight, 10)  # Cap at 10
        return score
    
    df['clinical_relevance'] = df.apply(calculate_relevance, axis=1)
    
    # Categorize relevance
    def categorize_relevance(score):
        if score <= 2:
            return 'Harmful'
        elif score <= 4:
            return 'Not Helpful'
        elif score <= 7:
            return 'Somewhat Useful'
        else:
            return 'Highly Valuable'
    
    df['relevance_category'] = df['clinical_relevance'].apply(categorize_relevance)
    
    mean_relevance = df['clinical_relevance'].mean()
    
    print("üíä Clinical Relevance Assessment")
    print("=" * 60)
    print(f"Mean Clinical Relevance Score: {mean_relevance:.2f} / 10")
    print("\nDistribution by Category:")
    print(df['relevance_category'].value_counts())
    print("\nScore by Specialty:")
    print(df.groupby('specialty')['clinical_relevance'].mean().sort_values(ascending=False))
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Score distribution
    axes[0, 0].hist(df['clinical_relevance'], bins=20, edgecolor='black', alpha=0.7, color='mediumseagreen')
    axes[0, 0].axvline(x=mean_relevance, color='red', linestyle='--', linewidth=2, 
                       label=f'Mean={mean_relevance:.2f}')
    axes[0, 0].set_xlabel('Clinical Relevance Score')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Clinical Relevance Score Distribution', fontsize=13, fontweight='bold')
    axes[0, 0].legend()
    
    # Category distribution
    category_counts = df['relevance_category'].value_counts()
    category_order = ['Harmful', 'Not Helpful', 'Somewhat Useful', 'Highly Valuable']
    category_counts = category_counts.reindex(category_order, fill_value=0)
    colors_cat = ['#e74c3c', '#e67e22', '#f39c12', '#2ecc71']
    
    axes[0, 1].bar(range(len(category_counts)), category_counts.values, color=colors_cat)
    axes[0, 1].set_xticks(range(len(category_counts)))
    axes[0, 1].set_xticklabels(category_order, rotation=45, ha='right')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].set_title('Relevance Category Distribution', fontsize=13, fontweight='bold')
    
    # Score by specialty
    specialty_scores = df.groupby('specialty')['clinical_relevance'].mean().sort_values()
    axes[1, 0].barh(range(len(specialty_scores)), specialty_scores.values, 
                    color=sns.color_palette('viridis', len(specialty_scores)))
    axes[1, 0].set_yticks(range(len(specialty_scores)))
    axes[1, 0].set_yticklabels(specialty_scores.index)
    axes[1, 0].set_xlabel('Mean Clinical Relevance')
    axes[1, 0].set_title('Clinical Relevance by Specialty', fontsize=13, fontweight='bold')
    axes[1, 0].axvline(x=mean_relevance, color='red', linestyle='--', alpha=0.7)
    
    # Score by difficulty
    difficulty_scores = df.groupby('difficulty')['clinical_relevance'].mean()
    diff_order = ['Easy', 'Moderate', 'Hard', 'Expert']
    difficulty_scores = difficulty_scores.reindex(diff_order)
    
    axes[1, 1].bar(range(len(difficulty_scores)), difficulty_scores.values, 
                   color=['#3498db', '#9b59b6', '#e74c3c', '#c0392b'])
    axes[1, 1].set_xticks(range(len(difficulty_scores)))
    axes[1, 1].set_xticklabels(diff_order)
    axes[1, 1].set_ylabel('Mean Clinical Relevance')
    axes[1, 1].set_title('Clinical Relevance by Difficulty', fontsize=13, fontweight='bold')
    axes[1, 1].axhline(y=mean_relevance, color='red', linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    plt.show()
    
    return mean_relevance

clinical_relevance = assess_clinical_relevance(medical_data)

---
## Practice 9: Safety Assessment

### üéØ Learning Objectives
- Identify safety-critical errors
- Risk stratification (Critical, High, Medium, Low)
- Flag dangerous predictions

### üìñ Key Concepts
**Safety-First:** Zero tolerance for critical safety issues

In [None]:
# 9.1 Implement safety assessment
def assess_safety(df):
    """
    Assess safety risk of predictions
    """
    def calculate_risk_level(row):
        if not row['is_correct']:
            # Incorrect predictions
            if row['confidence'] >= 0.9:
                return 'CRITICAL'  # High confidence but wrong
            elif row['confidence'] >= 0.7:
                return 'HIGH'      # Medium-high confidence but wrong
            elif row['confidence'] >= 0.5:
                return 'MEDIUM'    # Medium confidence but wrong
            else:
                return 'LOW'       # Low confidence, wrong (safer)
        else:
            # Correct predictions are generally safe
            return 'LOW'
    
    df['safety_risk'] = df.apply(calculate_risk_level, axis=1)
    
    # Calculate statistics
    risk_counts = df['safety_risk'].value_counts()
    critical_rate = (df['safety_risk'] == 'CRITICAL').mean()
    
    print("‚ö†Ô∏è Safety Assessment Results")
    print("=" * 60)
    print(f"Critical Safety Issues: {(df['safety_risk'] == 'CRITICAL').sum()}")
    print(f"Critical Rate: {critical_rate*100:.2f}%")
    print("\nRisk Level Distribution:")
    print(risk_counts)
    
    # Flag critical cases
    critical_cases = df[df['safety_risk'] == 'CRITICAL']
    if len(critical_cases) > 0:
        print("\nüö® CRITICAL SAFETY ALERTS:")
        print(critical_cases[['question_id', 'confidence', 'difficulty', 'specialty']].head())
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Risk pyramid
    risk_order = ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']
    risk_counts_ordered = risk_counts.reindex(risk_order, fill_value=0)
    colors_risk = ['#c0392b', '#e74c3c', '#f39c12', '#2ecc71']
    
    y_pos = np.arange(len(risk_order))
    axes[0].barh(y_pos, risk_counts_ordered.values, color=colors_risk)
    axes[0].set_yticks(y_pos)
    axes[0].set_yticklabels(risk_order)
    axes[0].set_xlabel('Count')
    axes[0].set_title('Safety Risk Distribution (Pyramid)', fontsize=14, fontweight='bold')
    axes[0].invert_yaxis()
    
    # Add counts on bars
    for i, v in enumerate(risk_counts_ordered.values):
        axes[0].text(v + 1, i, str(int(v)), va='center', fontweight='bold')
    
    # Risk by specialty
    specialty_risk = df[df['safety_risk'].isin(['CRITICAL', 'HIGH'])].groupby('specialty').size()
    if len(specialty_risk) > 0:
        specialty_risk = specialty_risk.sort_values(ascending=True)
        axes[1].barh(range(len(specialty_risk)), specialty_risk.values, 
                     color=sns.color_palette('Reds_r', len(specialty_risk)))
        axes[1].set_yticks(range(len(specialty_risk)))
        axes[1].set_yticklabels(specialty_risk.index)
        axes[1].set_xlabel('Number of High-Risk Cases')
        axes[1].set_title('High-Risk Cases by Specialty', fontsize=14, fontweight='bold')
    else:
        axes[1].text(0.5, 0.5, 'No high-risk cases', ha='center', va='center', 
                     transform=axes[1].transAxes, fontsize=14)
        axes[1].set_title('High-Risk Cases by Specialty', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    return critical_rate

critical_rate = assess_safety(medical_data)

---
## Practice 10: Performance Dashboard Creation

### üéØ Learning Objectives
- Create comprehensive evaluation dashboard
- Aggregate all metrics
- Generate executive summary

### üìñ Key Concepts
**Dashboard Features:** Real-time monitoring, alert system, trend analysis

In [None]:
# 10.1 Create comprehensive performance dashboard
def create_performance_dashboard(df):
    """
    Create a comprehensive performance dashboard
    """
    print("\n" + "="*70)
    print(" " * 15 + "üìä MEDICAL MODEL EVALUATION DASHBOARD")
    print("="*70)
    
    # Key Performance Indicators
    accuracy = df['is_correct'].mean()
    mean_confidence = df['confidence'].mean()
    mean_factuality = df['factuality_score'].mean()
    mean_relevance = df['clinical_relevance'].mean()
    critical_safety = (df['safety_risk'] == 'CRITICAL').sum()
    hallucination_count = df['potential_hallucination'].sum()
    
    print("\nüìà KEY PERFORMANCE INDICATORS")
    print("-" * 70)
    print(f"  Overall Accuracy:          {accuracy*100:>6.2f}% ")
    print(f"  Mean Confidence:           {mean_confidence:>6.3f} ")
    print(f"  Factuality Score:          {mean_factuality:>6.2f} / 10")
    print(f"  Clinical Relevance:        {mean_relevance:>6.2f} / 10")
    print(f"  Critical Safety Issues:    {critical_safety:>6d} ")
    print(f"  Potential Hallucinations:  {hallucination_count:>6d} ")
    
    # Performance by Category
    print("\nüìä PERFORMANCE BY CATEGORY")
    print("-" * 70)
    
    # By difficulty
    print("\n  By Difficulty Level:")
    diff_stats = df.groupby('difficulty')['is_correct'].agg(['mean', 'count'])
    for idx, row in diff_stats.iterrows():
        print(f"    {idx:12s}: {row['mean']*100:5.1f}%  (n={int(row['count'])})")
    
    # By specialty
    print("\n  By Medical Specialty:")
    spec_stats = df.groupby('specialty')['is_correct'].agg(['mean', 'count'])
    for idx, row in spec_stats.iterrows():
        print(f"    {idx:20s}: {row['mean']*100:5.1f}%  (n={int(row['count'])})")
    
    # Safety breakdown
    print("\n‚ö†Ô∏è  SAFETY RISK BREAKDOWN")
    print("-" * 70)
    risk_counts = df['safety_risk'].value_counts()
    for risk, count in risk_counts.items():
        pct = count / len(df) * 100
        print(f"  {risk:10s}: {int(count):4d} ({pct:5.1f}%)")
    
    # Recommendations
    print("\nüí° RECOMMENDATIONS")
    print("-" * 70)
    
    if accuracy < 0.8:
        print("  ‚ö†Ô∏è  Accuracy below 80% - Consider model retraining")
    if critical_safety > 0:
        print(f"  üö® {critical_safety} critical safety issues detected - Immediate review required")
    if hallucination_count > len(df) * 0.1:
        print("  ‚ö†Ô∏è  High hallucination rate - Review confidence calibration")
    if mean_relevance < 7:
        print("  ‚ö†Ô∏è  Clinical relevance below target - Improve clinical utility")
    
    if accuracy >= 0.85 and critical_safety == 0 and mean_relevance >= 7:
        print("  ‚úÖ Model performance meets quality standards")
    
    print("\n" + "="*70)
    print(" " * 25 + "End of Report")
    print("="*70 + "\n")
    
    # Create comprehensive visualization
    fig = plt.figure(figsize=(16, 10))
    gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
    
    # 1. KPI Summary
    ax1 = fig.add_subplot(gs[0, :])
    kpis = ['Accuracy', 'Confidence', 'Factuality', 'Relevance']
    values = [accuracy*100, mean_confidence*100, mean_factuality*10, mean_relevance*10]
    colors_kpi = ['#3498db', '#9b59b6', '#e67e22', '#2ecc71']
    bars = ax1.bar(kpis, values, color=colors_kpi, edgecolor='black', linewidth=1.5)
    ax1.set_ylim([0, 100])
    ax1.set_ylabel('Score (%)')
    ax1.set_title('Key Performance Indicators', fontsize=15, fontweight='bold')
    ax1.axhline(y=80, color='red', linestyle='--', alpha=0.5, label='80% Target')
    ax1.legend()
    for bar, val in zip(bars, values):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 2, f'{val:.1f}%',
                ha='center', va='bottom', fontweight='bold', fontsize=11)
    
    # 2. Accuracy by Difficulty
    ax2 = fig.add_subplot(gs[1, 0])
    diff_acc = df.groupby('difficulty')['is_correct'].mean()
    diff_order = ['Easy', 'Moderate', 'Hard', 'Expert']
    diff_acc = diff_acc.reindex(diff_order)
    ax2.bar(range(len(diff_acc)), diff_acc.values, color=['#2ecc71', '#f39c12', '#e74c3c', '#c0392b'])
    ax2.set_xticks(range(len(diff_acc)))
    ax2.set_xticklabels(diff_order, rotation=45)
    ax2.set_ylabel('Accuracy')
    ax2.set_ylim([0, 1])
    ax2.set_title('Accuracy by Difficulty', fontweight='bold')
    
    # 3. Safety Risk Distribution
    ax3 = fig.add_subplot(gs[1, 1])
    risk_order = ['LOW', 'MEDIUM', 'HIGH', 'CRITICAL']
    risk_counts_ordered = risk_counts.reindex(risk_order, fill_value=0)
    colors_risk_dash = ['#2ecc71', '#f39c12', '#e74c3c', '#c0392b']
    wedges, texts, autotexts = ax3.pie(risk_counts_ordered.values, labels=risk_order, autopct='%1.1f%%',
                                        colors=colors_risk_dash, startangle=90)
    ax3.set_title('Safety Risk Distribution', fontweight='bold')
    
    # 4. Confidence Distribution
    ax4 = fig.add_subplot(gs[1, 2])
    ax4.hist(df['confidence'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
    ax4.axvline(x=mean_confidence, color='red', linestyle='--', linewidth=2, label=f'Mean={mean_confidence:.3f}')
    ax4.set_xlabel('Confidence')
    ax4.set_ylabel('Frequency')
    ax4.set_title('Confidence Distribution', fontweight='bold')
    ax4.legend()
    
    # 5. Specialty Performance
    ax5 = fig.add_subplot(gs[2, :])
    spec_acc = df.groupby('specialty')['is_correct'].mean().sort_values()
    ax5.barh(range(len(spec_acc)), spec_acc.values, color=sns.color_palette('viridis', len(spec_acc)))
    ax5.set_yticks(range(len(spec_acc)))
    ax5.set_yticklabels(spec_acc.index)
    ax5.set_xlabel('Accuracy')
    ax5.set_xlim([0, 1])
    ax5.set_title('Performance by Medical Specialty', fontsize=15, fontweight='bold')
    ax5.axvline(x=accuracy, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Overall={accuracy:.3f}')
    ax5.legend()
    
    plt.suptitle('Medical Model Evaluation Dashboard', fontsize=18, fontweight='bold', y=0.98)
    plt.show()
    
    return {
        'accuracy': accuracy,
        'mean_confidence': mean_confidence,
        'factuality': mean_factuality,
        'relevance': mean_relevance,
        'critical_safety': critical_safety,
        'hallucinations': hallucination_count
    }

dashboard_metrics = create_performance_dashboard(medical_data)

---
## üéØ Practice Complete!

### Summary of What We Learned:

1. **Medical Benchmark Evaluation**: Loading and exploring MedQA-style datasets
2. **Accuracy Metrics**: Beyond simple accuracy - F1, precision, recall
3. **Hallucination Detection**: Identifying unreliable predictions
4. **Factuality Scoring**: Evidence-based assessment
5. **Consistency Measures**: Testing response stability
6. **Uncertainty Quantification**: When to say "I don't know"
7. **Calibration Metrics**: ECE and reliability diagrams
8. **Clinical Relevance**: Impact on patient care
9. **Safety Assessment**: Risk stratification and critical error detection
10. **Performance Dashboards**: Comprehensive monitoring

### Key Insights:
- **Accuracy ‚â† Clinical Utility**: High accuracy doesn't guarantee usefulness
- **Safety First**: Zero tolerance for critical errors
- **Calibration Matters**: Confidence should match accuracy
- **Context is Critical**: Medical specialty and difficulty affect performance

### Real-World Applications:
- Medical AI deployment decision-making
- Continuous model monitoring in production
- Regulatory compliance reporting
- Clinical validation studies

### Next Steps:
- Implement bias and fairness testing
- Add robustness evaluation
- Create automated monitoring pipelines
- Generate TRIPOD-compliant reports

---
## üìö Additional Resources

### Papers and Guidelines:
- **TRIPOD**: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis
- **STARD**: Standards for Reporting of Diagnostic Accuracy Studies
- **CONSORT-AI**: Reporting guidelines for clinical trials involving AI interventions

### Datasets:
- **MedQA**: 11,450 USMLE-style questions
- **PubMedQA**: 273K biomedical literature questions
- **MedMCQA**: 194K questions covering 21 medical subjects
- **MMLU Medical**: 1,089 questions across 6 medical topics

### Tools:
- **Papers with Code**: Medical AI benchmarks leaderboard
- **HuggingFace Evaluate**: Comprehensive evaluation metrics
- **scikit-learn**: Machine learning evaluation tools
- **TorchMetrics**: Deep learning metrics library