# üè• Biomedical ML: Essential Practice

## Table of Contents
1. [Handling High-Dimensional Data](#practice-1-handling-high-dimensional-data)
2. [Dealing with Class Imbalance](#practice-2-dealing-with-class-imbalance)
3. [Missing Data Imputation](#practice-3-missing-data-imputation)
4. [Cross-Validation Strategies](#practice-4-cross-validation-strategies)
5. [Performance Metrics for Biomedical Data](#practice-5-performance-metrics-for-biomedical-data)
6. [ROC and PR Curves](#practice-6-roc-and-pr-curves)
7. [Survival Analysis with Kaplan-Meier](#practice-7-survival-analysis-with-kaplan-meier)
8. [Cox Proportional Hazards Model](#practice-8-cox-proportional-hazards-model)
9. [Model Interpretability with SHAP](#practice-9-model-interpretability-with-shap)
10. [Complete Biomedical ML Pipeline](#practice-10-complete-biomedical-ml-pipeline)

## Installing and Importing Essential Libraries

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (confusion_matrix, classification_report, 
                             roc_curve, auc, precision_recall_curve,
                             balanced_accuracy_score, matthews_corrcoef)
from sklearn.impute import SimpleImputer, KNNImputer
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

print("‚úÖ All libraries loaded successfully!")

---
## Practice 1: Handling High-Dimensional Data

### üéØ Learning Objectives
- Understand the curse of dimensionality (P >> N problem)
- Apply feature selection methods
- Visualize data sparsity in high dimensions

### üìñ Key Concepts
**Curse of Dimensionality:** When features (P) greatly exceed samples (N), models tend to overfit and distances become meaningless.

In [None]:
# 1.1 Simulate high-dimensional biomedical data
def create_highdim_biomedical_data():
    """Simulate gene expression data with P >> N"""
    np.random.seed(42)
    
    # Simulate: 50 patients, 1000 genes
    n_samples = 50
    n_features = 1000
    
    # Generate expression data
    X = np.random.randn(n_samples, n_features)
    
    # Create disease labels (0: healthy, 1: disease)
    # Make 10 genes truly predictive
    true_genes = [0, 50, 100, 200, 300, 400, 500, 600, 700, 800]
    y = np.zeros(n_samples)
    
    for i, gene in enumerate(true_genes):
        X[:, gene] += np.random.randn(n_samples) * 0.5
    
    # Assign labels based on sum of true genes
    gene_sum = X[:, true_genes].sum(axis=1)
    y[gene_sum > np.median(gene_sum)] = 1
    
    print(f"üìä Dataset Shape: {X.shape}")
    print(f"   Samples (N): {n_samples}")
    print(f"   Features (P): {n_features}")
    print(f"   Ratio P/N: {n_features/n_samples:.1f}")
    print(f"\nüéØ Class Distribution:")
    print(f"   Healthy: {np.sum(y==0)} patients")
    print(f"   Disease: {np.sum(y==1)} patients")
    
    return X, y, true_genes

X_highdim, y_highdim, important_genes = create_highdim_biomedical_data()

In [None]:
# 1.2 Feature selection using univariate statistics
from sklearn.feature_selection import SelectKBest, f_classif

def apply_feature_selection(X, y, k=50):
    """Select top K features using F-statistic"""
    
    selector = SelectKBest(score_func=f_classif, k=k)
    X_selected = selector.fit_transform(X, y)
    
    # Get selected feature indices
    selected_features = selector.get_support(indices=True)
    
    print(f"\nüîç Feature Selection Results:")
    print(f"   Original features: {X.shape[1]}")
    print(f"   Selected features: {X_selected.shape[1]}")
    print(f"   Reduction: {(1 - X_selected.shape[1]/X.shape[1])*100:.1f}%")
    
    # Check how many true genes were selected
    true_selected = np.intersect1d(selected_features, important_genes)
    print(f"\n‚úÖ True predictive genes found: {len(true_selected)}/{len(important_genes)}")
    
    return X_selected, selected_features

X_selected, selected_idx = apply_feature_selection(X_highdim, y_highdim, k=50)

---
## Practice 2: Dealing with Class Imbalance

### üéØ Learning Objectives
- Understand the accuracy paradox
- Apply SMOTE for oversampling
- Use appropriate metrics for imbalanced data

### üìñ Key Concepts
**Accuracy Paradox:** 99% accuracy is useless if 99% of samples are negative!

In [None]:
# 2.1 Create imbalanced dataset (rare disease scenario)
def create_imbalanced_clinical_data():
    """Simulate imbalanced clinical data: 95% healthy, 5% disease"""
    np.random.seed(42)
    
    n_samples = 1000
    n_features = 20
    
    # Generate features
    X = np.random.randn(n_samples, n_features)
    
    # Create highly imbalanced labels
    y = np.zeros(n_samples)
    disease_indices = np.random.choice(n_samples, size=50, replace=False)  # 5% disease
    y[disease_indices] = 1
    
    # Make disease samples distinguishable
    X[disease_indices, :5] += 2.0  # Increase first 5 features for disease
    
    print(f"üìä Imbalanced Dataset:")
    print(f"   Total samples: {n_samples}")
    print(f"   Healthy: {np.sum(y==0)} ({np.sum(y==0)/len(y)*100:.1f}%)")
    print(f"   Disease: {np.sum(y==1)} ({np.sum(y==1)/len(y)*100:.1f}%)")
    print(f"\n‚ö†Ô∏è Imbalance ratio: {np.sum(y==0)/np.sum(y==1):.1f}:1")
    
    return X, y

X_imb, y_imb = create_imbalanced_clinical_data()

In [None]:
# 2.2 Apply SMOTE and compare results
def compare_with_without_smote(X, y):
    """Compare model performance with and without SMOTE"""
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Model 1: Without SMOTE
    clf1 = LogisticRegression(random_state=42, max_iter=1000)
    clf1.fit(X_train, y_train)
    y_pred1 = clf1.predict(X_test)
    
    # Model 2: With SMOTE
    smote = SMOTE(random_state=42)
    X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
    
    clf2 = LogisticRegression(random_state=42, max_iter=1000)
    clf2.fit(X_train_smote, y_train_smote)
    y_pred2 = clf2.predict(X_test)
    
    print("\n" + "="*60)
    print("WITHOUT SMOTE:")
    print("="*60)
    print(classification_report(y_test, y_pred1, target_names=['Healthy', 'Disease']))
    
    print("\n" + "="*60)
    print("WITH SMOTE:")
    print("="*60)
    print(classification_report(y_test, y_pred2, target_names=['Healthy', 'Disease']))
    
    # Compare specific metrics
    print("\n" + "="*60)
    print("COMPARISON:")
    print("="*60)
    print(f"Balanced Accuracy - Without SMOTE: {balanced_accuracy_score(y_test, y_pred1):.3f}")
    print(f"Balanced Accuracy - With SMOTE: {balanced_accuracy_score(y_test, y_pred2):.3f}")
    print(f"\nMatthews CC - Without SMOTE: {matthews_corrcoef(y_test, y_pred1):.3f}")
    print(f"Matthews CC - With SMOTE: {matthews_corrcoef(y_test, y_pred2):.3f}")
    
    return clf1, clf2

model_no_smote, model_with_smote = compare_with_without_smote(X_imb, y_imb)

---
## Practice 3: Missing Data Imputation

### üéØ Learning Objectives
- Understand missing data mechanisms (MCAR, MAR, MNAR)
- Apply different imputation strategies
- Compare imputation methods

### üìñ Key Concepts
**MCAR:** Missing Completely At Random  
**MAR:** Missing At Random (depends on observed data)  
**MNAR:** Missing Not At Random (depends on unobserved values)

In [None]:
# 3.1 Simulate missing data
def create_data_with_missing_values():
    """Create clinical dataset with missing values"""
    np.random.seed(42)
    
    # Generate complete data
    n_samples = 200
    n_features = 10
    
    X_complete = np.random.randn(n_samples, n_features) * 10 + 50
    X_missing = X_complete.copy()
    
    # Introduce MCAR: randomly missing 20% of values
    missing_mask = np.random.rand(n_samples, n_features) < 0.2
    X_missing[missing_mask] = np.nan
    
    missing_count = np.isnan(X_missing).sum()
    total_values = n_samples * n_features
    
    print(f"üìä Missing Data Statistics:")
    print(f"   Total values: {total_values}")
    print(f"   Missing values: {missing_count} ({missing_count/total_values*100:.1f}%)")
    print(f"\n   Missing per feature:")
    for i in range(n_features):
        missing_in_feature = np.isnan(X_missing[:, i]).sum()
        print(f"      Feature {i+1}: {missing_in_feature} ({missing_in_feature/n_samples*100:.1f}%)")
    
    return X_complete, X_missing

X_complete, X_with_missing = create_data_with_missing_values()

In [None]:
# 3.2 Compare imputation methods
def compare_imputation_methods(X_complete, X_missing):
    """Compare mean, median, and KNN imputation"""
    
    # Method 1: Mean imputation
    imputer_mean = SimpleImputer(strategy='mean')
    X_mean = imputer_mean.fit_transform(X_missing)
    
    # Method 2: Median imputation
    imputer_median = SimpleImputer(strategy='median')
    X_median = imputer_median.fit_transform(X_missing)
    
    # Method 3: KNN imputation
    imputer_knn = KNNImputer(n_neighbors=5)
    X_knn = imputer_knn.fit_transform(X_missing)
    
    # Calculate reconstruction error (only for missing values)
    missing_mask = np.isnan(X_missing)
    
    error_mean = np.abs(X_complete[missing_mask] - X_mean[missing_mask]).mean()
    error_median = np.abs(X_complete[missing_mask] - X_median[missing_mask]).mean()
    error_knn = np.abs(X_complete[missing_mask] - X_knn[missing_mask]).mean()
    
    print("\n" + "="*60)
    print("IMPUTATION METHOD COMPARISON")
    print("="*60)
    print(f"\nMean Absolute Error (MAE) on missing values:")
    print(f"  Mean Imputation:   {error_mean:.4f}")
    print(f"  Median Imputation: {error_median:.4f}")
    print(f"  KNN Imputation:    {error_knn:.4f}")
    
    # Determine best method
    errors = {'Mean': error_mean, 'Median': error_median, 'KNN': error_knn}
    best_method = min(errors, key=errors.get)
    print(f"\n‚úÖ Best method: {best_method} imputation")
    
    return X_mean, X_median, X_knn

X_imp_mean, X_imp_median, X_imp_knn = compare_imputation_methods(X_complete, X_with_missing)

---
## Practice 4: Cross-Validation Strategies

### üéØ Learning Objectives
- Implement different CV strategies
- Understand when to use each strategy
- Apply stratified CV for imbalanced data

### üìñ Key Concepts
**K-Fold CV:** Split data into K folds, train on K-1, test on 1  
**Stratified CV:** Preserve class distribution in each fold  
**Group CV:** Keep samples from same patient together

In [None]:
# 4.1 Compare different CV strategies
from sklearn.model_selection import KFold, StratifiedKFold, LeaveOneOut

def compare_cv_strategies():
    """Compare K-Fold, Stratified K-Fold, and LOOCV"""
    
    # Generate imbalanced data
    np.random.seed(42)
    X = np.random.randn(100, 20)
    y = np.zeros(100)
    y[:10] = 1  # 10% positive class
    
    # Shuffle
    indices = np.random.permutation(100)
    X, y = X[indices], y[indices]
    
    model = LogisticRegression(random_state=42, max_iter=1000)
    
    print("\n" + "="*60)
    print("CROSS-VALIDATION STRATEGY COMPARISON")
    print("="*60)
    
    # Strategy 1: K-Fold CV
    kfold = KFold(n_splits=5, shuffle=True, random_state=42)
    scores_kfold = cross_val_score(model, X, y, cv=kfold, scoring='balanced_accuracy')
    print(f"\n1. K-Fold CV (K=5):")
    print(f"   Scores: {scores_kfold}")
    print(f"   Mean: {scores_kfold.mean():.3f} (+/- {scores_kfold.std():.3f})")
    
    # Strategy 2: Stratified K-Fold CV
    skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores_skfold = cross_val_score(model, X, y, cv=skfold, scoring='balanced_accuracy')
    print(f"\n2. Stratified K-Fold CV (K=5):")
    print(f"   Scores: {scores_skfold}")
    print(f"   Mean: {scores_skfold.mean():.3f} (+/- {scores_skfold.std():.3f})")
    
    # Visualize class distribution in folds
    print(f"\nüìä Class Distribution per Fold:")
    print(f"\n   Regular K-Fold:")
    for i, (train_idx, test_idx) in enumerate(kfold.split(X, y)):
        test_pos = y[test_idx].sum()
        test_total = len(test_idx)
        print(f"      Fold {i+1}: {test_pos:.0f}/{test_total} positive ({test_pos/test_total*100:.1f}%)")
    
    print(f"\n   Stratified K-Fold:")
    for i, (train_idx, test_idx) in enumerate(skfold.split(X, y)):
        test_pos = y[test_idx].sum()
        test_total = len(test_idx)
        print(f"      Fold {i+1}: {test_pos:.0f}/{test_total} positive ({test_pos/test_total*100:.1f}%)")
    
    print(f"\n‚úÖ Stratified CV maintains class balance across folds!")
    
    return scores_kfold, scores_skfold

scores_k, scores_sk = compare_cv_strategies()

---
## Practice 5: Performance Metrics for Biomedical Data

### üéØ Learning Objectives
- Understand confusion matrix components
- Calculate sensitivity, specificity, PPV, NPV
- Use appropriate metrics for clinical scenarios

### üìñ Key Concepts
**Sensitivity (Recall):** TP / (TP + FN) - How many actual positives detected  
**Specificity:** TN / (TN + FP) - How many actual negatives identified  
**PPV (Precision):** TP / (TP + FP) - Positive predictive value

In [None]:
# 5.1 Calculate and visualize confusion matrix
def detailed_performance_metrics():
    """Calculate comprehensive performance metrics"""
    
    # Train a model on imbalanced data
    X_train, X_test, y_train, y_test = train_test_split(
        X_imb, y_imb, test_size=0.3, random_state=42, stratify=y_imb
    )
    
    clf = RandomForestClassifier(random_state=42, n_estimators=100)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    print("\n" + "="*60)
    print("CONFUSION MATRIX")
    print("="*60)
    print(f"\n                Predicted")
    print(f"              Negative  Positive")
    print(f"Actual Negative    {tn:4d}      {fp:4d}    (Specificity)")
    print(f"       Positive    {fn:4d}      {tp:4d}    (Sensitivity)")
    print(f"              (NPV)      (PPV)")
    
    # Calculate metrics manually
    sensitivity = tp / (tp + fn)
    specificity = tn / (tn + fp)
    ppv = tp / (tp + fp)
    npv = tn / (tn + fn)
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    balanced_acc = (sensitivity + specificity) / 2
    f1 = 2 * (ppv * sensitivity) / (ppv + sensitivity)
    mcc = matthews_corrcoef(y_test, y_pred)
    
    print("\n" + "="*60)
    print("PERFORMANCE METRICS")
    print("="*60)
    print(f"\nBasic Metrics:")
    print(f"  Accuracy:           {accuracy:.3f}")
    print(f"  Balanced Accuracy:  {balanced_acc:.3f}")
    
    print(f"\nClinical Metrics:")
    print(f"  Sensitivity (Recall): {sensitivity:.3f}  [TP/(TP+FN)]")
    print(f"  Specificity:          {specificity:.3f}  [TN/(TN+FP)]")
    print(f"  PPV (Precision):      {ppv:.3f}  [TP/(TP+FP)]")
    print(f"  NPV:                  {npv:.3f}  [TN/(TN+FN)]")
    
    print(f"\nCombined Metrics:")
    print(f"  F1-Score:           {f1:.3f}")
    print(f"  Matthews CC:        {mcc:.3f}")
    
    # Visualize confusion matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Healthy', 'Disease'],
                yticklabels=['Healthy', 'Disease'])
    plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
    plt.ylabel('Actual Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
    
    return clf, cm

trained_clf, conf_matrix = detailed_performance_metrics()

---
## Practice 6: ROC and PR Curves

### üéØ Learning Objectives
- Plot and interpret ROC curves
- Plot and interpret PR curves
- Understand when to use each curve

### üìñ Key Concepts
**ROC Curve:** TPR vs FPR - good for balanced datasets  
**PR Curve:** Precision vs Recall - better for imbalanced data

In [None]:
# 6.1 Plot ROC and PR curves
def plot_roc_and_pr_curves():
    """Generate and compare ROC and PR curves"""
    
    # Get probability predictions
    X_train, X_test, y_train, y_test = train_test_split(
        X_imb, y_imb, test_size=0.3, random_state=42, stratify=y_imb
    )
    
    clf = RandomForestClassifier(random_state=42, n_estimators=100)
    clf.fit(X_train, y_train)
    y_prob = clf.predict_proba(X_test)[:, 1]
    
    # ROC Curve
    fpr, tpr, thresholds_roc = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    
    # PR Curve
    precision, recall, thresholds_pr = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(recall, precision)
    
    # Create side-by-side plots
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # ROC Curve
    axes[0].plot(fpr, tpr, color='#1E64C8', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
    axes[0].plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random')
    axes[0].set_xlim([0.0, 1.0])
    axes[0].set_ylim([0.0, 1.05])
    axes[0].set_xlabel('False Positive Rate (1-Specificity)', fontsize=11)
    axes[0].set_ylabel('True Positive Rate (Sensitivity)', fontsize=11)
    axes[0].set_title('ROC Curve', fontsize=13, fontweight='bold')
    axes[0].legend(loc='lower right')
    axes[0].grid(True, alpha=0.3)
    
    # PR Curve
    axes[1].plot(recall, precision, color='#e74c3c', lw=2, label=f'PR curve (AUC = {pr_auc:.3f})')
    baseline = np.sum(y_test) / len(y_test)
    axes[1].axhline(y=baseline, color='gray', lw=1, linestyle='--', label=f'Baseline ({baseline:.3f})')
    axes[1].set_xlim([0.0, 1.0])
    axes[1].set_ylim([0.0, 1.05])
    axes[1].set_xlabel('Recall (Sensitivity)', fontsize=11)
    axes[1].set_ylabel('Precision (PPV)', fontsize=11)
    axes[1].set_title('Precision-Recall Curve', fontsize=13, fontweight='bold')
    axes[1].legend(loc='lower left')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\n" + "="*60)
    print("CURVE INTERPRETATION")
    print("="*60)
    print(f"\nROC-AUC: {roc_auc:.3f}")
    print(f"  - 0.5 = random classifier")
    print(f"  - 1.0 = perfect classifier")
    print(f"  - Good for balanced datasets")
    
    print(f"\nPR-AUC: {pr_auc:.3f}")
    print(f"  - Better for imbalanced data")
    print(f"  - Baseline = {baseline:.3f} (class prevalence)")
    print(f"  - More informative for rare diseases")
    
    return roc_auc, pr_auc

roc_score, pr_score = plot_roc_and_pr_curves()

---
## Practice 7: Survival Analysis with Kaplan-Meier

### üéØ Learning Objectives
- Understand survival analysis basics
- Plot Kaplan-Meier curves
- Compare survival between groups

### üìñ Key Concepts
**Survival Analysis:** Time-to-event analysis (death, recurrence, etc.)  
**Censoring:** Patient lost to follow-up or study ends before event  
**Kaplan-Meier:** Non-parametric estimator of survival function

In [None]:
# Install lifelines if not already installed
!pip install lifelines -q

from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test

# 7.1 Simulate clinical trial survival data
def create_survival_data():
    """Simulate survival data for two treatment groups"""
    np.random.seed(42)
    
    n_patients = 100
    
    # Treatment A: better survival
    time_A = np.random.exponential(scale=30, size=n_patients//2)
    event_A = np.random.rand(n_patients//2) < 0.6  # 60% experienced event
    group_A = np.array(['Treatment A'] * (n_patients//2))
    
    # Treatment B: worse survival
    time_B = np.random.exponential(scale=18, size=n_patients//2)
    event_B = np.random.rand(n_patients//2) < 0.75  # 75% experienced event
    group_B = np.array(['Treatment B'] * (n_patients//2))
    
    # Combine data
    df_survival = pd.DataFrame({
        'time': np.concatenate([time_A, time_B]),
        'event': np.concatenate([event_A, event_B]).astype(int),
        'group': np.concatenate([group_A, group_B])
    })
    
    print("üìä Survival Data Summary:")
    print(f"\nTotal patients: {len(df_survival)}")
    print(f"\nTreatment A:")
    print(f"  Events: {event_A.sum()}/{len(event_A)} ({event_A.sum()/len(event_A)*100:.1f}%)")
    print(f"  Censored: {(~event_A).sum()}/{len(event_A)}")
    print(f"  Median time: {np.median(time_A):.1f} months")
    
    print(f"\nTreatment B:")
    print(f"  Events: {event_B.sum()}/{len(event_B)} ({event_B.sum()/len(event_B)*100:.1f}%)")
    print(f"  Censored: {(~event_B).sum()}/{len(event_B)}")
    print(f"  Median time: {np.median(time_B):.1f} months")
    
    return df_survival

df_surv = create_survival_data()

In [None]:
# 7.2 Plot Kaplan-Meier curves
def plot_kaplan_meier(df):
    """Plot KM curves for both treatment groups"""
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    kmf = KaplanMeierFitter()
    
    # Plot for each group
    for group in df['group'].unique():
        mask = df['group'] == group
        kmf.fit(
            durations=df.loc[mask, 'time'],
            event_observed=df.loc[mask, 'event'],
            label=group
        )
        kmf.plot_survival_function(ax=ax, ci_show=True)
    
    ax.set_xlabel('Time (months)', fontsize=12)
    ax.set_ylabel('Survival Probability', fontsize=12)
    ax.set_title('Kaplan-Meier Survival Curves by Treatment Group', 
                 fontsize=14, fontweight='bold')
    ax.legend(loc='best', fontsize=11)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Log-rank test
    results = logrank_test(
        durations_A=df[df['group']=='Treatment A']['time'],
        durations_B=df[df['group']=='Treatment B']['time'],
        event_observed_A=df[df['group']=='Treatment A']['event'],
        event_observed_B=df[df['group']=='Treatment B']['event']
    )
    
    print("\n" + "="*60)
    print("LOG-RANK TEST RESULTS")
    print("="*60)
    print(f"\nTest statistic: {results.test_statistic:.4f}")
    print(f"p-value: {results.p_value:.4f}")
    
    if results.p_value < 0.05:
        print(f"\n‚úÖ Significant difference in survival between groups (p < 0.05)")
    else:
        print(f"\n‚ùå No significant difference in survival between groups (p >= 0.05)")
    
    return kmf, results

kmf_model, logrank_results = plot_kaplan_meier(df_surv)

---
## Practice 8: Cox Proportional Hazards Model

### üéØ Learning Objectives
- Fit Cox regression model
- Interpret hazard ratios
- Test proportional hazards assumption

### üìñ Key Concepts
**Cox Model:** h(t|X) = h‚ÇÄ(t) ¬∑ exp(Œ≤‚ÇÅX‚ÇÅ + ... + Œ≤‚ÇöX‚Çö)  
**Hazard Ratio:** exp(Œ≤) - risk multiplier  
**PH Assumption:** Hazard ratio constant over time

In [None]:
# 8.1 Fit Cox regression model
from lifelines import CoxPHFitter

def fit_cox_regression():
    """Fit Cox model with multiple covariates"""
    
    np.random.seed(42)
    
    # Create dataset with covariates
    n = 200
    df_cox = pd.DataFrame({
        'time': np.random.exponential(scale=20, size=n),
        'event': np.random.rand(n) < 0.7,
        'age': np.random.normal(65, 10, n),
        'treatment': np.random.choice([0, 1], n),  # 0=control, 1=treatment
        'stage': np.random.choice([1, 2, 3, 4], n),  # Cancer stage
        'biomarker': np.random.normal(100, 20, n)
    })
    
    # Fit Cox model
    cph = CoxPHFitter()
    cph.fit(df_cox, duration_col='time', event_col='event')
    
    print("\n" + "="*60)
    print("COX PROPORTIONAL HAZARDS MODEL")
    print("="*60)
    cph.print_summary()
    
    # Interpret hazard ratios
    print("\n" + "="*60)
    print("HAZARD RATIO INTERPRETATION")
    print("="*60)
    
    for covariate in cph.params_.index:
        hr = np.exp(cph.params_[covariate])
        ci_lower = np.exp(cph.confidence_intervals_.loc[covariate, '95% lower-bound'])
        ci_upper = np.exp(cph.confidence_intervals_.loc[covariate, '95% upper-bound'])
        
        print(f"\n{covariate}:")
        print(f"  Hazard Ratio: {hr:.3f}")
        print(f"  95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
        
        if hr > 1:
            print(f"  ‚Üí Increases risk by {(hr-1)*100:.1f}%")
        else:
            print(f"  ‚Üí Decreases risk by {(1-hr)*100:.1f}%")
    
    return cph, df_cox

cox_model, df_cox_data = fit_cox_regression()

---
## Practice 9: Model Interpretability with SHAP

### üéØ Learning Objectives
- Use SHAP values for model interpretation
- Visualize feature importance
- Explain individual predictions

### üìñ Key Concepts
**SHAP:** SHapley Additive exPlanations - unified framework  
**Shapley Values:** From game theory - fair contribution of each feature  
**TreeSHAP:** Fast algorithm for tree-based models

In [None]:
# Install shap if needed
!pip install shap -q

import shap

# 9.1 Apply SHAP to Random Forest model
def apply_shap_interpretation():
    """Use SHAP to interpret Random Forest predictions"""
    
    # Create interpretable dataset
    np.random.seed(42)
    n_samples = 500
    
    # Create features with meaningful names
    feature_names = ['Age', 'BMI', 'Blood_Pressure', 'Glucose', 'Cholesterol']
    
    X_interp = pd.DataFrame({
        'Age': np.random.normal(60, 15, n_samples),
        'BMI': np.random.normal(27, 5, n_samples),
        'Blood_Pressure': np.random.normal(130, 20, n_samples),
        'Glucose': np.random.normal(100, 25, n_samples),
        'Cholesterol': np.random.normal(200, 40, n_samples)
    })
    
    # Generate target based on features
    risk_score = (0.02 * X_interp['Age'] + 
                  0.1 * X_interp['BMI'] + 
                  0.01 * X_interp['Glucose'] +
                  np.random.randn(n_samples) * 2)
    y_interp = (risk_score > np.median(risk_score)).astype(int)
    
    # Train model
    X_train, X_test, y_train, y_test = train_test_split(
        X_interp, y_interp, test_size=0.2, random_state=42
    )
    
    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train)
    
    print(f"\nüìä Model Performance:")
    print(f"   Accuracy: {rf_model.score(X_test, y_test):.3f}")
    
    # SHAP analysis
    explainer = shap.TreeExplainer(rf_model)
    shap_values = explainer.shap_values(X_test)
    
    print("\n‚úÖ SHAP values calculated successfully!")
    
    # Summary plot
    print("\nüìà Generating SHAP summary plot...")
    shap.summary_plot(shap_values[1], X_test, show=False)
    plt.title('SHAP Feature Importance', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Feature importance comparison
    print("\n" + "="*60)
    print("FEATURE IMPORTANCE COMPARISON")
    print("="*60)
    
    # Random Forest native importance
    rf_importance = pd.DataFrame({
        'Feature': feature_names,
        'RF Importance': rf_model.feature_importances_
    }).sort_values('RF Importance', ascending=False)
    
    print("\nRandom Forest Feature Importance:")
    print(rf_importance.to_string(index=False))
    
    # SHAP importance (mean absolute SHAP value)
    shap_importance = pd.DataFrame({
        'Feature': feature_names,
        'SHAP Importance': np.abs(shap_values[1]).mean(axis=0)
    }).sort_values('SHAP Importance', ascending=False)
    
    print("\nSHAP Feature Importance:")
    print(shap_importance.to_string(index=False))
    
    return rf_model, explainer, shap_values, X_test

rf_shap, shap_explainer, shap_vals, X_test_shap = apply_shap_interpretation()

In [None]:
# 9.2 Explain individual prediction
def explain_single_prediction(model, explainer, shap_values, X_test, sample_idx=0):
    """Explain prediction for a single patient"""
    
    # Get prediction
    sample = X_test.iloc[sample_idx:sample_idx+1]
    prediction = model.predict(sample)[0]
    probability = model.predict_proba(sample)[0]
    
    print("\n" + "="*60)
    print(f"PATIENT #{sample_idx} PREDICTION EXPLANATION")
    print("="*60)
    
    print("\nPatient Features:")
    for feature, value in sample.iloc[0].items():
        print(f"  {feature}: {value:.2f}")
    
    print(f"\nPrediction:")
    print(f"  Class: {prediction} ({'Disease' if prediction == 1 else 'Healthy'})")
    print(f"  Probability: {probability[1]:.3f}")
    
    # SHAP waterfall plot
    print("\nüìä Generating SHAP waterfall plot...")
    shap.waterfall_plot(
        shap.Explanation(
            values=shap_values[1][sample_idx],
            base_values=explainer.expected_value[1],
            data=sample.iloc[0].values,
            feature_names=sample.columns.tolist()
        ),
        show=False
    )
    plt.title(f'SHAP Explanation for Patient #{sample_idx}', 
              fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\n‚úÖ Individual prediction explained!")

explain_single_prediction(rf_shap, shap_explainer, shap_vals, X_test_shap, sample_idx=0)

---
## Practice 10: Complete Biomedical ML Pipeline

### üéØ Learning Objectives
- Build end-to-end ML pipeline
- Integrate preprocessing, modeling, and evaluation
- Apply best practices for biomedical data

### üìñ Key Concepts
**Pipeline:** Automated workflow from raw data to predictions  
**Best Practices:** Feature selection, CV, proper metrics, interpretability

In [None]:
# 10.1 Complete pipeline implementation
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

def build_complete_pipeline():
    """Build complete ML pipeline for biomedical data"""
    
    print("\n" + "="*60)
    print("BUILDING COMPLETE BIOMEDICAL ML PIPELINE")
    print("="*60)
    
    # Step 1: Load data
    print("\n[Step 1/6] Loading and preparing data...")
    X, y = create_imbalanced_clinical_data()
    print(f"   ‚úì Data loaded: {X.shape[0]} samples, {X.shape[1]} features")
    
    # Step 2: Handle missing data
    print("\n[Step 2/6] Handling missing values...")
    # Introduce some missing values
    X_with_nan = X.copy()
    missing_mask = np.random.rand(*X.shape) < 0.1
    X_with_nan[missing_mask] = np.nan
    print(f"   ‚úì Missing values: {np.isnan(X_with_nan).sum()} ({np.isnan(X_with_nan).sum()/X_with_nan.size*100:.1f}%)")
    
    # Step 3: Split data
    print("\n[Step 3/6] Splitting data...")
    X_train, X_test, y_train, y_test = train_test_split(
        X_with_nan, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"   ‚úì Train: {len(X_train)} samples")
    print(f"   ‚úì Test: {len(X_test)} samples")
    
    # Step 4: Build pipeline
    print("\n[Step 4/6] Building ML pipeline...")
    pipeline = Pipeline([
        ('imputer', KNNImputer(n_neighbors=5)),
        ('scaler', StandardScaler()),
        ('feature_selection', SelectKBest(f_classif, k=10)),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
    ])
    print("   ‚úì Pipeline created with 4 steps:")
    print("     1. KNN Imputation")
    print("     2. Standard Scaling")
    print("     3. Feature Selection (K-Best)")
    print("     4. Random Forest (balanced)")
    
    # Step 5: Train with cross-validation
    print("\n[Step 5/6] Training with cross-validation...")
    cv_scores = cross_val_score(
        pipeline, X_train, y_train, 
        cv=StratifiedKFold(5, shuffle=True, random_state=42),
        scoring='balanced_accuracy'
    )
    print(f"   ‚úì CV Balanced Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
    
    # Fit final model
    pipeline.fit(X_train, y_train)
    print("   ‚úì Final model trained on all training data")
    
    # Step 6: Evaluate on test set
    print("\n[Step 6/6] Evaluating on test set...")
    y_pred = pipeline.predict(X_test)
    y_prob = pipeline.predict_proba(X_test)[:, 1]
    
    print("\n" + "="*60)
    print("FINAL TEST SET RESULTS")
    print("="*60)
    print(classification_report(y_test, y_pred, target_names=['Healthy', 'Disease']))
    
    # Additional metrics
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    sensitivity = tp / (tp + fn)
    specificity = tn / (tn + fp)
    ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
    npv = tn / (tn + fn) if (tn + fn) > 0 else 0
    
    print(f"\nClinical Metrics:")
    print(f"  Sensitivity: {sensitivity:.3f}")
    print(f"  Specificity: {specificity:.3f}")
    print(f"  PPV:         {ppv:.3f}")
    print(f"  NPV:         {npv:.3f}")
    print(f"  Balanced Acc: {balanced_accuracy_score(y_test, y_pred):.3f}")
    print(f"  Matthews CC:  {matthews_corrcoef(y_test, y_pred):.3f}")
    
    # ROC-AUC
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    print(f"  ROC-AUC:     {roc_auc:.3f}")
    
    print("\n‚úÖ Complete pipeline executed successfully!")
    
    return pipeline, X_test, y_test, y_pred, y_prob

final_pipeline, X_final_test, y_final_test, y_final_pred, y_final_prob = build_complete_pipeline()

---
## üéØ Practice Complete!

### Summary of What We Learned:

1. **High-Dimensional Data**: Feature selection to combat curse of dimensionality
2. **Class Imbalance**: SMOTE and proper metrics (balanced accuracy, Matthews CC)
3. **Missing Data**: KNN imputation and comparison with simpler methods
4. **Cross-Validation**: Stratified K-Fold for imbalanced biomedical data
5. **Performance Metrics**: Confusion matrix, sensitivity, specificity, PPV, NPV
6. **ROC and PR Curves**: Choosing appropriate evaluation for imbalanced data
7. **Survival Analysis**: Kaplan-Meier curves and log-rank test
8. **Cox Regression**: Hazard ratios and proportional hazards model
9. **Interpretability**: SHAP values for clinical acceptance
10. **Complete Pipeline**: End-to-end ML workflow for biomedical applications

### Key Insights:
- Biomedical data requires specialized techniques (not just standard ML)
- Always use stratified CV and appropriate metrics for imbalanced data
- Model interpretability is crucial for clinical adoption
- Survival analysis is essential for time-to-event outcomes

### Next Steps:
- Apply to real clinical datasets
- External validation on independent cohorts
- Clinical deployment considerations
- Regulatory requirements (FDA approval)

### üìö Additional Resources:
- **scikit-learn**: https://scikit-learn.org/
- **lifelines**: https://lifelines.readthedocs.io/
- **SHAP**: https://shap.readthedocs.io/
- **imbalanced-learn**: https://imbalanced-learn.org/