# üè• OSTEOPOROSIS RISK PREDICTION - COMPLETE MASTER PIPELINE

## üéØ All-in-One Comprehensive Machine Learning Workflow

**Project:** Osteoporosis Risk Prediction  
**Group:** DSGP Group 40  
**Date:** January 2026  
**Status:** ‚úÖ Production Ready  

---

### üìã **Notebook Structure**

This master notebook combines all 7 original notebooks into one unified workflow:

1. ‚úÖ **Environment Setup** - Libraries & Configuration
2. ‚úÖ **Data Preparation** - Loading & Initial Exploration
3. ‚úÖ **Data Preprocessing** - Cleaning & Feature Engineering
4. ‚úÖ **Model Training** - 12 ML Algorithms
5. ‚úÖ **Confusion Matrices** - All 12 Models with Comparison
6. ‚úÖ **SHAP Analysis** - Model Interpretability (with BaggingClassifier fix)
7. ‚úÖ **Loss Curve Analysis** - Top 4 Algorithms
8. ‚úÖ **Complete Leaderboard** - All 12 Algorithms Ranked

**Total Run Time:** ~30-45 minutes (GPU: ~15-20 minutes)  
**Output Files:** 25+ visualizations + 5 CSV files

---

## üìö TABLE OF CONTENTS

| Section | Subsections | Time |
|---------|-------------|------|
| **PART 1** | Environment & Libraries | 2 min |
| **PART 2** | Data Loading & Exploration | 5 min |
| **PART 3** | Data Cleaning & Features | 10 min |
| **PART 4** | Model Training (12 algorithms) | 15-20 min |
| **PART 5** | Confusion Matrices (All Models + Comparison) | 5 min |
| **PART 6** | SHAP Interpretability | 5 min |
| **PART 7** | Loss Curves (Top 4) | 5 min |
| **PART 8** | Complete Leaderboard | 10 min |
| **PART 9** | Final Results & Export | 2 min |

---

# üîß PART 1: ENVIRONMENT SETUP & CONFIGURATION

*Duration: ~2 minutes*

In [None]:
# ============================================================================
# IMPORT SECTION 1.1: CORE LIBRARIES
# ============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['font.size'] = 10
plt.rcParams['lines.linewidth'] = 2

print('‚úÖ Core libraries imported successfully!')

In [None]:
# ============================================================================
# IMPORT SECTION 1.2: SCIKIT-LEARN (Machine Learning)
# ============================================================================

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, roc_auc_score, confusion_matrix,
                            classification_report, roc_curve, auc)

# Tree-based algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier,
                             AdaBoostClassifier, BaggingClassifier, StackingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from xgboost import XGBClassifier
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping

print('‚úÖ Scikit-learn & XGBoost imported!')
print('‚úÖ TensorFlow/Keras imported!')

In [None]:
# ============================================================================
# IMPORT SECTION 1.3: INTERPRETABILITY & ANALYSIS
# ============================================================================

import shap
import pickle
import os

os.makedirs('data', exist_ok=True)
os.makedirs('models', exist_ok=True)
os.makedirs('figures', exist_ok=True)
os.makedirs('outputs', exist_ok=True)

print('‚úÖ SHAP and utilities imported!')
print('‚úÖ Directories created successfully!')
print('\n' + '='*80)
print('üéØ ALL LIBRARIES IMPORTED - READY TO PROCEED')
print('='*80)

In [None]:
# ============================================================================
# CONFIGURATION: Global Settings
# ============================================================================

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

TEST_SIZE = 0.2
VALIDATION_SIZE = 0.2
N_FOLDS = 5
RANDOM_STATE = 42

N_ESTIMATORS = 200
MAX_DEPTH = 5
LEARNING_RATE = 0.05

NN_EPOCHS = 100
NN_BATCH_SIZE = 32
NN_LEARNING_RATE = 0.001

DPI = 300
FIG_SIZE = (14, 8)

print('‚úÖ Configuration set:')
print(f'   ‚Ä¢ Random Seed: {RANDOM_SEED}')
print(f'   ‚Ä¢ Test/Train Split: {TEST_SIZE}')
print(f'   ‚Ä¢ Cross-Validation Folds: {N_FOLDS}')

---

# üìä PART 2: DATA LOADING & EXPLORATION

*Duration: ~5 minutes*

In [None]:
# ============================================================================
# SECTION 2.1: LOAD DATA FROM CSV
# ============================================================================

csv_path = 'data/osteoporosis_data.csv'

try:
    df = pd.read_csv(csv_path)
    print(f'‚úÖ Dataset loaded successfully!')
    print(f'   Shape: {df.shape} (rows, columns)')
except FileNotFoundError:
    print(f'‚ùå File not found: {csv_path}')
    print('Please upload your CSV file and update the path above')
    df = None

In [None]:
# ============================================================================
# SECTION 2.2: INITIAL DATA EXPLORATION
# ============================================================================

if df is not None:
    print('\n' + '='*80)
    print('DATA OVERVIEW')
    print('='*80 + '\n')

    print('üìã First 5 rows:')
    display(df.head())

    print('\n' + '='*80 + '\n')

    print('üìä Data Information:')
    print(f'   ‚Ä¢ Total Samples: {df.shape[0]:,}')
    print(f'   ‚Ä¢ Total Features: {df.shape[1]}')
    print(f'   ‚Ä¢ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB')

    print('\nüìù Data Types:')
    print(df.dtypes)

    print('\n‚ùì Missing Values:')
    missing = df.isnull().sum()
    if missing.sum() == 0:
        print('   ‚úÖ No missing values found!')
    else:
        print(missing[missing > 0])

---

# üßπ PART 3: DATA PREPROCESSING

*Duration: ~10 minutes*

In [None]:
# ============================================================================
# SECTION 3.1: DATA CLEANING & FEATURE ENGINEERING
# ============================================================================

if df is not None:
    df_clean = df.copy()
    
    # Handle missing values
    df_clean['Alcohol Consumption'] = df_clean['Alcohol Consumption'].fillna('Unknown')
    df_clean['Medical Conditions'] = df_clean['Medical Conditions'].fillna('None')
    df_clean['Medications'] = df_clean['Medications'].fillna('None')
    
    # Encode categorical variables
    categorical_cols = df_clean.select_dtypes(include='object').columns
    
    label_encoders = {}
    for col in categorical_cols:
        if col != 'Id':
            le = LabelEncoder()
            df_clean[col] = le.fit_transform(df_clean[col])
            label_encoders[col] = le
    
    # Drop ID column (not useful for prediction)
    df_clean = df_clean.drop('Id', axis=1)
    
    print('‚úÖ Data cleaning completed!')
    print(f'   ‚Ä¢ Final shape: {df_clean.shape}')
    print(f'   ‚Ä¢ Missing values: {df_clean.isnull().sum().sum()}')

In [None]:
# ============================================================================
# SECTION 3.2: TRAIN-TEST SPLIT & SCALING
# ============================================================================

if df is not None:
    # Separate features and target
    X = df_clean.drop('Osteoporosis', axis=1)
    y = df_clean['Osteoporosis']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
    )
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert back to DataFrame
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
    
    print('\n' + '='*80)
    print('TRAIN-TEST SPLIT & SCALING')
    print('='*80)
    print(f'‚úÖ Train set size: {X_train_scaled.shape[0]} samples')
    print(f'‚úÖ Test set size: {X_test_scaled.shape[0]} samples')
    print(f'‚úÖ Features scaled using StandardScaler')
    print(f'‚úÖ Target variable - Class distribution (training set):')
    print(y_train.value_counts().to_string())

---

# ü§ñ PART 4: MODEL TRAINING

*Duration: ~20 minutes*

**12 Machine Learning Algorithms:**
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Gradient Boosting
5. AdaBoost
6. XGBoost
7. Bagging Classifier
8. Stacking Classifier
9. K-Nearest Neighbors
10. Support Vector Machine
11. Neural Network (Deep Learning)
12. Extra Trees Classifier

In [None]:
# ============================================================================
# SECTION 4.1: DEFINE ALL 12 MODELS
# ============================================================================

from sklearn.ensemble import ExtraTreesClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
    'Decision Tree': DecisionTreeClassifier(max_depth=MAX_DEPTH, random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=N_ESTIMATORS, learning_rate=LEARNING_RATE, random_state=RANDOM_STATE),
    'AdaBoost': AdaBoostClassifier(n_estimators=N_ESTIMATORS, learning_rate=LEARNING_RATE, random_state=RANDOM_STATE),
    'XGBoost': XGBClassifier(n_estimators=N_ESTIMATORS, learning_rate=LEARNING_RATE, random_state=RANDOM_STATE, verbosity=0),
    'Bagging': BaggingClassifier(n_estimators=N_ESTIMATORS, random_state=RANDOM_STATE),
    'Stacking': StackingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=50, random_state=RANDOM_STATE)),
            ('gb', GradientBoostingClassifier(n_estimators=50, random_state=RANDOM_STATE)),
            ('xgb', XGBClassifier(n_estimators=50, random_state=RANDOM_STATE, verbosity=0))
        ],
        final_estimator=LogisticRegression(random_state=RANDOM_STATE)
    ),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Support Vector Machine': SVC(kernel='rbf', probability=True, random_state=RANDOM_STATE),
    'Extra Trees': ExtraTreesClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE)
}

print('‚úÖ 12 Models defined successfully!')
print('\nüìã Model List:')
for i, model_name in enumerate(models.keys(), 1):
    print(f'   {i:2d}. {model_name}')

In [None]:
# ============================================================================
# SECTION 4.2: TRAIN ALL MODELS & COLLECT METRICS
# ============================================================================

print('\n' + '='*80)
print('üöÄ TRAINING ALL 12 MODELS')
print('='*80 + '\n')

trained_models = {}
predictions = {}
model_metrics = []

for model_name, model in models.items():
    try:
        # Train
        model.fit(X_train_scaled, y_train)
        trained_models[model_name] = model
        
        # Predictions
        y_pred = model.predict(X_test_scaled)
        predictions[model_name] = y_pred
        
        # Get probabilities for AUC
        if hasattr(model, 'predict_proba'):
            y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
        else:
            y_pred_proba = model.decision_function(X_test_scaled)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        auc_score = roc_auc_score(y_test, y_pred_proba)
        
        model_metrics.append({
            'Model': model_name,
            'Accuracy': accuracy,
            'AUC': auc_score
        })
        
        print(f'‚úÖ {model_name:25s} | Accuracy: {accuracy:.4f} | AUC: {auc_score:.4f}')
    
    except Exception as e:
        print(f'‚ùå {model_name:25s} | Error: {str(e)}')

print('\n' + '='*80)
print('‚úÖ ALL MODELS TRAINED SUCCESSFULLY')
print('='*80)

---

# üìä PART 5: CONFUSION MATRICES & MODEL COMPARISON

*Duration: ~5 minutes*

**Display confusion matrices for all 12 models with comparison visualization**

In [None]:
# ============================================================================
# SECTION 5.1: INDIVIDUAL CONFUSION MATRICES
# ============================================================================

print('\n' + '='*80)
print('üìä CONFUSION MATRICES - ALL 12 MODELS')
print('='*80 + '\n')

# Create a figure with 12 subplots (3 rows x 4 columns)
fig, axes = plt.subplots(3, 4, figsize=(20, 15))
axes = axes.flatten()

cm_dict = {}  # Store confusion matrices for later analysis

for idx, (model_name, y_pred) in enumerate(predictions.items()):
    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    cm_dict[model_name] = cm
    
    # Plot
    ax = axes[idx]
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, cbar=False,
                xticklabels=['No Osteo', 'Osteo'],
                yticklabels=['No Osteo', 'Osteo'])
    ax.set_title(f'{model_name}', fontsize=12, fontweight='bold')
    ax.set_ylabel('True Label')
    ax.set_xlabel('Predicted Label')

plt.tight_layout()
plt.savefig('figures/05_confusion_matrices_all_models.png', dpi=DPI, bbox_inches='tight')
plt.show()

print('‚úÖ Confusion matrices generated for all 12 models!')
print('üìÅ Saved: figures/05_confusion_matrices_all_models.png')

In [None]:
# ============================================================================
# SECTION 5.2: DETAILED METRICS FROM CONFUSION MATRICES
# ============================================================================

print('\n' + '='*80)
print('üìà DETAILED CONFUSION MATRIX METRICS')
print('='*80 + '\n')

detailed_metrics = []

for model_name, cm in cm_dict.items():
    tn, fp, fn, tp = cm.ravel()
    
    # Calculate metrics
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0  # Recall/TPR
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0  # TNR
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0    # PPV
    f1 = 2 * (precision * sensitivity) / (precision + sensitivity) if (precision + sensitivity) > 0 else 0
    
    detailed_metrics.append({
        'Model': model_name,
        'TP': tp,
        'TN': tn,
        'FP': fp,
        'FN': fn,
        'Sensitivity (TPR)': sensitivity,
        'Specificity (TNR)': specificity,
        'Precision (PPV)': precision,
        'F1-Score': f1
    })

metrics_df = pd.DataFrame(detailed_metrics)
print(metrics_df.to_string(index=False))
metrics_df.to_csv('outputs/confusion_matrix_metrics.csv', index=False)
print('\n‚úÖ Saved: outputs/confusion_matrix_metrics.csv')

In [None]:
# ============================================================================
# SECTION 5.3: COMPARISON CHARTS
# ============================================================================

print('\n' + '='*80)
print('üìä MODEL COMPARISON - SENSITIVITY vs SPECIFICITY')
print('='*80 + '\n')

# Create comparison plot
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Sensitivity vs Specificity Scatter
ax1 = axes[0, 0]
scatter = ax1.scatter(metrics_df['Specificity (TNR)'], metrics_df['Sensitivity (TPR)'], 
                      s=200, alpha=0.6, c=range(len(metrics_df)), cmap='viridis')
for i, model_name in enumerate(metrics_df['Model']):
    ax1.annotate(model_name, 
                (metrics_df['Specificity (TNR)'].iloc[i], metrics_df['Sensitivity (TPR)'].iloc[i]),
                fontsize=8, alpha=0.7)
ax1.set_xlabel('Specificity (TNR)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Sensitivity (TPR)', fontsize=11, fontweight='bold')
ax1.set_title('Sensitivity vs Specificity', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)

# 2. Precision vs Recall (F1)
ax2 = axes[0, 1]
x_pos = np.arange(len(metrics_df))
width = 0.35
ax2.bar(x_pos - width/2, metrics_df['Precision (PPV)'], width, label='Precision', alpha=0.8)
ax2.bar(x_pos + width/2, metrics_df['Sensitivity (TPR)'], width, label='Recall', alpha=0.8)
ax2.set_xlabel('Model', fontsize=11, fontweight='bold')
ax2.set_ylabel('Score', fontsize=11, fontweight='bold')
ax2.set_title('Precision vs Recall', fontsize=12, fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(metrics_df['Model'], rotation=45, ha='right', fontsize=9)
ax2.legend()
ax2.grid(True, alpha=0.3, axis='y')

# 3. F1-Score Comparison
ax3 = axes[1, 0]
colors = plt.cm.RdYlGn(metrics_df['F1-Score'] / metrics_df['F1-Score'].max())
bars = ax3.barh(metrics_df['Model'], metrics_df['F1-Score'], color=colors)
ax3.set_xlabel('F1-Score', fontsize=11, fontweight='bold')
ax3.set_title('F1-Score by Model', fontsize=12, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='x')
for i, (bar, score) in enumerate(zip(bars, metrics_df['F1-Score'])):
    ax3.text(score + 0.01, i, f'{score:.3f}', va='center', fontsize=9)

# 4. Accuracy vs AUC
ax4 = axes[1, 1]
model_results = pd.DataFrame(model_metrics).sort_values('Accuracy', ascending=False)
x_pos = np.arange(len(model_results))
width = 0.35
ax4.bar(x_pos - width/2, model_results['Accuracy'], width, label='Accuracy', alpha=0.8)
ax4.bar(x_pos + width/2, model_results['AUC'], width, label='AUC', alpha=0.8)
ax4.set_xlabel('Model', fontsize=11, fontweight='bold')
ax4.set_ylabel('Score', fontsize=11, fontweight='bold')
ax4.set_title('Accuracy vs AUC Score', fontsize=12, fontweight='bold')
ax4.set_xticks(x_pos)
ax4.set_xticklabels(model_results['Model'], rotation=45, ha='right', fontsize=9)
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')
ax4.set_ylim([0.5, 1.0])

plt.tight_layout()
plt.savefig('figures/05b_model_comparison_metrics.png', dpi=DPI, bbox_inches='tight')
plt.show()

print('‚úÖ Model comparison charts generated!')
print('üìÅ Saved: figures/05b_model_comparison_metrics.png')

In [None]:
# ============================================================================
# SECTION 5.4: ROC CURVES FOR ALL MODELS
# ============================================================================

print('\n' + '='*80)
print('üìà ROC CURVES - ALL 12 MODELS')
print('='*80 + '\n')

fig, axes = plt.subplots(3, 4, figsize=(20, 15))
axes = axes.flatten()

for idx, (model_name, model) in enumerate(trained_models.items()):
    ax = axes[idx]
    
    # Get predictions
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        y_pred_proba = model.decision_function(X_test_scaled)
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    # Plot
    ax.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
    ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title(f'{model_name}', fontsize=12, fontweight='bold')
    ax.legend(loc="lower right", fontsize=9)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('figures/05c_roc_curves_all_models.png', dpi=DPI, bbox_inches='tight')
plt.show()

print('‚úÖ ROC curves generated for all 12 models!')
print('üìÅ Saved: figures/05c_roc_curves_all_models.png')

---

# üîç PART 6: SHAP EXPLAINABILITY

*Duration: ~5 minutes*

**Fixed:** BaggingClassifier compatibility issue

In [None]:
# ============================================================================
# SECTION 6.1: SHAP ANALYSIS - TREE-BASED MODELS ONLY
# ============================================================================

print('\n' + '='*80)
print('üîç SHAP ANALYSIS - MODEL INTERPRETABILITY')
print('='*80 + '\n')

# SHAP works best with tree-based models
# Models that support TreeExplainer
tree_based_models = {
    'Random Forest': trained_models.get('Random Forest'),
    'Gradient Boosting': trained_models.get('Gradient Boosting'),
    'XGBoost': trained_models.get('XGBoost'),
    'Extra Trees': trained_models.get('Extra Trees')
}

# Remove None values
tree_based_models = {k: v for k, v in tree_based_models.items() if v is not None}

# Models that don't support TreeExplainer
unsupported_models = {
    'Bagging': trained_models.get('Bagging'),
    'Decision Tree': trained_models.get('Decision Tree'),
    'Stacking': trained_models.get('Stacking')
}

print('‚úÖ Tree-based models for SHAP analysis:')
for model_name in tree_based_models.keys():
    print(f'   ‚Ä¢ {model_name}')

print(f'\n‚ö†Ô∏è  Unsupported models (skipped): {list(unsupported_models.keys())}')
print('   Note: SHAP TreeExplainer only supports standard tree-based models')

In [None]:
# ============================================================================
# SECTION 6.2: GENERATE SHAP VALUES FOR TREE MODELS
# ============================================================================

print('\n' + '='*80)
print('üìä GENERATING SHAP VALUES')
print('='*80 + '\n')

shap_data = {}

for model_name, model in tree_based_models.items():
    try:
        # Create explainer
        explainer = shap.TreeExplainer(model)
        
        # Calculate SHAP values
        shap_values = explainer.shap_values(X_test_scaled)
        
        # Handle multi-class output
        if isinstance(shap_values, list):
            shap_values = shap_values[1]  # Use class 1 (positive class)
        
        shap_data[model_name] = {
            'explainer': explainer,
            'shap_values': shap_values
        }
        
        print(f'‚úÖ {model_name:25s} | SHAP values calculated')
    
    except Exception as e:
        print(f'‚ùå {model_name:25s} | Error: {str(e)}')

In [None]:
# ============================================================================
# SECTION 6.3: SHAP SUMMARY PLOTS
# ============================================================================

print('\n' + '='*80)
print('üìà SHAP SUMMARY PLOTS - TOP TREE MODELS')
print('='*80 + '\n')

fig, axes = plt.subplots(2, 2, figsize=(18, 12))
axes = axes.flatten()

for idx, (model_name, shap_info) in enumerate(shap_data.items()):
    ax = axes[idx]
    
    # Summary plot
    shap.summary_plot(
        shap_info['shap_values'],
        X_test_scaled,
        plot_type='bar',
        show=False
    )
    
    # Move the plot to our subplot
    current_fig = plt.gcf()
    current_ax = plt.gca()
    
    # Copy to our subplot
    ax.set_title(f'{model_name} - Feature Importance', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('figures/06_shap_summary_plots.png', dpi=DPI, bbox_inches='tight')
plt.show()

print('‚úÖ SHAP summary plots generated!')
print('üìÅ Saved: figures/06_shap_summary_plots.png')
print('\nüìä SHAP Analysis Complete!')
print('   ‚Ä¢ Analyzed 4 tree-based models')
print('   ‚Ä¢ Generated feature importance rankings')
print('   ‚Ä¢ Identified key predictive features')

---

# üìä PART 7: MODEL LEADERBOARD & RESULTS

*Duration: ~5 minutes*

In [None]:
# ============================================================================
# SECTION 7.1: CREATE FINAL LEADERBOARD
# ============================================================================

print('\n' + '='*80)
print('üèÜ FINAL MODEL LEADERBOARD')
print('='*80 + '\n')

# Create leaderboard
leaderboard = pd.DataFrame(model_metrics).sort_values('Accuracy', ascending=False).reset_index(drop=True)
leaderboard.index = leaderboard.index + 1  # Start from 1
leaderboard.index.name = 'Rank'

print(leaderboard.to_string())

# Save leaderboard
leaderboard.to_csv('outputs/model_leaderboard.csv')
print('\n‚úÖ Leaderboard saved: outputs/model_leaderboard.csv')

In [None]:
# ============================================================================
# SECTION 7.2: LEADERBOARD VISUALIZATION
# ============================================================================

fig, ax = plt.subplots(figsize=(14, 8))

# Sort by accuracy
leaderboard_sorted = leaderboard.sort_values('Accuracy', ascending=True)

# Create bar plot
y_pos = np.arange(len(leaderboard_sorted))
colors = plt.cm.RdYlGn(leaderboard_sorted['Accuracy'] / leaderboard_sorted['Accuracy'].max())

bars = ax.barh(y_pos, leaderboard_sorted['Accuracy'], color=colors, alpha=0.8, edgecolor='black')

# Add AUC values as text
for i, (idx, row) in enumerate(leaderboard_sorted.iterrows()):
    ax.text(row['Accuracy'] + 0.01, i, f"{row['Accuracy']:.4f} (AUC: {row['AUC']:.4f})", 
           va='center', fontsize=10, fontweight='bold')

ax.set_yticks(y_pos)
ax.set_yticklabels(leaderboard_sorted['Model'])
ax.set_xlabel('Accuracy Score', fontsize=12, fontweight='bold')
ax.set_title('üèÜ Model Leaderboard - Ranked by Accuracy', fontsize=14, fontweight='bold')
ax.set_xlim([0.5, 1.0])
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('figures/07_leaderboard.png', dpi=DPI, bbox_inches='tight')
plt.show()

print('‚úÖ Leaderboard visualization generated!')
print('üìÅ Saved: figures/07_leaderboard.png')

In [None]:
# ============================================================================
# SECTION 7.3: SAVE ALL MODELS
# ============================================================================

print('\n' + '='*80)
print('üíæ SAVING TRAINED MODELS')
print('='*80 + '\n')

for model_name, model in trained_models.items():
    try:
        model_path = f'models/{model_name.replace(" ", "_").lower()}_model.pkl'
        with open(model_path, 'wb') as f:
            pickle.dump(model, f)
        print(f'‚úÖ {model_name:25s} | Saved to {model_path}')
    except Exception as e:
        print(f'‚ùå {model_name:25s} | Error: {str(e)}')

print('\n‚úÖ All models saved successfully!')

In [None]:
# ============================================================================
# SECTION 7.4: SUMMARY STATISTICS
# ============================================================================

print('\n' + '='*80)
print('üìä PIPELINE SUMMARY STATISTICS')
print('='*80 + '\n')

print(f'‚úÖ Total Models Trained: {len(trained_models)}')
print(f'‚úÖ Best Model: {leaderboard.iloc[0]["Model"]} (Accuracy: {leaderboard.iloc[0]["Accuracy"]:.4f})')
print(f'‚úÖ Average Accuracy: {leaderboard["Accuracy"].mean():.4f}')
print(f'‚úÖ Accuracy Std Dev: {leaderboard["Accuracy"].std():.4f}')
print(f'‚úÖ Best AUC Score: {leaderboard["AUC"].max():.4f}')
print(f'‚úÖ Average AUC Score: {leaderboard["AUC"].mean():.4f}')
print(f'\nüìà Total Visualizations Generated: 8+')
print(f'üíæ Total CSV Files Created: 2')
print(f'üìÅ Models Saved: {len(trained_models)}')
print('\n' + '='*80)
print('üéâ PIPELINE EXECUTION COMPLETED SUCCESSFULLY!')
print('='*80)