# Model Evaluation and Interpretation - Hotel Booking Cancellation Prediction
## Academic Research Framework - NIB 7072 Coursework

**Research Objective:** Comprehensive evaluation and interpretation of trained models for hotel booking cancellation prediction.

**Academic Context:** This notebook provides detailed analysis of model performance, SHAP-based interpretability, and business impact assessment for the trained models (LogReg, RandomForest, XGBoost, PyTorch MLP).

**Key Areas:**
- Model performance comparison and statistical significance testing
- SHAP-based feature importance and model interpretability
- Business impact analysis and revenue implications
- Academic documentation and methodology validation
- Sri Lankan tourism market context integration

## 📊 Environment Setup

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Model evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, classification_report, confusion_matrix,
    roc_curve, precision_recall_curve
)

# Model loading and MLflow
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import mlflow.pytorch
import joblib

# Model interpretability
import shap

# Statistical analysis
from scipy import stats
from scipy.stats import wilcoxon, friedmanchisquare

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('default')
sns.set_palette("husl")

print("✅ Model evaluation environment setup completed")

## 🔄 Load Trained Models and Results

In [None]:
# Load MLflow experiment results
mlflow.set_tracking_uri("file:../mlruns")

try:
    # Get experiment
    experiment = mlflow.get_experiment_by_name("hotel_cancellation_prediction")
    
    if experiment:
        experiment_id = experiment.experiment_id
        runs = mlflow.search_runs(experiment_ids=[experiment_id])
        
        print(f"✅ Found {len(runs)} experiment runs")
        print(f"📊 EXPERIMENT OVERVIEW:")
        
        # Display key metrics
        metrics_cols = [col for col in runs.columns if col.startswith('metrics.')]
        if metrics_cols:
            display_runs = runs[['tags.mlflow.runName'] + metrics_cols].head()
            print(display_runs.to_string(index=False))
        
    else:
        print("⚠️ No MLflow experiments found")
        
except Exception as e:
    print(f"⚠️ Could not load MLflow results: {e}")
    print("Creating sample evaluation data for demonstration...")
    
    # Create sample results
    runs = pd.DataFrame({
        'tags.mlflow.runName': ['LogisticRegression', 'RandomForest', 'XGBoost', 'PyTorch_MLP'],
        'metrics.test_accuracy': [0.825, 0.867, 0.889, 0.871],
        'metrics.test_f1': [0.756, 0.823, 0.850, 0.831],
        'metrics.test_roc_auc': [0.891, 0.932, 0.958, 0.941]
    })
    print("Sample evaluation results created")

## 📈 Model Performance Comparison

In [None]:
# Model performance comparison
print("📈 MODEL PERFORMANCE COMPARISON:")

if 'runs' in locals() and not runs.empty:
    # Extract performance metrics
    performance_data = []
    
    for idx, run in runs.iterrows():
        model_name = run.get('tags.mlflow.runName', f'Model_{idx}')
        
        performance_data.append({
            'Model': model_name,
            'Accuracy': run.get('metrics.test_accuracy', 0),
            'F1-Score': run.get('metrics.test_f1', 0),
            'ROC-AUC': run.get('metrics.test_roc_auc', 0),
            'Precision': run.get('metrics.test_precision', 0),
            'Recall': run.get('metrics.test_recall', 0)
        })
    
    performance_df = pd.DataFrame(performance_data)
    performance_df = performance_df.sort_values('F1-Score', ascending=False)
    
    print("\n🏆 PERFORMANCE RANKING (by F1-Score):")
    print(performance_df.round(4))
    
    # Visualize performance comparison
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # F1-Score comparison
    axes[0].bar(performance_df['Model'], performance_df['F1-Score'])
    axes[0].set_title('F1-Score Comparison')
    axes[0].set_ylabel('F1-Score')
    axes[0].tick_params(axis='x', rotation=45)
    
    # ROC-AUC comparison
    axes[1].bar(performance_df['Model'], performance_df['ROC-AUC'])
    axes[1].set_title('ROC-AUC Comparison')
    axes[1].set_ylabel('ROC-AUC')
    axes[1].tick_params(axis='x', rotation=45)
    
    # Accuracy comparison
    axes[2].bar(performance_df['Model'], performance_df['Accuracy'])
    axes[2].set_title('Accuracy Comparison')
    axes[2].set_ylabel('Accuracy')
    axes[2].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Champion model identification
    champion_model = performance_df.iloc[0]['Model']
    champion_f1 = performance_df.iloc[0]['F1-Score']
    print(f"\n🥇 CHAMPION MODEL: {champion_model} (F1-Score: {champion_f1:.4f})")

else:
    print("No performance data available for comparison")

## 🔍 Model Interpretability with SHAP

In [None]:
# SHAP-based model interpretability
print("🔍 MODEL INTERPRETABILITY ANALYSIS:")

# Load champion model for SHAP analysis
try:
    # Attempt to load XGBoost model (typical champion)
    model_path = "../models/xgboost_model.pkl"
    champion_model_obj = joblib.load(model_path)
    print(f"✅ Champion model loaded from {model_path}")
    
    # Load test data for SHAP analysis
    # This would typically come from the feature engineering stage
    print("SHAP analysis ready - load test data to continue...")
    
except FileNotFoundError:
    print("⚠️ Champion model not found")
    print("SHAP analysis will be implemented once models are trained")

# SHAP implementation will include:
# - Feature importance ranking
# - SHAP value distributions
# - Individual prediction explanations
# - Business impact interpretation

## 💼 Business Impact Analysis

In [None]:
# Business impact analysis
print("💼 BUSINESS IMPACT ANALYSIS:")

# Revenue impact simulation based on model predictions
# This will include:
# - Cancellation cost reduction estimates
# - Overbooking optimization potential
# - Market segment insights for Sri Lankan tourism
# - Seasonal booking strategies

print("Business impact analysis to be implemented...")
print("Focus areas:")
print("- Revenue loss prevention through early cancellation detection")
print("- Overbooking optimization strategies")
print("- Customer segmentation for targeted marketing")
print("- Seasonal demand forecasting for Sri Lankan market")

## 📊 Statistical Significance Testing

In [None]:
# Statistical significance testing for model comparison
print("📊 STATISTICAL SIGNIFICANCE TESTING:")

# Implement statistical tests for academic rigor:
# - Wilcoxon signed-rank test for paired model comparison
# - Friedman test for multiple model comparison
# - Effect size calculations

print("Statistical testing framework to be implemented...")
print("Tests to include:")
print("- Wilcoxon signed-rank test for pairwise model comparison")
print("- Friedman test for multiple model ranking")
print("- Effect size calculations for practical significance")
print("- Confidence intervals for performance metrics")

## 📝 Academic Documentation Summary

**Model Evaluation Completion Checklist:**

- ✅ Environment setup and model loading
- ✅ Performance metrics comparison framework
- ✅ Champion model identification process
- □ SHAP-based feature importance analysis
- □ Model interpretability and business insights
- □ Statistical significance testing
- □ Business impact quantification
- □ Sri Lankan tourism market recommendations
- □ Academic methodology validation
- □ Final research conclusions and future work

**Expected Academic Outcomes:**
- Rigorous model comparison with statistical validation
- Comprehensive interpretability analysis using SHAP
- Business-relevant insights for hospitality industry
- Academic-standard documentation for NIB 7072 coursework
- Actionable recommendations for Sri Lankan tourism market