# 🎯 Model Validation Analysis - NHL Elo Prediction System

**Comprehensive model validation on complete 2024-25 NHL season data**

**Scope:** 1312 games validation for prediction accuracy & calibration assessment

**Business Goal:** Go/no-go deployment decision based on model performance

---

## 📋 Analysis Overview

1. **Setup & Configuration** - Libraries, paths, constants
2. **Enhanced Data Discovery** - Robust CSV loading, fallbacks
3. **Prediction Accuracy Analysis** - Overall accuracy, classification metrics
4. **Calibration Analysis** - Probability reliability curves, Brier score
5. **Classification Performance** - Confusion matrix, detailed breakdown
6. **Temporal Analysis** - Performance stability over season
7. **Market Comparison** - Model vs betting odds (where available)
8. **Executive Summary** - Go/no-go deployment recommendations
9. **Export & Visualization** - HTML charts, summary export

---

## 1. 🚀 Setup & Configuration

In [None]:
import os
import logging
from pathlib import Path

# Use environment variables for paths
if 'HOCKEY_LOGS_DIR' in os.environ:
    logs_dir = Path(os.environ['HOCKEY_LOGS_DIR'])
    project_root = Path(os.environ.get('HOCKEY_PROJECT_ROOT', '.'))
else:
    # Fallback for manual execution
    project_root = Path('../../')
    logs_dir = project_root / 'logs'

# Ensure logs directory exists
logs_dir.mkdir(parents=True, exist_ok=True)

# Configure logging
log_file = logs_dir / 'model_validation_analysis.log'
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file, encoding='utf-8'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Project paths
logger.info(f"?? Project root: {project_root.absolute()}")
logger.info(f"?? Log file: {log_file.absolute()}")

In [None]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('../../logs/model_validation_analysis.log', encoding='utf-8'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Project paths
PROJECT_ROOT = Path.cwd().parent.parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
DATA_PATH = PROJECT_ROOT / 'models' / 'experiments'
CHARTS_PATH = DATA_PATH / 'charts'
LOGS_PATH = PROJECT_ROOT / 'logs'

# Create directories if they don't exist
CHARTS_PATH.mkdir(parents=True, exist_ok=True)
LOGS_PATH.mkdir(parents=True, exist_ok=True)

print(f"📁 Data path: {DATA_PATH}")
print(f"📊 Charts export: {CHARTS_PATH}")
print(f"📝 Logs path: {LOGS_PATH}")

In [None]:
# Analysis configuration
ANALYSIS_CONFIG = {
    'target_accuracy': 0.55,  # Minimum acceptable accuracy
    'training_benchmark': 0.588,  # Training accuracy benchmark
    'confidence_thresholds': [0.5, 0.6, 0.7, 0.8, 0.9],
    'max_brier_score': 0.25,  # Maximum acceptable Brier score
    'chart_template': 'plotly_white',
    'color_palette': ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
}

# Set Plotly template
pio.templates.default = ANALYSIS_CONFIG['chart_template']

logger.info("🚀 Model Validation Analysis initialized")
logger.info(f"📁 Data path: {DATA_PATH}")
logger.info(f"📊 Charts export: {CHARTS_PATH}")

print("✅ Configuration loaded successfully")
print(f"🎯 Target accuracy: {ANALYSIS_CONFIG['target_accuracy']:.1%}")
print(f"📊 Training benchmark: {ANALYSIS_CONFIG['training_benchmark']:.1%}")

## 2. 📂 Enhanced Data Discovery & Loading

In [None]:
# =============================================================================
# 2. ENHANCED DATA DISCOVERY & LOADING
# =============================================================================

def find_latest_validation_file(data_path: Path) -> Path:
    """Find the latest model validation data file"""
    pattern = "model_validation_complete_2025_*.csv"
    files = list(data_path.glob(pattern))
    
    if not files:
        raise FileNotFoundError(f"No validation data files found matching {pattern} in {data_path}")
    
    # Sort by timestamp in filename
    def extract_timestamp(filepath):
        match = re.search(r'(\d{8}_\d{6})', str(filepath))
        return match.group(1) if match else '00000000_000000'
    
    latest_file = max(files, key=extract_timestamp)
    logger.info(f"📂 Using validation data: {latest_file.name}")
    return latest_file

def load_validation_data(file_path: Path) -> pd.DataFrame:
    """Load and validate model validation data with robust error handling"""
    try:
        # Try UTF-8 first, fallback to other encodings
        encodings = ['utf-8', 'utf-8-sig', 'iso-8859-1', 'cp1252']
        df = None
        
        for encoding in encodings:
            try:
                df = pd.read_csv(file_path, encoding=encoding)
                logger.info(f"✅ Data loaded successfully with {encoding} encoding")
                break
            except UnicodeDecodeError:
                continue
        
        if df is None:
            raise ValueError("Could not load data with any supported encoding")
        
        # Data validation
        logger.info(f"📊 Dataset shape: {df.shape}")
        
        # Required columns validation
        required_cols = [
            'game_id', 'date', 'home_team_name', 'away_team_name',
            'actual_winner', 'predicted_winner', 'home_win_probability',
            'away_win_probability', 'prediction_correct', 'model_confidence'
        ]
        
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        # Convert data types
        df['date'] = pd.to_datetime(df['date'])
        df['prediction_correct'] = df['prediction_correct'].astype(bool)
        df['has_odds_data'] = df.get('has_odds_data', False).astype(bool)
        
        # Data quality checks
        logger.info(f"🎯 Games with predictions: {len(df)}")
        logger.info(f"🎲 Games with odds data: {df['has_odds_data'].sum()}")
        logger.info(f"📅 Date range: {df['date'].min()} to {df['date'].max()}")
        
        return df
        
    except Exception as e:
        logger.error(f"❌ Error loading validation data: {e}")
        raise

print("✅ Data loading functions defined")

In [None]:
# Load the data
try:
    validation_file = find_latest_validation_file(DATA_PATH)
    df = load_validation_data(validation_file)
    logger.info("✅ Model validation data loaded successfully")
    
    print("✅ Data loaded successfully!")
    print(f"📊 Dataset shape: {df.shape}")
    print(f"📅 Date range: {df['date'].min().strftime('%Y-%m-%d')} to {df['date'].max().strftime('%Y-%m-%d')}")
    print(f"🎯 Games with predictions: {len(df)}")
    print(f"🎲 Games with odds data: {df['has_odds_data'].sum()}")
    
except Exception as e:
    logger.error(f"❌ Failed to load validation data: {e}")
    print(f"❌ Error: {e}")
    raise

In [None]:
# Data overview
print("📊 Data Overview:")
print("=" * 50)
print(f"Columns: {list(df.columns)}")
print("\n🔍 Sample data:")
display(df.head())

print("\n📈 Basic statistics:")
print(f"Overall accuracy: {df['prediction_correct'].mean():.1%}")
print(f"Average model confidence: {df['model_confidence'].mean():.3f}")
print(f"Home win rate (actual): {(df['actual_winner'] == df['home_team_name']).mean():.1%}")
print(f"Home win rate (predicted): {(df['predicted_winner'] == df['home_team_name']).mean():.1%}")

## 3. 🎯 Prediction Accuracy Analysis

In [None]:
# =============================================================================
# 3. PREDICTION ACCURACY ANALYSIS
# =============================================================================

def analyze_prediction_accuracy(df: pd.DataFrame) -> dict:
    """Comprehensive prediction accuracy analysis"""
    logger.info("🎯 Starting prediction accuracy analysis...")
    
    results = {}
    
    # Overall accuracy
    overall_accuracy = df['prediction_correct'].mean()
    results['overall_accuracy'] = overall_accuracy
    
    # Home vs Away accuracy
    home_predictions = df[df['predicted_winner'] == df['home_team_name']]
    away_predictions = df[df['predicted_winner'] == df['away_team_name']]
    
    home_accuracy = home_predictions['prediction_correct'].mean() if len(home_predictions) > 0 else 0
    away_accuracy = away_predictions['prediction_correct'].mean() if len(away_predictions) > 0 else 0
    
    results['home_accuracy'] = home_accuracy
    results['away_accuracy'] = away_accuracy
    results['home_predictions'] = len(home_predictions)
    results['away_predictions'] = len(away_predictions)
    
    # Accuracy by confidence levels
    confidence_analysis = {}
    for threshold in ANALYSIS_CONFIG['confidence_thresholds']:
        high_conf_games = df[df['model_confidence'] >= threshold]
        if len(high_conf_games) > 0:
            conf_accuracy = high_conf_games['prediction_correct'].mean()
            confidence_analysis[threshold] = {
                'accuracy': conf_accuracy,
                'game_count': len(high_conf_games)
            }
    
    results['confidence_analysis'] = confidence_analysis
    
    logger.info(f"📊 Overall accuracy: {overall_accuracy:.3f}")
    logger.info(f"🏠 Home predictions accuracy: {home_accuracy:.3f} ({len(home_predictions)} games)")
    logger.info(f"✈️ Away predictions accuracy: {away_accuracy:.3f} ({len(away_predictions)} games)")
    
    return results

# Perform accuracy analysis
accuracy_results = analyze_prediction_accuracy(df)

print("✅ Accuracy analysis completed")
print(f"📊 Overall accuracy: {accuracy_results['overall_accuracy']:.1%}")
print(f"🏠 Home predictions: {accuracy_results['home_accuracy']:.1%} ({accuracy_results['home_predictions']} games)")
print(f"✈️ Away predictions: {accuracy_results['away_accuracy']:.1%} ({accuracy_results['away_predictions']} games)")

In [None]:
# Create accuracy visualizations
def create_accuracy_charts(df: pd.DataFrame, results: dict):
    """Create comprehensive accuracy visualizations"""
    
    # 1. Overall accuracy vs benchmarks
    fig_overview = go.Figure()
    
    accuracies = [
        results['overall_accuracy'],
        ANALYSIS_CONFIG['target_accuracy'],
        ANALYSIS_CONFIG['training_benchmark']
    ]
    labels = ['Model Performance', 'Target Threshold', 'Training Benchmark']
    colors = ['#2E86AB', '#A23B72', '#F18F01']
    
    fig_overview.add_trace(go.Bar(
        x=labels,
        y=accuracies,
        marker_color=colors,
        text=[f"{acc:.1%}" for acc in accuracies],
        textposition='auto'
    ))
    
    fig_overview.update_layout(
        title="🎯 Model Accuracy vs Benchmarks",
        yaxis_title="Accuracy",
        yaxis=dict(range=[0, 1], tickformat='.0%'),
        height=400
    )
    
    # 2. Home vs Away accuracy
    fig_home_away = go.Figure()
    
    fig_home_away.add_trace(go.Bar(
        name='Home Predictions',
        x=['Accuracy'],
        y=[results['home_accuracy']],
        marker_color='#1f77b4',
        text=f"{results['home_accuracy']:.1%}",
        textposition='auto'
    ))
    
    fig_home_away.add_trace(go.Bar(
        name='Away Predictions', 
        x=['Accuracy'],
        y=[results['away_accuracy']],
        marker_color='#ff7f0e',
        text=f"{results['away_accuracy']:.1%}",
        textposition='auto'
    ))
    
    fig_home_away.update_layout(
        title="🏠 Home vs Away Prediction Accuracy",
        yaxis_title="Accuracy",
        yaxis=dict(range=[0, 1], tickformat='.0%'),
        height=400
    )
    
    # 3. Confidence-based accuracy
    conf_thresholds = list(results['confidence_analysis'].keys())
    conf_accuracies = [results['confidence_analysis'][t]['accuracy'] for t in conf_thresholds]
    conf_counts = [results['confidence_analysis'][t]['game_count'] for t in conf_thresholds]
    
    fig_conf = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig_conf.add_trace(
        go.Scatter(
            x=conf_thresholds,
            y=conf_accuracies,
            mode='lines+markers',
            name='Accuracy',
            line=dict(color='#2E86AB', width=3),
            marker=dict(size=8)
        ),
        secondary_y=False
    )
    
    fig_conf.add_trace(
        go.Bar(
            x=conf_thresholds,
            y=conf_counts,
            name='Game Count',
            opacity=0.3,
            marker_color='#F18F01'
        ),
        secondary_y=True
    )
    
    fig_conf.update_layout(title="📈 Accuracy by Model Confidence")
    fig_conf.update_xaxes(title_text="Confidence Threshold")
    fig_conf.update_yaxes(title_text="Accuracy", secondary_y=False, tickformat='.0%')
    fig_conf.update_yaxes(title_text="Number of Games", secondary_y=True)
    
    return fig_overview, fig_home_away, fig_conf

accuracy_charts = create_accuracy_charts(df, accuracy_results)

print("📊 Displaying accuracy charts...")
for i, chart in enumerate(accuracy_charts, 1):
    chart.show()
    print(f"Chart {i} displayed")

## 4. 🎲 Calibration Analysis

In [None]:
# =============================================================================
# 4. CALIBRATION ANALYSIS
# =============================================================================

def analyze_model_calibration(df: pd.DataFrame) -> dict:
    """Analyze probability calibration quality"""
    logger.info("🎲 Starting calibration analysis...")
    
    results = {}
    
    # Prepare data for calibration
    y_true = (df['actual_winner'] == df['home_team_name']).astype(int)
    y_prob = df['home_win_probability'].values
    
    # Brier score
    brier_score = brier_score_loss(y_true, y_prob)
    results['brier_score'] = brier_score
    
    # Calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_true, y_prob, n_bins=10, strategy='quantile'
    )
    
    results['calibration_curve'] = {
        'fraction_of_positives': fraction_of_positives,
        'mean_predicted_value': mean_predicted_value
    }
    
    # Reliability metrics
    reliability = np.mean((fraction_of_positives - mean_predicted_value) ** 2)
    resolution = np.mean((fraction_of_positives - np.mean(y_true)) ** 2)
    uncertainty = np.mean(y_true) * (1 - np.mean(y_true))
    
    results['reliability'] = reliability
    results['resolution'] = resolution
    results['uncertainty'] = uncertainty
    
    # Calibration by probability bins
    bins = np.linspace(0, 1, 11)
    bin_centers = (bins[:-1] + bins[1:]) / 2
    bin_counts = []
    bin_accuracies = []
    
    for i in range(len(bins)-1):
        mask = (y_prob >= bins[i]) & (y_prob < bins[i+1])
        if i == len(bins)-2:  # Include 1.0 in the last bin
            mask = (y_prob >= bins[i]) & (y_prob <= bins[i+1])
        
        if np.sum(mask) > 0:
            bin_counts.append(np.sum(mask))
            bin_accuracies.append(np.mean(y_true[mask]))
        else:
            bin_counts.append(0)
            bin_accuracies.append(0)
    
    results['probability_bins'] = {
        'bin_centers': bin_centers,
        'bin_counts': bin_counts,
        'bin_accuracies': bin_accuracies
    }
    
    logger.info(f"🎲 Brier Score: {brier_score:.4f}")
    logger.info(f"📊 Reliability: {reliability:.4f}")
    
    return results

calibration_results = analyze_model_calibration(df)

print("✅ Calibration analysis completed")
print(f"🎲 Brier Score: {calibration_results['brier_score']:.4f}")
print(f"📊 Reliability: {calibration_results['reliability']:.4f}")
print(f"🎯 Target Brier Score: ≤ {ANALYSIS_CONFIG['max_brier_score']:.3f}")
print(f"✅ Meets target: {'YES' if calibration_results['brier_score'] <= ANALYSIS_CONFIG['max_brier_score'] else 'NO'}")

In [None]:
# Create calibration visualizations
def create_calibration_charts(results: dict):
    """Create calibration analysis charts"""
    
    # 1. Calibration curve
    fig_cal = go.Figure()
    
    # Perfect calibration line
    fig_cal.add_trace(go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        name='Perfect Calibration',
        line=dict(color='gray', dash='dash', width=2)
    ))
    
    # Actual calibration
    cal_data = results['calibration_curve']
    fig_cal.add_trace(go.Scatter(
        x=cal_data['mean_predicted_value'],
        y=cal_data['fraction_of_positives'],
        mode='lines+markers',
        name='Model Calibration',
        line=dict(color='#2E86AB', width=3),
        marker=dict(size=8)
    ))
    
    fig_cal.update_layout(
        title="🎯 Probability Calibration Curve",
        xaxis_title="Mean Predicted Probability",
        yaxis_title="Fraction of Positives",
        xaxis=dict(range=[0, 1]),
        yaxis=dict(range=[0, 1])
    )
    
    # 2. Reliability diagram
    bin_data = results['probability_bins']
    
    fig_rel = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig_rel.add_trace(
        go.Bar(
            x=bin_data['bin_centers'],
            y=bin_data['bin_counts'],
            name='Frequency',
            opacity=0.3,
            marker_color='#F18F01',
            width=0.08
        ),
        secondary_y=True
    )
    
    fig_rel.add_trace(
        go.Scatter(
            x=bin_data['bin_centers'],
            y=bin_data['bin_accuracies'],
            mode='lines+markers',
            name='Observed Frequency',
            line=dict(color='#2E86AB', width=3),
            marker=dict(size=10)
        ),
        secondary_y=False
    )
    
    # Perfect calibration line
    fig_rel.add_trace(
        go.Scatter(
            x=[0, 1],
            y=[0, 1],
            mode='lines',
            name='Perfect Calibration',
            line=dict(color='gray', dash='dash', width=2)
        ),
        secondary_y=False
    )
    
    fig_rel.update_layout(title="📊 Reliability Diagram")
    fig_rel.update_xaxes(title_text="Predicted Probability")
    fig_rel.update_yaxes(title_text="Observed Frequency", secondary_y=False)
    fig_rel.update_yaxes(title_text="Count", secondary_y=True)
    
    return fig_cal, fig_rel

calibration_charts = create_calibration_charts(calibration_results)

print("📊 Displaying calibration charts...")
for i, chart in enumerate(calibration_charts, 1):
    chart.show()
    print(f"Calibration chart {i} displayed")

## 5. 🔍 Classification Performance Analysis

In [None]:
# =============================================================================
# 5. CONFUSION MATRIX & CLASSIFICATION PERFORMANCE
# =============================================================================

def analyze_classification_performance(df: pd.DataFrame) -> dict:
    """Detailed classification performance analysis"""
    logger.info("🔍 Starting classification performance analysis...")
    
    results = {}
    
    # Prepare labels
    y_true = df['actual_winner'].values
    y_pred = df['predicted_winner'].values
    
    # Get unique teams for confusion matrix
    unique_winners = sorted(list(set(list(y_true) + list(y_pred))))
    
    # Confusion matrix - only use teams that actually appear as winners
    cm = confusion_matrix(y_true, y_pred, labels=unique_winners)
    results['confusion_matrix'] = cm
    results['team_labels'] = unique_winners
    
    # Classification report
    clf_report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)
    results['classification_report'] = clf_report
    
    # Team-specific performance
    team_performance = {}
    for team in unique_winners:
        team_games = df[(df['home_team_name'] == team) | (df['away_team_name'] == team)]
        if len(team_games) > 0:
            team_correct = team_games['prediction_correct'].sum()
            team_total = len(team_games)
            team_performance[team] = {
                'accuracy': team_correct / team_total,
                'games': team_total,
                'correct': team_correct
            }
    
    results['team_performance'] = team_performance
    
    logger.info(f"📊 Macro avg F1-score: {clf_report['macro avg']['f1-score']:.3f}")
    logger.info(f"📊 Weighted avg F1-score: {clf_report['weighted avg']['f1-score']:.3f}")
    
    return results

classification_results = analyze_classification_performance(df)

print("✅ Classification analysis completed")
print(f"📊 Total teams analyzed: {len(classification_results['team_labels'])}")
print(f"📊 Macro avg F1-score: {classification_results['classification_report']['macro avg']['f1-score']:.3f}")
print(f"📊 Weighted avg F1-score: {classification_results['classification_report']['weighted avg']['f1-score']:.3f}")

## 6. 📅 Temporal Performance Analysis

In [None]:
# =============================================================================
# 6. TEMPORAL PERFORMANCE ANALYSIS
# =============================================================================

def analyze_temporal_performance(df: pd.DataFrame) -> dict:
    """Analyze model performance over time"""
    logger.info("📅 Starting temporal performance analysis...")
    
    results = {}
    
    # Sort by date
    df_sorted = df.sort_values('date')
    
    # Monthly performance
    df_sorted['month'] = df_sorted['date'].dt.to_period('M')
    monthly_performance = df_sorted.groupby('month').agg({
        'prediction_correct': ['mean', 'count'],
        'model_confidence': 'mean'
    }).round(3)
    
    monthly_performance.columns = ['accuracy', 'games', 'avg_confidence']
    monthly_performance = monthly_performance.reset_index()
    monthly_performance['month'] = monthly_performance['month'].astype(str)
    
    results['monthly_performance'] = monthly_performance
    
    # Rolling 30-game accuracy
    window_size = 30
    rolling_accuracy = df_sorted['prediction_correct'].rolling(window=window_size, min_periods=10).mean()
    
    results['rolling_accuracy'] = {
        'values': rolling_accuracy.values,
        'dates': df_sorted['date'].values,
        'window_size': window_size
    }
    
    # Early vs Late season comparison  
    total_games = len(df_sorted)
    early_season = df_sorted.iloc[:total_games//2]
    late_season = df_sorted.iloc[total_games//2:]
    
    results['seasonal_comparison'] = {
        'early_accuracy': early_season['prediction_correct'].mean(),
        'late_accuracy': late_season['prediction_correct'].mean(),
        'early_games': len(early_season),
        'late_games': len(late_season)
    }
    
    logger.info(f"📊 Early season accuracy: {results['seasonal_comparison']['early_accuracy']:.3f}")
    logger.info(f"📊 Late season accuracy: {results['seasonal_comparison']['late_accuracy']:.3f}")
    
    return results

temporal_results = analyze_temporal_performance(df)

print("✅ Temporal analysis completed")
seasonal = temporal_results['seasonal_comparison']
print(f"📊 Early season ({seasonal['early_games']} games): {seasonal['early_accuracy']:.1%}")
print(f"📊 Late season ({seasonal['late_games']} games): {seasonal['late_accuracy']:.1%}")
print(f"📊 Monthly trends: {len(temporal_results['monthly_performance'])} months analyzed")

In [None]:
# Create temporal visualizations
def create_temporal_charts(results: dict):
    """Create temporal analysis charts"""
    
    # 1. Monthly performance
    monthly = results['monthly_performance']
    
    fig_monthly = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig_monthly.add_trace(
        go.Scatter(
            x=monthly['month'],
            y=monthly['accuracy'],
            mode='lines+markers',
            name='Monthly Accuracy',
            line=dict(color='#2E86AB', width=3),
            marker=dict(size=8)
        ),
        secondary_y=False
    )
    
    fig_monthly.add_trace(
        go.Bar(
            x=monthly['month'],
            y=monthly['games'],
            name='Games per Month',
            opacity=0.3,
            marker_color='#F18F01'
        ),
        secondary_y=True
    )
    
    fig_monthly.update_layout(title="📅 Monthly Performance Trends")
    fig_monthly.update_xaxes(title_text="Month")
    fig_monthly.update_yaxes(title_text="Accuracy", secondary_y=False, tickformat='.0%')
    fig_monthly.update_yaxes(title_text="Games Count", secondary_y=True)
    
    # 2. Rolling accuracy
    rolling = results['rolling_accuracy']
    
    fig_rolling = go.Figure()
    
    fig_rolling.add_trace(go.Scatter(
        x=rolling['dates'],
        y=rolling['values'],
        mode='lines',
        name=f'{rolling["window_size"]}-Game Rolling Accuracy',
        line=dict(color='#2E86AB', width=2)
    ))
    
    # Add target line
    fig_rolling.add_trace(go.Scatter(
        x=[rolling['dates'][0], rolling['dates'][-1]],
        y=[ANALYSIS_CONFIG['target_accuracy'], ANALYSIS_CONFIG['target_accuracy']],
        mode='lines',
        name='Target Accuracy',
        line=dict(color='red', dash='dash')
    ))
    
    fig_rolling.update_layout(
        title=f"🔄 Rolling {rolling['window_size']}-Game Accuracy",
        xaxis_title="Date",
        yaxis_title="Accuracy",
        yaxis=dict(tickformat='.0%')
    )
    
    return fig_monthly, fig_rolling

temporal_charts = create_temporal_charts(temporal_results)

print("📊 Displaying temporal charts...")
for i, chart in enumerate(temporal_charts, 1):
    chart.show()
    print(f"Temporal chart {i} displayed")

## 7. 💰 Market Comparison Analysis

In [None]:
# =============================================================================
# 7. MARKET COMPARISON ANALYSIS
# =============================================================================

def analyze_market_comparison(df: pd.DataFrame) -> dict:
    """Compare model performance vs betting market"""
    logger.info("💰 Starting market comparison analysis...")
    
    results = {}
    
    # Filter games with odds data
    odds_df = df[df['has_odds_data'] == True].copy()
    
    if len(odds_df) == 0:
        logger.warning("⚠️ No odds data available for market comparison")
        results['has_data'] = False
        return results
    
    results['has_data'] = True
    results['odds_coverage'] = len(odds_df) / len(df)
    
    # Market accuracy (using implied probabilities)
    market_predictions = []
    for _, row in odds_df.iterrows():
        if pd.notna(row.get('home_implied_prob')) and pd.notna(row.get('away_implied_prob')):
            if row['home_implied_prob'] > row['away_implied_prob']:
                market_predictions.append(row['home_team_name'])
            else:
                market_predictions.append(row['away_team_name'])
        else:
            market_predictions.append(None)  # No prediction if missing odds
    
    odds_df['market_prediction'] = market_predictions
    
    # Only analyze games with valid market predictions
    valid_market_df = odds_df.dropna(subset=['market_prediction'])
    
    if len(valid_market_df) == 0:
        logger.warning("⚠️ No valid market predictions available")
        results['has_data'] = False
        return results
    
    valid_market_df['market_correct'] = valid_market_df['market_prediction'] == valid_market_df['actual_winner']
    
    market_accuracy = valid_market_df['market_correct'].mean()
    model_accuracy_on_odds = valid_market_df['prediction_correct'].mean()
    
    results['market_accuracy'] = market_accuracy
    results['model_accuracy_on_odds'] = model_accuracy_on_odds
    results['accuracy_difference'] = model_accuracy_on_odds - market_accuracy
    results['valid_comparisons'] = len(valid_market_df)
    
    # Calibration comparison (if we have the necessary columns)
    if 'home_implied_prob' in valid_market_df.columns and 'away_implied_prob' in valid_market_df.columns:
        market_probs = valid_market_df['home_implied_prob'].values
        model_probs = valid_market_df['home_win_probability'].values
        y_true = (valid_market_df['actual_winner'] == valid_market_df['home_team_name']).astype(int)
        
        # Only calculate if we have valid probabilities
        valid_probs_mask = pd.notna(market_probs) & pd.notna(model_probs)
        if np.sum(valid_probs_mask) > 0:
            market_brier = brier_score_loss(y_true[valid_probs_mask], market_probs[valid_probs_mask])
            model_brier = brier_score_loss(y_true[valid_probs_mask], model_probs[valid_probs_mask])
            
            results['market_brier'] = market_brier
            results['model_brier'] = model_brier
            results['brier_improvement'] = market_brier - model_brier
    
    logger.info(f"💰 Market accuracy: {market_accuracy:.3f}")
    logger.info(f"🤖 Model accuracy (on odds games): {model_accuracy_on_odds:.3f}")
    logger.info(f"📊 Accuracy difference: {results['accuracy_difference']:+.3f}")
    
    return results

market_results = analyze_market_comparison(df)

if market_results.get('has_data'):
    print("✅ Market comparison completed")
    print(f"💰 Market accuracy: {market_results['market_accuracy']:.1%}")
    print(f"🤖 Model accuracy (on odds games): {market_results['model_accuracy_on_odds']:.1%}")
    print(f"📊 Accuracy difference: {market_results['accuracy_difference']:+.1%}")
    print(f"📈 Valid comparisons: {market_results['valid_comparisons']} games")
    
    if 'brier_improvement' in market_results:
        print(f"🎲 Brier score improvement: {market_results['brier_improvement']:+.4f}")
else:
    print("⚠️ No market comparison data available")

## 8. 📋 Executive Summary & Recommendations

In [None]:
# =============================================================================
# 8. EXECUTIVE SUMMARY & RECOMMENDATIONS
# =============================================================================

def generate_executive_summary(
    accuracy_results: dict,
    calibration_results: dict,
    temporal_results: dict,
    market_results: dict
) -> dict:
    """Generate executive summary with go/no-go recommendations"""
    logger.info("📋 Generating executive summary...")
    
    summary = {
        'timestamp': datetime.now().isoformat(),
        'dataset_info': {
            'total_games': len(df),
            'date_range': f"{df['date'].min().strftime('%Y-%m-%d')} to {df['date'].max().strftime('%Y-%m-%d')}",
            'odds_coverage': f"{market_results.get('odds_coverage', 0):.1%}" if market_results.get('has_data') else "0%"
        }
    }
    
    # Performance Assessment
    overall_accuracy = accuracy_results['overall_accuracy']
    target_accuracy = ANALYSIS_CONFIG['target_accuracy']
    training_benchmark = ANALYSIS_CONFIG['training_benchmark']
    
    performance_grade = 'A' if overall_accuracy >= training_benchmark else \
                       'B' if overall_accuracy >= target_accuracy else \
                       'C' if overall_accuracy >= 0.52 else 'D'
    
    summary['performance_assessment'] = {
        'overall_accuracy': overall_accuracy,
        'vs_target': overall_accuracy - target_accuracy,
        'vs_training': overall_accuracy - training_benchmark,
        'grade': performance_grade,
        'meets_target': overall_accuracy >= target_accuracy
    }
    
    # Calibration Assessment
    brier_score = calibration_results['brier_score']
    max_brier = ANALYSIS_CONFIG['max_brier_score']
    
    calibration_grade = 'A' if brier_score <= 0.20 else \
                       'B' if brier_score <= max_brier else \
                       'C' if brier_score <= 0.30 else 'D'
    
    summary['calibration_assessment'] = {
        'brier_score': brier_score,
        'target_brier': max_brier,
        'grade': calibration_grade,
        'well_calibrated': brier_score <= max_brier
    }
    
    # Stability Assessment
    early_acc = temporal_results['seasonal_comparison']['early_accuracy']
    late_acc = temporal_results['seasonal_comparison']['late_accuracy']
    stability_score = 1 - abs(early_acc - late_acc)
    
    summary['stability_assessment'] = {
        'early_season_accuracy': early_acc,
        'late_season_accuracy': late_acc,
        'stability_score': stability_score,
        'is_stable': stability_score >= 0.85
    }
    
    # Market Competitiveness
    if market_results.get('has_data'):
        market_edge = market_results['accuracy_difference']
        summary['market_competitiveness'] = {
            'model_vs_market': market_edge,
            'has_edge': market_edge > 0,
            'brier_improvement': market_results.get('brier_improvement', 0)
        }
    else:
        summary['market_competitiveness'] = {'has_data': False}
    
    # Overall Recommendation
    criteria_met = 0
    total_criteria = 4
    
    if summary['performance_assessment']['meets_target']:
        criteria_met += 1
    if summary['calibration_assessment']['well_calibrated']:
        criteria_met += 1
    if summary['stability_assessment']['is_stable']:
        criteria_met += 1
    if market_results.get('has_data') and summary['market_competitiveness']['has_edge']:
        criteria_met += 1
    elif not market_results.get('has_data'):
        total_criteria = 3  # Adjust if no market data
    
    recommendation_score = criteria_met / total_criteria
    
    if recommendation_score >= 0.75:
        recommendation = "GO - Deploy to Production"
        recommendation_color = "green"
    elif recommendation_score >= 0.5:
        recommendation = "CONDITIONAL GO - Deploy with Monitoring"
        recommendation_color = "orange"
    else:
        recommendation = "NO GO - Needs Improvement"
        recommendation_color = "red"
    
    summary['final_recommendation'] = {
        'recommendation': recommendation,
        'score': recommendation_score,
        'color': recommendation_color,
        'criteria_met': f"{criteria_met}/{total_criteria}"
    }
    
    return summary

executive_summary = generate_executive_summary(
    accuracy_results, calibration_results, temporal_results, market_results
)

print("✅ Executive summary generated")

## 9. 💾 Export Visualizations & Summary

In [None]:
# =============================================================================
# 9. EXPORT VISUALIZATIONS & SUMMARY
# =============================================================================

def export_all_charts():
    """Export all visualizations to HTML files"""
    logger.info("💾 Exporting visualizations...")
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Export accuracy charts
    for i, chart in enumerate(accuracy_charts):
        filename = f"model_validation_accuracy_{i+1}_{timestamp}.html"
        chart.write_html(CHARTS_PATH / filename)
    
    # Export calibration charts
    for i, chart in enumerate(calibration_charts):
        filename = f"model_validation_calibration_{i+1}_{timestamp}.html"
        chart.write_html(CHARTS_PATH / filename)
    
    # Export temporal charts
    for i, chart in enumerate(temporal_charts):
        filename = f"model_validation_temporal_{i+1}_{timestamp}.html"
        chart.write_html(CHARTS_PATH / filename)
    
    logger.info(f"✅ Charts exported to {CHARTS_PATH}")
    return timestamp

def export_executive_summary(summary: dict, timestamp: str):
    """Export executive summary to JSON"""
    filename = f"model_validation_summary_{timestamp}.json"
    
    with open(CHARTS_PATH / filename, 'w', encoding='utf-8') as f:
        json.dump(summary, f, indent=2, ensure_ascii=False, default=str)
    
    logger.info(f"📋 Executive summary exported: {filename}")
    return filename

# Export everything
export_timestamp = export_all_charts()
summary_filename = export_executive_summary(executive_summary, export_timestamp)

print(f"✅ All exports completed")
print(f"📁 Charts exported to: {CHARTS_PATH}")
print(f"📋 Summary saved as: {summary_filename}")

## 10. 🎯 Final Results Display

### Executive Summary

In [None]:
# =============================================================================
# 10. FINAL RESULTS DISPLAY
# =============================================================================

print("🎯 MODEL VALIDATION ANALYSIS - EXECUTIVE SUMMARY")
print("=" * 60)
print(f"📊 Dataset: {executive_summary['dataset_info']['total_games']} games")
print(f"📅 Period: {executive_summary['dataset_info']['date_range']}")
print(f"💰 Odds Coverage: {executive_summary['dataset_info']['odds_coverage']}")
print()

print("🎯 PERFORMANCE ASSESSMENT")
print("-" * 30)
perf = executive_summary['performance_assessment']
print(f"Overall Accuracy: {perf['overall_accuracy']:.1%} (Grade: {perf['grade']})")
print(f"vs Target ({ANALYSIS_CONFIG['target_accuracy']:.1%}): {perf['vs_target']:+.1%}")
print(f"vs Training ({ANALYSIS_CONFIG['training_benchmark']:.1%}): {perf['vs_training']:+.1%}")
print(f"Meets Target: {'✅ YES' if perf['meets_target'] else '❌ NO'}")
print()

print("🎲 CALIBRATION ASSESSMENT")
print("-" * 30)
cal = executive_summary['calibration_assessment']
print(f"Brier Score: {cal['brier_score']:.4f} (Grade: {cal['grade']})")
print(f"Target: ≤ {cal['target_brier']:.3f}")
print(f"Well Calibrated: {'✅ YES' if cal['well_calibrated'] else '❌ NO'}")
print()

print("📅 STABILITY ASSESSMENT")
print("-" * 30)
stab = executive_summary['stability_assessment']
print(f"Early Season: {stab['early_season_accuracy']:.1%}")
print(f"Late Season: {stab['late_season_accuracy']:.1%}")
print(f"Stability Score: {stab['stability_score']:.3f}")
print(f"Is Stable: {'✅ YES' if stab['is_stable'] else '❌ NO'}")
print()

if executive_summary['market_competitiveness'].get('has_data'):
    print("💰 MARKET COMPETITIVENESS")
    print("-" * 30)
    market = executive_summary['market_competitiveness']
    print(f"Model vs Market: {market['model_vs_market']:+.1%}")
    print(f"Has Edge: {'✅ YES' if market['has_edge'] else '❌ NO'}")
    if 'brier_improvement' in market:
        print(f"Brier Improvement: {market['brier_improvement']:+.4f}")
    print()

print("🚀 FINAL RECOMMENDATION")
print("=" * 30)
rec = executive_summary['final_recommendation']
print(f"Decision: {rec['recommendation']}")
print(f"Score: {rec['score']:.1%}")
print(f"Criteria Met: {rec['criteria_met']}")
print()

print("✅ Analysis Complete!")
print(f"📁 Charts exported to: {CHARTS_PATH}")
print(f"📋 Summary saved to: {summary_filename}")

logger.info("🎯 Model validation analysis completed successfully")

---

## 📋 Analysis Complete!

This comprehensive model validation analysis provides:

✅ **Prediction Accuracy Assessment** - Overall performance vs benchmarks  
✅ **Calibration Quality Analysis** - Probability reliability and Brier score  
✅ **Temporal Stability Evaluation** - Performance consistency over time  
✅ **Market Competitiveness Analysis** - Model vs betting market comparison  
✅ **Executive Summary** - Go/no-go deployment recommendations  

**Charts and summary exported to:** `models/experiments/charts/`

---

*Model Validation Analysis - Part 4 of Specialized Notebooks Pipeline*