# üöÄ Universal ML Model Training - Multi-Purpose Machine Learning Pipeline

This notebook provides a comprehensive machine learning pipeline that automatically adapts to your dataset and problem type. It supports:

## üéØ Problem Types:
1. **Binary Classification** - Two-class prediction (e.g., pleasant/unpleasant weather)
2. **Multi-class Classification** - Multiple class prediction (e.g., weather categories)
3. **Regression** - Continuous value prediction (e.g., temperature, humidity)

## üåü Key Features:
- **Automatic problem type detection** based on target variable
- **Multi-station support** for weather or sensor data
- **Adaptive model selection** - Different algorithms for classification vs regression
- **Educational gradient descent** implementation for single-feature regression
- **Comprehensive evaluation** with appropriate metrics for each problem type
- **Interactive feature selection** with multiple strategies
- **Station-wise analysis** for multi-location datasets
- **Overfitting detection** and model recommendations

## üìÅ Expected Structure
```
Your Project/
‚îú‚îÄ‚îÄ 02_data/Processed_data/    ‚Üê Pre-scaled data
‚îú‚îÄ‚îÄ 03_notebooks/              ‚Üê Run from here
‚îî‚îÄ‚îÄ 05_results/                ‚Üê Output files
```

## ‚öôÔ∏è Configuration

Define all configuration parameters used throughout the analysis. These settings control data quality thresholds, model parameters, and analysis behavior.

In [1]:
# Configuration settings for the analysis
CONFIG = {
    'target_column': 'pleasant_weather',  # Column name pattern for pleasant weather labels
    'station_column': 'station_id',  # Column name for station identifiers
    'missing_threshold': 0.04,  # Maximum allowed missing data per station
    'critical_features': ['temp', 'humidity', 'pressure', 'wind'],  # Features to check for missing data
    'overfitting_threshold': 0.05,  # Maximum train-test score difference
    'high_accuracy_threshold': 0.95,  # Threshold for high-accuracy station reporting
    'exclude_patterns': ['date', 'time', 'year', 'month', 'day', 'hour', 'minute', 'id'],  # Temporal features to exclude
    'weather_patterns': {  # Patterns to identify weather feature types
        'temperature': ['temp'],
        'humidity': ['humid', 'moisture'],
        'pressure': ['pressure', 'press'],
        'wind': ['wind', 'gust'],
        'precipitation': ['rain', 'precip', 'snow'],
        'radiation': ['radiation', 'solar'],
        'visibility': ['visibility', 'vis'],
        'clouds': ['cloud']
    },
    'neural_network_params': {
        'hidden_layer_sizes': [(50,), (100,), (100,50), (100,75,50), (200,100,50)],
        'max_iter': [500, 1000, 2000],
        'tol': [1e-3, 1e-4, 1e-5],
        'activation': ['relu', 'tanh'],
        'solver': ['adam', 'lbfgs']
    }
}

print("‚úÖ Configuration loaded successfully!")
print("\nüìã Key Settings:")
for key, value in CONFIG.items():
    if key not in ['neural_network_params', 'weather_patterns']:
        print(f"  ‚Ä¢ {key}: {value}")

‚úÖ Configuration loaded successfully!

üìã Key Settings:
  ‚Ä¢ target_column: pleasant_weather
  ‚Ä¢ station_column: station_id
  ‚Ä¢ missing_threshold: 0.04
  ‚Ä¢ critical_features: ['temp', 'humidity', 'pressure', 'wind']
  ‚Ä¢ overfitting_threshold: 0.05
  ‚Ä¢ high_accuracy_threshold: 0.95
  ‚Ä¢ exclude_patterns: ['date', 'time', 'year', 'month', 'day', 'hour', 'minute', 'id']


## üìö 1. Setup and Imports

Import all required libraries for:
- **Data Processing**: numpy, pandas, pathlib
- **Machine Learning**: scikit-learn models and utilities
- **Visualization**: matplotlib, seaborn, plotly
- **Progress Tracking**: tqdm for training progress

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
import json
import joblib
import time
import sys

# Progress bar imports
from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook

# ML imports - Model Selection and Splitting
from sklearn.model_selection import (
    train_test_split, cross_val_score, GridSearchCV, 
    KFold, StratifiedKFold
)

# ML imports - Metrics
from sklearn.metrics import (
    # Classification metrics
    accuracy_score, balanced_accuracy_score, precision_score, 
    recall_score, f1_score, roc_auc_score, confusion_matrix,
    classification_report,
    # Regression metrics
    mean_squared_error, mean_absolute_error, r2_score,
    explained_variance_score
)

# ML imports - Preprocessing and Feature Selection
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import (
    SelectKBest, f_classif, f_regression, 
    mutual_info_classif, mutual_info_regression
)

# ML imports - Models
from sklearn.linear_model import (
    LogisticRegression, LinearRegression, Ridge, Lasso, 
    ElasticNet, SGDClassifier, SGDRegressor
)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor, 
    GradientBoostingClassifier, GradientBoostingRegressor
)
from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.naive_bayes import GaussianNB

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

plt.style.use('seaborn-v0_8-whitegrid')
print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")

‚úÖ All libraries imported successfully!
üìÖ Analysis date: 2025-06-09 05:20


## üì• 2. Load and Merge Data

### Data Loading Process:
1. **Interactive path setup** - Select project root directory
2. **File selection** - Choose which CSV files to load
3. **Dataset identification** - Automatically identify features vs. target datasets
4. **Data merging** - Merge datasets on common keys (date/station columns)

In [3]:
# Import the module
sys.path.append('./src')  # Optional: adapt if running outside src
from file_handler import setup_paths, load_multiple_datasets

# 1. Interactive project path setup
project_root, input_path, output_path = setup_paths()

# 2. File selection and loading
datasets = load_multiple_datasets(input_path)

# 3. Identify features and target datasets
print("\nüîç Identifying datasets...")
features_df = None
target_df = None

for name, df in datasets.items():
    print(f"\nüìä {name}:")
    print(f"   Shape: {df.shape}")
    print(f"   Columns sample: {list(df.columns[:5])}...")
    
    # Check if this is the target dataset
    if 'answer' in name.lower() or 'target' in name.lower() or 'label' in name.lower():
        target_df = df
        print("   ‚úÖ Identified as TARGET dataset")
    elif 'processed' in name.lower() or 'scaled' in name.lower() or 'feature' in name.lower():
        features_df = df
        print("   ‚úÖ Identified as FEATURES dataset")

if features_df is None or target_df is None:
    print("\n‚ö†Ô∏è Could not automatically identify datasets. Please select manually:")
    dataset_names = list(datasets.keys())
    
    print("\nAvailable datasets:")
    for i, name in enumerate(dataset_names):
        print(f"  {i+1}. {name}")
    
    features_idx = int(input("\nüëâ Select FEATURES dataset number: ")) - 1
    target_idx = int(input("üëâ Select TARGET dataset number: ")) - 1
    
    features_df = datasets[dataset_names[features_idx]]
    target_df = datasets[dataset_names[target_idx]]

print(f"\n‚úÖ Datasets identified:")
print(f"   Features: {features_df.shape}")
print(f"   Targets: {target_df.shape}")

üìç Current directory: C:\Users\User\Dropbox\Personal\CareerFoundry\07 Machine Learning\ML\03_notebooks
üìÅ Project root: C:\Users\User\Dropbox\Personal\CareerFoundry\07 Machine Learning\ML

üì• SELECT INPUT FOLDER

üìã Available folders in project:
   1: 01_roject_management
   2: 02_data
   3: 03_notebooks
   4: 04_analysis
   5: 05_results



>>> Choose input folder number (1-5):  2



‚úÖ Selected: 02_data

----------------------------------------
üìÇ Subfolders in '02_data':
   0: Use '02_data' (parent folder)
   1: Merged_data
   2: Original_data
   3: Processed_data



>>> Choose subfolder (0-3) [Enter for 0]:  3



‚úÖ Input path set to: C:\Users\User\Dropbox\Personal\CareerFoundry\07 Machine Learning\ML\02_data\Processed_data


üì§ SELECT OUTPUT FOLDER

üìã Available folders in project:
   1: 01_roject_management
   2: 02_data
   3: 03_notebooks
   4: 04_analysis
   5: 05_results

   üí° Press Enter to use input folder: 02_data\Processed_data



>>> Choose output folder number (1-5) [Enter for input folder]:  2



‚úÖ Selected: 02_data

----------------------------------------
üìÇ Subfolders in '02_data':
   0: Use '02_data' (parent folder)
   1: Merged_data
   2: Original_data
   3: Processed_data



>>> Choose subfolder (0-3) [Enter for 0]:  3




‚úÖ PROJECT SETUP COMPLETE!

   üì• Input path:  C:\Users\User\Dropbox\Personal\CareerFoundry\07 Machine Learning\ML\02_data\Processed_data
   üì§ Output path: C:\Users\User\Dropbox\Personal\CareerFoundry\07 Machine Learning\ML\02_data\Processed_data


üìã Available data files:
   1: üìä Dataset-Answers-Weather_Prediction_Pleasant_Weather.csv (CSV)
   2: üìä Dataset-Answers-Weather_Prediction_Pleasant_Weather_test.csv (CSV)
   3: üìä Dataset-weather-prediction-dataset-processed_scaled_20250528_1500.csv (CSV)
   4: üìä Dataset-weather-prediction-dataset-processed_scaled_20250528_1500_test.csv (CSV)

üîç How would you like to select files?
   1: Select specific files
   2: Load all files
   3: Load files by type (CSV, Excel, etc.)
   4: Load files matching a pattern



>>> Choose selection mode (1-4):  1



üìå Select files (separate numbers with commas, e.g., 1,3,5)
   Or use ranges (e.g., 1-3,5,7-9)
   Press Enter to select all files



>>> Enter file numbers:  1,3



üîÑ Loading 2 files...

[1/2] Loading: Dataset-Answers-Weather_Prediction_Pleasant_Weather.csv
   ‚úÖ Loaded: 22950 rows √ó 16 columns

[2/2] Loading: Dataset-weather-prediction-dataset-processed_scaled_20250528_1500.csv
   ‚úÖ Loaded: 22950 rows √ó 171 columns

üìä LOADING SUMMARY
‚úÖ Successfully loaded: 2 files

üìã Loaded datasets:
   - Dataset-Answers-Weather_Prediction_Pleasant_Weather.csv: 22950 rows √ó 16 columns
   - Dataset-weather-prediction-dataset-processed_scaled_20250528_1500.csv: 22950 rows √ó 171 columns

üîç Identifying datasets...

üìä Dataset-Answers-Weather_Prediction_Pleasant_Weather.csv:
   Shape: (22950, 16)
   Columns sample: ['DATE', 'BASEL_pleasant_weather', 'BELGRADE_pleasant_weather', 'BUDAPEST_pleasant_weather', 'DEBILT_pleasant_weather']...
   ‚úÖ Identified as TARGET dataset

üìä Dataset-weather-prediction-dataset-processed_scaled_20250528_1500.csv:
   Shape: (22950, 171)
   Columns sample: ['id', 'DATE', 'MONTH', 'BASEL_cloud_cover', 'BASEL_wind_

### Merge Features and Target Datasets

This step merges the features and target datasets based on common columns (typically date and station identifiers). The merge operation ensures that we have aligned features and targets for each observation.

In [4]:
# 4. Merge datasets
print("\nüîÑ Merging datasets...")

# Find common columns for merging
common_cols = list(set(features_df.columns) & set(target_df.columns))
print(f"\nCommon columns: {common_cols}")

# Identify merge keys (date and station identifiers)
merge_keys = []
date_cols = [col for col in common_cols if 'date' in col.lower()]
station_cols = [col for col in common_cols if 'station' in col.lower() or 'id' in col.lower()]

if date_cols:
    merge_keys.extend(date_cols[:1])  # Use first date column
if station_cols:
    merge_keys.extend(station_cols[:1])  # Use first station column

# If no automatic detection, ask user
if not merge_keys:
    print("\n‚ö†Ô∏è Could not automatically detect merge keys.")
    print("Available columns in both datasets:")
    for i, col in enumerate(common_cols[:20]):
        print(f"  {i+1}. {col}")
    
    key_indices = input("\nüëâ Select merge key columns (comma-separated numbers): ")
    merge_keys = [common_cols[int(i)-1] for i in key_indices.split(',')]

print(f"\nüîó Merging on: {merge_keys}")

# Perform merge
df = pd.merge(features_df, target_df[merge_keys + [col for col in target_df.columns if col not in merge_keys]], 
              on=merge_keys, how='inner')

print(f"‚úÖ Merged dataset: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
print(f"   Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")

# Update station column name if needed
if CONFIG['station_column'] not in df.columns:
    station_candidates = [col for col in df.columns if 'station' in col.lower()]
    if station_candidates:
        CONFIG['station_column'] = station_candidates[0]
        print(f"\nüìç Updated station column to: {CONFIG['station_column']}")


üîÑ Merging datasets...

Common columns: ['DATE']

üîó Merging on: ['DATE']
‚úÖ Merged dataset: 22,950 rows √ó 186 columns
   Memory usage: 32.57 MB


## üßπ 3. Data Quality Control

### Quality Control Steps:
1. **Identify critical features** - Find weather-related columns (temperature, humidity, pressure, wind)
2. **Station-wise quality check** - Calculate missing data percentage per station
3. **Remove low-quality stations** - Drop stations exceeding missing data threshold
4. **Fill remaining missing values** - Use median imputation for numeric columns

In [9]:
# Identify critical features
critical_features = []
for pattern in CONFIG['critical_features']:
    matching_cols = [col for col in df.columns if pattern.lower() in col.lower()]
    critical_features.extend(matching_cols)

critical_features = list(set(critical_features))  # Remove duplicates
print(f"\nüîç Found {len(critical_features)} critical features:")
for feat in critical_features[:10]:
    print(f"   ‚Ä¢ {feat}")
if len(critical_features) > 10:
    print(f"   ... and {len(critical_features) - 10} more")

# Check for station column
if CONFIG['station_column'] in df.columns:
    # Calculate missing data per station
    print(f"\nüìä Analyzing data quality by station...")
    stations_to_drop = []
    station_quality = {}
    
    for station in df[CONFIG['station_column']].unique():
        station_data = df[df[CONFIG['station_column']] == station]
        
        # Calculate missing percentage for critical features
        if critical_features:
            missing_pct = station_data[critical_features].isnull().sum().sum() / (len(station_data) * len(critical_features))
        else:
            missing_pct = station_data.isnull().sum().sum() / (len(station_data) * len(station_data.columns))
        
        station_quality[station] = {
            'missing_pct': missing_pct,
            'row_count': len(station_data)
        }
        
        if missing_pct > CONFIG['missing_threshold']:
            stations_to_drop.append(station)
    
    # Report and drop stations
    if stations_to_drop:
        print(f"\n‚ö†Ô∏è Dropping {len(stations_to_drop)} stations with >{CONFIG['missing_threshold']*100:.0f}% missing data:")
        for station in stations_to_drop[:5]:
            print(f"   ‚Ä¢ {station}: {station_quality[station]['missing_pct']*100:.1f}% missing")
        if len(stations_to_drop) > 5:
            print(f"   ... and {len(stations_to_drop) - 5} more")
        
        # Drop stations
        df_before = len(df)
        df = df[~df[CONFIG['station_column']].isin(stations_to_drop)]
        print(f"\n‚úÖ Removed {df_before - len(df):,} rows from {len(stations_to_drop)} stations")
    else:
        print(f"\n‚úÖ All stations have acceptable data quality!")
    
    # Summary of remaining stations
    remaining_stations = df[CONFIG['station_column']].nunique()
    print(f"\nüìç Remaining stations: {remaining_stations}")
else:
    print(f"\n‚ö†Ô∏è Station column '{CONFIG['station_column']}' not found. Proceeding with overall data quality check.")
    
    # Overall missing data handling
    missing_pct = df.isnull().sum() / len(df) * 100
    cols_to_drop = missing_pct[missing_pct > CONFIG['missing_threshold']*100].index
    if len(cols_to_drop) > 0:
        df = df.drop(columns=cols_to_drop)
        print(f"   Dropped {len(cols_to_drop)} columns with >{CONFIG['missing_threshold']*100:.0f}% missing data")

# Fill remaining missing values
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

print(f"\n‚úÖ Final dataset: {df.shape[0]:,} rows √ó {df.shape[1]} columns")


üîç Found 96 critical features:
   ‚Ä¢ DEBILT_humidity
   ‚Ä¢ MADRID_wind_speed
   ‚Ä¢ SONNBLICK_temp_mean
   ‚Ä¢ MADRID_humidity
   ‚Ä¢ LJUBLJANA_temp_min
   ‚Ä¢ TOURS_pressure
   ‚Ä¢ HEATHROW_temp_min
   ‚Ä¢ ROMA_pressure
   ‚Ä¢ MUNCHENB_temp_mean
   ‚Ä¢ BASEL_humidity
   ... and 86 more

‚ö†Ô∏è Station column 'station_id' not found. Proceeding with overall data quality check.

‚úÖ Final dataset: 22,950 rows √ó 187 columns


## üéØ 4. Interactive Target Selection and Problem Type Detection

### Multi-Purpose Target Selection

This section automatically detects if the dataset contains multiple stations/targets and determines the problem type:
- **Classification**: Binary (2 classes) or Multi-class (>2 classes)
- **Regression**: Continuous values

Options:
- **Single Station/Feature Model**: Focus on one specific station or feature
- **Multi-Station/Feature Model**: Use data from all stations or features
- **Traditional Approach**: Manual target selection

In [10]:
# MULTI-STATION/TARGET DATA DETECTION
print("\nüåç DETECTING DATA STRUCTURE AND TARGET OPTIONS")
print("="*60)

# Initialize variables
single_station_mode = False
selected_station = None
target_col = None
problem_type = None

# Look for potential target columns
target_patterns = ['target', 'label', 'pleasant_weather', 'y', 'output', 'result']
potential_targets = []

for col in df.columns:
    if any(pattern in col.lower() for pattern in target_patterns):
        potential_targets.append(col)

# Detect if this is multi-station data
station_names = []
if potential_targets:
    # Check if targets have station prefixes
    for target in potential_targets:
        parts = target.split('_')
        if len(parts) > 1:
            potential_station = parts[0]
            # Check if this prefix appears in other columns
            station_cols = [col for col in df.columns if col.startswith(potential_station + '_')]
            if len(station_cols) > 5:  # Likely a station prefix
                station_names.append(potential_station)
    
    station_names = list(set(station_names))

if station_names:
    print(f"\n‚úÖ Detected multi-station/multi-source dataset with {len(station_names)} stations:")
    for i, station in enumerate(station_names, 1):
        station_features = [col for col in df.columns if col.startswith(station + '_')]
        print(f"  {i:2d}. {station} ({len(station_features)} features)")
    
    # Ask user to select approach
    print("\nüìä Model Training Approach:")
    print("  1. Single Station Model - Train separate model for one station")
    print("  2. Multi-Station Model - Use data from all stations")
    print("  3. Traditional Approach - Select target column manually")
    
    approach = input("\nüëâ Select approach (1-3): ").strip()
    
    if approach == '1':
        # Single station approach
        print("\nüéØ SELECT TARGET STATION")
        print("-"*40)
        for i, station in enumerate(station_names, 1):
            print(f"  {i:2d}. {station}")
        
        station_idx = int(input("\nüëâ Select station number: ")) - 1
        selected_station = station_names[station_idx]
        single_station_mode = True
        
        # Show all features for the selected station
        station_features = [col for col in df.columns if col.startswith(selected_station + '_')]
        
        print(f"\nüìä Available features for {selected_station}:")
        print("-"*40)
        
        # Categorize and display features
        feature_list = []
        for i, feat in enumerate(station_features, 1):
            feature_list.append(feat)
            print(f"  {i:2d}. {feat}")
            # Show basic stats
            if df[feat].dtype in [np.number]:
                unique_vals = df[feat].nunique()
                if unique_vals == 2:
                    print(f"      Type: Binary ({dict(df[feat].value_counts())})")
                elif unique_vals <= 10:
                    print(f"      Type: Categorical ({unique_vals} unique values)")
                else:
                    print(f"      Type: Continuous (range: [{df[feat].min():.2f}, {df[feat].max():.2f}])")
        
        # Ask user to select target
        print("\nüéØ SELECT TARGET VARIABLE")
        target_idx = int(input(f"\nüëâ Select target feature number (1-{len(feature_list)}): "))
        target_col = feature_list[target_idx - 1]
        
    else:
        approach = '3'  # Fall back to traditional approach

if approach == '3' or not station_names:
    # Traditional approach - show all potential targets
    print("\nüéØ SELECT TARGET VARIABLE")
    print("-"*40)
    
    # Show all columns with basic info
    all_cols = list(df.columns)
    print("\nAvailable columns:")
    for i, col in enumerate(all_cols[:50], 1):
        col_info = f"{i:3d}. {col}"
        if df[col].dtype in [np.number]:
            unique_vals = df[col].nunique()
            if unique_vals == 2:
                col_info += " (Binary)"
            elif unique_vals <= 10:
                col_info += f" (Categorical: {unique_vals} classes)"
            else:
                col_info += " (Continuous)"
        print(col_info)
    
    if len(all_cols) > 50:
        print(f"\n... and {len(all_cols) - 50} more columns")
    
    target_idx = int(input(f"\nüëâ Select target column number (1-{len(all_cols)}): "))
    target_col = all_cols[target_idx - 1]

# Determine problem type based on target
print(f"\n‚úÖ Selected target: {target_col}")

if df[target_col].dtype not in [np.number]:
    # Convert categorical to numeric if needed
    le = LabelEncoder()
    df[target_col] = le.fit_transform(df[target_col])
    print(f"   Encoded categorical target to numeric")

unique_values = df[target_col].nunique()
if unique_values == 2:
    problem_type = 'classification'
    print(f"   Problem type: Binary Classification")
    print(f"   Class distribution: {dict(df[target_col].value_counts())}")
elif unique_values <= 10:
    problem_type = 'classification'
    print(f"   Problem type: Multi-class Classification ({unique_values} classes)")
    print(f"   Class distribution: {dict(df[target_col].value_counts().head())}")
else:
    problem_type = 'regression'
    print(f"   Problem type: Regression")
    print(f"   Target statistics:")
    print(f"     ‚Ä¢ Mean: {df[target_col].mean():.2f}")
    print(f"     ‚Ä¢ Std: {df[target_col].std():.2f}")
    print(f"     ‚Ä¢ Range: [{df[target_col].min():.2f}, {df[target_col].max():.2f}]")

# Create target column alias
df['target'] = df[target_col]


üåç DETECTING DATA STRUCTURE AND TARGET OPTIONS

‚úÖ Detected multi-station/multi-source dataset with 18 stations:
   1. DUSSELDORF (12 features)
   2. LJUBLJANA (11 features)
   3. DEBILT (11 features)
   4. MADRID (11 features)
   5. VALENTIA (11 features)
   6. HEATHROW (11 features)
   7. OSLO (12 features)
   8. ROMA (6 features)
   9. BELGRADE (10 features)
  10. BUDAPEST (10 features)
  11. MAASTRICHT (11 features)
  12. GDANSK (7 features)
  13. BASEL (12 features)
  14. MUNCHENB (10 features)
  15. SONNBLICK (11 features)
  16. TOURS (8 features)
  17. KASSEL (10 features)
  18. STOCKHOLM (9 features)

üìä Model Training Approach:
  1. Single Station Model - Train separate model for one station
  2. Multi-Station Model - Use data from all stations
  3. Traditional Approach - Select target column manually



üëâ Select approach (1-3):  1



üéØ SELECT TARGET STATION
----------------------------------------
   1. DUSSELDORF
   2. LJUBLJANA
   3. DEBILT
   4. MADRID
   5. VALENTIA
   6. HEATHROW
   7. OSLO
   8. ROMA
   9. BELGRADE
  10. BUDAPEST
  11. MAASTRICHT
  12. GDANSK
  13. BASEL
  14. MUNCHENB
  15. SONNBLICK
  16. TOURS
  17. KASSEL
  18. STOCKHOLM



üëâ Select station number:  5



üìä Available features for VALENTIA:
----------------------------------------
   1. VALENTIA_cloud_cover
      Type: Categorical (9 unique values)
   2. VALENTIA_humidity
      Type: Continuous (range: [-6.27, 2.45])
   3. VALENTIA_pressure
      Type: Continuous (range: [-5.51, 2.99])
   4. VALENTIA_global_radiation
      Type: Continuous (range: [-1.31, 3.35])
   5. VALENTIA_precipitation
      Type: Continuous (range: [-0.49, 106.03])
   6. VALENTIA_snow_depth
      Type: Categorical (4 unique values)
   7. VALENTIA_sunshine
      Type: Continuous (range: [-1.04, 3.71])
   8. VALENTIA_temp_mean
      Type: Continuous (range: [-4.27, 3.87])
   9. VALENTIA_temp_min
      Type: Continuous (range: [-4.15, 3.17])
  10. VALENTIA_temp_max
      Type: Continuous (range: [-4.32, 4.28])
  11. VALENTIA_pleasant_weather

üéØ SELECT TARGET VARIABLE



üëâ Select target feature number (1-11):  11



‚úÖ Selected target: VALENTIA_pleasant_weather
   Encoded categorical target to numeric
   Problem type: Binary Classification
   Class distribution: {0: np.int64(21776), 1: np.int64(1174)}


### Feature Selection Strategy

Choose features based on the selected approach and problem type:
- **Automatic selection**: Based on correlation with target
- **Pattern-based**: Select features matching specific patterns
- **Category-based**: Select by feature type (temperature, humidity, etc.)
- **Manual selection**: Choose individual features from a list

In [12]:
# FEATURE SELECTION
print("\nüéØ FEATURE SELECTION")
print("="*60)

# Identify columns to exclude
exclude_cols = [target_col, 'target']
if CONFIG['station_column'] in df.columns:
    exclude_cols.append(CONFIG['station_column'])

# Get available features based on mode
if single_station_mode and selected_station:
    print(f"\nüéØ Single Station Mode: {selected_station}")
    available_features = [col for col in df.columns 
                         if col.startswith(selected_station + '_') 
                         and col not in exclude_cols 
                         and df[col].dtype in [np.number]]
else:
    # All numeric features
    available_features = [col for col in df.columns 
                         if col not in exclude_cols 
                         and df[col].dtype in [np.number]]

print(f"\nüìä Found {len(available_features)} available numeric features")

# Feature selection options
print("\nüìä Feature Selection Options:")
print("  1. Automatic selection (based on correlation/importance)")
print("  2. Pattern-based selection (e.g., 'temp', 'humid', 'mean')")
print("  3. Category-based selection (temperature, humidity, etc.)")
print("  4. Manual feature selection")
print("  5. Use all available features")

selection_option = input("\nüëâ Choose option (1-5): ").strip()

if selection_option == '1':
    # Automatic selection based on correlation
    print("\nüîÑ Calculating feature importance...")
    
    # Calculate correlations or mutual information
    if problem_type == 'regression':
        correlations = pd.DataFrame({
            'feature': available_features,
            'correlation': [abs(df[feat].corr(df['target'])) for feat in available_features]
        }).sort_values('correlation', ascending=False)
        
        # Select top features
        threshold = 0.1
        selected_features = correlations[correlations['correlation'] > threshold]['feature'].tolist()
        
        print(f"\n‚úÖ Selected {len(selected_features)} features with correlation > {threshold}")
        print("\nTop 10 correlated features:")
        for _, row in correlations.head(10).iterrows():
            print(f"   ‚Ä¢ {row['feature']}: {row['correlation']:.3f}")
    else:
        # Use mutual information for classification
        from sklearn.feature_selection import mutual_info_classif
        mi_scores = mutual_info_classif(df[available_features], df['target'])
        mi_df = pd.DataFrame({
            'feature': available_features,
            'mi_score': mi_scores
        }).sort_values('mi_score', ascending=False)
        
        # Select top features
        n_features = min(30, len(available_features) // 2)
        selected_features = mi_df.head(n_features)['feature'].tolist()
        
        print(f"\n‚úÖ Selected top {len(selected_features)} features by mutual information")
        print("\nTop 10 features:")
        for _, row in mi_df.head(10).iterrows():
            print(f"   ‚Ä¢ {row['feature']}: {row['mi_score']:.3f}")
    
    feature_cols = selected_features
    
elif selection_option == '2':
    # Pattern-based selection
    print("\nEnter patterns to match (comma-separated)")
    print("Example: 'mean,temp,humid' for mean values, temperature, and humidity")
    patterns = input("\nüëâ Patterns: ").strip().split(',')
    
    feature_cols = []
    for pattern in patterns:
        matching = [f for f in available_features if pattern.strip().lower() in f.lower()]
        feature_cols.extend(matching)
        print(f"  ‚Ä¢ '{pattern.strip()}' matched {len(matching)} features")
    
    feature_cols = list(set(feature_cols))  # Remove duplicates
    print(f"\n‚úÖ Selected {len(feature_cols)} features by pattern")
    
elif selection_option == '3':
    # Category-based selection
    categories = {
        '1': ('Temperature', ['temp']),
        '2': ('Humidity', ['humid', 'moisture']),
        '3': ('Pressure', ['pressure', 'press']),
        '4': ('Wind', ['wind', 'gust']),
        '5': ('Precipitation', ['rain', 'precip', 'snow']),
        '6': ('All Weather', ['temp', 'humid', 'pressure', 'wind', 'rain'])
    }
    
    print("\nüìä Feature Categories:")
    for key, (name, _) in categories.items():
        print(f"  {key}. {name}")
    
    cat_selection = input("\nüëâ Select categories (comma-separated numbers): ").strip().split(',')
    
    feature_cols = []
    for cat in cat_selection:
        if cat in categories:
            name, patterns = categories[cat]
            for pattern in patterns:
                matching = [f for f in available_features if pattern in f.lower()]
                feature_cols.extend(matching)
            print(f"  ‚Ä¢ {name}: added {len(matching)} features")
    
    feature_cols = list(set(feature_cols))  # Remove duplicates
    print(f"\n‚úÖ Selected {len(feature_cols)} features by category")
    
elif selection_option == '4':
    # Manual selection
    print("\nüìã Available features:")
    for i, feat in enumerate(available_features[:50], 1):
        print(f"  {i:3d}. {feat}")
    if len(available_features) > 50:
        print(f"  ... and {len(available_features) - 50} more")
    
    print("\nEnter feature numbers (comma-separated) or 'all'")
    selection = input("\nüëâ Selection: ").strip()
    
    if selection.lower() == 'all':
        feature_cols = available_features
    else:
        indices = [int(x.strip())-1 for x in selection.split(',')]
        feature_cols = [available_features[i] for i in indices if i < len(available_features)]
    
    print(f"\n‚úÖ Selected {len(feature_cols)} features manually")
    
else:
    # Use all features
    feature_cols = available_features
    print(f"\n‚úÖ Using all {len(feature_cols)} available features")

# Exclude temporal features
temporal_excluded = []
for pattern in CONFIG['exclude_patterns']:
    temporal_excluded.extend([f for f in feature_cols if pattern in f.lower()])

feature_cols = [f for f in feature_cols if f not in temporal_excluded]

print(f"\nüìä FINAL FEATURE SELECTION:")
print(f"   Selected features: {len(feature_cols)}")
print(f"   Excluded temporal: {len(temporal_excluded)}")

# Show sample of selected features
print("\nSample of selected features:")
for feat in feature_cols[:10]:
    print(f"   ‚Ä¢ {feat}")
if len(feature_cols) > 10:
    print(f"   ... and {len(feature_cols) - 10} more features")

# Store station information if available
station_info = None
if CONFIG['station_column'] in df.columns:
    station_info = df[CONFIG['station_column']].values
    print(f"\nüìç Station information preserved for analysis")


üéØ FEATURE SELECTION

üéØ Single Station Mode: VALENTIA

üìä Found 10 available numeric features

üìä Feature Selection Options:
  1. Automatic selection (based on correlation/importance)
  2. Pattern-based selection (e.g., 'temp', 'humid', 'mean')
  3. Category-based selection (temperature, humidity, etc.)
  4. Manual feature selection
  5. Use all available features



üëâ Choose option (1-5):  4



üìã Available features:
    1. VALENTIA_cloud_cover
    2. VALENTIA_humidity
    3. VALENTIA_pressure
    4. VALENTIA_global_radiation
    5. VALENTIA_precipitation
    6. VALENTIA_snow_depth
    7. VALENTIA_sunshine
    8. VALENTIA_temp_mean
    9. VALENTIA_temp_min
   10. VALENTIA_temp_max

Enter feature numbers (comma-separated) or 'all'



üëâ Selection:  8



‚úÖ Selected 1 features manually

üìä FINAL FEATURE SELECTION:
   Selected features: 1
   Excluded temporal: 0

Sample of selected features:
   ‚Ä¢ VALENTIA_temp_mean


## üîÑ 5. Data Preparation and Train-Test Split

### Data Preparation Steps:
1. **Feature matrix creation** - Extract selected features into X
2. **Feature reduction** - Apply SelectKBest if too many features (>50)
3. **Train-test split** - Stratified split for classification, random split for regression
4. **Data preservation** - Keep station information for later analysis

In [13]:
# Prepare features and target
X = df[feature_cols]
y = df['target']

# Feature selection if too many features
if X.shape[1] > 50:
    print(f"\n‚ö†Ô∏è Too many features ({X.shape[1]}). Applying feature selection...")
    k = min(30, X.shape[1] // 2)
    
    if problem_type == 'classification':
        selector = SelectKBest(f_classif, k=k)
    else:
        selector = SelectKBest(f_regression, k=k)
    
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()].tolist()
    X = pd.DataFrame(X_selected, columns=selected_features)
    print(f"‚úÖ Reduced features from {len(feature_cols)} to {k}")
else:
    selected_features = feature_cols

# Train-test split with appropriate strategy
if problem_type == 'classification':
    # Stratified split for classification
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"\n‚úÇÔ∏è Data split (stratified):")
else:
    # Random split for regression
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"\n‚úÇÔ∏è Data split (random):")

# Also split station info if available
if station_info is not None:
    train_indices = X_train.index
    test_indices = X_test.index
    station_train = station_info[train_indices]
    station_test = station_info[test_indices]

print(f"   Training: {X_train.shape[0]:,} samples")
print(f"   Testing: {X_test.shape[0]:,} samples")
print(f"   Features: {X_train.shape[1]}")

if problem_type == 'classification':
    print(f"\n   Class balance (train): {dict(y_train.value_counts(normalize=True).round(3))}")
    print(f"   Class balance (test): {dict(y_test.value_counts(normalize=True).round(3))}")
else:
    print(f"\n   Target statistics (train):")
    print(f"     ‚Ä¢ Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}")
    print(f"   Target statistics (test):")
    print(f"     ‚Ä¢ Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}")


‚úÇÔ∏è Data split (stratified):
   Training: 18,360 samples
   Testing: 4,590 samples
   Features: 1

   Class balance (train): {0: np.float64(0.949), 1: np.float64(0.051)}
   Class balance (test): {0: np.float64(0.949), 1: np.float64(0.051)}


## üéì 6. Manual Gradient Descent Implementation (opt.)

This section demonstrates gradient descent optimization from scratch for educational purposes.

**Availability**: 
- Only runs for **regression problems** with **single feature**
- Provides visualization of cost function and convergence
- For production use, sklearn's optimized implementations are recommended

In [14]:
# Check if gradient descent demo should be run
print(f"\nüí° Current problem type: {problem_type}")
print(f"   Number of features: {X_train.shape[1]}")

if problem_type == 'regression' and X_train.shape[1] == 1:
    print("\nüéì GRADIENT DESCENT DEMONSTRATION")
    print("="*60)
    print("Running educational gradient descent implementation...")
    
    # Manual Gradient Descent Implementation
    class ManualLinearRegression:
        def __init__(self, learning_rate=0.01, n_iterations=1000):
            self.learning_rate = learning_rate
            self.n_iterations = n_iterations
            self.theta = None
            self.cost_history = []
            
        def add_intercept(self, X):
            """Add intercept term (column of ones) to feature matrix"""
            intercept = np.ones((X.shape[0], 1))
            return np.c_[intercept, X]
        
        def cost_function(self, X, y, theta):
            """Calculate Mean Squared Error cost"""
            m = len(y)
            predictions = X.dot(theta)
            cost = (1/(2*m)) * np.sum(np.square(predictions - y))
            return cost
        
        def gradient_descent(self, X, y, theta):
            """Perform one step of gradient descent"""
            m = len(y)
            predictions = X.dot(theta)
            errors = predictions - y
            gradient = (1/m) * X.T.dot(errors)
            theta = theta - self.learning_rate * gradient
            return theta
        
        def fit(self, X, y):
            """Train the model using gradient descent"""
            # Add intercept
            X = self.add_intercept(X)
            
            # Initialize parameters
            self.theta = np.zeros(X.shape[1])
            
            # Gradient descent
            for i in range(self.n_iterations):
                cost = self.cost_function(X, y, self.theta)
                self.cost_history.append(cost)
                self.theta = self.gradient_descent(X, y, self.theta)
                
                if i % 100 == 0:
                    print(f"   Iteration {i}: Cost = {cost:.4f}")
        
        def predict(self, X):
            """Make predictions"""
            X = self.add_intercept(X)
            return X.dot(self.theta)
    
    # Train manual gradient descent model
    print("\nüîÑ Training with manual gradient descent...")
    manual_model = ManualLinearRegression(learning_rate=0.01, n_iterations=1000)
    manual_model.fit(X_train.values, y_train.values)
    
    # Make predictions
    y_pred_manual = manual_model.predict(X_test.values)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred_manual)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred_manual)
    r2 = r2_score(y_test, y_pred_manual)
    
    print(f"\n‚úÖ Manual Gradient Descent Results:")
    print(f"   Final cost: {manual_model.cost_history[-1]:.4f}")
    print(f"   Test MSE: {mse:.4f}")
    print(f"   Test RMSE: {rmse:.4f}")
    print(f"   Test MAE: {mae:.4f}")
    print(f"   Test R¬≤: {r2:.4f}")
    print(f"   Coefficients: Œ∏‚ÇÄ={manual_model.theta[0]:.4f}, Œ∏‚ÇÅ={manual_model.theta[1]:.4f}")
    
    # Visualize results
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # 1. Cost function during training
    axes[0].plot(manual_model.cost_history)
    axes[0].set_title('Cost Function During Training')
    axes[0].set_xlabel('Iteration')
    axes[0].set_ylabel('Cost (MSE)')
    axes[0].grid(True)
    
    # 2. Predictions vs Actual
    axes[1].scatter(X_test.iloc[:, 0], y_test, alpha=0.5, label='Actual')
    axes[1].scatter(X_test.iloc[:, 0], y_pred_manual, alpha=0.5, label='Predicted')
    
    # Add regression line
    x_range = np.linspace(X_test.iloc[:, 0].min(), X_test.iloc[:, 0].max(), 100)
    y_range = manual_model.theta[0] + manual_model.theta[1] * x_range
    axes[1].plot(x_range, y_range, 'r-', linewidth=2, label='Regression Line')
    
    axes[1].set_xlabel(X_test.columns[0])
    axes[1].set_ylabel('Target')
    axes[1].set_title('Predictions vs Actual')
    axes[1].legend()
    axes[1].grid(True)
    
    # 3. Residual plot
    residuals = y_test - y_pred_manual
    axes[2].scatter(y_pred_manual, residuals, alpha=0.5)
    axes[2].axhline(y=0, color='r', linestyle='--')
    axes[2].set_xlabel('Predicted Values')
    axes[2].set_ylabel('Residuals')
    axes[2].set_title('Residual Plot')
    axes[2].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Initialize results storage if needed
    if 'results' not in locals():
        results = {}
    if 'best_models' not in locals():
        best_models = {}
    
    # Store results for later comparison
    results['Manual Gradient Descent'] = {
        'metrics': {
            'mse': mse,
            'rmse': rmse,
            'mae': mae,
            'r2_score': r2,
            'cv_score': r2,  # Use R¬≤ as proxy for CV score
            'train_mse': mean_squared_error(y_train, manual_model.predict(X_train.values)),
            'train_r2': r2_score(y_train, manual_model.predict(X_train.values)),
            'training_time': 1.0  # Approximate
        },
        'best_params': {
            'learning_rate': 0.01,
            'n_iterations': 1000
        },
        'predictions': y_pred_manual,
        'probabilities': None
    }
    best_models['Manual Gradient Descent'] = manual_model
    
elif problem_type == 'regression' and X_train.shape[1] > 1:
    print("\n‚ö†Ô∏è Gradient descent demo is only available for single-feature regression.")
    print("   For multi-feature regression, sklearn's optimized models will handle it efficiently.")
    # Initialize results dictionaries
    results = {}
    best_models = {}
else:
    print("\nüìä Skipping gradient descent demo (only for single-feature regression problems).")
    print(f"   Current problem: {problem_type} with {X_train.shape[1]} features")
    # Initialize results dictionaries
    results = {}
    best_models = {}


üí° Current problem type: classification
   Number of features: 1

üìä Skipping gradient descent demo (only for single-feature regression problems).
   Current problem: classification with 1 features


## ü§ñ 7. Model Training with Adaptive Algorithm Selection

### Adaptive Model Selection:

The notebook automatically selects appropriate algorithms based on your problem type:

**For Classification:**
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Support Vector Classifier (SVC)
- Neural Network Classifier
- Naive Bayes

**For Regression:**
- Linear Regression
- Ridge Regression
- Lasso Regression
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- Support Vector Regressor (SVR)
- Neural Network Regressor

In [15]:
# Define models based on problem type
print(f"\nü§ñ CONFIGURING MODELS FOR {problem_type.upper()}")
print("="*60)

if problem_type == 'classification':
    # Check class balance for classification
    class_props = y_train.value_counts(normalize=True)
    is_balanced = class_props.min() >= 0.2
    class_weight = None if is_balanced else 'balanced'
    
    print(f"\nüìä Class Balance Analysis:")
    print(f"   Minimum class proportion: {class_props.min():.2%}")
    print(f"   Class weighting: {'Not needed (balanced)' if is_balanced else 'Applied (imbalanced)'}")
    
    # Classification models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, class_weight=class_weight, random_state=42),
        'Decision Tree': DecisionTreeClassifier(max_depth=10, class_weight=class_weight, random_state=42),
        'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, class_weight=class_weight, n_jobs=-1, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42),
        'SVM': SVC(kernel='rbf', probability=True, class_weight=class_weight, random_state=42),
        'Neural Network': MLPClassifier(random_state=42),
        'Naive Bayes': GaussianNB()
    }
    
    # Classification parameter grids
    param_grids = {
        'Logistic Regression': {'C': [0.1, 1, 10]},
        'Decision Tree': {'max_depth': [5, 10, 15], 'min_samples_split': [2, 10]},
        'Random Forest': {'n_estimators': [50, 100], 'max_depth': [10, 20]},
        'Gradient Boosting': {'n_estimators': [50, 100], 'learning_rate': [0.1, 0.2]},
        'SVM': {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']},
        'Neural Network': CONFIG['neural_network_params'],
        'Naive Bayes': {}
    }
    
    cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scoring = 'balanced_accuracy' if not is_balanced else 'accuracy'
    
else:  # regression
    print(f"\nüìä Target Variable Analysis:")
    print(f"   Mean: {y_train.mean():.2f}")
    print(f"   Std: {y_train.std():.2f}")
    print(f"   Skewness: {y_train.skew():.2f}")
    
    # Regression models
    models = {
        'Linear Regression': LinearRegression(),
        'Ridge Regression': Ridge(random_state=42),
        'Lasso Regression': Lasso(random_state=42),
        'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
        'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, n_jobs=-1, random_state=42),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42),
        'SVR': SVR(kernel='rbf'),
        'Neural Network': MLPRegressor(random_state=42)
    }
    
    # Regression parameter grids
    param_grids = {
        'Linear Regression': {},
        'Ridge Regression': {'alpha': [0.1, 1.0, 10.0]},
        'Lasso Regression': {'alpha': [0.1, 1.0, 10.0]},
        'Decision Tree': {'max_depth': [5, 10, 15], 'min_samples_split': [2, 10]},
        'Random Forest': {'n_estimators': [50, 100], 'max_depth': [10, 20]},
        'Gradient Boosting': {'n_estimators': [50, 100], 'learning_rate': [0.1, 0.2]},
        'SVR': {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']},
        'Neural Network': {
            'hidden_layer_sizes': [(50,), (100,), (100,50)],
            'max_iter': [500, 1000],
            'activation': ['relu', 'tanh']
        }
    }
    
    cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)
    scoring = 'neg_mean_squared_error'  # sklearn uses negative MSE

print(f"\nüéØ Training Configuration:")
print(f"   Models to train: {len(models)}")
print(f"   Problem type: {problem_type}")
print(f"   Cross-validation: {cv_strategy.n_splits}-fold")
print(f"   Scoring metric: {scoring}")
if problem_type == 'classification':
    print(f"   Class weight: {class_weight}")

# Calculate total parameter combinations
total_combinations = sum([
    np.prod([len(v) for v in params.values()]) if params else 1 
    for params in param_grids.values()
])
print(f"\nüß† Total parameter combinations to test: {total_combinations:,}")


ü§ñ CONFIGURING MODELS FOR CLASSIFICATION

üìä Class Balance Analysis:
   Minimum class proportion: 5.11%
   Class weighting: Applied (imbalanced)

üéØ Training Configuration:
   Models to train: 7
   Problem type: classification
   Cross-validation: 5-fold
   Scoring metric: balanced_accuracy
   Class weight: balanced

üß† Total parameter combinations to test: 204


### Model Training with Progress Tracking

Train all models with:
- **Hyperparameter tuning** using GridSearchCV
- **Cross-validation** for robust evaluation
- **Progress bars** to track training status
- **Appropriate metrics** for each problem type

In [None]:
# Model training with progress tracking
print(f"\nüöÄ TRAINING {len(models)} MODELS")
print("="*80)

# Calculate parameter combinations for progress estimation
total_params = {}
for name, params in param_grids.items():
    if params:
        n_combos = 1
        for param_values in params.values():
            n_combos *= len(param_values)
        total_params[name] = n_combos
    else:
        total_params[name] = 1

print("\nüìä Parameter combinations per model:")
for name, n_combos in total_params.items():
    print(f"   ‚Ä¢ {name}: {n_combos} combinations")
    if name == 'Neural Network':
        print(f"     üß† Total fits: {n_combos * cv_strategy.n_splits:,}")

print("\n" + "-"*80)

# Training variables
training_times = {}

# Create main progress bar
with tqdm(total=len(models), desc="Overall Progress", position=0, leave=True) as pbar_main:
    
    for model_idx, (name, model) in enumerate(models.items()):
        pbar_main.set_description(f"Training {name}")
        
        print(f"\nüîÑ Model {model_idx + 1}/{len(models)}: {name}")
        print("-"*60)
        
        start_time = time.time()
        
        # GridSearchCV with appropriate settings
        grid = GridSearchCV(
            model, 
            param_grids[name], 
            cv=cv_strategy, 
            scoring=scoring, 
            n_jobs=-1,
            verbose=0
        )
        
        # Show parameter search info
        print(f"  üìä Searching {total_params[name]} parameter combinations...")
        print(f"  üîÑ Using {cv_strategy.n_splits}-fold cross-validation")
        
        # Fit model
        try:
            grid.fit(X_train, y_train)
        except KeyboardInterrupt:
            print("\n\n‚ö†Ô∏è Training interrupted! Saving results so far...")
            break
        
        training_time = time.time() - start_time
        training_times[name] = training_time
        
        # Store best model
        best_models[name] = grid.best_estimator_
        
        # Make predictions
        print("  üìà Making predictions...", end='')
        y_pred = grid.best_estimator_.predict(X_test)
        y_train_pred = grid.best_estimator_.predict(X_train)
        
        # Get probabilities for classification
        y_pred_proba = None
        if problem_type == 'classification' and hasattr(grid.best_estimator_, 'predict_proba'):
            y_pred_proba = grid.best_estimator_.predict_proba(X_test)
            if y_train.nunique() == 2:  # Binary classification
                y_pred_proba = y_pred_proba[:, 1]
        
        print(" Done!")
        
        # Calculate metrics based on problem type
        print("  üìä Calculating metrics...", end='')
        
        if problem_type == 'classification':
            # Classification metrics
            metrics = {
                'accuracy': accuracy_score(y_test, y_pred),
                'balanced_accuracy': balanced_accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, average='weighted', zero_division=0),
                'recall': recall_score(y_test, y_pred, average='weighted', zero_division=0),
                'f1_score': f1_score(y_test, y_pred, average='weighted', zero_division=0),
                'cv_score': grid.best_score_,
                'train_accuracy': accuracy_score(y_train, y_train_pred),
                'train_balanced_accuracy': balanced_accuracy_score(y_train, y_train_pred),
                'training_time': training_time
            }
            
            # Add AUC-ROC for binary classification
            if y_pred_proba is not None and y_train.nunique() == 2:
                metrics['auc_roc'] = roc_auc_score(y_test, y_pred_proba)
            
        else:
            # Regression metrics
            metrics = {
                'mse': mean_squared_error(y_test, y_pred),
                'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
                'mae': mean_absolute_error(y_test, y_pred),
                'r2_score': r2_score(y_test, y_pred),
                'explained_variance': explained_variance_score(y_test, y_pred),
                'cv_score': -grid.best_score_,  # Convert back from negative MSE
                'train_mse': mean_squared_error(y_train, y_train_pred),
                'train_r2': r2_score(y_train, y_train_pred),
                'training_time': training_time
            }
        
        print(" Done!")
        
        # Store results
        results[name] = {
            'metrics': metrics,
            'best_params': grid.best_params_,
            'predictions': y_pred,
            'probabilities': y_pred_proba
        }
        
        # Display results
        print(f"\n  ‚úÖ Results for {name}:")
        print(f"     ‚Ä¢ Best CV Score: {grid.best_score_:.4f}")
        
        if problem_type == 'classification':
            print(f"     ‚Ä¢ Test Accuracy: {metrics['accuracy']:.4f}")
            print(f"     ‚Ä¢ Test Balanced Accuracy: {metrics['balanced_accuracy']:.4f}")
            if 'auc_roc' in metrics:
                print(f"     ‚Ä¢ Test AUC-ROC: {metrics['auc_roc']:.4f}")
        else:
            print(f"     ‚Ä¢ Test R¬≤: {metrics['r2_score']:.4f}")
            print(f"     ‚Ä¢ Test RMSE: {metrics['rmse']:.4f}")
            print(f"     ‚Ä¢ Test MAE: {metrics['mae']:.4f}")
        
        print(f"     ‚Ä¢ Training Time: {training_time:.2f}s")
        
        # Special reporting for Neural Network
        if name == 'Neural Network' and grid.best_params_:
            print(f"\n  üß† Best Neural Network Configuration:")
            for param, value in grid.best_params_.items():
                print(f"     ‚Ä¢ {param}: {value}")
        
        # Update progress bar
        pbar_main.update(1)
        
        # Time estimation
        if model_idx < len(models) - 1:
            elapsed_time = sum(training_times.values())
            avg_time = elapsed_time / (model_idx + 1)
            remaining = (len(models) - model_idx - 1) * avg_time
            print(f"\n  ‚è±Ô∏è Estimated time remaining: {remaining:.1f}s")

print("\n" + "="*80)
print("‚úÖ All models trained successfully!")
print(f"‚è±Ô∏è Total training time: {sum(training_times.values()):.2f}s")

# Summary table
print("\nüìä Quick Summary:")
print("-"*80)

if problem_type == 'classification':
    print(f"{'Model':<25} {'CV Score':>10} {'Test Acc':>10} {'Balanced':>10} {'Time (s)':>10}")
    print("-"*80)
    for name in models.keys():
        if name in results:
            m = results[name]['metrics']
            print(f"{name:<25} {m['cv_score']:>10.4f} {m['accuracy']:>10.4f} "
                  f"{m['balanced_accuracy']:>10.4f} {m['training_time']:>10.2f}")
else:
    print(f"{'Model':<25} {'CV MSE':>10} {'Test R¬≤':>10} {'RMSE':>10} {'Time (s)':>10}")
    print("-"*80)
    for name in models.keys():
        if name in results:
            m = results[name]['metrics']
            print(f"{name:<25} {m['cv_score']:>10.4f} {m['r2_score']:>10.4f} "
                  f"{m['rmse']:>10.4f} {m['training_time']:>10.2f}")


üöÄ TRAINING 7 MODELS

üìä Parameter combinations per model:
   ‚Ä¢ Logistic Regression: 3 combinations
   ‚Ä¢ Decision Tree: 6 combinations
   ‚Ä¢ Random Forest: 4 combinations
   ‚Ä¢ Gradient Boosting: 4 combinations
   ‚Ä¢ SVM: 6 combinations
   ‚Ä¢ Neural Network: 180 combinations
     üß† Total fits: 900
   ‚Ä¢ Naive Bayes: 1 combinations

--------------------------------------------------------------------------------


Training Logistic Regression:   0%|                                                              | 0/7 [00:00<?, ?it/s]


üîÑ Model 1/7: Logistic Regression
------------------------------------------------------------
  üìä Searching 3 parameter combinations...
  üîÑ Using 5-fold cross-validation


Training Decision Tree:  14%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå                                                   | 1/7 [00:04<00:28,  4.80s/it]

  üìà Making predictions... Done!
  üìä Calculating metrics... Done!

  ‚úÖ Results for Logistic Regression:
     ‚Ä¢ Best CV Score: 0.8556
     ‚Ä¢ Test Accuracy: 0.8244
     ‚Ä¢ Test Balanced Accuracy: 0.8431
     ‚Ä¢ Test AUC-ROC: 0.9230
     ‚Ä¢ Training Time: 4.75s

  ‚è±Ô∏è Estimated time remaining: 28.5s

üîÑ Model 2/7: Decision Tree
------------------------------------------------------------
  üìä Searching 6 parameter combinations...
  üîÑ Using 5-fold cross-validation


Training Random Forest:  29%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè                                          | 2/7 [00:09<00:22,  4.45s/it]

  üìà Making predictions... Done!
  üìä Calculating metrics... Done!

  ‚úÖ Results for Decision Tree:
     ‚Ä¢ Best CV Score: 0.8652
     ‚Ä¢ Test Accuracy: 0.7749
     ‚Ä¢ Test Balanced Accuracy: 0.8613
     ‚Ä¢ Test AUC-ROC: 0.9224
     ‚Ä¢ Training Time: 4.16s

  ‚è±Ô∏è Estimated time remaining: 22.3s

üîÑ Model 3/7: Random Forest
------------------------------------------------------------
  üìä Searching 4 parameter combinations...
  üîÑ Using 5-fold cross-validation
  üìà Making predictions... Done!
  üìä Calculating metrics...

Training Gradient Boosting:  43%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                                | 3/7 [00:12<00:16,  4.18s/it]

 Done!

  ‚úÖ Results for Random Forest:
     ‚Ä¢ Best CV Score: 0.8575
     ‚Ä¢ Test Accuracy: 0.7850
     ‚Ä¢ Test Balanced Accuracy: 0.8565
     ‚Ä¢ Test AUC-ROC: 0.9193
     ‚Ä¢ Training Time: 3.64s

  ‚è±Ô∏è Estimated time remaining: 16.7s

üîÑ Model 4/7: Gradient Boosting
------------------------------------------------------------
  üìä Searching 4 parameter combinations...
  üîÑ Using 5-fold cross-validation


Training SVM:  57%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà                              | 4/7 [00:18<00:14,  4.83s/it]

  üìà Making predictions... Done!
  üìä Calculating metrics... Done!

  ‚úÖ Results for Gradient Boosting:
     ‚Ä¢ Best CV Score: 0.5370
     ‚Ä¢ Test Accuracy: 0.9490
     ‚Ä¢ Test Balanced Accuracy: 0.5363
     ‚Ä¢ Test AUC-ROC: 0.9201
     ‚Ä¢ Training Time: 5.70s

  ‚è±Ô∏è Estimated time remaining: 13.7s

üîÑ Model 5/7: SVM
------------------------------------------------------------
  üìä Searching 6 parameter combinations...
  üîÑ Using 5-fold cross-validation
  üìà Making predictions...

Training Neural Network:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç                | 5/7 [05:47<04:03, 121.79s/it]

 Done!
  üìä Calculating metrics... Done!

  ‚úÖ Results for SVM:
     ‚Ä¢ Best CV Score: 0.8694
     ‚Ä¢ Test Accuracy: 0.7749
     ‚Ä¢ Test Balanced Accuracy: 0.8613
     ‚Ä¢ Test AUC-ROC: 0.8925
     ‚Ä¢ Training Time: 305.09s

  ‚è±Ô∏è Estimated time remaining: 129.3s

üîÑ Model 6/7: Neural Network
------------------------------------------------------------
  üìä Searching 180 parameter combinations...
  üîÑ Using 5-fold cross-validation


## üìä 8. Model Comparison and Analysis

### Analysis Components:
1. **Performance comparison** - Compare all models on relevant metrics
2. **Overfitting detection** - Identify models with large train-test gaps
3. **Complex model analysis** - Special focus on ensemble and neural network models
4. **Best model identification** - Select top performer based on problem type

In [None]:
# Create comprehensive comparison dataframe
comparison_data = []
for name, result in results.items():
    row = {'Model': name}
    row.update(result['metrics'])
    
    # Calculate overfitting metrics based on problem type
    if problem_type == 'classification':
        row['overfitting_score'] = row['train_balanced_accuracy'] - row['balanced_accuracy']
        row['is_overfitting'] = row['overfitting_score'] > CONFIG['overfitting_threshold']
    else:
        row['overfitting_score'] = row['train_r2'] - row['r2_score']
        row['is_overfitting'] = row['overfitting_score'] > CONFIG['overfitting_threshold']
    
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)

# Sort by appropriate metric
if problem_type == 'classification':
    comparison_df = comparison_df.sort_values('balanced_accuracy', ascending=False)
    primary_metric = 'balanced_accuracy'
else:
    comparison_df = comparison_df.sort_values('r2_score', ascending=False)
    primary_metric = 'r2_score'

print("\nüìä MODEL PERFORMANCE COMPARISON")
print("="*80)

if problem_type == 'classification':
    display_cols = ['Model', 'balanced_accuracy', 'accuracy', 'precision', 'recall', 'f1_score', 
                    'train_balanced_accuracy', 'overfitting_score', 'training_time']
else:
    display_cols = ['Model', 'r2_score', 'rmse', 'mae', 'explained_variance',
                    'train_r2', 'overfitting_score', 'training_time']

print(comparison_df[display_cols].to_string(index=False, float_format='%.4f'))

# Overfitting analysis
print("\nüîç OVERFITTING ANALYSIS")
print("="*60)
overfitting_models = comparison_df[comparison_df['is_overfitting']]
if len(overfitting_models) > 0:
    print(f"‚ö†Ô∏è Models showing overfitting (train-test difference > {CONFIG['overfitting_threshold']*100}%):")
    for _, model in overfitting_models.iterrows():
        print(f"   ‚Ä¢ {model['Model']}: {model['overfitting_score']*100:.2f}% difference")
else:
    print("‚úÖ No models show significant overfitting!")

# Complex model analysis
complex_models = ['Neural Network', 'Random Forest', 'Gradient Boosting', 'SVM', 'SVR']
print("\nüìà COMPLEX MODEL ANALYSIS")
print("-"*60)
for model_name in complex_models:
    if model_name in comparison_df['Model'].values:
        model_data = comparison_df[comparison_df['Model'] == model_name].iloc[0]
        print(f"\n{model_name}:")
        if problem_type == 'classification':
            print(f"   ‚Ä¢ Test Balanced Accuracy: {model_data['balanced_accuracy']:.4f}")
        else:
            print(f"   ‚Ä¢ Test R¬≤: {model_data['r2_score']:.4f}")
            print(f"   ‚Ä¢ Test RMSE: {model_data['rmse']:.4f}")
        print(f"   ‚Ä¢ Overfitting: {'Yes' if model_data['is_overfitting'] else 'No'} ({model_data['overfitting_score']*100:.2f}%)")
        print(f"   ‚Ä¢ Training Time: {model_data['training_time']:.2f}s")

# Identify best model
best_model_name = comparison_df.iloc[0]['Model']
best_model = best_models[best_model_name]
best_metrics = comparison_df.iloc[0]

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print("-"*60)
if problem_type == 'classification':
    print(f"   ‚Ä¢ Balanced Accuracy: {best_metrics['balanced_accuracy']:.4f}")
    print(f"   ‚Ä¢ Accuracy: {best_metrics['accuracy']:.4f}")
    print(f"   ‚Ä¢ F1 Score: {best_metrics['f1_score']:.4f}")
else:
    print(f"   ‚Ä¢ R¬≤ Score: {best_metrics['r2_score']:.4f}")
    print(f"   ‚Ä¢ RMSE: {best_metrics['rmse']:.4f}")
    print(f"   ‚Ä¢ MAE: {best_metrics['mae']:.4f}")

## üåç 9. Station/Group Analysis (if applicable)

If the dataset contains station or group information, analyze model performance across different stations/groups to identify:
- **High-performance stations** - Where the model works best
- **Challenging stations** - Where predictions are less accurate
- **Performance distribution** - Statistical summary across all stations

In [None]:
# Perform station-wise analysis if station information is available
station_results_df = None

if station_info is not None and CONFIG['station_column'] in df.columns:
    print("\nüåç STATION-WISE ANALYSIS")
    print("="*60)
    
    # Get best model predictions
    best_pred = results[best_model_name]['predictions']
    
    # Calculate per-station metrics
    unique_stations = np.unique(station_test)
    station_metrics = {}
    
    for station in unique_stations:
        station_mask = station_test == station
        if np.sum(station_mask) > 10:  # Only analyze stations with sufficient samples
            station_y_true = y_test[station_mask]
            station_y_pred = best_pred[station_mask]
            
            if problem_type == 'classification':
                station_acc = accuracy_score(station_y_true, station_y_pred)
                station_metrics[station] = {
                    'accuracy': station_acc,
                    'n_samples': np.sum(station_mask),
                    'n_correct': np.sum(station_y_true == station_y_pred)
                }
            else:
                station_r2 = r2_score(station_y_true, station_y_pred)
                station_rmse = np.sqrt(mean_squared_error(station_y_true, station_y_pred))
                station_metrics[station] = {
                    'r2_score': station_r2,
                    'rmse': station_rmse,
                    'n_samples': np.sum(station_mask)
                }
    
    # Create station results dataframe
    if problem_type == 'classification':
        station_results_df = pd.DataFrame([
            {'station': k, 'accuracy': v['accuracy'], 'n_samples': v['n_samples']} 
            for k, v in station_metrics.items()
        ]).sort_values('accuracy', ascending=False)
        
        # Report high-accuracy stations
        high_acc_stations = station_results_df[station_results_df['accuracy'] >= CONFIG['high_accuracy_threshold']]
        print(f"\nüèÜ Stations with ‚â•{CONFIG['high_accuracy_threshold']*100:.0f}% accuracy: {len(high_acc_stations)}")
        if len(high_acc_stations) > 0:
            for _, row in high_acc_stations.head(10).iterrows():
                print(f"   ‚Ä¢ {row['station']}: {row['accuracy']*100:.2f}% ({row['n_samples']} samples)")
        
        # Overall statistics
        acc_values = station_results_df['accuracy'].values
        print(f"\nüìä Station Accuracy Statistics:")
        print(f"   ‚Ä¢ Mean: {np.mean(acc_values)*100:.2f}%")
        print(f"   ‚Ä¢ Std: {np.std(acc_values)*100:.2f}%")
        print(f"   ‚Ä¢ Min: {np.min(acc_values)*100:.2f}%")
        print(f"   ‚Ä¢ Max: {np.max(acc_values)*100:.2f}%")
        
    else:  # regression
        station_results_df = pd.DataFrame([
            {'station': k, 'r2_score': v['r2_score'], 'rmse': v['rmse'], 'n_samples': v['n_samples']} 
            for k, v in station_metrics.items()
        ]).sort_values('r2_score', ascending=False)
        
        # Report best performing stations
        print(f"\nüèÜ Top performing stations by R¬≤:")
        for _, row in station_results_df.head(10).iterrows():
            print(f"   ‚Ä¢ {row['station']}: R¬≤={row['r2_score']:.3f}, RMSE={row['rmse']:.2f} ({row['n_samples']} samples)")
        
        # Overall statistics
        r2_values = station_results_df['r2_score'].values
        print(f"\nüìä Station R¬≤ Statistics:")
        print(f"   ‚Ä¢ Mean: {np.mean(r2_values):.3f}")
        print(f"   ‚Ä¢ Std: {np.std(r2_values):.3f}")
        print(f"   ‚Ä¢ Min: {np.min(r2_values):.3f}")
        print(f"   ‚Ä¢ Max: {np.max(r2_values):.3f}")
    
    print(f"\nüìç Total stations analyzed: {len(station_metrics)}")
else:
    print("\n‚ö†Ô∏è Station-wise analysis not available (no station information in dataset)")

## üìà 10. Comprehensive Visualizations

Create interactive visualizations adapted to the problem type:

**For Classification:**
- Model performance comparison
- Confusion matrix
- ROC curves (for binary classification)
- Feature importance

**For Regression:**
- Model performance comparison
- Actual vs Predicted scatter plot
- Residual analysis
- Feature importance

In [None]:
# Create comprehensive visualizations based on problem type
best_pred = results[best_model_name]['predictions']

print("\nüìä GENERATING VISUALIZATIONS")
print("="*60)

if problem_type == 'classification':
    # Classification visualizations
    n_classes = y_train.nunique()
    
    # Create subplots
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Model Performance Comparison', 'Confusion Matrix',
                       'Training Time Comparison', 'Overfitting Analysis'),
        specs=[[{'type': 'bar'}, {'type': 'heatmap'}],
               [{'type': 'bar'}, {'type': 'scatter'}]]
    )
    
    # 1. Model Performance Comparison
    fig.add_trace(
        go.Bar(
            x=comparison_df['Model'],
            y=comparison_df['balanced_accuracy'],
            name='Balanced Accuracy',
            marker_color='lightblue',
            text=[f"{val:.3f}" for val in comparison_df['balanced_accuracy']],
            textposition='auto'
        ),
        row=1, col=1
    )
    
    # 2. Confusion Matrix
    cm = confusion_matrix(y_test, best_pred)
    
    # Normalize confusion matrix
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    # Create labels for confusion matrix
    if n_classes <= 10:
        labels = [[f"{cm[i,j]}\n({cm_normalized[i,j]:.1%})" for j in range(n_classes)] for i in range(n_classes)]
    else:
        labels = cm
    
    fig.add_trace(
        go.Heatmap(
            z=cm_normalized,
            text=labels,
            texttemplate='%{text}',
            colorscale='Blues',
            showscale=True
        ),
        row=1, col=2
    )
    
    # 3. Training Time Comparison
    fig.add_trace(
        go.Bar(
            x=comparison_df['Model'],
            y=comparison_df['training_time'],
            name='Training Time (s)',
            marker_color='lightcoral',
            text=[f"{val:.1f}s" for val in comparison_df['training_time']],
            textposition='auto'
        ),
        row=2, col=1
    )
    
    # 4. Overfitting Analysis
    fig.add_trace(
        go.Scatter(
            x=comparison_df['train_balanced_accuracy'],
            y=comparison_df['balanced_accuracy'],
            mode='markers+text',
            text=comparison_df['Model'],
            textposition='top center',
            marker=dict(size=10, color='darkblue'),
            name='Models'
        ),
        row=2, col=2
    )
    
    # Add diagonal line for overfitting plot
    fig.add_trace(
        go.Scatter(
            x=[0, 1],
            y=[0, 1],
            mode='lines',
            line=dict(dash='dash', color='red'),
            name='No Overfitting Line',
            showlegend=False
        ),
        row=2, col=2
    )
    
    # Update layout
    fig.update_xaxes(title_text='Model', row=1, col=1)
    fig.update_yaxes(title_text='Balanced Accuracy', row=1, col=1)
    fig.update_xaxes(title_text='Predicted', row=1, col=2)
    fig.update_yaxes(title_text='Actual', row=1, col=2)
    fig.update_xaxes(title_text='Model', row=2, col=1, tickangle=45)
    fig.update_yaxes(title_text='Training Time (s)', row=2, col=1)
    fig.update_xaxes(title_text='Train Balanced Accuracy', row=2, col=2)
    fig.update_yaxes(title_text='Test Balanced Accuracy', row=2, col=2)
    
    fig.update_layout(
        height=800,
        title_text=f'Classification Model Analysis - {best_model_name}',
        showlegend=False
    )
    
    fig.show()
    
    # Additional ROC curve for binary classification
    if n_classes == 2 and any(results[name]['probabilities'] is not None for name in results):
        print("\nüìà Generating ROC curves for binary classification...")
        
        from sklearn.metrics import roc_curve, auc
        
        plt.figure(figsize=(10, 8))
        
        for name in results:
            if results[name]['probabilities'] is not None:
                fpr, tpr, _ = roc_curve(y_test, results[name]['probabilities'])
                roc_auc = auc(fpr, tpr)
                plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')
        
        plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC Curves - Model Comparison')
        plt.legend(loc="lower right")
        plt.grid(True, alpha=0.3)
        plt.show()
    
else:
    # Regression visualizations
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Model Performance (R¬≤)', 'Actual vs Predicted',
                       'Residual Analysis', 'Model Comparison'),
        specs=[[{'type': 'bar'}, {'type': 'scatter'}],
               [{'type': 'scatter'}, {'type': 'bar'}]]
    )
    
    # 1. Model Performance Comparison (R¬≤)
    fig.add_trace(
        go.Bar(
            x=comparison_df['Model'],
            y=comparison_df['r2_score'],
            name='R¬≤ Score',
            marker_color='lightgreen',
            text=[f"{val:.3f}" for val in comparison_df['r2_score']],
            textposition='auto'
        ),
        row=1, col=1
    )
    
    # 2. Actual vs Predicted
    fig.add_trace(
        go.Scatter(
            x=y_test,
            y=best_pred,
            mode='markers',
            marker=dict(size=5, color='blue', opacity=0.5),
            name='Predictions'
        ),
        row=1, col=2
    )
    
    # Add perfect prediction line
    min_val = min(y_test.min(), best_pred.min())
    max_val = max(y_test.max(), best_pred.max())
    fig.add_trace(
        go.Scatter(
            x=[min_val, max_val],
            y=[min_val, max_val],
            mode='lines',
            line=dict(color='red', dash='dash'),
            name='Perfect Prediction',
            showlegend=False
        ),
        row=1, col=2
    )
    
    # 3. Residual Plot
    residuals = y_test - best_pred
    fig.add_trace(
        go.Scatter(
            x=best_pred,
            y=residuals,
            mode='markers',
            marker=dict(size=5, color='purple', opacity=0.5),
            name='Residuals'
        ),
        row=2, col=1
    )
    
    # Add zero line
    fig.add_trace(
        go.Scatter(
            x=[best_pred.min(), best_pred.max()],
            y=[0, 0],
            mode='lines',
            line=dict(color='red', dash='dash'),
            showlegend=False
        ),
        row=2, col=1
    )
    
    # 4. RMSE Comparison
    fig.add_trace(
        go.Bar(
            x=comparison_df['Model'],
            y=comparison_df['rmse'],
            name='RMSE',
            marker_color='lightcoral',
            text=[f"{val:.2f}" for val in comparison_df['rmse']],
            textposition='auto'
        ),
        row=2, col=2
    )
    
    # Update layout
    fig.update_xaxes(title_text='Model', row=1, col=1, tickangle=45)
    fig.update_yaxes(title_text='R¬≤ Score', row=1, col=1)
    fig.update_xaxes(title_text='Actual', row=1, col=2)
    fig.update_yaxes(title_text='Predicted', row=1, col=2)
    fig.update_xaxes(title_text='Predicted', row=2, col=1)
    fig.update_yaxes(title_text='Residuals', row=2, col=1)
    fig.update_xaxes(title_text='Model', row=2, col=2, tickangle=45)
    fig.update_yaxes(title_text='RMSE', row=2, col=2)
    
    fig.update_layout(
        height=800,
        title_text=f'Regression Model Analysis - {best_model_name}',
        showlegend=False
    )
    
    fig.show()
    
    # Additional residual distribution plot
    print("\nüìä Generating residual distribution analysis...")
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Histogram of residuals
    axes[0].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
    axes[0].axvline(x=0, color='red', linestyle='--', linewidth=2)
    axes[0].set_xlabel('Residuals')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of Residuals')
    axes[0].grid(True, alpha=0.3)
    
    # Q-Q plot
    from scipy import stats
    stats.probplot(residuals, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

### Feature Importance Visualization

For models that support feature importance (tree-based models) or coefficients (linear models), visualize which features contribute most to predictions.

In [None]:
# Feature importance visualization
print("\nüîç FEATURE IMPORTANCE ANALYSIS")
print("="*60)

if hasattr(best_model, 'feature_importances_'):
    # Tree-based models
    importance_df = pd.DataFrame({
        'feature': X_train.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False).head(20)
    
    fig = px.bar(importance_df, y='feature', x='importance', orientation='h',
                title=f'Top 20 Feature Importances - {best_model_name}',
                labels={'importance': 'Importance Score', 'feature': 'Feature'})
    fig.update_layout(height=600)
    fig.show()
    
    print("\nTop 10 most important features:")
    for _, row in importance_df.head(10).iterrows():
        print(f"   ‚Ä¢ {row['feature']}: {row['importance']:.4f}")
        
elif hasattr(best_model, 'coef_'):
    # Linear models
    if problem_type == 'classification' and len(best_model.coef_.shape) > 1:
        # Multi-class classification
        coef = np.abs(best_model.coef_).mean(axis=0)
    else:
        coef = np.abs(best_model.coef_).flatten()
    
    coef_df = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': coef
    }).sort_values('coefficient', ascending=False).head(20)
    
    fig = px.bar(coef_df, y='feature', x='coefficient', orientation='h',
                title=f'Top 20 Feature Coefficients (Absolute) - {best_model_name}',
                labels={'coefficient': 'Absolute Coefficient', 'feature': 'Feature'})
    fig.update_layout(height=600)
    fig.show()
    
    print("\nTop 10 most influential features:")
    for _, row in coef_df.head(10).iterrows():
        print(f"   ‚Ä¢ {row['feature']}: {row['coefficient']:.4f}")
        
else:
    print(f"\n‚ö†Ô∏è Feature importance not available for {best_model_name}")
    print("   (Only available for tree-based and linear models)")

## üìã 11. Comprehensive Analysis Report and Recommendations

Generate a detailed report with:
1. **Performance summary** across all models
2. **Overfitting analysis** to identify potential issues
3. **Model recommendations** based on multiple criteria
4. **Actionable insights** for deployment

In [None]:
print("\n" + "="*80)
print("üìã COMPREHENSIVE ANALYSIS REPORT")
print("="*80)

# 1. Performance Summary
print("\n1Ô∏è‚É£ PERFORMANCE SUMMARY")
print("-"*60)
print(f"Problem Type: {problem_type.capitalize()}")
print(f"Target Variable: {target_col}")
print(f"Number of Features: {X_train.shape[1]}")
print(f"Training Samples: {X_train.shape[0]:,}")
print(f"Test Samples: {X_test.shape[0]:,}")

if problem_type == 'classification':
    print(f"\nNumber of Classes: {y_train.nunique()}")
    print(f"Class Distribution: {dict(y_train.value_counts())}")
    print(f"\nBest Model Performance:")
    print(f"   ‚Ä¢ Model: {best_model_name}")
    print(f"   ‚Ä¢ Accuracy: {best_metrics['accuracy']:.4f}")
    print(f"   ‚Ä¢ Balanced Accuracy: {best_metrics['balanced_accuracy']:.4f}")
    print(f"   ‚Ä¢ F1 Score: {best_metrics['f1_score']:.4f}")
else:
    print(f"\nTarget Statistics:")
    print(f"   ‚Ä¢ Mean: {y_train.mean():.2f}")
    print(f"   ‚Ä¢ Std: {y_train.std():.2f}")
    print(f"   ‚Ä¢ Range: [{y_train.min():.2f}, {y_train.max():.2f}]")
    print(f"\nBest Model Performance:")
    print(f"   ‚Ä¢ Model: {best_model_name}")
    print(f"   ‚Ä¢ R¬≤ Score: {best_metrics['r2_score']:.4f}")
    print(f"   ‚Ä¢ RMSE: {best_metrics['rmse']:.4f}")
    print(f"   ‚Ä¢ MAE: {best_metrics['mae']:.4f}")

# 2. Station Analysis Summary (if applicable)
if station_results_df is not None:
    print("\n2Ô∏è‚É£ STATION ANALYSIS SUMMARY")
    print("-"*60)
    if problem_type == 'classification':
        high_perf_threshold = 0.9
        high_perf_stations = station_results_df[station_results_df['accuracy'] >= high_perf_threshold]
        low_perf_stations = station_results_df[station_results_df['accuracy'] < 0.7]
        
        print(f"High Performance Stations (‚â•{high_perf_threshold*100:.0f}%): {len(high_perf_stations)}")
        print(f"Low Performance Stations (<70%): {len(low_perf_stations)}")
        
        if len(low_perf_stations) > 0:
            print("\n‚ö†Ô∏è Stations requiring attention:")
            for _, row in low_perf_stations.head(5).iterrows():
                print(f"   ‚Ä¢ {row['station']}: {row['accuracy']*100:.1f}% accuracy")
    else:
        high_perf_stations = station_results_df[station_results_df['r2_score'] >= 0.8]
        low_perf_stations = station_results_df[station_results_df['r2_score'] < 0.5]
        
        print(f"High Performance Stations (R¬≤ ‚â• 0.8): {len(high_perf_stations)}")
        print(f"Low Performance Stations (R¬≤ < 0.5): {len(low_perf_stations)}")

# 3. Overfitting Analysis
print("\n3Ô∏è‚É£ OVERFITTING ANALYSIS")
print("-"*60)

overfitting_summary = comparison_df.groupby('is_overfitting').size()
print(f"Models without overfitting: {overfitting_summary.get(False, 0)}")
print(f"Models with overfitting: {overfitting_summary.get(True, 0)}")

if len(overfitting_models) > 0:
    print("\nOverfitting details:")
    for _, model in overfitting_models.iterrows():
        print(f"   ‚Ä¢ {model['Model']}: {model['overfitting_score']*100:.1f}% gap")
    print("\nüí° Recommendation: Consider regularization or ensemble methods")

# 4. Model Recommendations
print("\n4Ô∏è‚É£ MODEL RECOMMENDATIONS")
print("-"*60)

# Score models based on multiple criteria
recommendation_scores = []
for _, model in comparison_df.iterrows():
    score = 0
    reasons = []
    
    # Performance score (40% weight)
    if problem_type == 'classification':
        perf_score = model['balanced_accuracy'] * 40
    else:
        perf_score = model['r2_score'] * 40
    score += perf_score
    
    # Speed score (30% weight)
    if model['training_time'] < np.percentile(comparison_df['training_time'], 25):
        score += 30
        reasons.append("fast training")
    elif model['training_time'] < np.percentile(comparison_df['training_time'], 50):
        score += 20
        reasons.append("moderate speed")
    elif model['training_time'] < np.percentile(comparison_df['training_time'], 75):
        score += 10
    
    # Generalization score (30% weight)
    if not model['is_overfitting']:
        score += 30
        reasons.append("no overfitting")
    elif model['overfitting_score'] < 0.03:
        score += 20
        reasons.append("minimal overfitting")
    elif model['overfitting_score'] < 0.05:
        score += 10
        reasons.append("acceptable overfitting")
    
    # Model complexity bonus
    simple_models = ['Linear Regression', 'Logistic Regression', 'Naive Bayes']
    if model['Model'] in simple_models and score > 70:
        reasons.append("interpretable")
    
    recommendation_scores.append({
        'Model': model['Model'],
        'Score': score,
        'Performance': perf_score/40,
        'Training_Time': model['training_time'],
        'Overfitting': model['overfitting_score'],
        'Reasons': ', '.join(reasons)
    })

recommendation_df = pd.DataFrame(recommendation_scores).sort_values('Score', ascending=False)

print("\nüèÜ TOP 3 RECOMMENDATIONS:")
for i, row in enumerate(recommendation_df.head(3).iterrows(), 1):
    _, rec = row
    print(f"\n{i}. {rec['Model']} (Score: {rec['Score']:.1f}/100)")
    print(f"   ‚Ä¢ Performance: {rec['Performance']:.2%}")
    print(f"   ‚Ä¢ Training Time: {rec['Training_Time']:.2f}s")
    print(f"   ‚Ä¢ Overfitting: {rec['Overfitting']*100:.1f}%")
    if rec['Reasons']:
        print(f"   ‚Ä¢ Strengths: {rec['Reasons']}")

# Final recommendation
best_overall = recommendation_df.iloc[0]
print(f"\nüìå FINAL RECOMMENDATION: {best_overall['Model']}")
print(f"\nThis model provides the best balance of:")
print(f"   ‚Ä¢ Performance: {best_overall['Performance']:.2%}")
print(f"   ‚Ä¢ Speed: {best_overall['Training_Time']:.2f}s training time")
print(f"   ‚Ä¢ Generalization: {best_overall['Overfitting']*100:.1f}% train-test gap")

# Deployment considerations
print("\n5Ô∏è‚É£ DEPLOYMENT CONSIDERATIONS")
print("-"*60)
print("‚úì Model is ready for deployment")
print("‚úì Consider implementing:")
print("   ‚Ä¢ Input validation for feature ranges")
print("   ‚Ä¢ Model versioning and monitoring")
print("   ‚Ä¢ Regular retraining schedule")
if station_results_df is not None and len(low_perf_stations) > 0:
    print("   ‚Ä¢ Special handling for low-performance stations")

## üíæ 12. Save Results and Models

Save all analysis outputs:
- **Model files** - Serialized models in pickle format
- **Performance metrics** - CSV files with detailed results
- **Analysis summary** - JSON file with complete analysis
- **Predictions** - Actual vs predicted values for validation
- **Visualizations** - Optional saving of plots

In [None]:
# Create output directory with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
analysis_type = 'classification' if problem_type == 'classification' else 'regression'
output_dir = output_path / f'{analysis_type}_analysis_{timestamp}'
output_dir.mkdir(parents=True, exist_ok=True)

print(f"\nüíæ SAVING RESULTS")
print("="*60)
print(f"Output directory: {output_dir}")

# 1. Save comprehensive summary
summary = {
    'analysis_info': {
        'date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'problem_type': problem_type,
        'target_variable': target_col,
        'n_features': X_train.shape[1],
        'n_train_samples': X_train.shape[0],
        'n_test_samples': X_test.shape[0]
    },
    'configuration': CONFIG,
    'feature_selection': {
        'method': 'automatic' if 'selection_option' not in locals() else selection_option,
        'n_features_selected': len(selected_features),
        'features': selected_features
    },
    'model_comparison': comparison_df.to_dict('records'),
    'best_model': {
        'name': best_model_name,
        'parameters': results[best_model_name]['best_params'],
        'metrics': results[best_model_name]['metrics']
    },
    'recommendations': recommendation_df.head(3).to_dict('records'),
    'overfitting_analysis': {
        'models_with_overfitting': list(overfitting_models['Model'].values) if len(overfitting_models) > 0 else [],
        'overfitting_threshold': CONFIG['overfitting_threshold']
    }
}

# Add station analysis if available
if station_results_df is not None:
    if problem_type == 'classification':
        summary['station_analysis'] = {
            'n_stations': len(station_results_df),
            'mean_accuracy': station_results_df['accuracy'].mean(),
            'std_accuracy': station_results_df['accuracy'].std(),
            'best_stations': station_results_df.head(5).to_dict('records'),
            'worst_stations': station_results_df.tail(5).to_dict('records')
        }
    else:
        summary['station_analysis'] = {
            'n_stations': len(station_results_df),
            'mean_r2': station_results_df['r2_score'].mean(),
            'std_r2': station_results_df['r2_score'].std(),
            'best_stations': station_results_df.head(5).to_dict('records'),
            'worst_stations': station_results_df.tail(5).to_dict('records')
        }

# Save JSON summary
with open(output_dir / 'analysis_summary.json', 'w') as f:
    json.dump(summary, f, indent=4)
print("‚úÖ Saved analysis summary")

# 2. Save detailed DataFrames
comparison_df.to_csv(output_dir / 'model_comparison.csv', index=False)
recommendation_df.to_csv(output_dir / 'model_recommendations.csv', index=False)
print("‚úÖ Saved model comparison and recommendations")

if station_results_df is not None:
    station_results_df.to_csv(output_dir / 'station_results.csv', index=False)
    print("‚úÖ Saved station-wise results")

# 3. Save predictions
predictions_df = pd.DataFrame({
    'actual': y_test,
    'predicted': results[best_model_name]['predictions']
})

if problem_type == 'regression':
    predictions_df['error'] = predictions_df['actual'] - predictions_df['predicted']
    predictions_df['abs_error'] = np.abs(predictions_df['error'])
    predictions_df['pct_error'] = (predictions_df['error'] / predictions_df['actual'] * 100).round(2)

if station_info is not None:
    predictions_df['station'] = station_test

predictions_df.to_csv(output_dir / 'predictions.csv', index=False)
print("‚úÖ Saved predictions")

# 4. Save best model
model_filename = f'best_model_{best_model_name.lower().replace(" ", "_")}.pkl'
joblib.dump(best_models[best_model_name], output_dir / model_filename)
print(f"‚úÖ Saved best model: {model_filename}")

# 5. Save all models (optional)
save_all = input("\nüíæ Save all trained models? (y/n): ").strip().lower()
if save_all == 'y':
    models_dir = output_dir / 'all_models'
    models_dir.mkdir(exist_ok=True)
    
    for name, model in best_models.items():
        filename = f"{name.lower().replace(' ', '_')}.pkl"
        joblib.dump(model, models_dir / filename)
    
    print(f"‚úÖ Saved all {len(best_models)} models")

# 6. Generate and save model card
model_card = f"""
# Model Card: {best_model_name}

## Model Details
- **Model Type**: {best_model_name}
- **Problem Type**: {problem_type.capitalize()}
- **Target Variable**: {target_col}
- **Number of Features**: {X_train.shape[1]}
- **Training Date**: {datetime.now().strftime('%Y-%m-%d')}

## Performance Metrics
"""

if problem_type == 'classification':
    model_card += f"""
- **Accuracy**: {best_metrics['accuracy']:.4f}
- **Balanced Accuracy**: {best_metrics['balanced_accuracy']:.4f}
- **Precision**: {best_metrics['precision']:.4f}
- **Recall**: {best_metrics['recall']:.4f}
- **F1 Score**: {best_metrics['f1_score']:.4f}
"""
else:
    model_card += f"""
- **R¬≤ Score**: {best_metrics['r2_score']:.4f}
- **RMSE**: {best_metrics['rmse']:.4f}
- **MAE**: {best_metrics['mae']:.4f}
- **Explained Variance**: {best_metrics['explained_variance']:.4f}
"""

model_card += f"""

## Training Information
- **Training Samples**: {X_train.shape[0]:,}
- **Test Samples**: {X_test.shape[0]:,}
- **Cross-validation**: {cv_strategy.n_splits}-fold
- **Training Time**: {best_metrics['training_time']:.2f} seconds

## Best Parameters
{json.dumps(results[best_model_name]['best_params'], indent=2)}

## Usage
```python
import joblib
import pandas as pd

# Load model
model = joblib.load('{model_filename}')

# Prepare features (ensure same order as training)
features = {selected_features[:3]} # ... etc
X = df[features]

# Make predictions
predictions = model.predict(X)
```
"""

with open(output_dir / 'model_card.md', 'w') as f:
    f.write(model_card)
print("‚úÖ Saved model card")

print(f"\nüìÅ All results saved to: {output_dir}")

## üìä 13. Final Summary and Next Steps

Display a concise summary of the entire analysis and provide guidance for next steps.

In [None]:
print("\n" + "="*80)
print("üéâ ANALYSIS COMPLETE!")
print("="*80)

print(f"\nüìä ANALYSIS SUMMARY")
print("-"*60)
print(f"Problem Type: {problem_type.capitalize()}")
print(f"Target Variable: {target_col}")
if 'selected_station' in locals() and selected_station:
    print(f"Selected Station: {selected_station}")
print(f"Features Used: {len(selected_features)}")
print(f"Models Trained: {len(results)}")
print(f"Total Training Time: {sum(training_times.values()):.2f}s")

print(f"\nüèÜ BEST MODEL: {best_model_name}")
if problem_type == 'classification':
    print(f"   ‚Ä¢ Accuracy: {best_metrics['accuracy']:.2%}")
    print(f"   ‚Ä¢ Balanced Accuracy: {best_metrics['balanced_accuracy']:.2%}")
else:
    print(f"   ‚Ä¢ R¬≤ Score: {best_metrics['r2_score']:.4f}")
    print(f"   ‚Ä¢ RMSE: {best_metrics['rmse']:.4f}")

print(f"\nüí° RECOMMENDED MODEL: {best_overall['Model']}")
print(f"   Recommendation Score: {best_overall['Score']:.1f}/100")

print(f"\nüìÅ Results saved to: {output_dir}")

print("\nüöÄ NEXT STEPS:")
print("-"*60)
print("1. Review the model_card.md for deployment instructions")
print("2. Validate model performance on new data")
print("3. Set up monitoring for model drift")
print("4. Consider ensemble methods if higher accuracy needed")
if station_results_df is not None and len(low_perf_stations) > 0:
    print("5. Investigate low-performance stations for data quality issues")

print("\n‚ú® Thank you for using the Universal ML Model Training Pipeline!")