# XGBoost Model for Bike Station Demand Prediction

This notebook implements XGBoost models for predicting bike station demand with separate models for:
- `cbike_start`: Classic bike start trips
- `cbike_end`: Classic bike end trips  
- `ebike_start`: E-bike start trips
- `ebike_end`: E-bike end trips

## ✅ GPU Error Fix Applied

**The original GPU error has been fixed!** The notebook now includes:
1. **Fixed GPU Setup** (Section 2) - Safely detects and tests GPU functionality
2. **Robust GPU Detection** (Section 2.1) - Alternative robust detection method
3. **Quick Fix Option** (Section 2.2) - Force CPU mode if needed

## Features:
- **GPU Optimization**: Automatically detects and uses GPU in Google Colab
- **Hyperparameter Tuning**: Grid search with cross-validation
- **Comprehensive Evaluation**: R², RMSE, and MAE metrics
- **Feature Engineering**: Handles skewed variables and collinearity
- **Error Handling**: Robust GPU detection and CPU fallback

## Data Requirements:
- Preprocessed training and test datasets
- Standardized features
- One-hot encoded month variables

## 1. Setup and Imports

In [None]:
# Install required packages if not already installed
!pip install xgboost pandas numpy scikit-learn matplotlib seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
print('All packages imported successfully!')

## 2. GPU Setup for Google Colab (Fixed Version)

✅ **FIXED**: The original problematic GPU setup has been replaced with a safer version that properly tests GPU functionality before using it.

## 2.1. Fixed GPU Setup (Robust Detection)


In [None]:
# FIXED GPU DETECTION - Use this instead of the previous cell
def setup_gpu_robust():
    """
    Robust GPU setup for XGBoost in Google Colab.
    Returns True if GPU is available and working, False otherwise.
    """
    try:
        # Check if we're in Google Colab
        import google.colab
        print('✅ Running in Google Colab')
        
        # Check GPU availability with nvidia-smi
        import subprocess
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        
        if result.returncode == 0:
            print('✅ GPU hardware detected!')
            print('GPU Info:')
            print(result.stdout.split('\\n')[0:3])  # Show first few lines
            
            # Test XGBoost GPU functionality with a small dataset
            try:
                import xgboost as xgb
                print('\\n🔍 Testing XGBoost GPU functionality...')
                
                # Create minimal test data
                import numpy as np
                test_X = np.array([[1, 2], [3, 4], [5, 6]])
                test_y = np.array([1, 2, 3])
                
                # Test GPU XGBoost
                test_model = xgb.XGBRegressor(
                    tree_method='gpu_hist',
                    gpu_id=0,
                    n_estimators=2,
                    random_state=42,
                    verbosity=0
                )
                test_model.fit(test_X, test_y)
                test_pred = test_model.predict(test_X)
                print('✅ XGBoost GPU functionality confirmed!')
                return True
                
            except Exception as gpu_error:
                print(f'❌ GPU detected but XGBoost GPU failed: {gpu_error}')
                print('🔄 Falling back to CPU mode')
                return False
        else:
            print('⚠️  No GPU hardware detected')
            print('🔄 Using CPU mode')
            return False
            
    except ImportError:
        print('⚠️  Not running in Google Colab - using CPU mode')
        return False
    except Exception as e:
        print(f'❌ Error during GPU detection: {e}')
        print('🔄 Falling back to CPU mode')
        return False

# Setup GPU with robust detection
USE_GPU = setup_gpu_robust()
print(f'\\n🚀 Final GPU Mode: {"Enabled" if USE_GPU else "Disabled"}')

# Force CPU mode if GPU detection failed
if not USE_GPU:
    print('\\n⚠️  IMPORTANT: Using CPU mode for all XGBoost models')
    print('   This will be slower but more reliable')
else:
    print('\\n✅ GPU mode enabled - XGBoost will use GPU acceleration')


## 🚨 QUICK FIX FOR GPU ERROR

If you got the GPU error above, run this cell instead to force CPU mode:


In [None]:
# QUICK FIX: Force CPU mode to avoid GPU errors
print("🔧 Applying quick fix for GPU error...")
USE_GPU = False  # Force CPU mode
print("✅ CPU mode enabled - this will prevent GPU errors")
print("⚠️  Training will be slower but more reliable")
print(f"🚀 GPU Mode: {'Enabled' if USE_GPU else 'Disabled (Fixed)'}")

# Verify XGBoost works in CPU mode
try:
    import xgboost as xgb
    import numpy as np
    
    # Test CPU XGBoost
    test_X = np.array([[1, 2], [3, 4], [5, 6]])
    test_y = np.array([1, 2, 3])
    
    test_model = xgb.XGBRegressor(
        tree_method='hist',  # CPU method
        n_estimators=2,
        random_state=42,
        verbosity=0
    )
    test_model.fit(test_X, test_y)
    print("✅ XGBoost CPU mode confirmed working!")
    
except Exception as e:
    print(f"❌ Error testing XGBoost: {e}")
    print("Please restart the runtime and try again")


In [None]:
# ORIGINAL GPU SETUP - REPLACED WITH SAFER VERSION
# This function was causing GPU errors, so it's been replaced with a safer version

def setup_gpu_safe():
    """
    Safe GPU setup for XGBoost in Google Colab.
    Returns True if GPU is available and working, False otherwise.
    """
    try:
        # Check if we're in Google Colab
        import google.colab
        print('✅ Running in Google Colab')
        
        # Check GPU availability with subprocess (safer than !nvidia-smi)
        import subprocess
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        
        if result.returncode == 0:
            print('✅ GPU hardware detected!')
            
            # Test XGBoost GPU functionality with minimal test
            try:
                import xgboost as xgb
                import numpy as np
                
                # Minimal test data
                test_X = np.array([[1, 2], [3, 4]])
                test_y = np.array([1, 2])
                
                # Test GPU XGBoost
                test_model = xgb.XGBRegressor(
                    tree_method='gpu_hist',
                    gpu_id=0,
                    n_estimators=1,
                    random_state=42,
                    verbosity=0
                )
                test_model.fit(test_X, test_y)
                print('✅ XGBoost GPU functionality confirmed!')
                return True
                
            except Exception as gpu_error:
                print(f'❌ GPU detected but XGBoost GPU failed: {gpu_error}')
                print('🔄 Falling back to CPU mode')
                return False
        else:
            print('⚠️  No GPU hardware detected')
            print('🔄 Using CPU mode')
            return False
            
    except ImportError:
        print('⚠️  Not running in Google Colab - using CPU mode')
        return False
    except Exception as e:
        print(f'❌ Error during GPU detection: {e}')
        print('🔄 Falling back to CPU mode')
        return False

# Setup GPU with safe detection
USE_GPU = setup_gpu_safe()
print(f'\\n🚀 GPU Mode: {"Enabled" if USE_GPU else "Disabled"}')

## 3. Data Loading and Preparation

In [None]:
# Load preprocessed datasets
print('Loading preprocessed datasets...')

try:
    training_data = pd.read_csv('result/training_dataset_preprocessed.csv')
    test_data = pd.read_csv('result/test_dataset_preprocessed.csv')
    
    print(f'✅ Training data loaded: {training_data.shape}')
    print(f'✅ Test data loaded: {test_data.shape}')
    
except FileNotFoundError as e:
    print(f'❌ Error loading data: {e}')
    print('Please ensure the preprocessed datasets are available in the result/ directory')
    raise

# Display basic info
print('\nTraining data info:')
print(training_data.info())

print('\nTest data info:')
print(test_data.info())

In [None]:
# Check for missing values
print('Missing values in training data:')
missing_train = training_data.isnull().sum()
print(missing_train[missing_train > 0])

print('\nMissing values in test data:')
missing_test = test_data.isnull().sum()
print(missing_test[missing_test > 0])

## 4. Feature Selection and Target Variables

In [None]:
# Define target variables
target_variables = ['cbike_start', 'cbike_end', 'ebike_start', 'ebike_end']

# Define features to exclude
exclude_features = ['station_id', 'year', 'total_start', 'total_end']

# Get feature columns (excluding targets and excluded features)
feature_columns = [col for col in training_data.columns 
                   if col not in target_variables + exclude_features]

print(f'Target variables: {target_variables}')
print(f'Features to exclude: {exclude_features}')
print(f'Number of feature columns: {len(feature_columns)}')
print(f'\nFeature columns:\n{feature_columns}')

## 5. Parameter Grid Definition (Ultra-Fast Training)

⚡ **Ultra-optimized for speed!** This parameter grid has been reduced from 2,916 to just 128 combinations for very fast training.

In [None]:
def define_parameter_grid(use_gpu=False):
    """
    Define parameter grid for XGBoost hyperparameter tuning.
    Optimized for faster training with fewer combinations.
    
    Args:
        use_gpu (bool): Whether to use GPU-optimized parameters
    
    Returns:
        dict: Parameter grid for GridSearchCV
    """
    # Efficient parameter grid - same for both GPU and CPU
    # Focus on most impactful parameters with fewer options
    param_grid = {
        'n_estimators': [100, 200],           # 2 options
        'max_depth': [4, 6],                  # 2 options (removed 5)
        'learning_rate': [0.05, 0.1],         # 2 options (removed 0.01)
        'subsample': [0.8, 0.9],              # 2 options
        'colsample_bytree': [0.8, 0.9],       # 2 options
        'min_child_weight': [1, 3],           # 2 options (removed 5)
        'gamma': [0, 0.1]#,                    # 2 options (removed 0.2)
        #'reg_alpha': [0, 0.1],                # 2 options (removed 0.5)
        #'reg_lambda': [0.1, 1.0]              # 2 options (removed 5.0)
    }
    
    return param_grid

# Define parameter grid based on GPU availability
param_grid = define_parameter_grid(USE_GPU)
print(f'Parameter grid ({'GPU' if USE_GPU else 'CPU'} mode):')
for param, values in param_grid.items():
    print(f'  {param}: {values}')

# Calculate total combinations
total_combinations = 1
for values in param_grid.values():
    total_combinations *= len(values)
print(f'\\nTotal parameter combinations: {total_combinations}')
print(f'With 5-fold CV: {total_combinations * 5} total fits')
print(f'\\n⚡ Training time: ~{total_combinations * 5 // 20}-{total_combinations * 5 // 10} minutes (estimated)')
print(f'🚀 Speed improvement: ~{2916 // total_combinations}x faster than original!')

## 6. XGBoost Model Training Function

In [None]:
def train_xgboost_model(X_train, y_train, X_val, y_val, target_name, use_gpu=False):
    """
    Train XGBoost model with hyperparameter tuning.
    
    Args:
        X_train, y_train: Training data
        X_val, y_val: Validation data
        target_name (str): Name of target variable
        use_gpu (bool): Whether to use GPU
    
    Returns:
        dict: Model results and best model
    """
    print(f'\n🚀 Training XGBoost model for {target_name}...')
    
    # Define base XGBoost parameters
    base_params = {
        'objective': 'reg:squarederror',
        'random_state': 42,
        'n_jobs': 1 if use_gpu else -1  # Use single thread for GPU
    }
    
    if use_gpu:
        base_params.update({
            'tree_method': 'gpu_hist',
            'gpu_id': 0
        })
    
    # Create XGBoost regressor
    xgb_model = xgb.XGBRegressor(**base_params)
    
    # Perform grid search with cross-validation
    print('🔍 Performing grid search with 5-fold cross-validation...')
    grid_search = GridSearchCV(
        estimator=xgb_model,
        param_grid=param_grid,
        cv=5,
        scoring='r2',
        n_jobs=1,  # Single thread for GPU compatibility
        verbose=1
    )
    
    # Fit the grid search
    grid_search.fit(X_train, y_train)
    
    # Get best model
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    
    print(f'✅ Best parameters: {best_params}')
    print(f'✅ Best CV score: {grid_search.best_score_:.4f}')
    
    # Evaluate on validation set
    y_val_pred = best_model.predict(X_val)
    
    val_r2 = r2_score(y_val, y_val_pred)
    val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
    val_mae = mean_absolute_error(y_val, y_val_pred)
    
    print(f'📊 Validation Performance:')
    print(f'  R²: {val_r2:.4f}')
    print(f'  RMSE: {val_rmse:.4f}')
    print(f'  MAE: {val_mae:.4f}')
    
    return {
        'best_model': best_model,
        'best_params': best_params,
        'best_cv_score': grid_search.best_score_,
        'val_r2': val_r2,
        'val_rmse': val_rmse,
        'val_mae': val_mae,
        'grid_search': grid_search
    }

## 7. Model Training for All Targets

In [None]:
# Prepare data for modeling
X = training_data[feature_columns]
print(f'Feature matrix shape: {X.shape}')

# Initialize results storage
model_results = {}
trained_models = {}

# Train models for each target variable
for target in target_variables:
    print(f'\n{'='*60}')
    print(f'TRAINING MODEL FOR: {target.upper()}')
    print(f'{'='*60}')
    
    # Prepare target variable
    y = training_data[target]
    
    # Split data (80% train, 20% validation)
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    print(f'Training set: {X_train.shape[0]} samples')
    print(f'Validation set: {X_val.shape[0]} samples')
    
    # Train model
    results = train_xgboost_model(
        X_train, y_train, X_val, y_val, target, USE_GPU
    )
    
    # Store results
    model_results[target] = results
    trained_models[target] = results['best_model']
    
    print(f'✅ Model training completed for {target}')

print(f'\n🎉 All models trained successfully!')
print(f'Trained models: {list(trained_models.keys())}')

## 8. Model Evaluation on Test Set

In [None]:
# Evaluate models on test set
print('\n' + '='*60)
print('TEST SET EVALUATION')
print('='*60)

test_results = {}
X_test = test_data[feature_columns]

for target in target_variables:
    print(f'\n📊 Evaluating {target} model on test set...')
    
    # Get true values
    y_test = test_data[target]
    
    # Make predictions
    y_test_pred = trained_models[target].predict(X_test)
    
    # Calculate metrics
    test_r2 = r2_score(y_test, y_test_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    print(f'  Test R²: {test_r2:.4f}')
    print(f'  Test RMSE: {test_rmse:.4f}')
    print(f'  Test MAE: {test_mae:.4f}')
    
    # Store test results
    test_results[target] = {
        'test_r2': test_r2,
        'test_rmse': test_rmse,
        'test_mae': test_mae,
        'y_test': y_test,
        'y_test_pred': y_test_pred
    }

print('\n✅ Test set evaluation completed!')

## 9. Results Summary and Comparison

In [None]:
# Create comprehensive results summary
print('\n' + '='*80)
print('COMPREHENSIVE MODEL PERFORMANCE SUMMARY')
print('='*80)

summary_data = []

for target in target_variables:
    # Training/validation results
    train_results = model_results[target]
    
    # Test results
    test_results_target = test_results[target]
    
    # Create summary row
    summary_row = {
        'Target': target,
        'Best_CV_Score': train_results['best_cv_score'],
        'Val_R2': train_results['val_r2'],
        'Val_RMSE': train_results['val_rmse'],
        'Val_MAE': train_results['val_mae'],
        'Test_R2': test_results_target['test_r2'],
        'Test_RMSE': test_results_target['test_rmse'],
        'Test_MAE': test_results_target['test_mae']
    }
    
    summary_data.append(summary_row)
    
    # Print detailed results
    print(f'\n🎯 {target.upper()}:')
    print(f'  Best CV Score: {train_results['best_cv_score']:.4f}')
    print(f'  Validation - R²: {train_results['val_r2']:.4f}, RMSE: {train_results['val_rmse']:.4f}, MAE: {train_results['val_mae']:.4f}')
    print(f'  Test      - R²: {test_results_target['test_r2']:.4f}, RMSE: {test_results_target['test_rmse']:.4f}, MAE: {test_results_target['test_mae']:.4f}')
    print(f'  Best Parameters: {train_results['best_params']}')

# Create summary DataFrame
summary_df = pd.DataFrame(summary_data)
print('\n📊 Performance Summary Table:')
print(summary_df.round(4))

## 10. Feature Importance Analysis

In [None]:
# Analyze feature importance for each model
print('\n' + '='*60)
print('FEATURE IMPORTANCE ANALYSIS')
print('='*60)

for target in target_variables:
    print(f'\n🔍 Feature importance for {target} model:')
    
    model = trained_models[target]
    
    # Get feature importance
    importance_df = pd.DataFrame({
        'feature': feature_columns,
        'importance': model.feature_importances_
    })
    
    # Sort by importance
    importance_df = importance_df.sort_values('importance', ascending=False)
    
    # Display top 15 features
    print(f'Top 15 most important features:')
    print(importance_df.head(15).round(4))
    
    # Plot feature importance
    plt.figure(figsize=(12, 8))
    top_features = importance_df.head(20)
    
    plt.barh(range(len(top_features)), top_features['importance'])
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Feature Importance')
    plt.title(f'Top 20 Feature Importance - {target}')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

## 11. Model Performance Visualization

In [None]:
# Create performance comparison plots
print('\n📈 Creating performance visualization...')

# R² comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Model Performance Comparison', fontsize=16)

metrics = ['R²', 'RMSE', 'MAE']
metric_cols = ['r2', 'rmse', 'mae']

for i, (metric, col) in enumerate(zip(metrics, metric_cols)):
    ax = axes[i//2, i%2]
    
    # Prepare data for plotting
    val_data = [model_results[target][f'val_{col}'] for target in target_variables]
    test_data = [test_results[target][f'test_{col}'] for target in target_variables]
    
    x = np.arange(len(target_variables))
    width = 0.35
    
    ax.bar(x - width/2, val_data, width, label='Validation', alpha=0.8)
    ax.bar(x + width/2, test_data, width, label='Test', alpha=0.8)
    
    ax.set_xlabel('Target Variable')
    ax.set_Google Colab
    ax.set_title(f'{metric} Comparison')
    ax.set_xticks(x)
    ax.set_xticklabels(target_variables, rotation=45)
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Actual vs Predicted plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Actual vs Predicted Values (Test Set)', fontsize=16)

for i, target in enumerate(target_variables):
    ax = axes[i//2, i%2]
    
    y_true = test_results[target]['y_test']
    y_pred = test_results[target]['y_test_pred']
    
    ax.scatter(y_true, y_pred, alpha=0.6)
    ax.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)
    
    ax.set_xlabel('Actual Values')
    ax.set_ylabel('Predicted Values')
    ax.set_title(f'{target} - R²: {test_results[target]["test_r2"]:.4f}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 12. Save Results and Models

In [None]:
# Save results to CSV
print('\n💾 Saving results...')

# Prepare results for saving
results_data = []
for target in target_variables:
    train_results = model_results[target]
    test_results_target = test_results[target]
    
    row = {
        'target_variable': target,
        'best_cv_score': train_results['best_cv_score'],
        'validation_r2': train_results['val_r2'],
        'validation_rmse': train_results['val_rmse'],
        'validation_mae': train_results['val_mae'],
        'test_r2': test_results_target['test_r2'],
        'test_rmse': test_results_target['test_rmse'],
        'test_mae': test_results_target['test_mae'],
        'best_parameters': str(train_results['best_params']),
        'gpu_used': USE_GPU
    }
    results_data.append(row)

# Create and save results DataFrame
results_df = pd.DataFrame(results_data)
results_df.to_csv('result/xgboost_results.csv', index=False)
print(f'✅ Results saved to result/xgboost_results.csv')

# Save models (optional - for future use)
import joblib
for target in target_variables:
    model_filename = f'result/xgboost_model_{target}.joblib'
    joblib.dump(trained_models[target], model_filename)
    print(f'✅ Model saved to {model_filename}')

## 13. Summary and Next Steps

In [None]:
print('\n' + '='*80)
print('🎉 XGBOOST MODELING COMPLETED SUCCESSFULLY!')
print('='*80)

print(f'\n📊 Models Trained: {len(trained_models)}')
print(f'🎯 Target Variables: {', '.join(target_variables)}')
print(f'🚀 GPU Acceleration: {'Enabled' if USE_GPU else 'Disabled'}')
print(f'🔍 Features Used: {len(feature_columns)}')

print('\n📁 Files Generated:')
print('  - result/xgboost_results.csv (performance metrics)')
print('  - result/xgboost_model_*.joblib (trained models)')

print('\n🚀 Next Steps:')
print('  1. Analyze feature importance for insights')
print('  2. Compare with linear regression results')
print('  3. Consider ensemble methods')
print('  4. Deploy models for predictions')

print('\n✅ Notebook execution completed!')