<a href="https://colab.research.google.com/github/john-d-noble/callcenter/blob/main/CB_Step_4_Machine_Learning_Models_(With_Feature_Engineering).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install pandas numpy scikit-learn xgboost



In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler

# Load the updated dataset
df = pd.read_csv('updated_final_merged_data.csv', index_col='Date', parse_dates=True)

# Assume 'Calls' is the target column
target = 'Calls'

# Prepare data: Sort by date if not already
df = df.sort_index()

# Feature Engineering
# Lags: previous day (lag1) and previous week (lag7)
df['Lag1'] = df[target].shift(1)
df['Lag7'] = df[target].shift(7)

# Rolling statistics: 7-day mean and std
df['Rolling_Mean_7'] = df[target].rolling(window=7).mean()
df['Rolling_Std_7'] = df[target].rolling(window=7).std()

# Day-of-week dummies (from EDA)
df = pd.get_dummies(df, columns=['DayOfWeek'], drop_first=True)

# Select market features with |corr| > 0.2 from EDA (adjust based on actual high_corr)
# Assuming from previous: e.g., '^VIX_Close_^VIX', 'CVOL-USD_Close_CVOL-USD', etc.
# For code, select all numeric except target and engineered
features = [col for col in df.columns if col != target and df[col].dtype in [np.float64, np.int64, bool]]

# Drop NaNs from shifting/rolling
df = df.dropna()

# X and y
X = df[features]
y = df[target]

# Time series cross-validation: 5 splits
tscv = TimeSeriesSplit(n_splits=5)

# Function to calculate metrics
def calculate_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100  # As percentage
    return {'MAE': mae, 'RMSE': rmse, 'MAPE': mape}

# Dictionary to store average metrics for each model
model_metrics = {}

# Scaler for models that need it (Ridge, SVR)
scaler = StandardScaler()

# 1. Ridge Regression (linear with L2 regularization)
ridge_preds = []
ridge_trues = []
for train_idx, test_idx in tscv.split(df):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Scale
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Fit
    model = Ridge(alpha=1.0)
    model.fit(X_train_scaled, y_train)

    # Predict
    pred = model.predict(X_test_scaled)

    ridge_preds.extend(pred)
    ridge_trues.extend(y_test)

ridge_metrics = calculate_metrics(ridge_trues, ridge_preds)
model_metrics['Ridge'] = ridge_metrics

# 2. Random Forest Regressor
rf_preds = []
rf_trues = []
for train_idx, test_idx in tscv.split(df):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Fit (no scaling needed)
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Predict
    pred = model.predict(X_test)

    rf_preds.extend(pred)
    rf_trues.extend(y_test)

rf_metrics = calculate_metrics(rf_trues, rf_preds)
model_metrics['Random Forest'] = rf_metrics

# 3. XGBoost Regressor
xgb_preds = []
xgb_trues = []
for train_idx, test_idx in tscv.split(df):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Fit (no scaling needed)
    model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
    model.fit(X_train, y_train)

    # Predict
    pred = model.predict(X_test)

    xgb_preds.extend(pred)
    xgb_trues.extend(y_test)

xgb_metrics = calculate_metrics(xgb_trues, xgb_preds)
model_metrics['XGBoost'] = xgb_metrics

# 4. Support Vector Regression (SVR)
svr_preds = []
svr_trues = []
for train_idx, test_idx in tscv.split(df):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Scale
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Fit
    model = SVR(kernel='rbf', C=100, epsilon=0.1)
    model.fit(X_train_scaled, y_train)

    # Predict
    pred = model.predict(X_test_scaled)

    svr_preds.extend(pred)
    svr_trues.extend(y_test)

svr_metrics = calculate_metrics(svr_trues, svr_preds)
model_metrics['SVR'] = svr_metrics

# Summarize performance
print("\nModel Performance Summary:")
metrics_df = pd.DataFrame(model_metrics).T
print(metrics_df)

# Pick winner: Lowest MAE (primary metric)
winner = metrics_df['MAE'].idxmin()
print(f"\nChampion ML Model: {winner}")
print(f"Metrics: {metrics_df.loc[winner].to_dict()}")


  df = pd.read_csv('updated_final_merged_data.csv', index_col='Date', parse_dates=True)



Model Performance Summary:
                       MAE         RMSE       MAPE
Ridge          1011.386531  1392.164956  10.887078
Random Forest  1080.180770  1700.625098  11.130627
XGBoost        1338.681402  1955.162291  14.519525
SVR            1925.278643  2607.626401  20.171805

Champion ML Model: Ridge
Metrics: {'MAE': 1011.3865312439392, 'RMSE': 1392.1649564183208, 'MAPE': 10.887077830494986}


**### Combined Performance Summary: Baseline, Classical, and Machine Learning Models**

To provide a holistic view of the forecasting progression, below are the performance tables for all model tiers evaluated so far: baselines (simple benchmarks), classical time series (univariate with trend/seasonality), and machine learning (multivariate with feature engineering like lags, rollings, dummies, and market indicators). All were assessed using time-series cross-validation on the filled dataset, with metrics including Mean Absolute Error (MAE, in call counts), Root Mean Squared Error (RMSE, emphasizing large errors), and Mean Absolute Percentage Error (MAPE, for relative scale). This cumulative context highlights improvements (or lack thereof) across complexity levels, informed by the EDA's insights on seasonality, trends, and market correlations.

#### Baseline Models Performance
| Model          | MAE       | RMSE      | MAPE     |
|----------------|-----------|-----------|----------|
| Naive          | 2351.46  | 2942.38  | 24.84%  |
| Mean           | 1634.56  | 2154.49  | 18.23%  |
| Median         | 1613.91  | 2177.89  | 17.38%  |
| Seasonal Naive | 907.70   | 1359.05  | 9.67%   |

**Baseline Champion**: Seasonal Naive (excels due to strong weekly patterns from EDA).

#### Classical Models Performance
| Model   | MAE       | RMSE      | MAPE     |
|---------|-----------|-----------|----------|
| ARIMA   | 2268.08  | 2860.61  | 24.43%  |
| SARIMA  | 2560.83  | 3163.07  | 28.56%  |
| ETS     | 2233.64  | 2882.92  | 22.57%  |

**Classical Champion**: ETS (best among classics but underperforms baselines).

#### Machine Learning Models Performance
| Model         | MAE       | RMSE      | MAPE     |
|---------------|-----------|-----------|----------|
| Ridge         | 1011.39  | 1392.16  | 10.89%  |
| Random Forest | 1080.18  | 1700.63  | 11.13%  |
| XGBoost       | 1338.68  | 1955.16  | 14.52%  |
| SVR           | 1925.28  | 2607.63  | 20.17%  |

**ML Champion**: Ridge (lowest errors, leveraging regularization for stability).

### Full Narrative Analysis
The baseline models establish a solid starting point, relying on minimal assumptions to benchmark against the call volume data's inherent patterns. The Naive method, repeating the prior day's value, yields high errors (MAE ~2,351, MAPE 25%) due to daily volatility, while Mean and Median offer modest gains (MAEs ~1,614-1,635, MAPEs 17-18%) by centering on overall tendencies—benefiting from the EDA's distribution analysis showing slight skew. The standout Seasonal Naive, with an MAE of 908 and MAPE under 10%, capitalizes on the EDA's decomposed weekly seasonality and day-of-week averages, proving that simple periodic repetition effectively handles the recurring cycles in this 7-day operational context, even post-imputation of weekends/holidays.

Classical models, designed to explicitly model trends and seasonality, show limited advancement. ARIMA addresses non-stationarity (per EDA's ADF test suggesting differencing) but ignores seasonality, resulting in an MAE of 2,268 and MAPE of 24%—comparable to Naive but inferior to Seasonal Naive. SARIMA, incorporating weekly terms, disappointingly ranks worst (MAE 2,561, MAPE 29%), likely due to overfitting on outliers or noise in filled data, failing to leverage the EDA's clear periodic signals. ETS performs best in this tier (MAE 2,234, MAPE 23%), using smoothing to balance trends and additive seasonality amid the rolling volatility from EDA plots, yet it still trails baselines by over 2x in MAE, indicating classical univariate approaches add complexity without proportional gains.

The machine learning tier, enhanced by feature engineering (e.g., lags for autocorrelation, rollings for trends, day-of-week dummies for seasonality, and market features like VIX/CVOL per EDA correlations >0.2), demonstrates clearer improvements by integrating multivariate signals. Ridge regression, with regularization to handle collinearity, emerges as champion (MAE 1,011, MAPE 11%), offering stability and benefiting from scaled inputs to mitigate outlier impacts. Random Forest follows closely (MAE 1,080, MAPE 11%), using ensembles for non-linearity and feature interactions, while XGBoost (MAE 1,339, MAPE 15%) provides gains through boosting but may overfit on the dataset's noise. SVR lags (MAE 1,925, MAPE 20%), struggling with kernel-based non-linearity in this time-series setup.

Cumulatively, ML models outperform classical ones—Ridge reduces ETS's MAE by ~55%—but still fall short of the baseline Seasonal Naive's precision (MAE 908 vs. 1,011), suggesting that while externalities and engineered features add value, the dominant weekly rhythm remains best captured simply. This aligns with EDA findings: strong seasonality trumps complex modeling unless further tuned (e.g., hyperparameter optimization or hybridizing with Prophet). Across tiers, errors trend downward with multivariate integration, but the baseline's efficiency highlights potential over-engineering risks. Next steps could involve deep learning hybrids or ensemble stacking to push below the 10% MAPE benchmark, ensuring scalable, robust forecasts for call center operations.