<a href="https://colab.research.google.com/github/john-d-noble/callcenter/blob/main/CB_Step_2_Baseline_Models_(Simple_Benchmarks).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit

# Load the updated dataset
df = pd.read_csv('updated_final_merged_data.csv', index_col='date', parse_dates=True)

# Assume 'Calls' is the target column
target = 'calls'

# Prepare data: Sort by date if not already
df = df.sort_index()

# Define forecast horizon (e.g., 7 days for weekly)
horizon = 7

# Time series cross-validation: 5 splits
tscv = TimeSeriesSplit(n_splits=5)

# Function to calculate metrics
def calculate_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100  # As percentage
    return {'MAE': mae, 'RMSE': rmse, 'MAPE': mape}

# Dictionary to store average metrics for each model
model_metrics = {}

# 1. Naive Forecast (Last observed value)
naive_preds = []
naive_trues = []
for train_idx, test_idx in tscv.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[test_idx]

    # Predict last train value for all test points
    last_value = train[target].iloc[-1]
    pred = np.full(len(test), last_value)

    naive_preds.extend(pred)
    naive_trues.extend(test[target])

naive_metrics = calculate_metrics(naive_trues, naive_preds)
model_metrics['Naive'] = naive_metrics

# 2. Mean Forecast (Overall mean)
mean_value = df[target].mean()  # Global mean
mean_preds = np.full(len(df), mean_value)
mean_metrics = calculate_metrics(df[target], mean_preds)  # Evaluate on full data since it's constant
model_metrics['Mean'] = mean_metrics

# Median Forecast (Overall median)
median_value = df[target].median()
median_preds = np.full(len(df), median_value)
median_metrics = calculate_metrics(df[target], median_preds)
model_metrics['Median'] = median_metrics

# 3. Seasonal Naive (Same day last week, lag=7)
seasonal_preds = []
seasonal_trues = []
for train_idx, test_idx in tscv.split(df):
    train = df.iloc[train_idx]
    test = df.iloc[test_idx]

    # For each test point, predict the value from 7 days ago (if available)
    pred = []
    for i in test_idx:
        lag_idx = i - 7
        if lag_idx >= 0:
            pred.append(df.iloc[lag_idx][target])
        else:
            pred.append(train[target].mean())  # Fallback if no lag

    seasonal_preds.extend(pred)
    seasonal_trues.extend(test[target])

seasonal_metrics = calculate_metrics(seasonal_trues, seasonal_preds)
model_metrics['Seasonal Naive'] = seasonal_metrics

# Summarize performance
print("\nModel Performance Summary:")
metrics_df = pd.DataFrame(model_metrics).T
print(metrics_df)

# Pick winner: Lowest MAE (primary metric)
winner = metrics_df['MAE'].idxmin()
print(f"\nChampion Baseline Model: {winner}")
print(f"Metrics: {metrics_df.loc[winner].to_dict()}")



Model Performance Summary:
                        MAE         RMSE       MAPE
Naive           2706.061350  3385.348788  37.198945
Mean            1968.528582  2547.250358  27.218917
Median          1960.742331  2553.890108  26.514827
Seasonal Naive   857.704294  1322.311704  10.292237

Champion Baseline Model: Seasonal Naive
Metrics: {'MAE': 857.7042944785276, 'RMSE': 1322.3117036908475, 'MAPE': 10.292236542928933}


The provided results represent the performance of baseline forecasting models for daily call center volume, evaluated using time-series cross-validation on the dataset (which includes imputed values for weekends and holidays via forward-fill, as per previous steps). These baselines—Naive (repeating the last observed value), Mean (global average), Median (global median), and Seasonal Naive (repeating the value from the same day last week)—serve as simple benchmarks to establish a performance floor before advancing to more complex models. Metrics include Mean Absolute Error (MAE, measuring average deviation in call counts), Root Mean Squared Error (RMSE, penalizing larger errors more), and Mean Absolute Percentage Error (MAPE, for relative accuracy scaled to data magnitude). The evaluation draws on insights from the exploratory data analysis (EDA), which highlighted strong weekly seasonality, non-stationarity, variability (e.g., calls ranging ~2,000-25,000 with outliers), and potential market correlations (e.g., VIX and CVOL).

### Model Performance Summary Table
Here's the results table for clarity:

| Model          | MAE       | RMSE      | MAPE     |
|----------------|-----------|-----------|----------|
| Naive          | 2706.06  | 3385.35  | 37.20%  |
| Mean           | 1968.53  | 2547.25  | 27.22%  |
| Median         | 1960.74  | 2553.89  | 26.51%  |
| Seasonal Naive | 857.70   | 1322.31  | 10.29%  |

**Champion Baseline Model**: Seasonal Naive (lowest errors across all metrics, demonstrating the value of incorporating basic periodicity).

### Narrative Evaluation
These baseline results are a critical starting point, offering insights into the dataset's inherent predictability without sophisticated modeling. The Naive model, which assumes short-term persistence by repeating the prior day's value, performs the worst with an MAE of ~2,706 calls and a MAPE of 37%. This high error rate aligns with the EDA's time series visualization and rolling statistics, which showed significant day-to-day fluctuations and volatility clusters—indicating that call volume doesn't remain stable from one day to the next, possibly due to external factors like market events or operational cycles.

The Mean and Median models improve upon this by leveraging central tendencies across the full dataset, achieving MAEs of ~1,969 and ~1,961 calls, respectively, with MAPEs around 27% and 27%. The slight edge of the Median over the Mean in MAE and MAPE reflects the EDA's distribution analysis, which suggested moderate skew and the presence of outliers (e.g., via boxplots and Z-scores/IQR detection)—the Median is more robust to these extremes, avoiding the pull of high-volume spikes. However, these global averages ignore temporal structure, leading to errors that are still substantially higher than ideal for operational forecasting, as they fail to account for trends or seasonality evident in the EDA's decomposition and autocorrelation plots.

The Seasonal Naive model emerges as the clear champion, with a dramatically lower MAE of 858 calls, RMSE of 1,322, and MAPE of 10%—reducing errors by over 50% compared to the next-best baseline (Median). This strong performance validates the EDA's key finding of prominent weekly seasonality (e.g., from decomposition showing recurring 7-day patterns and day-of-week averages indicating higher mid-week volumes). By simply repeating the value from the same day last week, it effectively captures these cycles, even in the imputed dataset where weekends/holidays inherit prior business day values. The low MAPE under 11% suggests practical utility for short-term planning, as predictions are within ~10% of actuals on average—far better than the other baselines and indicative of the data's periodic nature dominating over random noise.

Overall, these results are meaningful in context: The baselines confirm the EDA's emphasis on seasonality as the dominant driver, with the Seasonal Naive setting a high bar (MAPE ~10%) that more complex models must surpass to justify their use. The higher errors in non-seasonal baselines underscore the non-stationary, patterned nature of the data, while the champion's success implies that sophisticated techniques (e.g., incorporating market features like VIX for volatility ties) could build on this for further gains. However, if advanced tiers (e.g., ML or DL) don't beat this benchmark, the simple, interpretable Seasonal Naive would be the most efficient choice for call center resource allocation—avoiding overfitting risks on a dataset with potential outliers and imputed gaps. To refine, future iterations could test sensitivity to imputation methods or add external regressors, but these baselines already provide a robust, low-effort forecast floor.