# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    mean_absolute_percentage_error,
    r2_score,
),

## Model Choice

We use a **linear regression baseline** (ordinary least squares via `LinearRegression`) because it is transparent, fast to train, and easy to interpret. It also aligns with `MF_20251113.ipynb`, where a linear model is used with engineered calendar and lag features to explain daily revenue.

For this baseline, the target is transformed with `log1p(Umsatz)` during fitting and transformed back with `expm1` for evaluation. This reduces skewness in revenue and improves stability while keeping the model linear in features.

## Feature Selection

Following `MF_20251113.ipynb`, we use features that are available at day/product-group level and strongly related to demand:

- **Calendar effects**: `IsWeekend`, `IsNewYears`, `IsHalloween`
- **Seasonality (Fourier terms)**: `sin_1y`, `cos_1y`, `sin_2y`, `cos_2y`
- **Event/holiday indicators**: `holiday`, `Easter`, `KielerWoche`
- **Autoregressive signals**: `Revenue_lag1`, `Revenue_lag7` (within each `Warengruppe`)
- **Product-group fixed effects**: one-hot encoded `Warengruppe` dummies

These features provide a strong, interpretable baseline before moving to more complex models.

In [None]:
# Load and prepare data (aligned with MF_20251113.ipynb)
sales = pd.read_csv('/workspaces/TeamCPH/data/umsatzdaten_gekuerzt.csv', parse_dates=['Datum'])
wetter = pd.read_csv('/workspaces/TeamCPH/data/wetter1.csv', parse_dates=['Datum'])
kiwo = pd.read_csv('/workspaces/TeamCPH/data/kiwo.csv', parse_dates=['Datum'])
holidays = pd.read_csv('/workspaces/TeamCPH/data/school_holidays_SH.csv', parse_dates=['Datum'])

# Aggregate to daily revenue per product group
sales_daily = (
    sales
    .groupby(['Datum', 'Warengruppe'], as_index=False)['Umsatz']
    .sum()
)

# Merge exogenous data
merged = sales_daily.merge(wetter, on='Datum', how='left')
merged = merged.merge(kiwo, on='Datum', how='left')
merged = merged.merge(holidays, on='Datum', how='left')

# Ensure expected indicator columns exist
for col in ['holiday', 'Easter']:
    if col not in merged.columns:
        merged[col] = 0

if 'KielerWoche' in merged.columns:
    merged['KielerWoche'] = merged['KielerWoche'].fillna(0).astype(int)
else:
    merged['KielerWoche'] = 0

# Calendar features
merged = merged.sort_values(['Warengruppe', 'Datum']).reset_index(drop=True)
merged['IsWeekend'] = merged['Datum'].dt.weekday.isin([5, 6]).astype(int)
merged['IsNewYears'] = (merged['Datum'].dt.strftime('%m-%d') == '12-31').astype(int)
halloween_days = [f'10-{day:02d}' for day in range(24, 32)]
merged['IsHalloween'] = merged['Datum'].dt.strftime('%m-%d').isin(halloween_days).astype(int)

# Seasonality features (Fourier terms)
merged['DayOfYear'] = merged['Datum'].dt.dayofyear
merged['sin_1y'] = np.sin(2 * np.pi * merged['DayOfYear'] / 365.25)
merged['cos_1y'] = np.cos(2 * np.pi * merged['DayOfYear'] / 365.25)
merged['sin_2y'] = np.sin(4 * np.pi * merged['DayOfYear'] / 365.25)
merged['cos_2y'] = np.cos(4 * np.pi * merged['DayOfYear'] / 365.25)

# Lag features within product group
merged['Revenue_lag1'] = merged.groupby('Warengruppe')['Umsatz'].shift(1)
merged['Revenue_lag7'] = merged.groupby('Warengruppe')['Umsatz'].shift(7)

# Product-group dummies
wg_dummies = pd.get_dummies(merged['Warengruppe'], prefix='WG', drop_first=True, dtype=int)
merged = pd.concat([merged, wg_dummies], axis=1)

# Define predictors
predictors = [
    'holiday', 'Easter', 'KielerWoche',
    'IsWeekend', 'IsNewYears', 'IsHalloween',
    'sin_1y', 'cos_1y', 'sin_2y', 'cos_2y',
    'Revenue_lag1', 'Revenue_lag7',
] + wg_dummies.columns.tolist()

# Build modeling table and split
model_df = merged[['Umsatz'] + predictors].replace([np.inf, -np.inf], np.nan).dropna()
X = model_df[predictors]
y = np.log1p(model_df['Umsatz'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)

## Implementation

We fit a **linear regression** model on `log1p(Umsatz)` using the engineered feature set above. This is the baseline benchmark for later, more complex models.

In [None]:
# Initialize and train baseline linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predict in log-space and convert back to revenue space
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_true = np.expm1(y_test)

print('Baseline linear regression trained successfully.')
print('Number of features:', X_train.shape[1])

## Evaluation

For the baseline linear regression, we report:

- **MAE** (average absolute revenue error)
- **RMSE** (penalizes larger errors)
- **MAPE** (relative percentage error, easy to interpret across product groups)
- **R² in log-space** (variance explained in the fitted target space)

These metrics provide a clear benchmark for comparing later, more complex models (e.g., neural nets).

In [None]:
# Evaluate the baseline model
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = mean_absolute_percentage_error(y_true, y_pred) * 100
r2_log = r2_score(y_test, y_pred_log)

print(f'MAE (revenue):  {mae:,.2f}')
print(f'RMSE (revenue): {rmse:,.2f}')
print(f'MAPE (revenue): {mape:.2f}%')
print(f'R² (log-space): {r2_log:.4f}')