# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    mean_absolute_percentage_error,
    r2_score,
 )

## Model Choice

We use a **linear regression baseline** (ordinary least squares via `LinearRegression`) because it is transparent, fast to train, and easy to interpret. It also aligns with `MF_20251113.ipynb`, where a linear model is used with engineered calendar and lag features to explain daily revenue.

For this baseline, the target is transformed with `log1p(Umsatz)` during fitting and transformed back with `expm1` for evaluation. This reduces skewness in revenue and improves stability while keeping the model linear in features.

## Feature Selection

For this baseline model, we use only two explanatory variables:
- `Temperatur` (daily temperature from weather data)
- `KielerWoche` (binary indicator for Kieler Woche event days)

This provides a simple, interpretable benchmark focused on weather and event effects.

In [9]:
# Load and prepare data for a simple baseline: Umsatz ~ Temperatur + KielerWoche
sales = pd.read_csv('/workspaces/TeamCPH/data/umsatzdaten_gekuerzt.csv', parse_dates=['Datum'])
wetter = pd.read_csv('/workspaces/TeamCPH/data/wetter1.csv', parse_dates=['Datum'])
kiwo = pd.read_csv('/workspaces/TeamCPH/data/kiwo.csv', parse_dates=['Datum'])

# Aggregate to daily revenue per product group
sales_daily = (
    sales
    .groupby(['Datum', 'Warengruppe'], as_index=False)['Umsatz']
    .sum()
)

# Merge weather and Kieler Woche indicator
merged = sales_daily.merge(wetter, on='Datum', how='left')
merged = merged.merge(kiwo, on='Datum', how='left')

# Make sure KielerWoche exists and is numeric
if 'KielerWoche' in merged.columns:
    merged['KielerWoche'] = merged['KielerWoche'].fillna(0).astype(int)
else:
    merged['KielerWoche'] = 0

# Keep only requested baseline features
predictors = ['Temperatur', 'KielerWoche']
model_df = merged[['Umsatz'] + predictors].replace([np.inf, -np.inf], np.nan).dropna()

X = model_df[predictors]
y = np.log1p(model_df['Umsatz'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)
print('Features used:', predictors)

Train shape: (7454, 2)
Test shape: (1864, 2)
Features used: ['Temperatur', 'KielerWoche']


## Implementation

We fit a **linear regression** model on `log1p(Umsatz)` using only `Temperatur` and `KielerWoche` as predictors.

In [10]:
# Initialize and train baseline linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predict in log-space and convert back to revenue space
y_pred_log = model.predict(X_test)
y_pred = np.expm1(y_pred_log)
y_true = np.expm1(y_test)

print('Baseline linear regression trained successfully.')
print('Number of features:', X_train.shape[1])

Baseline linear regression trained successfully.
Number of features: 2


## Evaluation

For the baseline linear regression, we report:

- **MAE** (average absolute revenue error)
- **RMSE** (penalizes larger errors)
- **MAPE** (relative percentage error, easy to interpret across product groups)
- **R² in log-space** (variance explained in the fitted target space)

These metrics provide a clear benchmark for comparing later, more complex models (e.g., neural nets).

In [8]:
# Evaluate the baseline model
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = mean_absolute_percentage_error(y_true, y_pred) * 100
r2_log = r2_score(y_test, y_pred_log)

print(f'MAE (revenue):  {mae:,.2f}')
print(f'RMSE (revenue): {rmse:,.2f}')
print(f'MAPE (revenue): {mape:.2f}%')
print(f'R² (log-space): {r2_log:.4f}')

MAE (revenue):  105.10
RMSE (revenue): 147.10
MAPE (revenue): 63.83%
R² (log-space): 0.0373
