# Baseline Model

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
# Import your chosen baseline model
from sklearn.linear_model import LogisticRegression


## Model Choice

We chose the multilinear regression model as our baseline because it is a natural starting point when wanting to predict something which depends on something else and we feel confident when using it because it is simple to understand, but it is also efficient. It provides a clear guideline on how to improve the dataset thanks to the R^2, making it a useful benchmark for the neural network. 

In [3]:
# Load the dataset
# Replace 'your_dataset.csv' with the path to your actual dataset
df = pd.read_csv('/workspaces/ml-project-template/final_dataset.csv')

print("Dataset shape:", df.shape)
df.head()

Dataset shape: (10896, 22)


Unnamed: 0,Date,Holiday,NextDayHoliday,IsWeekend,Month,KielerWeek,IsNewYearsEve,IsHalloween,t,lag_1,...,year_sin1,year_cos1,year_sin2,year_cos2,Revenue,Product_2,Product_3,Product_4,Product_5,Product_6
0,2013-07-01,1,1,0,7,0,0,0,0,1269.2491,...,0.0,1.0,0.0,1.0,148.82835,0,0,0,0,0
1,2013-07-01,1,1,0,7,0,0,0,0,1269.2491,...,0.0,1.0,0.0,1.0,535.85626,1,0,0,0,0
2,2013-07-01,1,1,0,7,0,0,0,0,1269.2491,...,0.0,1.0,0.0,1.0,201.19843,0,1,0,0,0
3,2013-07-01,1,1,0,7,0,0,0,0,1269.2491,...,0.0,1.0,0.0,1.0,0.0,0,0,0,0,1
4,2013-07-01,1,1,0,7,0,0,0,0,1269.2491,...,0.0,1.0,0.0,1.0,317.47586,0,0,0,1,0


## Feature selection

To make this simple model, we choose `Holiday`, because it was one of the first intuitions we had which increased the R^2. We also choose and `lag_1`, because it is one of the later features we created. (Note: we still take into account that our product columns are onehot encoded, if not `Holiday` and `lag_1` would have a very small explanatory power)

In [4]:
# Feature selection
# Example: Selecting only two features for a simple baseline model
X = df[['Holiday', 'lag_1',
        'Product_2', 'Product_3',
        'Product_4', 'Product_5', 'Product_6']]

y = df['Revenue']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Implementation

In [5]:
# Initialize and train the baseline model
# Example for a classification problem using Logistic Regression
# model = LogisticRegression()
# model.fit(X_train, y_train)

import statsmodels.api as sm
import numpy as np

X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
y_pred = model.predict(X_test_const)
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                Revenue   R-squared:                       0.748
Model:                            OLS   Adj. R-squared:                  0.748
Method:                 Least Squares   F-statistic:                     3699.
Date:                Fri, 27 Feb 2026   Prob (F-statistic):               0.00
Time:                        12:01:45   Log-Likelihood:                -50115.
No. Observations:                8716   AIC:                         1.002e+05
Df Residuals:                    8708   BIC:                         1.003e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         77.3933      2.560     30.229      0.0

## Evaluation

We chose MSE, RMSE, and R² because they are standard metrics for evaluating regression models. 
**MSE** measures the average squared prediction error and penalizes large deviations more heavily, making it sensitive to significant mistakes. **RMSE** is the square root of MSE, making it easier to interpret practically. 
**R²** provides a measure of explanatory power. 

The model yields an MSE of 5901.34 and an RMSE of 76.82, meaning that predictions differ from actual Revenue values by approximately 77 units on average. The R² value of 0.754 indicates that the model explains about 75.4% of the variance in Revenue, demonstrating strong explanatory power.

In [6]:
# Evaluate the baseline model
# Example for a classification problem
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)

# For a regression problem, you might use:
# mse = mean_squared_error(y_test, y_pred)

# Your evaluation code here
from sklearn.metrics import mean_squared_error, r2_score

# Predictions (if not already done)
y_pred = model.predict(X_test_const)

# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)


MSE: 5901.337781775943
RMSE: 76.8201652027379
R²: 0.7544480786043978
