**Multiple Regression Model**


*Build a multiple regression model using all engineered
features.*

**Train.csv**:
* The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
* Store_nbr identifies the store at which the products are sold.
family identifies the type of product sold.
* Sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
* Onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.



**holidays_events.csv**
* Holidays and Events, with metadata
* A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
* Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

**stores.csv**:
* Store metadata, including city, state, type, and cluster.
* cluster is a grouping of similar stores.

dataset links: https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data

In [14]:
#importing all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [15]:
#Reading and loading datasets
train_df = pd.read_csv('train.csv', parse_dates=['date'])
stores_df = pd.read_csv('stores.csv')
holidays_df = pd.read_csv('holidays_events.csv', parse_dates=['date'])

In [16]:
# Merge train_df with stores_df to get the 'type' (store type)
df = pd.merge(train_df, stores_df, on='store_nbr', how='left')

In [17]:
# We create a simple binary column for whether a date is a recognized holiday.
actual_holidays = holidays_df[
    (holidays_df['type'] != 'Work Day') &
    (holidays_df['type'] != 'Transfer')
]

In [18]:
# Create a unique list of actual holiday dates
holiday_dates = actual_holidays['date'].unique()

In [19]:
#Creating the IsHoliday feature in the main dataframe
df['IsHoliday'] = df['date'].isin(holiday_dates).astype(int)

In [20]:
# We remove rows missing key feature data.
df.dropna(inplace=True)
print(f"Total rows after dropping NaNs: {len(df)}")

Total rows after dropping NaNs: 3000888


In [21]:
#Applying log to sales column as it is right skewed
df['sales'] = np.log1p(df['sales'])

In [22]:
#selecting features and target column for training the model
features = ['onpromotion', 'IsHoliday', 'type', 'store_nbr', 'family']
target = 'sales'

In [23]:
#Handling Categorical Features using One-Hot Encoding
X = df[features]
X = pd.get_dummies(X, columns=['type', 'store_nbr', 'family'], dummy_na=False)
y = df[target]

In [24]:
#Test traing spliting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining on {len(X_train)} rows.")


Training on 2400710 rows.


In [25]:
#Fitting the model
model = LinearRegression(n_jobs=-1)
model.fit(X_train, y_train)

In [26]:
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("\n--- Model Evaluation Results ---")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f} (on log-transformed sales)")
print(f"R-squared (R2 Score): {r2:.4f}")


--- Model Evaluation Results ---
Mean Squared Error (MSE): 1.8813
Root Mean Squared Error (RMSE): 1.3716 (on log-transformed sales)
R-squared (R2 Score): 0.7413


The linear model is a strong baseline with $R^2$ of 0.7413, but struggles with zero-sales and high peaks. To improve, we must switch to a non-linear model (like XGBoost) and incorporate time-series features.