<a href="https://colab.research.google.com/github/jblcky/retail-inventory-forecasting-2/blob/main/notebooks/model_dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading Data from github

In [4]:
import pandas as pd

url = 'https://raw.githubusercontent.com/jblcky/retail-inventory-forecasting-2/refs/heads/main/raw/sales_data.csv'
sales_df = pd.read_csv(url, parse_dates=["date"])

sales_df.head()



Unnamed: 0,date,sku_id,store_id,quantity_sold
0,2024-01-01,SKU_001,Store_A,45
1,2024-01-02,SKU_001,Store_A,51
2,2024-01-03,SKU_001,Store_A,51
3,2024-01-04,SKU_001,Store_A,57
4,2024-01-05,SKU_001,Store_A,46


In [6]:
url = 'https://raw.githubusercontent.com/jblcky/retail-inventory-forecasting-2/refs/heads/main/raw/events_calendar.csv'
events_df = pd.read_csv(url, parse_dates=["date"])
events_df.head()

Unnamed: 0,date,event
0,2024-01-01,Public Holiday
1,2024-01-15,Public Holiday
2,2024-01-29,Flu Season
3,2024-02-12,Public Holiday
4,2024-02-26,Public Holiday


In [7]:
url = 'https://raw.githubusercontent.com/jblcky/retail-inventory-forecasting-2/refs/heads/main/raw/inventory_levels.csv'
inventory_df = pd.read_csv(url, parse_dates=['date'])
inventory_df.head()

Unnamed: 0,date,sku_id,store_id,inventory_level
0,2024-01-01,SKU_001,Store_A,77
1,2024-01-02,SKU_001,Store_A,180
2,2024-01-03,SKU_001,Store_A,124
3,2024-01-04,SKU_001,Store_A,127
4,2024-01-05,SKU_001,Store_A,139


In [8]:
url = 'https://raw.githubusercontent.com/jblcky/retail-inventory-forecasting-2/refs/heads/main/raw/sku_metadata.csv'
sku_df = pd.read_csv(url)
sku_df.head()

Unnamed: 0,sku_id,category,supplier,lead_time_days,unit_cost
0,SKU_001,Medicine,Supplier_1,7,4.53
1,SKU_002,Supplement,Supplier_1,21,7.14
2,SKU_003,Supplement,Supplier_2,7,4.74
3,SKU_004,Medicine,Supplier_4,7,4.58
4,SKU_005,Personal Care,Supplier_2,14,9.47


**Data cleaning steps**
- inspect basic information and missing values
- Check nulls
- Check duplicates
- Fix dtypes if needed

In [9]:
# Check shapes
print("Sales shape:", sales_df.shape)
print("Inventory shape:", inventory_df.shape)
print("Events shape:", events_df.shape)
print("SKU Meta shape:", sku_df.shape)

# Check missing values
print("\nMissing values in each dataframe:")
print("Sales:\n", sales_df.isnull().sum())
print("Inventory:\n", inventory_df.isnull().sum())
print("Events:\n", events_df.isnull().sum())
print("SKU Meta:\n", sku_df.isnull().sum())

# Check data types
print("\nData types:")
print(sales_df.dtypes)


Sales shape: (3600, 4)
Inventory shape: (3600, 4)
Events shape: (10, 2)
SKU Meta shape: (10, 5)

Missing values in each dataframe:
Sales:
 date             0
sku_id           0
store_id         0
quantity_sold    0
dtype: int64
Inventory:
 date               0
sku_id             0
store_id           0
inventory_level    0
dtype: int64
Events:
 date     0
event    0
dtype: int64
SKU Meta:
 sku_id            0
category          0
supplier          0
lead_time_days    0
unit_cost         0
dtype: int64

Data types:
date             datetime64[ns]
sku_id                   object
store_id                 object
quantity_sold             int64
dtype: object


basic stats and duplicates

In [10]:
# Basic summary stats
print("\nSales quantity summary:")
display(sales_df['quantity_sold'].describe())

# Check for duplicates
print("\nDuplicates:")
print("Sales:", sales_df.duplicated().sum())
print("Inventory:", inventory_df.duplicated().sum())
print("Events:", events_df.duplicated().sum())
print("SKU Meta:", sku_df.duplicated().sum())



Sales quantity summary:


Unnamed: 0,quantity_sold
count,3600.0
mean,30.744444
std,15.206232
min,0.0
25%,19.0
50%,28.0
75%,42.0
max,71.0



Duplicates:
Sales: 0
Inventory: 0
Events: 0
SKU Meta: 0


explore unique values

In [11]:
print("Unique SKUs:", sales_df['sku_id'].nunique())
print("Unique Stores:", sales_df['store_id'].nunique())
print("Date range:", sales_df['date'].min(), "to", sales_df['date'].max())

# Peek at top selling SKUs
top_skus = sales_df.groupby('sku_id')['quantity_sold'].sum().sort_values(ascending=False)
print("\nTop selling SKUs:")
display(top_skus.head(5))


Unique SKUs: 10
Unique Stores: 2
Date range: 2024-01-01 00:00:00 to 2024-06-28 00:00:00

Top selling SKUs:


Unnamed: 0_level_0,quantity_sold
sku_id,Unnamed: 1_level_1
SKU_001,17697
SKU_010,14784
SKU_003,13815
SKU_008,12679
SKU_009,11185


In [12]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


**Feature engineering**
- Time features
- Event flags
- Rolling averages (7d, 14d, etc.)
- Merge SKU metadata (optional)

add time feature

In [13]:
sales_df['day_of_week'] = sales_df['date'].dt.dayofweek  # 0 = Monday
sales_df['weekofyear'] = sales_df['date'].dt.isocalendar().week
sales_df['month'] = sales_df['date'].dt.month
sales_df['is_weekend'] = sales_df['day_of_week'] >= 5


Merge Events (Holidays, School Holidays, Flu)

In [14]:
# Merge event flags (left join on date)
sales_df = sales_df.merge(events_df, on='date', how='left')

# One-hot encode event type
sales_df = pd.get_dummies(sales_df, columns=['event'], prefix='event', dummy_na=False)

# Fill NaNs with 0 for no event
event_cols = [col for col in sales_df.columns if col.startswith("event_")]
sales_df[event_cols] = sales_df[event_cols].fillna(0)


Add Rolling Sales Features. This captures short-term trends.

In [15]:
# Sort to ensure correct rolling
sales_df = sales_df.sort_values(['sku_id', 'store_id', 'date'])

# 7-day rolling average of past sales per SKU+Store
sales_df['rolling_7d_sales'] = (
    sales_df.groupby(['sku_id', 'store_id'])['quantity_sold']
    .transform(lambda x: x.shift(1).rolling(window=7).mean())
)

# Fill NaNs from early days
sales_df['rolling_7d_sales'] = sales_df['rolling_7d_sales'].fillna(0)
