## Step 2 : Feature Engineering


#### In this step we are goin to Convert date columns to datetime and create time‑based features (Days_to_Expire, Product_Age_Days, Days_Since_Last_Order) for freshness and lifecycle information.

#### Create expiry‑risk flags (Is_Near_Expiry, Is_Expired) to easily identify products in the danger zone.

#### Build stock vs. sales features (Stock_to_Sales_Ratio, Stock_Value, Revenue_per_Unit) to capture overstock and profitability.

#### Add category‑level context (category mean/std stock and z‑scores) so each product is compared to its category peers.

#### Define a High_Risk label (combining perishability, expiry, turnover, and overstock) that can be used later for a classification model and manager alerts.

In [44]:
import pandas as pd
from datetime import datetime

# Load cleaned data
df = pd.read_csv("grocery_inventory_cleaned.csv")

In [47]:
# parse dates


date_cols = ["Date_Received", "Expiration_Date", "Last_Order_Date"]
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors="coerce")

In [48]:
# Choose a reference date

reference_date = df["Date_Received"].max()


# Time‑based features

df["Days_to_Expire"] = (df["Expiration_Date"] - reference_date).dt.days
df["Product_Age_Days"] = (reference_date - df["Date_Received"]).dt.days
df["Days_Since_Last_Order"] = (reference_date - df["Last_Order_Date"]).dt.days

# Clean extreme values
df["Days_to_Expire"] = df["Days_to_Expire"].clip(lower=-30, upper=365)
df["Product_Age_Days"] = df["Product_Age_Days"].clip(lower=0, upper=365)
df["Days_Since_Last_Order"] = df["Days_Since_Last_Order"].clip(lower=0, upper=365)

df[["Days_to_Expire", "Product_Age_Days", "Days_Since_Last_Order"]].describe()

Unnamed: 0,Days_to_Expire,Product_Age_Days,Days_Since_Last_Order
count,989.0,989.0,989.0
mean,-28.776542,185.0273,182.287159
std,4.756799,104.767538,107.040045
min,-30.0,0.0,0.0
25%,-30.0,93.0,87.0
50%,-30.0,189.0,188.0
75%,-30.0,273.0,271.0
max,0.0,365.0,365.0


In [49]:
# checking if it worked with extreme values. 


# If the counts are 0 → no extreme values remain 
# If counts > 0 → check for errors 

def clipping_report(series, lower, upper):

    below = (series < lower).sum()
    above = (series > upper).sum()
    total = len(series)
    print(f"{series.name}: {below} below {lower}, {above} above {upper}, out of {total} rows")

clipping_report(df["Days_to_Expire"], -30, 365)
clipping_report(df["Product_Age_Days"], 0, 365)
clipping_report(df["Days_Since_Last_Order"], 0, 365)


Days_to_Expire: 0 below -30, 0 above 365, out of 989 rows
Product_Age_Days: 0 below 0, 0 above 365, out of 989 rows
Days_Since_Last_Order: 0 below 0, 0 above 365, out of 989 rows


In [50]:
# Expiry risk flag


df["Is_Near_Expiry"] = (df["Days_to_Expire"] <= 7).astype(int)
df["Is_Expired"] = (df["Days_to_Expire"] < 0).astype(int)

# it converts the Boolean condition into numeric 1/0
# 1 - meets the condition
# 0 - does not meet the condition




In [51]:
# Stock vs sales ratios

df["Stock_to_Sales_Ratio"] = df["Stock_Quantity"] / (df["Sales_Volume"] + 1)
df["Stock_Value"] = df["Stock_Quantity"] * df["Unit_Price"]
df["Revenue_per_Unit"] = df["Sales_Revenue"] / (df["Sales_Volume"] + 1)

#It tells us if we are holding “too much stock for the sales we get” and how valuable each item is.

In [54]:
#  Category‑level context

cat_stats = df.groupby("Category")["Stock_Quantity"].agg(["mean", "std"]).rename(
    columns={"mean": "Cat_Stock_Mean", "std": "Cat_Stock_Std"}
)

df = df.join(cat_stats, on="Category")


df["Stock_Zscore_in_Category"] = (df["Stock_Quantity"] - df["Cat_Stock_Mean"]) / (
    df["Cat_Stock_Std"] + 1e-3
)

# puts each product in context of its category (unusually high stock or low stock).


In [55]:
df[["Category","Stock_Quantity","Cat_Stock_Mean","Cat_Stock_Std","Stock_Zscore_in_Category"]].head()

Unnamed: 0,Category,Stock_Quantity,Cat_Stock_Mean,Cat_Stock_Std,Stock_Zscore_in_Category
0,Grains & Pulses,22,50.864198,26.034498,-1.108648
1,Beverages,45,50.96,23.804065,-0.250367
2,Grains & Pulses,30,50.864198,26.034498,-0.801375
3,Grains & Pulses,12,50.864198,26.034498,-1.492739
4,Fruits & Vegetables,37,55.858006,26.38654,-0.714656


In [56]:
df

Unnamed: 0,Product_ID,Product_Name,Category,Supplier_ID,Supplier_Name,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Date_Received,...,Product_Age_Days,Days_Since_Last_Order,Is_Near_Expiry,Is_Expired,Stock_to_Sales_Ratio,Stock_Value,Revenue_per_Unit,Cat_Stock_Mean,Cat_Stock_Std,Stock_Zscore_in_Category
0,29-205-1132,Sushi Rice,Grains & Pulses,38-037-1699,Jaxnation,22,72,70,4.5,2024-08-16,...,192,240,1,1,0.666667,99.0,4.363636,50.864198,26.034498,-1.108648
1,40-681-9981,Arabica Coffee,Beverages,54-470-2479,Feedmix,45,77,2,20.0,2024-11-01,...,115,271,1,1,0.523256,900.0,19.767442,50.960000,23.804065,-0.250367
2,06-955-3428,Black Rice,Grains & Pulses,54-031-2945,Vinder,30,38,83,6.0,2024-08-03,...,205,259,1,1,0.937500,180.0,5.812500,50.864198,26.034498,-0.801375
3,71-594-6552,Long Grain Rice,Grains & Pulses,63-492-7603,Brightbean,12,59,62,1.5,2024-12-08,...,78,5,1,1,0.125000,18.0,1.484375,50.864198,26.034498,-1.492739
4,57-437-1828,Plum,Fruits & Vegetables,54-226-4308,Topicstorm,37,30,74,4.0,2024-07-03,...,236,136,1,1,0.587302,148.0,3.936508,55.858006,26.386540,-0.714656
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,82-977-7752,Spinach,Fruits & Vegetables,57-473-8672,Shuffledrive,88,78,17,2.5,2024-09-06,...,171,58,1,1,1.491525,220.0,2.457627,55.858006,26.386540,1.218075
985,62-393-9939,Cheddar Cheese,Dairy,93-877-9384,Gabcube,60,9,89,9.0,2024-06-01,...,268,267,1,1,0.625000,540.0,8.906250,58.522222,25.864377,0.057133
986,31-745-6850,Cabbage,Fruits & Vegetables,96-215-2767,Lajo,94,90,12,0.9,2024-10-03,...,144,123,1,1,0.949495,84.6,0.890909,55.858006,26.386540,1.445455
987,86-692-2312,Avocado Oil,Oils & Fats,77-783-4107,Dazzlesphere,30,48,52,10.0,2024-06-11,...,258,79,1,1,1.304348,300.0,9.565217,53.545455,26.802664,-0.878442


In [72]:
# High‑risk label (for classification)

turnover_median = df["Inventory_Turnover_Rate"].median()
df["High_Risk"] = (
    (df["Perishable"] == 1) &
    (df["Is_Near_Expiry"] == 1) &
    (df["Inventory_Turnover_Rate"] < turnover_median) &
    (df["Stock_to_Sales_Ratio"] > 1)
).astype(int)

# creates the target for a high‑risk classifier using business logic:perishable + near expiry + slow moving + relatively overstocked.

In [71]:
print(df["High_Risk"].value_counts())

High_Risk
0    861
1    128
Name: count, dtype: int64


In [70]:
# Saving engineered dataset

df.to_csv("grocery_inventory_featured.csv", index=False)
print("Finished feature engineering. Shape:", df.shape)

Finished feature engineering. Shape: (989, 30)


In [65]:
df

Unnamed: 0,Product_ID,Product_Name,Category,Supplier_ID,Supplier_Name,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Date_Received,...,Days_Since_Last_Order,Is_Near_Expiry,Is_Expired,Stock_to_Sales_Ratio,Stock_Value,Revenue_per_Unit,Cat_Stock_Mean,Cat_Stock_Std,Stock_Zscore_in_Category,High_Risk
0,29-205-1132,Sushi Rice,Grains & Pulses,38-037-1699,Jaxnation,22,72,70,4.5,2024-08-16,...,240,1,1,0.666667,99.0,4.363636,50.864198,26.034498,-1.108648,0
1,40-681-9981,Arabica Coffee,Beverages,54-470-2479,Feedmix,45,77,2,20.0,2024-11-01,...,271,1,1,0.523256,900.0,19.767442,50.960000,23.804065,-0.250367,0
2,06-955-3428,Black Rice,Grains & Pulses,54-031-2945,Vinder,30,38,83,6.0,2024-08-03,...,259,1,1,0.937500,180.0,5.812500,50.864198,26.034498,-0.801375,0
3,71-594-6552,Long Grain Rice,Grains & Pulses,63-492-7603,Brightbean,12,59,62,1.5,2024-12-08,...,5,1,1,0.125000,18.0,1.484375,50.864198,26.034498,-1.492739,0
4,57-437-1828,Plum,Fruits & Vegetables,54-226-4308,Topicstorm,37,30,74,4.0,2024-07-03,...,136,1,1,0.587302,148.0,3.936508,55.858006,26.386540,-0.714656,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
984,82-977-7752,Spinach,Fruits & Vegetables,57-473-8672,Shuffledrive,88,78,17,2.5,2024-09-06,...,58,1,1,1.491525,220.0,2.457627,55.858006,26.386540,1.218075,0
985,62-393-9939,Cheddar Cheese,Dairy,93-877-9384,Gabcube,60,9,89,9.0,2024-06-01,...,267,1,1,0.625000,540.0,8.906250,58.522222,25.864377,0.057133,0
986,31-745-6850,Cabbage,Fruits & Vegetables,96-215-2767,Lajo,94,90,12,0.9,2024-10-03,...,123,1,1,0.949495,84.6,0.890909,55.858006,26.386540,1.445455,0
987,86-692-2312,Avocado Oil,Oils & Fats,77-783-4107,Dazzlesphere,30,48,52,10.0,2024-06-11,...,79,1,1,1.304348,300.0,9.565217,53.545455,26.802664,-0.878442,0


## Just a small representation of how featured enginering step wil help us for our Modeling

Demand regression model

Uses numeric features like Stock_Quantity, Perishable, Days_to_Expire, Product_Age_Days, and ratios to predict future Sales_Volume (demand) more accurately.

High‑risk classification model

Uses Is_Near_Expiry, Stock_to_Sales_Ratio, Inventory_Turnover_Rate, Stock_Zscore_in_Category, etc., to learn which products are likely to be wasted or overstocked.

Inventory optimization & dashboards

Ratios and z‑scores give you interpretable KPIs (e.g., “products with Stock_to_Sales_Ratio > 3 and near expiry”), perfect for reorder rules, alerts, and visual highlights in your Tableau/plot dashboards.




## Step 2 Featured Engineering done.

#### Next Step: Modeling