# Feature Engineering 

## FEATURES FOR REGGRESSION
* extracting features for Predicting next-month product sales,Forecast Revenue and Predict Future Demand

In [27]:
import numpy as np
import pandas as pd

In [28]:
online_retail_df=pd.read_csv("../data/cleaned/online_retail_cleaned.csv")

InvoiceDate is NO LONGER datetime
ðŸ‘‰ CSV cannot store datetime dtype
ðŸ‘‰ Pandas reloads it as object (string)

In [29]:
online_retail_df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country', 'Revenue', 'Year', 'Month_num',
       'day_of_week', 'week', 'CohortMonth', 'date', 'countryorregion',
       'holidayname', 'month', 'is_weekend', 'month_end', 'month_start',
       'isholiday', 'HolidayQuantity', 'YearMonth'],
      dtype='object')

In [30]:
clean_online_retail=online_retail_df.copy()

In [31]:
clean_online_retail['InvoiceDate']=pd.to_datetime(clean_online_retail['InvoiceDate'],errors='coerce')

In [32]:

clean_online_retail['YearMonth']=clean_online_retail['InvoiceDate'].dt.to_period('M')

#Monthly aggregation of quantity per product 
#Monthly average of unit price per product
#Monthly holiday count
#monthly revenue per product
Monthly = (
    clean_online_retail.groupby(['StockCode', 'YearMonth']).agg(
        MonthlyQuantity=('Quantity', 'sum'),
        AvgUnitPrice=('UnitPrice', 'mean'),
        HolidayCount=('isholiday', 'sum'),
        MonthlyRevenue=('Quantity',lambda x:(x * clean_online_retail.loc[x.index,'UnitPrice']).sum())
    ).reset_index()
)


To forecast future product demand and revenue by leveraging historical sales patterns using lag-based time-series features.

Engineered lag features (Lag-1, Lag-2, Lag-3) from past monthly sales to capture temporal dependencies such as trend persistence, seasonality, and momentum effects.

In [33]:
monthly1=Monthly.copy()
monthly1=monthly1.sort_values(['StockCode','YearMonth'])
for lag in [1,2,3]:
    monthly1[f'lag{lag}_quantity']=monthly1.groupby('StockCode')['MonthlyQuantity'].shift(lag)

A rolling mean quantity is the average sales of a product over the last k months, calculated for each month, to capture the trend.

In [34]:
monthly1['rolling_mean_3']=monthly1.groupby('StockCode')['MonthlyQuantity'].shift(1).rolling(window=3).mean()

In [11]:
monthly1['avg_unitprice']=(monthly1.groupby('StockCode')['AvgUnitPrice'].pct_change())

Growth Rate (Momentum)

 Month-over-month demand change

In [35]:
monthly1['GrowthRate']=monthly1.groupby('StockCode')['MonthlyQuantity'].pct_change()

Month (Seasonality)

In [36]:
monthly1['Month']=monthly1['YearMonth'].dt.month

Lag revenue of past 3 months
Revenue from a previous time period, shifted forward and used as a feature to predict the current or future period.

In [14]:
monthly1['LagRevenue1']=monthly1.groupby('StockCode')['MonthlyRevenue'].shift(1)
monthly1['LagRevenue2']=monthly1.groupby('StockCode')['MonthlyRevenue'].shift(2)
monthly1['LagRevenue3']=monthly1.groupby('StockCode')['MonthlyRevenue'].shift(3)


In [37]:
monthly1=monthly1.dropna()
monthly1.columns

Index(['StockCode', 'YearMonth', 'MonthlyQuantity', 'AvgUnitPrice',
       'HolidayCount', 'MonthlyRevenue', 'lag1_quantity', 'lag2_quantity',
       'lag3_quantity', 'rolling_mean_3', 'GrowthRate', 'Month'],
      dtype='object')

In [38]:
monthly1.to_csv("../data/processed/regression_features.csv",index=False)
monthly1.to_csv("regression_features_backup.csv",index=False)


# Product Metrices

## 1. KPIs AND FEATURES  FOR PRODUCT-LEVEL CLASSIFICATION DATASET



### 1. TotalQuantitySold
**Definition:**  
Total number of units sold for a product over a given time period (monthly, quarterly, or yearly).  
**Purpose:**  
Measures overall product demand.

### 2. TotalRevenue
**Definition:**  
Total revenue generated by a product over a given period.  
**Formula:**  


### 3. AvgMonthlyQuantity
**Definition:**  
Average number of units sold per month for a product.  
**Purpose:**  
Represents typical monthly demand.

### 4. AvgMonthlyRevenue
**Definition:**  
Average revenue generated per month by a product.

### 5. ActiveMonthCount
**Definition:**  
Number of months in which a product recorded at least one sale (Quantity > 0).  
**Purpose:**  
Indicates how consistently the product sells over time.

### 6. StdMonthlyQuantity
**Definition:**  
Standard deviation of monthly quantities sold for a product.  
**Purpose:**  
Measures variability and stability of product demand.

### 7. AvgUnitPrice
**Definition:**  
Average selling price of a product (based on StockCode) over time.  
**Purpose:**  
Represents the typical pricing level of the product.

### 8. PriceVariance
**Definition:**  
Variance in unit price of a product (StockCode) over time.  
**Purpose:**  
Captures price fluctuations that may influence demand and revenue.

### 9. PeakMonthSales
**Definition:**  
Maximum number of units sold in a single month for a product.  
**Logic:**  
Aggregate monthly quantity per product and take the maximum value.

### 10. HolidaySalesRatio
**Definition:**  
Proportion of a productâ€™s total sales that occurred during holidays or special occasions.  


### 11. UniqueCustomerCount
**Definition:**
Total number of unique customers who purchased a product over a given period.
**Purpose:**
Measures the productâ€™s reach and popularity among different customers.

In [39]:
clean_online_retail[['Month_num', 'YearMonth']].head()

Unnamed: 0,Month_num,YearMonth
0,1,2011-01
1,12,2010-12
2,12,2010-12
3,12,2010-12
4,12,2010-12


In [40]:
clean_online_retail[['InvoiceDate', 'YearMonth']].head()


Unnamed: 0,InvoiceDate,YearMonth
0,2011-01-18 10:01:00,2011-01
1,2010-12-07 14:57:00,2010-12
2,2010-12-07 14:57:00,2010-12
3,2010-12-07 14:57:00,2010-12
4,2010-12-07 14:57:00,2010-12


In [43]:
monthly_product=clean_online_retail.groupby(['StockCode','YearMonth']).agg(
     MonthlyQuantity=('Quantity','sum'),
     MonthlyRevenue=('Revenue','sum'),
     TotalHolidayQuantity=('HolidayQuantity','sum'),
     
     
).reset_index()

product_level=monthly_product.groupby('StockCode').agg(
    TotalQuantitySold=('MonthlyQuantity','sum'),
    TotalRevenue=('MonthlyRevenue','sum'),
    AvgMonthlyQuantity=('MonthlyQuantity','mean'),
    AvgMonthlyRevenue=('MonthlyRevenue','mean'),
    ActiveMonthCount=('YearMonth','nunique'),
    StdMonthlyQuantity=('MonthlyQuantity','std'),
    PeakMonthSale=('MonthlyQuantity','max'),
    TotalHolidayQuantity=('TotalHolidayQuantity','sum')
    
    ).reset_index()

product_level['HolidayRatio']=(product_level['TotalHolidayQuantity']/product_level['TotalQuantitySold'])

PricingFeatures=clean_online_retail.groupby('StockCode').agg(
    AvgUnitPrice=('UnitPrice','mean'),
    PriceVariance=('UnitPrice','var'),
   
).reset_index()

CustomerEngagement=clean_online_retail.groupby('StockCode').agg(
     UniqueCustomersCount=('CustomerID','nunique')
).reset_index()
    
product_kpis=(product_level.merge(PricingFeatures,on='StockCode',how='left')
                                 .merge(CustomerEngagement,on='StockCode',how='left'))

product_kpis['HolidayRatio']=product_kpis['HolidayRatio'].fillna(0)



In [44]:
product_kpis.columns

Index(['StockCode', 'TotalQuantitySold', 'TotalRevenue', 'AvgMonthlyQuantity',
       'AvgMonthlyRevenue', 'ActiveMonthCount', 'StdMonthlyQuantity',
       'PeakMonthSale', 'TotalHolidayQuantity', 'HolidayRatio', 'AvgUnitPrice',
       'PriceVariance', 'UniqueCustomersCount'],
      dtype='object')

In [45]:
product_kpis.to_csv("../data/processed/product_classification_features.csv   ")

## 2  KPIs and Features FOR CUSTOMER CHURN CLASSIFICATION DATASET

### 1. TotalOrders
**Definition:**  
Each unique `InvoiceNo` counts as one order.  
A customer may appear in multiple rows per invoice (if they bought multiple products).  
**Conceptual Definition:**  
Count of unique `InvoiceNo` per `CustomerID`.

---

### 2. Recency (Days Since Last Purchase)
**Business Meaning:**  
Measures how recently a customer last purchased.  
- Low recency â†’ customer is active  
- High recency â†’ customer may be at risk of churn

**Data Meaning:**  
- Based on `InvoiceDate`  
- Uses the most recent purchase date per customer  
- Calculated as the number of days between a reference date (usually the datasetâ€™s last date) and the customerâ€™s last `InvoiceDate`.

---

### 3. AvgOrderValue (AOV)
**Business Meaning:**  
Represents how much a customer spends per order on average.  
Helps identify high-value or premium buyers.

**Data Meaning:**  
- Order value = sum of (`Quantity Ã— UnitPrice`) per invoice  

**Conceptual Definition:** 

 AvgOrderValue = Total Revenue of customer Ã· TotalOrders



---

### 4. ActiveMonths
**Definition:**  
Number of distinct months in which a customer made purchases.  
**Purpose:**  
Captures the customerâ€™s activity duration over time.

---

### 5. PurchaseFrequency
**Definition:**  
Measures how frequently a customer orders relative to their activity months.

**Formula:**  

PurchaseFrequency = TotalOrders Ã· ActiveMonths


In [48]:
clean_online_retail=clean_online_retail.copy()
TotalOrdersDS=clean_online_retail.groupby('CustomerID')['InvoiceNo'].nunique().reset_index(name='TotalOrders')
reference_date=clean_online_retail['InvoiceDate'].max()
recencyDS=clean_online_retail.groupby('CustomerID')['InvoiceDate'].max().reset_index()
recencyDS['Recency']=(reference_date - recencyDS['InvoiceDate']).dt.days
clean_online_retail['OrderValue']=clean_online_retail['Quantity']*clean_online_retail['UnitPrice']
order_revenue=clean_online_retail.groupby(['InvoiceDate','CustomerID'])['OrderValue'].sum().reset_index()
avg_order_value = order_revenue.groupby('CustomerID')['OrderValue'].mean().reset_index(name='AvgOrderValue')



clean_online_retail['YearMonth']=clean_online_retail['InvoiceDate'].dt.to_period('M')
grouped=clean_online_retail.groupby('CustomerID')
activemonthDS=grouped['YearMonth'].nunique().reset_index(name='ActiveMonth')

customer_kpis=TotalOrdersDS.merge(activemonthDS,on='CustomerID').merge(avg_order_value,on='CustomerID').merge(recencyDS,on='CustomerID')
customer_kpis['PurchaseFrequency']=(customer_kpis['TotalOrders']/customer_kpis['ActiveMonth'])

In [49]:
customer_kpis.isnull().sum()


CustomerID           0
TotalOrders          0
ActiveMonth          0
AvgOrderValue        0
InvoiceDate          0
Recency              0
PurchaseFrequency    0
dtype: int64

In [None]:
customer_kpis.columns

Index(['CustomerID', 'TotalOrders', 'ActiveMonth', 'AvgOrderValue',
       'InvoiceDate', 'Recency', 'PurchaseFrequency'],
      dtype='object')

In [50]:
for i in ['AvgOrderValue','PurchaseFrequency','Recency']:
    customer_kpis[i].quantile(0.99)
    lower=customer_kpis[i].quantile(0.01)
    customer_kpis[i]=customer_kpis[i].clip(upper=upper,lower=lower)

In [51]:
customer_kpis.to_csv('../data/processed/customer_churn_features.csv ',index=False)