# 03 - Feature Engineering

## Overview
1. **Load data:** Import dataset from the previous stage.
2. **Additional features:** add new fetature columns for future analysis.
3.  **Aggregated features:** generate aggregated features and key KPIs
4.  **Save changes:** export featured version of the dataset to:  
   `data/processed/03_gold_features.parquet`;  
       export aggerated features to  
    `/data/processed/03_gold_aggregated_kpis.csv`.
    
**Goal:** Enhance the cleaned dataset from the Silver stage with new analytical and business-relevant variables to support KPI calculation and downstream analysis.

### Load data
> Import the Stage 02 Parquet file.

In [4]:
# Import Pandas
import pandas as pd

# Display all columns
pd.set_option('display.max_columns', None)

# Load bronze stage data
file_path = '../data/interim/02_silver_cleaned.parquet'
df = pd.read_parquet(file_path)

# Check results
print('Data loaded successfully.')
df.head()

Data loaded successfully.


Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,quantity,discount,profit
0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,Fort Lauderdale,Florida,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


### Additional features
> Add new columns to enrich the data for analysis and machine learning. 

Extract Year, quarter, month, day, and dow from `order_date` in order to capture potential time-based patterns

In [7]:
# Process order_date
df['order_year'] = df['order_date'].dt.year
df['order_quarter'] = df['order_date'].dt.quarter
df['order_month'] = df['order_date'].dt.month
df['order_day'] = df['order_date'].dt.day
df['order_dow'] = df['order_date'].dt.dayofweek

# Check results
df.head(1)

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,quantity,discount,profit,order_year,order_quarter,order_month,order_day,order_dow
0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136,2016,4,11,8,1


Calculate shipping delay in days to support logistics performance analysis

In [9]:
df['delay_in_days'] = (df.ship_date - df.order_date).dt.days

df.delay_in_days.head()

0    3
1    3
2    4
3    7
4    7
Name: delay_in_days, dtype: int64

Calculate profit margin to evaluate sales efficiency and profitability

In [11]:
df['profit_margin'] = (df.profit / df.sales) * 100

df.profit_margin.head()

0    16.00
1    30.00
2    47.00
3   -40.00
4    11.25
Name: profit_margin, dtype: float64

Create `has_discount` flag to identify discounted orders for segmentation and impact analysis

In [13]:
df['has_discount'] = (df.discount > 0).astype(int)

df.has_discount.head()

0    0
1    0
2    0
3    1
4    1
Name: has_discount, dtype: int32

Cost estimate

In [15]:
df['cost'] = df.sales - df.profit

### Aggregated features
> Generate aggregated features and KPIs.

Compute key aggregated KPIs: total revenue, total profit, average profit margin, repeat customer rate, and average shipping delay to assess overall business performance

In [18]:
# Calculate aggregated KPIs
total_revenue = df.sales.sum() # Overall performance
total_profit = df.profit.sum() # Profit summary
avg_profit_margin = df.profit_margin.mean() # Benchmark for dashboard

is_repeat_customer = (df.customer_id.value_counts() > 1) 
repeat_customer_rate = (is_repeat_customer).astype(int).sum() / is_repeat_customer.size # loyalty metric

avg_shipping_delay = df.delay_in_days.mean() # SLA monitoring

# Create summary series
aggregated_kpis = pd.Series({
    'total_revenue': total_revenue,
    'total_profit': total_profit,
    'avg_profit_margin': avg_profit_margin,
    'repeat_customer_rate': repeat_customer_rate,
    'avg_shipping_delay': avg_shipping_delay})

aggregated_kpis

total_revenue           2.297201e+06
total_profit            2.863970e+05
avg_profit_margin       1.203139e+01
repeat_customer_rate    9.936948e-01
avg_shipping_delay      3.958175e+00
dtype: float64

In [19]:
df.shape

(9994, 28)

### Save changes 
> Scince stage is complete, export featured version of the dataset and aggregated features. 

Convert dataset file format to Parquet to preserve dtypes.

In [22]:
df.to_parquet('../data/processed/03_gold_features.parquet', index = False)
print('Featured copy is saved.')

aggregated_kpis.to_csv('../data/processed/03_gold_aggregated_kpis.csv')
print('Aggregated KPIs are saved.')

Featured copy is saved.
Aggregated KPIs are saved.


A summary of this stage is documented separately in `/reports/03_feature_engineering.md`.