## Feature Engineering - Food Delivery Times Dataset
### Introduction
This notebook creates meaningful features that capture relationships between variables. We develop categorical features and interaction terms, then select the optimal subset using Random Forest importance scores.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Feature Engineering Steps:**
1. Import Libraries and Load Data
2. Feature Creation
3. Feature Selection
4. Final Feature Set

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

df = pd.read_csv("../data/processed/processed_data.csv")
print(f"Processed dataset: {df.shape[0]} rows and {df.shape[1]} columns")

df.head()

Processed dataset: 1000 rows and 20 columns


Unnamed: 0,Order_ID,Distance_km,Preparation_Time_min,Courier_Experience_yrs,Delivery_Time_min,Weather_Clear,Weather_Foggy,Weather_Rainy,Weather_Snowy,Weather_Windy,Traffic_Level_High,Traffic_Level_Low,Traffic_Level_Medium,Time_of_Day_Afternoon,Time_of_Day_Evening,Time_of_Day_Morning,Time_of_Day_Night,Vehicle_Type_Bike,Vehicle_Type_Car,Vehicle_Type_Scooter
0,522,-0.374085,-0.691853,-1.251672,43,False,False,False,False,True,False,True,False,True,False,False,False,False,False,True
1,738,1.117008,0.419111,-0.903211,84,True,False,False,False,False,False,False,True,False,True,False,False,True,False,False
2,741,-0.094835,1.530076,-1.251672,59,False,True,False,False,False,False,True,False,False,False,False,True,False,False,True
3,661,-0.460144,-1.663947,-1.251672,37,False,False,True,False,False,False,False,True,True,False,False,False,False,False,True
4,412,1.575401,-0.136371,0.142172,68,True,False,False,False,False,False,True,False,False,False,True,False,True,False,False


### 2. Feature Creation

In [12]:
def create_features(df):
    df = df.copy()
    
    df['Distance_Category'] = pd.cut(df['Distance_km'], 
                                   bins=[-3, -1, 0, 1, 3], 
                                   labels=['Very_Short', 'Short', 'Medium', 'Long'])
    
    df['Experience_Category'] = pd.cut(df['Courier_Experience_yrs'],
                                     bins=[-3, -1, 0, 1, 3],
                                     labels=['Newbie', 'Junior', 'Mid', 'Senior'])
                                     
    df['Distance_x_Experience'] = df['Distance_km'] * df['Courier_Experience_yrs']
    df['Distance_x_Prep'] = df['Distance_km'] * df['Preparation_Time_min']
    
    return df

df_engineered = create_features(df)
print(f"After feature engineering: {df_engineered.shape}")
print("New features created:")
new_features = ['Distance_Category', 'Experience_Category', 'Distance_x_Experience', 'Distance_x_Prep']
for feat in new_features:
    print(f"- {feat}")

df_engineered[new_features].head()

After feature engineering: (1000, 24)
New features created:
- Distance_Category
- Experience_Category
- Distance_x_Experience
- Distance_x_Prep


Unnamed: 0,Distance_Category,Experience_Category,Distance_x_Experience,Distance_x_Prep
0,Short,Newbie,0.468232,0.258812
1,Long,Junior,-1.008894,0.468151
2,Short,Newbie,0.118702,-0.145104
3,Short,Newbie,0.575949,0.765655
4,Long,Mid,0.223978,-0.214839


### 3. Feature Selection

In [14]:
X = df_engineered.drop(['Order_ID', 'Delivery_Time_min'], axis=1)
y = df_engineered['Delivery_Time_min']

X_encoded = pd.get_dummies(X, columns=['Distance_Category', 'Experience_Category'], drop_first=True)

rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X_encoded, y)

feature_importance = pd.DataFrame({
    'feature': X_encoded.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 important features:")
print(feature_importance.head(10))

selected_features = feature_importance.head(10)['feature'].tolist()
print(f"\nSelected {len(selected_features)} features")

Top 10 important features:
                   feature  importance
0              Distance_km    0.634014
19         Distance_x_Prep    0.147961
1     Preparation_Time_min    0.053068
18   Distance_x_Experience    0.043168
8       Traffic_Level_High    0.016341
9        Traffic_Level_Low    0.014813
2   Courier_Experience_yrs    0.012767
3            Weather_Clear    0.012368
4            Weather_Foggy    0.006729
12     Time_of_Day_Evening    0.006277

Selected 10 features


### 4. Final Feature Set

In [16]:
X_final = X_encoded[selected_features]
print(f"Final dataset shape: {X_final.shape}")
print("\nFinal features:")
for i, feat in enumerate(selected_features, 1):
    print(f"{i}. {feat}")

X_final.to_csv("../data/processed/X_engineered.csv", index=False)
pd.DataFrame(y).to_csv("../data/processed/y_target.csv", index=False)

Final dataset shape: (1000, 10)

Final features:
1. Distance_km
2. Distance_x_Prep
3. Preparation_Time_min
4. Distance_x_Experience
5. Traffic_Level_High
6. Traffic_Level_Low
7. Courier_Experience_yrs
8. Weather_Clear
9. Weather_Foggy
10. Time_of_Day_Evening


### Conclusion
Created 4 new features: Distance_Category, Experience_Category, and interaction terms Distance_x_Experience, Distance_x_Prep. Feature selection reduced dataset from 24 to 10 features, with Distance_km remaining the dominant predictor (63.4% importance).