# Feature Engineering

— The art and science of transforming raw data into meaningful features that improve model performance.

🎯 Goal:

Simulate a realistic dataset and perform complete feature engineering with all major real-world techniques — no steps missed.

| Category                         | Techniques                                                            |
| -------------------------------- | --------------------------------------------------------------------- |
| **1. Missing Value Imputation**  | Mean, median, mode, custom values, conditional imputation             |
| **2. Handling Outliers**         | Capping (winsorization), removal, transformation                      |
| **3. Encoding Categorical**      | Label encoding, one-hot, ordinal, frequency encoding                  |
| **4. Date/Time Features**        | Extract day/month/year/week, weekday/weekend, recency                 |
| **5. Text Features**             | Word count, length, TF-IDF, sentiment (optional preview)              |
| **6. Binning**                   | Age → bins (youth, adult, senior), income brackets                    |
| **7. Group-based Aggregates**    | Mean spending per region, median income by product category           |
| **8. Feature Interactions**      | Multiply or combine 2+ features (e.g. Age × Income)                   |
| **9. Scaling**                   | MinMaxScaler, StandardScaler                                          |
| **10. Log Transformations**      | For skewed variables (e.g. income, purchases)                         |
| **11. Domain-Specific Features** | E.g., `Days_since_signup`, `Is_High_Spender`, `Repeat_Purchase`, etc. |


## Simulating Sample Dataset



In [9]:
import pandas as pd
import numpy as np
np.random.seed(42)

n = 300

df = pd.DataFrame({
    'Customer_ID': [f'C{i:04d}' for i in range(1, n+1)],
    'Age': np.random.randint(18, 70, n),
    'Gender': np.random.choice(['Male', 'Female'], n),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'Signup_Date': pd.date_range(start='2022-01-01', periods=n, freq='D'),
    'Annual_Income': np.random.normal(60000, 20000, n).astype(int),
    'Product_Category': np.random.choice(['Grocery', 'Electronics', 'Clothing', 'Home'], n),
    'Spending_Score': np.random.randint(1, 100, n),
    'Purchased': np.random.choice([0, 1], n, p=[0.7, 0.3]),
})

# Add missing values and outliers
df.loc[np.random.randint(0, n, 10), 'Annual_Income'] = np.nan
df.loc[np.random.randint(0, n, 5), 'Age'] = np.nan
df.loc[np.random.randint(0, n, 3), 'Spending_Score'] = 999  # outliers

df.head()

Unnamed: 0,Customer_ID,Age,Gender,Region,Signup_Date,Annual_Income,Product_Category,Spending_Score,Purchased
0,C0001,56.0,Female,South,2022-01-01,69254.0,Electronics,26,0
1,C0002,69.0,Male,South,2022-01-02,39182.0,Home,47,0
2,C0003,46.0,Male,North,2022-01-03,32950.0,Grocery,32,0
3,C0004,32.0,Male,East,2022-01-04,48831.0,Electronics,10,0
4,C0005,60.0,Male,South,2022-01-05,44035.0,Electronics,16,1


## Step-by-Step Feature Engineering

## ✅ 1. Missing Value Imputation




In [10]:
# Impute Age with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# Impute Income with mean
df['Annual_Income'] = df['Annual_Income'].fillna(df['Annual_Income'].mean())


## 🚨 2. Handle Outliers

In [11]:
# Cap spending score at 95th percentile
cap = df['Spending_Score'].quantile(0.95)
df['Spending_Score'] = np.where(df['Spending_Score'] > cap, cap, df['Spending_Score'])


## 🎯 3. Encoding Categorical Features

In [14]:
from sklearn.preprocessing import LabelEncoder

# Label encoding Gender (binary)
le = LabelEncoder()
df['Gender_Code'] = le.fit_transform(df['Gender'])

# One-hot encoding Region & Product_Category
df = pd.get_dummies(df, columns=['Region', 'Product_Category'], drop_first=True)


In [15]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,Region_West,Product_Category_Electronics,Product_Category_Grocery,Product_Category_Home
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,False,True,False,False
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,False,False,False,True
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,False,False,True,False
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,False,True,False,False
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,False,True,False,False


## 🕒 4. Date/Time Feature Extraction



In [16]:
df['Signup_Year'] = df['Signup_Date'].dt.year
df['Signup_Month'] = df['Signup_Date'].dt.month
df['Signup_Day'] = df['Signup_Date'].dt.day
df['Signup_Weekday'] = df['Signup_Date'].dt.weekday
df['Days_Since_Signup'] = (pd.to_datetime('today') - df['Signup_Date']).dt.days


In [17]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,Region_West,Product_Category_Electronics,Product_Category_Grocery,Product_Category_Home,Signup_Year,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,False,True,False,False,2022,1,1,5,1291
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,False,False,False,True,2022,1,2,6,1290
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,False,False,True,False,2022,1,3,0,1289
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,False,True,False,False,2022,1,4,1,1288
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,False,True,False,False,2022,1,5,2,1287


## 📝 5. Text Feature Engineering (for name column, assume it exists)
If Name column existed:

In [18]:
# df['Name_Length'] = df['Name'].str.len()
# df['Word_Count'] = df['Name'].str.split().apply(len)


## 🔢 6. Binning

In [19]:
df['Age_Bin'] = pd.cut(df['Age'], bins=[0, 25, 45, 65, 100], labels=['Youth', 'Adult', 'Mid-age', 'Senior'])


In [20]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,Region_West,Product_Category_Electronics,Product_Category_Grocery,Product_Category_Home,Signup_Year,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup,Age_Bin
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,False,True,False,False,2022,1,1,5,1291,Mid-age
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,False,False,False,True,2022,1,2,6,1290,Senior
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,False,False,True,False,2022,1,3,0,1289,Mid-age
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,False,True,False,False,2022,1,4,1,1288,Adult
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,False,True,False,False,2022,1,5,2,1287,Mid-age


## 📊 7. Group-based Aggregates



In [21]:
region_income = df.groupby('Region_West')['Annual_Income'].mean()
df['Region_Avg_Income'] = df['Region_West'].map(region_income)
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Product_Category_Electronics,Product_Category_Grocery,Product_Category_Home,Signup_Year,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,True,False,False,2022,1,1,5,1291,Mid-age,58104.461296
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,False,False,True,2022,1,2,6,1290,Senior,58104.461296
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,False,True,False,2022,1,3,0,1289,Mid-age,58104.461296
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,True,False,False,2022,1,4,1,1288,Adult,58104.461296
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,True,False,False,2022,1,5,2,1287,Mid-age,58104.461296


## 🔁 8. Feature Interactions

In [22]:
df['Income_X_Spending'] = df['Annual_Income'] * df['Spending_Score']
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Product_Category_Grocery,Product_Category_Home,Signup_Year,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income,Income_X_Spending
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,False,False,2022,1,1,5,1291,Mid-age,58104.461296,1800604.0
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,False,True,2022,1,2,6,1290,Senior,58104.461296,1841554.0
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,True,False,2022,1,3,0,1289,Mid-age,58104.461296,1054400.0
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,False,False,2022,1,4,1,1288,Adult,58104.461296,488310.0
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,False,False,2022,1,5,2,1287,Mid-age,58104.461296,704560.0


## 📏 9. Scaling Features

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age_scaled', 'Annual_Income_scaled']] = scaler.fit_transform(df[['Age', 'Annual_Income']])


In [24]:
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Signup_Year,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income,Income_X_Spending,Age_scaled,Annual_Income_scaled
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,2022,1,1,5,1291,Mid-age,58104.461296,1800604.0,0.841949,0.476498
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,2022,1,2,6,1290,Senior,58104.461296,1841554.0,1.703335,-1.003203
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,2022,1,3,0,1289,Mid-age,58104.461296,1054400.0,0.179345,-1.30985
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,2022,1,4,1,1288,Adult,58104.461296,488310.0,-0.748301,-0.528421
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,2022,1,5,2,1287,Mid-age,58104.461296,704560.0,1.106991,-0.76441


## 🔄 10. Log Transformations

In [25]:
df['Log_Income'] = np.log1p(df['Annual_Income'])

df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Signup_Month,Signup_Day,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income,Income_X_Spending,Age_scaled,Annual_Income_scaled,Log_Income
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,1,1,5,1291,Mid-age,58104.461296,1800604.0,0.841949,0.476498,11.145551
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,1,2,6,1290,Senior,58104.461296,1841554.0,1.703335,-1.003203,10.575998
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,1,3,0,1289,Mid-age,58104.461296,1054400.0,0.179345,-1.30985,10.402777
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,1,4,1,1288,Adult,58104.461296,488310.0,-0.748301,-0.528421,10.796141
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,1,5,2,1287,Mid-age,58104.461296,704560.0,1.106991,-0.76441,10.692763


## 📌 11. Domain-Specific Features



In [26]:
df['Is_High_Spender'] = (df['Spending_Score'] > 75).astype(int)
df['Repeat_Customer'] = np.random.choice([0, 1], size=len(df))  # example assumption
df.head()

Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income,Income_X_Spending,Age_scaled,Annual_Income_scaled,Log_Income,Is_High_Spender,Repeat_Customer
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,5,1291,Mid-age,58104.461296,1800604.0,0.841949,0.476498,11.145551,0,0
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,6,1290,Senior,58104.461296,1841554.0,1.703335,-1.003203,10.575998,0,0
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,0,1289,Mid-age,58104.461296,1054400.0,0.179345,-1.30985,10.402777,0,1
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,1,1288,Adult,58104.461296,488310.0,-0.748301,-0.528421,10.796141,0,0
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,2,1287,Mid-age,58104.461296,704560.0,1.106991,-0.76441,10.692763,0,0


## 🔍 Final Overview

In [29]:
df.head(5)


Unnamed: 0,Customer_ID,Age,Gender,Signup_Date,Annual_Income,Spending_Score,Purchased,Gender_Code,Region_North,Region_South,...,Signup_Weekday,Days_Since_Signup,Age_Bin,Region_Avg_Income,Income_X_Spending,Age_scaled,Annual_Income_scaled,Log_Income,Is_High_Spender,Repeat_Customer
0,C0001,56.0,Female,2022-01-01,69254.0,26.0,0,0,False,True,...,5,1291,Mid-age,58104.461296,1800604.0,0.841949,0.476498,11.145551,0,0
1,C0002,69.0,Male,2022-01-02,39182.0,47.0,0,1,False,True,...,6,1290,Senior,58104.461296,1841554.0,1.703335,-1.003203,10.575998,0,0
2,C0003,46.0,Male,2022-01-03,32950.0,32.0,0,1,True,False,...,0,1289,Mid-age,58104.461296,1054400.0,0.179345,-1.30985,10.402777,0,1
3,C0004,32.0,Male,2022-01-04,48831.0,10.0,0,1,False,False,...,1,1288,Adult,58104.461296,488310.0,-0.748301,-0.528421,10.796141,0,0
4,C0005,60.0,Male,2022-01-05,44035.0,16.0,1,1,False,True,...,2,1287,Mid-age,58104.461296,704560.0,1.106991,-0.76441,10.692763,0,0


In [30]:
print(df.dtypes)

Customer_ID                             object
Age                                    float64
Gender                                  object
Signup_Date                     datetime64[ns]
Annual_Income                          float64
Spending_Score                         float64
Purchased                                int64
Gender_Code                              int64
Region_North                              bool
Region_South                              bool
Region_West                               bool
Product_Category_Electronics              bool
Product_Category_Grocery                  bool
Product_Category_Home                     bool
Signup_Year                              int32
Signup_Month                             int32
Signup_Day                               int32
Signup_Weekday                           int32
Days_Since_Signup                        int64
Age_Bin                               category
Region_Avg_Income                      float64
Income_X_Spen

**✅ Conclusion: Feature Engineering**

- You now have a feature-rich dataset with:

- Cleaned numerical features

- Encoded categorical features

- Time-based and domain-aware variables

- Interactions and scaled features

