# Feature Engineering & Preprocessing

Objective:
Prepare customer data for clustering by selecting relevant features, encoding variables, and applying scaling.

Key Focus:
- Justify feature inclusion/exclusion
- Encode categorical variables if used
- Apply scaling
- Explain why K-Means is sensitive to scaling
- Experiment with at least two feature combinations

In [44]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

In [45]:
df = pd.read_csv(r'D:\Bridgeon\TASK\Customer_Segmentation_Clustering\data\Mall_Customers.csv')
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


## Feature Selection 

### Excluded Features
- **CustomerID**: Unique identifier with no behavioral meaning; including it would introduce noise.

### Considered Features
- **Age**: Reflects life stage and purchasing behavior
- **Annual Income (k$)**: Represents purchasing power
- **Spending Score (1–100)**: Direct measure of spending behavior


## Encoding Categorical Variables

Gender is encoded using label encoding .
This helps us check if adding demographic details makes the customer groups clearer.

In [46]:
df_orginal  = df.copy()
df["Gender"] = df["Gender"].map({"Male": 0, "Female": 1})
df.drop('CustomerID',axis=1,inplace=True)
df.head()

Unnamed: 0,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,0,19,15,39
1,0,21,15,81
2,1,20,16,6
3,1,23,16,77
4,1,31,17,40


## Why Scaling is Required

K-Means relies on Euclidean distance.
Features with larger numeric ranges dominate distance calculations.

Example:
- Annual Income ranges up to ~140
- Spending Score ranges up to 100
- Age ranges up to ~70

Without scaling, income would disproportionately influence cluster formation.

StandardScaler centers data around a mean of 0 with a standard deviation of 1. It preserves the relationships between variables while ensuring no single feature dominates.

In [47]:
scaler = StandardScaler()

df_scaled = pd.DataFrame(scaler.fit_transform(df),columns=df.columns)


### Preprocessing Conclusion

- Encoded categorical data for experimentation  
- Applied scaling to ensure fair distance computation  

⚠️ Note: Feature selection is not final yet. We will experiment with at least two different feature combinations to evaluate their impact on clustering quality.

Next Step:  
K-Means clustering and model selection