# 4. Feature Engineering

In this notebook, we transform the cleaned dataset into a modeling-ready format.

### Objectives:
- Encode categorical variables
- Scale numerical features (optional)
- Prepare final modeling dataframe
- Split data into training and testing sets

This step ensures that machine learning models receive properly formatted and meaningful inputs.

## 4.1 Import Clean Dataset

We load the cleaned dataset generated from the previous EDA notebook.

In [1]:
import pandas as pd

df = pd.read_csv("../data/raw/Bank Customer Churn Prediction.csv")
df.head()

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## 4.2 Encode Categorical Variables

We convert categorical variables into numerical format required by machine learning models.

Encoding steps:
- **Gender** → Binary mapping (Male = 0, Female = 1)
- **Country** → One-hot encoding (France, Germany, Spain)

In [3]:
df['gender'] = df['gender'].map({'Male': 0, 'Female': 1})
df[['gender']].head()

Unnamed: 0,gender
0,1
1,1
2,1
3,1
4,1


In [4]:
df = pd.get_dummies(df, columns=['country'], drop_first=True)
df.head()

Unnamed: 0,customer_id,credit_score,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn,country_Germany,country_Spain
0,15634602,619,1,42,2,0.0,1,1,1,101348.88,1,False,False
1,15647311,608,1,41,1,83807.86,1,0,1,112542.58,0,False,True
2,15619304,502,1,42,8,159660.8,3,1,0,113931.57,1,False,False
3,15701354,699,1,39,1,0.0,2,0,0,93826.63,0,False,False
4,15737888,850,1,43,2,125510.82,1,1,1,79084.1,0,False,True


## 4.3 Define Features (X) and Target (y)

We separate:
- **X:** Independent variables  
- **y:** Target variable (churn)

In [5]:
X = df.drop(['churn', 'customer_id'], axis=1)
y = df['churn']

X.head(), y.head()

(   credit_score  gender  age  tenure    balance  products_number  credit_card  \
 0           619       1   42       2       0.00                1            1   
 1           608       1   41       1   83807.86                1            0   
 2           502       1   42       8  159660.80                3            1   
 3           699       1   39       1       0.00                2            0   
 4           850       1   43       2  125510.82                1            1   
 
    active_member  estimated_salary  country_Germany  country_Spain  
 0              1         101348.88            False          False  
 1              1         112542.58            False           True  
 2              0         113931.57            False          False  
 3              0          93826.63            False          False  
 4              1          79084.10            False           True  ,
 0    1
 1    0
 2    1
 3    0
 4    0
 Name: churn, dtype: int64)

## 4.4 Train-Test Split

We split the data into:
- **80% training data**
- **20% testing data**

This helps evaluate the model on unseen data.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape

((8000, 11), (2000, 11))

## 4.5 Feature Scaling (Standardization)

We apply StandardScaler to numerical columns only.

This ensures all features are on similar scales, which helps distance-based models.

In [8]:
from sklearn.preprocessing import StandardScaler

num_cols = ['age', 'credit_score', 'balance', 'estimated_salary']

scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

## 4.6 Final Modeling Dataset Ready

We have completed:
- Encoding  
- Scaling  
- Train-test split  

The dataset is now ready for machine learning modeling.