# Bank Customer Churn - Feature Engineering and Feature Selection

With our model defined and initial preprocessing done, now its the time for Feature Engineering and Feature Selection.

## Feature Engineering

Recording from the first notebook the possible Feature Engineering operations that can be applied:

- Since we have the Salary, relate the income with Balance, Age, etc.
- If IsActiveMember reffers to active account movement, relate the account balance with being an active member.
- Considering Tenure is the time the customer is with the bank, it can be related to features like NumOfProducts, isActiveMember.
- Group numerical features into categorical features: Age, CreditScore, Tenure.

Putting more tough on it instead of leaving it abroad, it can be rearanged as:

- Balance to Income Ratio
- Income to Age Ratio
- Relate IsActiveMember it with the number of Products and Tenure: 1 to 0, with 1 being Active with all products and long tenure, 0 with low products and low tenure, or just being inactive.
- Product Tenure Score: Product of Tenure with NumOfProducts.
- Bin CreditScore since it can be concealed within defined ranges.


In [20]:
import pickle
import numpy as np
import pandas as pd

from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import f1_score, accuracy_score
from xgboost import XGBClassifier

All new features will be created but a kind of Grid Search will be done to evaluate the impact of adding these features one by one in the space of original features until the feature space comprises all new features, obtaining the schema with best performance that will be used in the pipeline.


In [3]:
data_path = '../data/interim/churn_customer_preprocessing.csv'
df = pd.read_csv(data_path)

In [4]:
df

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,France,Germany,Spain
0,619,1,42,2,0.00,1,1,1,101348.88,1,1,0,0
1,608,1,41,1,83807.86,1,0,1,112542.58,0,0,0,1
2,502,1,42,8,159660.80,3,1,0,113931.57,1,1,0,0
3,699,1,39,1,0.00,2,0,0,93826.63,0,1,0,0
4,850,1,43,2,125510.82,1,1,1,79084.10,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,0,39,5,0.00,2,1,0,96270.64,0,1,0,0
9996,516,0,35,10,57369.61,1,1,1,101699.77,0,1,0,0
9997,709,1,36,7,0.00,1,0,1,42085.58,1,1,0,0
9998,772,0,42,3,75075.31,2,1,0,92888.52,1,0,1,0


In [5]:
x = df.drop(columns='Exited')
y = df.Exited

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y, random_state=0)

In [7]:
original_feature_space = list(x.columns)

In [8]:
def fe_pipeline(df: pd.DataFrame) -> pd.DataFrame:
    df_copy = df.copy()
    
    df_copy['Balance_Income_ratio'] = df_copy['Balance'] / df_copy['EstimatedSalary']
    df_copy['Income_Age_ratio'] = df_copy['EstimatedSalary'] / df_copy['Age']
    df_copy['Products_Tenure_relation'] = df_copy['NumOfProducts'] * df_copy['Tenure']
    
    high_engagement_mask = ((df_copy['NumOfProducts'] >= 3) & (df_copy['Tenure'] >= 5))
    mid_engagement_mask = ((df_copy['NumOfProducts'] >= 2) | (df_copy['Tenure'] >= 3))


    df_copy['Engagement_Score'] = np.where(
        df_copy['IsActiveMember'] == 1,
            np.where(
                high_engagement_mask , 1,
                    np.where(mid_engagement_mask, 0.5, 0) 
            ),
            0
        )
    
    score_bins = [300, 580, 670, 740, 800, 870]
    score_labels = [0, 1, 2, 3, 4]

    df_copy['Credit_Score_bins'] = pd.cut(df_copy['CreditScore'], bins=score_bins, labels=score_labels, right=False).astype(int)
    
    
    return df_copy

In [9]:
x_train = fe_pipeline(x_train)
x_test = fe_pipeline(x_test)

In [10]:
def evaluate_model(model, x_features: pd.DataFrame, y: pd.Series, skf: StratifiedKFold) -> float:
    return cross_val_score(model, x_features, y, cv=skf, scoring='f1',error_score='raise').mean()

def evaluate_feature_list(feature_performance_list: list[tuple[list[str],float]]) -> list[str]:
    sorted_feature_list = sorted(feature_performance_list, key=lambda x: x[1], reverse=True)
    best_feature_space = sorted_feature_list[0][0]
    return best_feature_space

def greedy_feature_selection(
    model: XGBClassifier,
    skf: StratifiedKFold, 
    x: pd.DataFrame,
    y: pd.Series, 
    selected_features: list[str],
    remaining_features: list[str]) -> list[str]: 
    # Store performance
    feature_performance = []
              
    while remaining_features:
        best_score = -np.inf
        best_feature = None
        best_features = selected_features.copy()
        
        # Evaluate the addition of each feature
        for feature in remaining_features:
            current_features = selected_features + [feature]
            score = evaluate_model(model, x[current_features], y, skf)            
            if score > best_score:
                best_score = score
                best_feature = feature
                best_features = current_features
                        
        # Add the best feature to the selected list
        selected_features = best_features
        remaining_features.remove(best_feature)
        
        # Store the result for this step
        feature_performance.append((selected_features, best_score))
        
    best_feature_space = evaluate_feature_list(feature_performance)
    
    return best_feature_space


In [11]:
xgb_model = XGBClassifier()
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

new_feature_space = list(x_train.columns[-5:])
best_feature_space = greedy_feature_selection(
    model=xgb_model,
    skf=skf,
    x=x_train,
    y=y_train,
    selected_features=original_feature_space,
    remaining_features=new_feature_space,
    )

best_feature_space   

['CreditScore',
 'Gender',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary',
 'France',
 'Germany',
 'Spain',
 'Products_Tenure_relation']

With this we obtain the best feature space with the new features - or feature, in this case.

Even though this can be considered a feature selection step, it was done only to the new features. A more thorough selection method should be applied to assert that only features that provide useful information to the model are used.

Here a recursive feature elimination (RFE) will be utilized. It consists of iteratively removing the least important features based on model weights until the desired number of features is reached.

In [12]:
rfe = RFE(xgb_model, n_features_to_select=12)

x_train_selected = rfe.fit_transform(x_train[best_feature_space], y_train)

In [13]:
filtered_feature_space = [col for col, selected in zip(best_feature_space, rfe.support_) if selected]

In [14]:
filtered_feature_space

['CreditScore',
 'Gender',
 'Age',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary',
 'France',
 'Germany',
 'Spain',
 'Products_Tenure_relation']

In [15]:
# Model with all features

model = XGBClassifier()
model.fit(x_train, y_train)

print(f' F1 Score: {f1_score(y_test, model.predict(x_test))}')
print(f'Accuracy: {accuracy_score(y_test, model.predict(x_test))}')

 F1 Score: 0.5637583892617449
Accuracy: 0.8483333333333334


In [16]:
# Model with selected features

model = XGBClassifier()
model.fit(x_train[best_feature_space], y_train)

print(f' F1 Score: {f1_score(y_test, model.predict(x_test[best_feature_space]))}')
print(f'Accuracy: {accuracy_score(y_test, model.predict(x_test[best_feature_space]))}')

 F1 Score: 0.5719806763285025
Accuracy: 0.8523333333333334


In [17]:
# Model with selected features via RFE

model_new_features = XGBClassifier()
model_new_features.fit(x_train[filtered_feature_space], y_train)

print(f' F1 Score: {f1_score(y_test, model_new_features.predict(x_test[filtered_feature_space]))}')
print(f'Accuracy: {accuracy_score(y_test, model_new_features.predict(x_test[filtered_feature_space]))}')

 F1 Score: 0.5575992255566312
Accuracy: 0.8476666666666667


It seems the application of RFE don't improve the model performance. This may happen due to the limited amount of data available or other reasons.

 In any case, this process helped define which features would compose the final arrangement of the data that will be used in the model. The next step now is apply hyperparameter tuning to improve the performance even further. 
 
 Before that, it's useful to prepare the data to the training step. I'll also split the test data into validation and test to be used in the training step.

In [19]:
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test, test_size=0.5, stratify=y_test, random_state=0)

In [21]:
x_train = x_train[best_feature_space]
x_valid = x_valid[best_feature_space]
x_test = x_test[best_feature_space]

training_data = {
    'x_train': x_train,
    'x_valid': x_valid, 
    'x_test': x_test,
    'y_train': y_train,
    'y_valid': y_valid, 
    'y_test': y_test,    
}

with open('../data/processed/training_data.pkl','wb') as f:
    pickle.dump(training_data, f)

With all prepared, the remaining steps are **hyperparameter tuning** and model **inference**.