# Feature engineering

**Consideration for Feature Engineering:**

When performing feature engineering, it's important to carefully analyze and manipulate the data to improve the performance of a churn prediction model. Here are a few points to consider:

1. **One-Hot Encoding**: One-hot encoding is a technique used to represent categorical variables as binary features. If you have categorical variables that are not ordinal in nature, such as gender or geographical region, consider applying one-hot encoding to create separate binary features for each category. This can help capture any non-linear relationships between the categories and the target variable.

2. **Variable Transformation**: Sometimes, transforming variables can reveal hidden patterns or improve model performance. For example, taking the logarithm or square root of a skewed numerical variable like "Income" might help normalize its distribution. Additionally, creating interaction or polynomial features by combining existing variables can capture complex relationships.

3. **Binning or Discretization**: Continuous variables can be converted into categorical variables by dividing them into bins or discrete intervals. This can help capture non-linear relationships and patterns that might not be apparent when treating the variable as continuous.

4. **Feature Scaling**: Scaling numerical features is important to ensure that they are on a similar scale and have a similar range. Standardization (scaling to zero mean and unit variance) or normalization (scaling to a specified range, e.g., [0, 1]) can be applied to numerical features. This helps prevent features with larger values from dominating the model and ensures fair comparisons.

5. **Domain Knowledge**: Utilize domain knowledge or conduct thorough exploratory data analysis to identify potential features that may be relevant for churn prediction. Understanding the business context and factors that could influence customer churn can guide the selection and creation of meaningful features.

Remember that feature engineering is an iterative process, and it often requires experimentation and evaluation of different approaches. Regularly assess the impact of feature engineering on the model's performance through cross-validation or other validation techniques.


**First Loop of feature engineering**

**_Drop:_**

    - Reason: no effect on "Exited".

- RowNumber: number of the data row. 

- CustomerId: random values of customer ID.

- Surname: customer surname.

**_Scaling:_**

- Age: customer age.
    - Action: standardize values.
    - Reason: integer value with a normal distribution in real life.

**_One-hot encoding:_**

    - Reason: cathegorical variable

- Geography: geographpy of the customer. 

- Gender: customer gender.

**_binning into some intervals:_**

    - Reason: can help capture non-linear relationships and patterns that might not be apparent when treating the variable as continuous

- CreditScore: credit score of customer.

- EstimatedSalary: customer estimated salary.

- Balance: balance of customer in the bank account.

**_No transformation for now:_**

- IsActiveMember: active customers.

- Tenure: years the customer is a client of the bank.

- NumOfProducts: number of products that a customer has purchased through the bank

- HasCrCard: denotes whether or not a customer has a credit card. 

**To do in a second loop of feature engineering**
- Tenure: years the customer is a client of the bank.
    - Action: create new features by combining tenure with other features. 
    - Reason: the variable alone does not show much influence in the target variable "Exited".

- NumOfProducts: refers to the number of products that a customer has purchased through the bank.
    - Action: one-hot encoding
    - Reason: evaluate if there is any benefit in treating this as a cathegorical feature.


- Balance: also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank 
compared to those with lower balances.






In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

In [26]:
# Load data
df = pd.read_csv('data/Abandono_clientes.csv')

# Drop irrelevant info
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

## Scaling ##
# Estimated Salary and "Age" column  
scaler = StandardScaler()
df['EstimatedSalaryScaled'] = scaler.fit_transform(df[['EstimatedSalary']])
df['AgeScaled'] = scaler.fit_transform(df[['Age']])
df = df.drop(['Age', 'EstimatedSalary'], axis=1)

## Binning ## 
# CreditScore column
bins = [350, 500, 650, 800, 850]  # Define the bin edges
labels = ['Low', 'Medium', 'High', 'Very High']  # Define the bin labels

df['CreditScoreBins'] = pd.cut(df['CreditScore'], bins=bins, labels=labels)

# Balance 
bin_edges = [0, 50000, 100000, 150000, 200000, 250000]
bin_labels = ['0-50k', '50k-100k', '100k-150k', '150k-200k', '200k-250k']

df['Balance_Bins'] = pd.cut(df['Balance'], bins=bin_edges, labels=bin_labels)

# Create a boolean column for when Balance is zero
df['Balance_IsZero'] = df['Balance'] == 0

df = df.drop(['Balance', 'CreditScore'], axis=1)

### One hot encoding for categorical variables
df = pd.get_dummies(df, columns =['Geography', 'Gender', 'CreditScoreBins', 'Balance_Bins'], drop_first = True)


Unnamed: 0,Tenure,NumOfProducts,HasCrCard,IsActiveMember,Exited,EstimatedSalaryScaled,AgeScaled,Balance_IsZero,Geography_Germany,Geography_Spain,Gender_Male,CreditScoreBins_Medium,CreditScoreBins_High,CreditScoreBins_Very High,Balance_Bins_50k-100k,Balance_Bins_100k-150k,Balance_Bins_150k-200k,Balance_Bins_200k-250k
0,2,1,1,1,1,0.021886,0.293517,True,False,False,False,True,False,False,False,False,False,False
1,1,1,0,1,0,0.216534,0.198164,False,False,True,False,True,False,False,True,False,False,False
2,8,3,1,0,1,0.240687,0.293517,False,False,False,False,True,False,False,False,False,True,False
3,1,2,0,0,0,-0.108918,0.007457,True,False,False,False,False,True,False,False,False,False,False
4,2,1,1,1,0,-0.365276,0.388871,False,False,True,False,False,False,True,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,5,2,1,0,0,-0.066419,0.007457,True,False,False,True,False,True,False,False,False,False,False
9996,10,1,1,1,0,0.027988,-0.373958,False,False,False,True,True,False,False,True,False,False,False
9997,7,1,0,1,1,-1.008643,-0.278604,True,False,False,False,False,True,False,False,False,False,False
9998,3,2,1,0,1,-0.125231,0.293517,False,True,False,True,False,True,False,True,False,False,False


In [27]:
df.describe()

Unnamed: 0,Tenure,NumOfProducts,HasCrCard,IsActiveMember,Exited,EstimatedSalaryScaled,AgeScaled
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.0128,1.5302,0.7055,0.5151,0.2037,-2.8776980000000004e-17,2.318146e-16
std,2.892174,0.581654,0.45584,0.499797,0.402769,1.00005,1.00005
min,0.0,1.0,0.0,0.0,0.0,-1.740268,-1.994969
25%,3.0,1.0,0.0,0.0,0.0,-0.8535935,-0.6600185
50%,5.0,1.0,1.0,1.0,0.0,0.001802807,-0.1832505
75%,7.0,2.0,1.0,1.0,0.0,0.8572431,0.4842246
max,10.0,4.0,1.0,1.0,1.0,1.7372,5.061197


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Tenure                     10000 non-null  int64   
 1   Balance                    10000 non-null  float64 
 2   NumOfProducts              10000 non-null  int64   
 3   HasCrCard                  10000 non-null  int64   
 4   IsActiveMember             10000 non-null  int64   
 5   Exited                     10000 non-null  int64   
 6   EstimatedSalaryScaled      10000 non-null  float64 
 7   AgeScaled                  10000 non-null  float64 
 8   Geography_Germany          10000 non-null  bool    
 9   Geography_Spain            10000 non-null  bool    
 10  Gender_Male                10000 non-null  bool    
 11  CreditScoreBins            9995 non-null   category
 12  Tenure                     10000 non-null  int64   
 13  Balance                    10000

In [29]:
df.to_csv('data/features.csv', index=False)