<div>
    <h3><a href="#Introduction">1. Introduction</a></h3>
    <ul>
        <li><a href="#Overview">1.1 Project Overview</a></li>
        <li><a href="#Objective">1.2 Objective</a></li>
        <li><a href="#DataSource">1.3 Data Source</a></li>
    </ul>
    <h3><a href="#Libraries">2. Import Libraries</a></h3>
    <h3><a href="#LoadCleanData">3. Load and Clean Data</a></h3>
    <ul>
        <li><a href="#LoadData">3.1 Load Data</a></li>
        <li><a href="#DataReview">3.2 Initial Data Review</a>
            <ul>
                <li><a href="#DataStructure">3.2.1 Data Structure Review</a></li>
                <li><a href="#StatSummaries">3.2.2 Basic Statistical Summaries</a></li>
                <li><a href="#Duplicate">3.2.3 Duplicate Check</a></li>
                <li><a href="#Missing">3.2.4 Missing Data Check</a></li>
                <li><a href="#CatConsistency">3.2.5 Category Consistency Check</a></li>
            </ul>
        </li>
    </ul>
    <h3><a href="#EDA">3. Exploratory Data Analysis</a></h3>
    <ul>
        <li><a href="#UnivariateContinuous">3.1 Univariate Analysis - Continous Variables</a>
            <ul>
                <li><a href="#CreditScore">3.1.1 CreditScore</a></li>
                <li><a href="#Age">3.1.2 Age</a></li>
                <li><a href="#Balance">3.1.3 Balance</a></li>
                <li><a href="#EstimatedSalary">3.1.4 EstimatedSalary</a></li>
                <li><a href="#Tenure">3.1.5 Tenure</a></li>
            </ul>
        </li>
        <li><a href="#UnivariateCategorical">3.2 Univariate Analysis - Categorical Variables</a>
            <ul>
                <li><a href="#Gender">3.2.1 Gender</a></li>
                <li><a href="#Geography">3.2.2 Geography</a></li>
                <li><a href="#NumOfProducts">3.2.3 NumOfProducts</a></li>
                <li><a href="#HasCrCard">3.2.4 HasCrCard</a></li>
                <li><a href="#IsActiveMember">3.2.5 IsActiveMember</a></li>
            </ul>
        </li>
        <li><a href="#BivariateTarget">3.3 Bivariate Analysis - Features and Target</a>
            <ul>
                <li><a href="#CreditScoreExited">3.3.1 CreditScore and Exited</a></li>
                <!-- Continue with other sub-sections similarly -->
            </ul>
        </li>
        <!-- Continue with other subsections similarly -->
    </ul>
    <h3><a href="#PreprocessData">4. Preprocessing</a></h3>
    <h3><a href="#FeatureEngineering">5. Feature Engineering</a></h3>
    <ul>
        <li><a href="#NewVariables">5.1 Create New Variables</a></li>
        <li><a href="#ConvertCategorical">5.2 Convert to Categorical</a></li>
        <li><a href="#Encoding">5.3 One Hot Encoding</a></li>
        <li><a href="#Transformations">5.4 Feature Transformations</a></li>
        <li><a href="#FEPipeline">5.5 Feature Engineering Pipeline</a></li>
    </ul>
    <h3><a href="#ModelSelection">6. Model Selection</a></h3>
    <!-- Continue with more sections as needed -->
</div>

<a id='Introduction'></a>
# 1. Introduction

<a id='Overview'></a>
## 1.1 Project Overview

<a id='Objective'></a>
## 1.2 Objective

<a id='DataSource'></a>
## 1.3 Data Source

Test set is the set that was given without labels on which we will be evaluated (not to be confused with df_test)

<a id='Libraries'></a>
# 2. Import Libraries and Helpers

In [1]:
# Python file with functions to build charts 
%run Charts_for_EDA.ipynb

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.datasets import make_blobs
from pandas.plotting import scatter_matrix

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder
from imblearn.combine import SMOTETomek
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, r2_score, classification_report, roc_auc_score, roc_curve
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from category_encoders import CatBoostEncoder
# from xgboost import XGBClassifier
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Sequential
from tensorflow.keras.losses import SparseCategoricalCrossentropy
###notes: specify that activation function for the output layer is linear and include from_logits = True in compile
# remove warnings
import warnings
warnings.filterwarnings('ignore')

import logging 
logging.getLogger("tensorflow").setLevel(logging.ERROR) 
tf.autograph.set_verbosity(0)
from scipy import stats
from scipy.stats import randint, boxcox, yeojohnson
import optuna
tf.autograph.set_verbosity(0)
import pandas as pd
import numpy as np
from scipy.stats import randint, boxcox, yeojohnson
from sklearn.linear_model import LogisticRegression



<a id='LoadCleanData'></a>

# 3. Load and Clean Data

<a id='LoadData'></a>

## 3.1 Load Data

In [2]:
df_train = pd.read_csv("/Users/gracebarringer/Machine Learning Projects/Kaggle/Bank Churn - Binary Classification/Raw Data/train.csv").astype({'IsActiveMember' : np.uint8, 'HasCrCard' : np.uint8})
df_test = pd.read_csv("/Users/gracebarringer/Machine Learning Projects/Kaggle/Bank Churn - Binary Classification/Raw Data/test.csv").astype({'IsActiveMember' : np.uint8, 'HasCrCard' : np.uint8})

In [3]:
#Creating Copies
df_train_copy = df_train.copy()
df_test_copy = df_test.copy()

<a id='DataReview'></a>

## 3.2 Initial Data Review

<a id='DataStructure'></a>

### 3.2.1 Data Structure Review

In [None]:
display(df_train_copy.head())
display(df_test_copy.head())

Insights & Observations
- id and CustomerId columns are unique identifier and will not be used as predictors
- Columns are in the same order in the train and test sets (minus the target variable in test) 
- At first glance, columns look to be in the same format across both datasets
- Seems to be some incorrect data in the Surname column in the test set ("K?") but we will not be using Surname as a predictor and will be using id or CustomerId to identify unique customers

In [None]:
display(df_train_copy.dtypes.to_frame().T)
display(df_test_copy.dtypes.to_frame().T)

Insights & Observations
- Data has both continuous and categorical features 
- Categorical features include: Geography and Gender
- Categorical features will need to be encoded - since there are not a ton of classes in each we can use one-hot encoding
- Feature data types in training and testing set are the same

<a id='StatSummaries'></a>
### 3.2.2 Basic Statistical Summaries

In [None]:
display(df_train_copy.drop(columns = ['id', 'CustomerId'], axis = 1).describe())
display(df_test_copy.drop(columns = ['id', 'CustomerId'], axis = 1).describe())

Insights & Observations
- The mean and standard deviation between the train and test set are very similar 
- The distribution of each feature appears to be very similar between the train and test set 
- The value ranges of features vary quite a bit - will likely need to scale the features depending on the algorithm selected
- HasCrCard and IsActiveMember appear to be binary - possibly convert to categorical variables
- Given the geographies are European countries, Balance and EstimatedSalary are likely in Euros
- Age and Tenure both seem to be in years

<a id='Duplicate'></a>
### 3.2.3 Duplicate Check

In [None]:
# Checking for duplicates 
duplicate_count_train = df_train_copy.duplicated().sum()
duplicate_count_test = df_test_copy.duplicated().sum()

print('Train: ', duplicate_count_train)
print('Test: ', duplicate_count_test)

Insights & Observations
- No duplicates found in either dataset 

<a id='Missing'></a>
### 3.2.4 Missing Data Check

In [None]:
# Checking for missing values 
display(df_train_copy.isnull().sum().to_frame().T)
display(df_test_copy.isnull().sum().to_frame().T)

 Insights & Observations
 - No null values in either dataset

<a id='Introduction'></a>
### 3.2.5 Category Consistency Check

In [None]:
# Checking for categorical features

df_train_copy_unique_values_dict = {}
df_test_copy_unique_values_dict = {}

for data in [df_train_copy,df_test_copy]:
    for column in data.columns:
        if df_train_copy[column].dtype == 'O':
            if 'Exited' in data.columns.tolist():
                df_train_copy_unique_values_dict[column] = data[column].nunique()
            else:
                df_test_copy_unique_values_dict[column] = data[column].nunique()
                
df_train_copy_unique_values_dict['Dataset'] = 'Train'
df_test_copy_unique_values_dict['Dataset'] = 'Test'

df_combined = pd.DataFrame([df_train_copy_unique_values_dict, df_test_copy_unique_values_dict])

df_combined

Insights & Observations
- Surname is not a unique identifier for the customer so will not be used as a customer identifier 
- Both data sets have the same number of classes for Geography and Gender so no need to further investigate to ensure the test options are a subset of the train options

In [4]:
# Splitting Categorical vs. Continuous Variables
cat_vars = df_train_copy.select_dtypes(include = 'object').columns.tolist()
cat_vars.remove('Surname')
target_var = 'Exited'
cont_vars = df_train_copy.select_dtypes(exclude = 'object').columns.tolist()
cont_vars.remove('CustomerId')
cont_vars.remove('id')
cont_vars.remove('Exited')

<a id='EDA'></a>

# 3. Exploratory Data Analysis

<a id='UnivariateContinuous'></a>
## 3.1 Univariate Analysis - Continuous Variables

#### Credit Score

In [None]:
cont_dist(df_train_copy, 'CreditScore')

Insights & Observations 
- Appears to have multimodal distribution - will consider segmenting the customer base
- Slightly skewed to the left indicating presence of customers with lower credit scores (though seem relatively few in number) - will consider transforming  
- Spike in density around the max credit score (850)
- CreditScore provides insight into financial health and risk profiles of the customers - might reflect customers' financial stability and satisfaction with the bank's services, therefore influencing their decision to stay with or leave the bank - will consider using in a financial stability metric 



#### Age

In [None]:
cont_dist(df_train_copy, 'Age')

Insights & Observations
- Right-skewed distribution, which a higher concentration of younger customers and a tail extending into older ages - will consider a transformation 
- Late 20s and early-mid 30s represents a significant portion of customer base - might make sense to segment the customers into age groups as age-related features might help capture different behaviours across customer life stages (e.g., younger customers might be more prone to switching services due to seeking better deals or being less established with the bank, whereas older customers might value stability and have a longer-standing relationship with the bank) 
- Quite a lot of outliers at older ages which might indicate specific segments of customer base

#### Balance

In [None]:
cont_dist(df_train_copy, 'Balance')

In [None]:
zero_balance_count = (df_train_copy['Balance'] == 0).sum()
total_count = df_train_copy.shape[0]
zero_balance_percentage = (zero_balance_count / total_count) * 100
print(zero_balance_percentage)

Insights & Observations
- Appears to have a bi-modal distribution with one significant peak at 0 and another broader peak around 100-150k range - might make sense to segment customers 
- A significant portion of customers have a balance close to 0 which could indicate that the bank is not the customer's primary bank or they are experiencing financial instability - will consider using to create a financial stability metric 
- Over half of the customers have 0 balances 


#### EstimatedSalary

In [None]:
cont_dist(df_train_copy, 'EstimatedSalary')

Insights & Observations
- Kind of uniform distribution with small fluctuations, indicating that customers' estimated salaries vary widely without a dominant concentration in a specific range 
- Broad spread of customer salaries - will normalize or standardize 
- Uniform distribution suggests that EstimatedSalary might not be a strong predictor for churn on its own - might make sense to analyze interactions with other features which might yield more meaningful insights into churn - consider using to create a financial stability metric

#### Tenure

In [None]:
cont_dist(df_train_copy, 'Tenure')

Insights & Observations
- Quite uniform distribution across 0 to 10 years range - might normalize or standardize
- Seems to be in whole years
- Tenure by itself might not show a strong direct correlation with churn but combining it with other features might - will consider using to create a loyalty metric 

#### NumOfProducts

In [None]:
cont_dist(df_train_copy, 'NumOfProducts')

Insights & Observations
- Discrete distribution with two primary peaks at 1 and 2 products - might convert to categorical variable
- Customers with 3+ products might exhibit different loyalty patterns compared to those who only use 1 or 2 products 
- NumOfProducts could be a significant feature for predicting churn as the number of products a customer uses might correlate with their engagement level and satisfaction with the bank - will consider using in an engagment/satisfaction metric 

#### HasCrCard and IsActiveMember

In [None]:
for i in ['HasCrCard', 'IsActiveMember']:
    cont_dist(df_train_copy, i)

Insights & Observations
- HasCrCard and IsActiveMember are both binary variables - might convert to categorical variables
- A higher portion of customers have a credit card compared to those who don't
- The number of active members is relatively balanced between active and inactive members
- Understanding how HasCrCard and IsActiveMember influences churn could provide actionable insights (significant weights on these features might indicate that they are key determinants of customer loyalty and satisfaction) - will consider using to create a loyalty and customer engagement/satisfaction


<a id='UnivariateCategorical'></a>
## 3.2 Univariate Analysis - Categorical Variables

#### Gender

In [None]:
cat_dist(df_train_copy, 'Gender')

Insights & Observations
- Slightly imbalanced class with the majority being male 
- Categorical variable so will need to encode

#### Geography

In [None]:
cat_dist(df_train_copy, 'Geography')

Insights & Observations 
- France has significantly more customers than Germany or Spain - need to ensure the model doesn't overfit characteristics of the dominant French market at the expense of generalizability to Germany and Spain
- Different countries may have different customer behaviour patterns, economic conditions, and competition levels that influence churn rates 
- Categorical variable so will need to encode

<a id='BivariateTarget'></a>
## 3.3 Bivariate Analysis - Features and Target

#### CreditScore and Exited

In [None]:
cat_cont_dist(df_train_copy, 'Exited', 'CreditScore')

Insights & Observations
- The mean and median CreditScores of customer who exited versus those who did not are relatively close which indicates that CreditScore alone might not be a strong predictor of churn 
- Both boxplots show a similar range and similar outliers - doesn't appear to be a signigicant difference in the spread of credit scores between those who exited and those who didn't
- The credit scores of customers who exited versus those who didn't show very similar statistical profiles which suggests that CreditScore alone may not be a decisive factor in predicting churn 

#### Age and Exited

In [None]:
cat_cont_dist(df_train_copy, 'Exited', 'Age')

Insights & Observations
- Customers who exited have a higher mean and median than those who stayed, suggesting older customers are more likely to churn on average (which is not what I was expecting) 
- Customers who stayed show a narrower IQR centered around a younger age group, indicating less variability in age among retained customers
- Customers who exited show a wider IQR centered around an older age, suggesting a broader spread of ages among churned customers but generally skewed towards older individuals 
- Both distributions show outliers with the churned group showing outliers on both the lower and higher ends suggestion younger customers also churn but the majority are older
- The analysis indicates that age is quite a significant factor in customer churn, with older customers being more likely to leave which could be due to changing financial needs, disatisfaction with services, or better offerings from competitors targeting older demographics 

#### Balance and Exited

In [None]:
churn_groups = df_train_copy.groupby('Exited')['Balance'].describe()
print(churn_groups)

In [None]:
cat_cont_dist(df_train_copy, 'Exited', 'Balance')

Insights & Observations 
- On average, customers who churn tend to have higher balances that those who did not 
- The median balance for churned customers is significantly higher than the mean which is somewhat unexpected 
- The median balance for retained customers is 0, which suggests potential issues with the data - will be cautious using Balance as a predictor

#### NumOfProducts and Exited

In [None]:
cat_target_dist(df_train_copy, 'Exited', 'NumOfProducts')

Insights & Observations
- Customers with fewer products (1 or 2) have lower churn rates compared to those with more products (3 to 4, though there is quite a significant decrease in churn rate when customers move from 1 product to 2 - this could possibly indicate diminishing returns beyond 2 products 


#### HasCrCard and Exited

In [None]:
cat_target_dist(df_train_copy, 'Exited', 'HasCrCard')

Insights & Observations
- Having a credit card seems to be associated with a marginally lower churn rate, though the difference in churn rates is relatively small indicating that having a credit card is not a decisive factor by itself 

#### IsActiveMember and Exited

In [None]:
cat_target_dist(df_train_copy, 'Exited', 'IsActiveMember')

Insights & Observations
- Active member have a lower churn rate (13%) than inactive members suggesting that the feature could facilitate churn prediction

#### Gender and Exited

In [None]:
cat_target_dist(df_train_copy, 'Exited', 'Gender')

Insights & Observations
- Males have a lower churn rate than females 

#### Geography and Exited

In [None]:
cat_target_dist(df_train_copy, 'Exited', 'Geography')

Insights and Observations
- Germany's churn rate is significantly higher than that of France's and Spain's

<a id='BivariateCorrelation'></a>

## 3.4 Bivariate Analysis - Feature Correlation

#### Scatter Plot

In [None]:
sns.pairplot(df_train_copy[cont_vars+[target_var]], kind='scatter', plot_kws={'alpha':0.1})

Insights & Observations 
- There are a lot of data points but there doesn't appear to be any clear cut relationships between variables

#### Correlation Heat Map

In [None]:
corr_heat_map(df_train_copy[cont_vars+[target_var]])

Insights & Observations 
- No features are highly correlated with each other so no multicollinearity 

## 3.5 Multivariate Analysis - Feature Interactions with Target

#### IsActiveMember and NumOfProducts on Exited

In [None]:
target_cat_cat_dist(df_train_copy, 'Exited', 'IsActiveMember', 'NumOfProducts')

Insights & Observations
- Churn rate skyrockets for customers with 3+ products, particularly for inactive members
- These features are important for predicting churn - adding an interaction variable might increase the model performance

#### CreditScore and EstimatedSalary on Exited

In [None]:
sns.kdeplot(data=df_train_copy, x='CreditScore', y='EstimatedSalary', hue='Exited', fill=True)


Insights & Observations
- Significant overlap in contours for customers who exited and customers who remained suggests that there is not a distinctly seperable pattern that clearly differentiates between those who exited vs. those who remained
- The wide spread contours, especially for customers who remained, incidate a broad distribution of CreditScores and EstimatedSalary among customers who remained
- There isn't a simple or direct relationship between CreditScore and EstimatedSalary, and the likelihood of exiting - will likely need more sophisticated modeling techniques to capture complex interactions

<a id='PreprocessData'></a>
## 4. Preprocessing

In [5]:
# Split the features and the target
X = df_train_copy.iloc[:,:-1]
y = df_train_copy.iloc[:,-1:]

In [6]:
# Create a training and test set 
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.2, random_state=42)

<a id='Scaling'></a>
### 3.3.1 Feature Scaling

In [7]:
# Class for feature scaling
class FeatureScaling(BaseEstimator, TransformerMixin):
    def __init__(self, features_to_scale=None, scaler_method='standard'):
        self.features_to_scale = features_to_scale
        self.scaler_method = scaler_method
        self.scalers = {}

    def fit(self, X, y=None):
        
        for feature in self.features_to_scale:
            if self.scaler_method == 'standard':
                self.scalers[feature] = StandardScaler()
            elif self.scaler_method == 'minmax':
                self.scalers[feature] = MinMaxScaler()
            elif self.scaler_method == 'maxabsscaler':
                self.scaler[feature] = MaxAbsScaler()
            elif self.scaler_method == 'robust':
                self.scaler[feature] = RobustScaler()
            self.scalers[feature].fit(X[[feature]]) 
            
        return self

    def transform(self, X):
        X_scaled = X.copy()
        for feature in self.features_to_scale:
            if feature in X_scaled.columns:
                X_scaled[feature] = self.scalers[feature].transform(X_scaled[[feature]])

        
        return X_scaled

<a id='FeatureEngineering'></a>
# 5. Feature Engineering

<a id='NewVariables'></a>
## 5.1 Create New Features

In [28]:
# Class for feature creation
class CreateNewFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, scale = False):
        self.scale = scale

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.scale == False:
            print(X.isnull().values.any())
            X['Loyalty'] = X['Tenure'] + X['HasCrCard'] + X['IsActiveMember']
            X['Satisfaction'] = X['NumOfProducts'] + X['IsActiveMember'] + X['HasCrCard']
            X['FinancialStability'] = X['Balance'] + X['CreditScore'] + X['EstimatedSalary']
            X['TenureAgeRatio'] = X['Tenure'].astype(float)/X['Age']
            X['NumOfProducts_IsActiveMember'] = X['NumOfProducts']*X['IsActiveMember']
        else:
            print(X.isnull().values.any())
            scaler = MinMaxScaler()
            X[['TenureScaled', 'HasCrCardScaled', 'IsActiveMemberScaled', 'NumOfProductsScaled', 'BalanceScaled', 'CreditScoreScaled', 'EstimatedSalaryScaled']] = scaler.fit_transform(X[['Tenure', 'HasCrCard', 'IsActiveMember', 'NumOfProducts', 'Balance', 'CreditScore', 'EstimatedSalary']]) 
            X['Loyalty'] = X['TenureScaled'] + X['HasCrCardScaled'] + X['IsActiveMemberScaled']
            X['Satisfaction'] = X['NumOfProductsScaled'] + X['IsActiveMemberScaled'] + X['HasCrCardScaled']
            X['FinancialStability'] = X['BalanceScaled'] + X['CreditScoreScaled'] + X['EstimatedSalaryScaled']
            X['TenureAgeRatio'] = X['Tenure'].astype(float)/X['Age']
            X['NumOfProducts_IsActiveMember'] = X['NumOfProducts']*X['IsActiveMember']
            X = X.drop(columns = ['TenureScaled', 'HasCrCardScaled', 'IsActiveMemberScaled', 'NumOfProductsScaled', 'BalanceScaled', 'CreditScoreScaled', 'EstimatedSalaryScaled'], axis = 1)
        
        return X

<a id='Encoding'></a>
## 5.2 One Hot Encoding

In [9]:
# Class for categorical encoding
class OHEncoding(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_encode = None):
        self.columns_to_encode = columns_to_encode
        self.encoders = {}
        
    def fit(self, X, y=None):
        if self.columns_to_encode is not None:
            for col in self.columns_to_encode:
                if col in X.columns:
                    oh_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
                    oh_encoder.fit(X[[col]])
                    self.encoders[col] = oh_encoder
                else:
                    raise ValueError(f"Column {col} not found in DataFrame")  

        return self

    def transform(self, X):
        X_encoded = X.copy()
        for col, encoder in self.encoders.items():
            encoded_col = encoder.transform(X_encoded[[col]])
            df_encoded = pd.DataFrame(encoded_col, columns=[f"{col}_{category}" for category in encoder.categories_[0]], index = X_encoded.index)
            X_encoded = pd.concat([X_encoded, df_encoded], axis=1)

        X_encoded = X_encoded.drop(columns = self.columns_to_encode, axis=1)

        return X_encoded

<a id='Transformations'></a>
## 5.3 Feature Transformations

In [10]:
# Class for feature transformations
class CustomTransformation(BaseEstimator, TransformerMixin):
    def __init__(self, feat_func):
#         self.func = func
        self.feat_func = feat_func

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_transformed = X.copy()
        if self.feat_func is not None:
            for i in self.feat_func:
                if self.feat_func[i] == 'log':
                    X_transformed[i+'_log'] = np.log1p(X_transformed[i])
                elif self.feat_func[i] == 'sqrt':
                    X_transformed[i+'_sqrt'] = np.sqrt(X_transformed[i])
                else:
                    X_transformed[i+'_bc'], _ = boxcox(X_transformed[i]+1)
                
            X_transformed = X_transformed.drop(columns = list(self.feat_func.keys()), axis = 1)

        return X_transformed


# 6. Model Selection

## 6.1 Algoritm Selection

### Random Forest

In [11]:
rf_pipeline = Pipeline([
    ('encoder', OHEncoding(columns_to_encode = ['Gender', 'Geography'])),
    ('feature_creator', CreateNewFeatures(scale = True)),
    ('classifier', RandomForestClassifier())
])


In [29]:
predictors = [x for x in X_train.columns if x not in ['id', 'CustomerId', 'Surname']]

rf_scores = cross_val_score(rf_pipeline, X_train[predictors], y_train, cv=5, scoring='roc_auc')
print("Random Forest AUC: %0.4f (+/- %0.4f)" % (rf_scores.mean(), rf_scores.std() * 2))

Random Forest AUC: 0.8711 (+/- 0.0039)


### XGBoost

In [13]:
xgb_pipeline = Pipeline([
    ('encoder', OHEncoding(columns_to_encode = ['Gender', 'Geography'])),
    ('feature_creator', CreateNewFeatures(scale = True)),
    ('classifier', XGBClassifier())
])

In [14]:
xgb_scores = cross_val_score(xgb_pipeline, X_train[predictors], y_train, cv=5, scoring='roc_auc')
print("XGBoost AUC: %0.4f (+/- %0.4f)" % (xgb_scores.mean(), xgb_scores.std() * 2))

XGBoost AUC: 0.8849 (+/- 0.0040)


### LightGBM

In [15]:
lgbm_pipeline = Pipeline([
    ('encoder', OHEncoding(columns_to_encode = ['Gender', 'Geography'])),
    ('feature_creator', CreateNewFeatures(scale = True)),
    ('classifier', LGBMClassifier())
])

In [16]:
lgbm_scores = cross_val_score(lgbm_pipeline, X_train[predictors], y_train, cv=5, scoring='roc_auc')
print("LGBM AUC: %0.4f (+/- %0.4f)" % (lgbm_scores.mean(), lgbm_scores.std() * 2))

[LightGBM] [Info] Number of positive: 22373, number of negative: 83248
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001472 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1410
[LightGBM] [Info] Number of data points in the train set: 105621, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.211823 -> initscore=-1.313969
[LightGBM] [Info] Start training from score -1.313969
[LightGBM] [Info] Number of positive: 22372, number of negative: 83249
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001669 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1412
[LightGBM] [Info] Number of data points in the train set: 105621, number of used features: 18
[LightGBM] [Info

### CatBoost

In [17]:
cat_features = ['Geography', 'Gender', 'NumOfProducts', 'IsActiveMember', 'HasCrCard'] 

In [18]:
cb_pipeline = Pipeline([
    ('feature_creator', CreateNewFeatures(scale = True)),
    ('encoder', CatBoostEncoder(cols = cat_features)),
    ('classifier', CatBoostClassifier()),
])

In [19]:
cb_scores = cross_val_score(cb_pipeline, X_train[predictors], y_train, cv=5, scoring='roc_auc')
print("CatBoost AUC: %0.4f (+/- %0.4f)" % (cb_scores.mean(), cb_scores.std() * 2))

Learning rate set to 0.075349
0:	learn: 0.6249028	total: 68.9ms	remaining: 1m 8s
1:	learn: 0.5709931	total: 75.3ms	remaining: 37.6s
2:	learn: 0.5278661	total: 81.8ms	remaining: 27.2s
3:	learn: 0.4912332	total: 88.5ms	remaining: 22s
4:	learn: 0.4638078	total: 95ms	remaining: 18.9s
5:	learn: 0.4406232	total: 140ms	remaining: 23.2s
6:	learn: 0.4215169	total: 157ms	remaining: 22.2s
7:	learn: 0.4070692	total: 193ms	remaining: 23.9s
8:	learn: 0.3954813	total: 202ms	remaining: 22.3s
9:	learn: 0.3846837	total: 211ms	remaining: 20.9s
10:	learn: 0.3757234	total: 219ms	remaining: 19.7s
11:	learn: 0.3686460	total: 226ms	remaining: 18.6s
12:	learn: 0.3623941	total: 235ms	remaining: 17.8s
13:	learn: 0.3571206	total: 242ms	remaining: 17.1s
14:	learn: 0.3525021	total: 251ms	remaining: 16.5s
15:	learn: 0.3491908	total: 258ms	remaining: 15.9s
16:	learn: 0.3461750	total: 266ms	remaining: 15.4s
17:	learn: 0.3440413	total: 273ms	remaining: 14.9s
18:	learn: 0.3415932	total: 280ms	remaining: 14.5s
19:	learn:

### Logistic Regression

In [20]:
new_cont_vars = X_train.select_dtypes(exclude = 'object').columns.tolist()
new_cont_vars.remove('id')
new_cont_vars.remove('CustomerId')

In [26]:
lr_pipeline = Pipeline([
    ('scaler', FeatureScaling(features_to_scale = new_cont_vars, scaler_method = 'minmax')),
    ('encoder', OHEncoding(columns_to_encode = ['Geography', 'Gender'])),
    ('feature_creator', CreateNewFeatures(scale = False)),
    ('feature_transformer', CustomTransformation(feat_func = {'Age':'boxcox'})),
    ('classifier', LogisticRegression())
])

In [27]:
lr_scores = cross_val_score(lr_pipeline, X_train[predictors], y_train, cv=5, scoring='roc_auc')
print("Logistic Regression AUC: %0.4f (+/- %0.4f)" % (lr_scores.mean(), lr_scores.std() * 2))

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1196, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1106, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 921, in check_array
    _assert_all_finite(
  File "/Users/gracebarringer/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


In [23]:
X_train[predictors].isnull().values.any()

False

In [None]:
Insights & Observations
- 

## Addressing Skewed Distribution 

Insights & Observations
- Age seems the be the primary one to address

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, r2_score, classification_report, roc_auc_score, roc_curve
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from scipy import stats
from scipy.stats import randint, boxcox, yeojohnson
import sklearn.metrics as metrics

In [None]:
# Transforming Age - Deciding on the best transformation method
df_train_copy['BS_log'] = np.log(df_train_copy['Age'] + 1)  # Adding 1 to avoid log(0)
df_train_copy['BS_sqrt'] = np.sqrt(df_train_copy['Age'])
df_train_copy['BS_bc'], fitted_lambda = boxcox(df_train_copy['Age'])
df_train_copy['BS_yj'], fitted_lambda = yeojohnson(df_train_copy['Age'])



print("BalanceSalaryRatio skew: ", df_train_copy['Age'].skew())
print("BS_log skew: ", df_train_copy['BS_log'].skew())
print("BS_sqrt skew: ", df_train_copy['BS_sqrt'].skew())
print("BS_bc skew: ", df_train_copy['BS_bc'].skew())
print("BS_yj skew: ", df_train_copy['BS_yj'].skew())


Insights & Observations
- Boxbox leads to the lowest skew value - use this transformation in data preprocessing

# Model Selection and Hyperparameter Tuning

In [None]:
df_train_copy[cont_vars].skew()