## Summary:

The target variable pertains to customers' acceptance of a personal loan, making the exercise a binary classification problem. The target variable is imbalanced since 480 instances are 1 and 4520 negative cases. This has direct implications on the evaluation metrics, since labeling everything as 0 would achieve a misleading accuracy of 90%.

The choice of the most relevant evaluation metric hinges on the bank's specific needs. For instance, if the goal is to maximize ROI, precision would be a suitable choice. Conversely, if prioritizing aggressive expansion, recall would emerge as the optimal metric.

My objective for this exercise is not to obtain the highest possible score, but rather engaging with different Naive Bayes models to understand their benefits and limitations.

As expected, the Bernoulli classifier achieved the best metrics, but isn't an ideal model since the variables are not independent. The Gridsearch hyperparameters were alpha = 0, which is understandable due to the number of zeros on the DataFrame and a binarize = 0.9 - expected again due to the imbalanced target variable.

The second most relevant model was Gaussian Naive Bayes, displaying a remarkably similar F-1 score. This is intriguing given that a majority of numerical columns deviate from a normal distribution (Shapiro-Wilk test wasn't conducted for validation).

Naive Bayes models demonstrated benefits in terms of speed and lack of optimization complexity, making them particularly valuable for scenarios involving numerous variables. For example, Multinomial Naive Bayes can prove advantageous for text classification tasks.


### Libraries


In [17]:
#Imports:
#General imports
import numpy as np
import pandas as pd



# Plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import plot_tree

In [30]:
#Imports:
#Sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB, binarize, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.metrics import classification_report



#Other models
from xgboost import XGBClassifier


#Sklearn metrics
from sklearn.metrics import RocCurveDisplay, roc_curve, roc_auc_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [48]:
#Loading the DF
df = pd.read_csv("../data/clean_df.csv")
df.drop("Unnamed: 0", axis=1, inplace=True)
#sanity check:
df.head(5)

Unnamed: 0,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,49,4,1.6,1,0,0,1,0,0,0
1,19,34,3,1.5,1,0,0,1,0,0,0
2,15,11,1,1.0,1,0,0,0,0,0,0
3,9,100,1,2.7,2,0,0,0,0,0,0
4,8,45,4,1.0,2,0,0,0,0,0,1


In [49]:
df.describe()

Unnamed: 0,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,20.1354,73.7742,2.3964,1.937913,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,11.414672,46.033729,1.147663,1.747666,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,0.0,8.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,39.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,20.0,64.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,30.0,98.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,43.0,224.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


In [50]:
df.columns


Index(['Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
       'Personal Loan', 'Securities Account', 'CD Account', 'Online',
       'CreditCard'],
      dtype='object')

In [69]:
#Defining the variables

X =  df.drop(["Personal Loan"], axis=1)

y = df["Personal Loan"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=24)

In [52]:
#Sanity check:
print(f"""
- X train shape: {X_train.shape}
- X test shape: {X_test.shape}
- y train shape: {y_train.shape}
- y test shape: {y_test.shape}
""")


- X train shape: (4000, 10)
- X test shape: (1000, 10)
- y train shape: (4000,)
- y test shape: (1000,)



## Baseline model

In [67]:
print(f'''
- Count of target variable ==1: {df["Personal Loan"].sum()}
- Total rows: {df["Personal Loan"].count()}
- Baseline accuracy: {1-(df["Personal Loan"].sum()/(df["Personal Loan"].count()))}
''')


- Count of target variable ==1: 480
- Total rows: 5000
- Baseline accuracy: 0.904



## Bernoulli Naive Bayes:

In [70]:
#Bernulli Naive Bayes gridsearch:
estimators = [
    ("normalise", StandardScaler()),
    ("select", PCA()),
    ("model", BernoulliNB())
]

pipe = Pipeline(estimators)

param_grid = [
    {
        'model__alpha': [0, 0.001, 0.0001, 0.00001, 0.1, 0.5, 0.8, 0.9, 1],
        'model__binarize': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'normalise': [None],
        'select': [None]
        
    }
]


grid = GridSearchCV(pipe, param_grid, cv=5, verbose=1)

#Fitting the grid
bnb_fitted_grid = grid.fit(X_train, y_train)

# Print best parameters and corresponding score
print("Best Parameters:", bnb_fitted_grid.best_params_)
print("Best Score:", bnb_fitted_grid.best_score_)

Fitting 5 folds for each of 99 candidates, totalling 495 fits




Best Parameters: {'model__alpha': 0, 'model__binarize': 0.9, 'normalise': None, 'select': None}
Best Score: 0.90825




In [71]:
print(classification_report(y, bnb_fitted_grid.predict(X)))

              precision    recall  f1-score   support

           0       0.93      0.98      0.95      4520
           1       0.54      0.26      0.35       480

    accuracy                           0.91      5000
   macro avg       0.73      0.62      0.65      5000
weighted avg       0.89      0.91      0.89      5000



## Gaussian Naive bayes

In [53]:
# Model:
gnb = GaussianNB()

y_pred = gnb.fit(X_train, y_train).predict(X_test)

print("Number of mislabeled points out of a total %d points : %d"
% (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 1000 points : 114


In [54]:
print(f"""
- The train classification accuracy is: {gnb.score(X_train, y_train)}
- The test classification accuracy is: {gnb.score(X_test, y_test)}
""")


- The train classification accuracy is: 0.8795
- The test classification accuracy is: 0.886



In [55]:
print(classification_report(y, gnb.predict(X)))

              precision    recall  f1-score   support

           0       0.95      0.91      0.93      4520
           1       0.41      0.57      0.48       480

    accuracy                           0.88      5000
   macro avg       0.68      0.74      0.71      5000
weighted avg       0.90      0.88      0.89      5000



##  Multinomial Naive Bayes

In [56]:
#Multinomial Naive Bayes gridsearch:
estimators = [
    ("normalise", StandardScaler()),
    ("select", PCA()),
    ("model", MultinomialNB())
]

pipe = Pipeline(estimators)

param_grid = [
    {
        'model__alpha': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
        'normalise': [None],
        'select': [None]
        
    }
]


grid = GridSearchCV(pipe, param_grid, cv=5, verbose=1)

#Fitting the grid
mnb_fitted_grid = grid.fit(X_train, y_train)

# Print best parameters and corresponding score
print("Best Parameters:", mnb_fitted_grid.best_params_)
print("Best Score:", mnb_fitted_grid.best_score_)

Fitting 5 folds for each of 11 candidates, totalling 55 fits
Best Parameters: {'model__alpha': 0, 'normalise': None, 'select': None}
Best Score: 0.76




In [57]:
print(classification_report(y, mnb_fitted_grid.predict(X)))

              precision    recall  f1-score   support

           0       0.96      0.77      0.85      4520
           1       0.25      0.72      0.37       480

    accuracy                           0.76      5000
   macro avg       0.60      0.74      0.61      5000
weighted avg       0.89      0.76      0.81      5000

