# **AMEX Default Prediction**
- In this notebook will be building a model to predict credit default for American Express.
- This follows the after data exploring in the notebook **Amex Feature Engineering**.
- This is for the kaggle competiton on https://www.kaggle.com/competitions/amex-default-prediction.

## **Libraries**
- Now will import libraries that will need in this notebook.
- Here will be using scikit learn algorithms.

In [11]:
import pandas as pd
import numpy as np

import plotly.graph_objs as go
import matplotlib.pyplot as plt

import seaborn as sb

from sklearn.model_selection import cross_val_score,train_test_split,KFold,GridSearchCV,RandomizedSearchCV
from sklearn.metrics import recall_score,roc_curve, roc_auc_score,f1_score,classification_report, confusion_matrix 
from sklearn.tree import DecisionTreeClassifier,export_graphviz
from sklearn.preprocessing import scale,Binarizer,MinMaxScaler
from sklearn.linear_model import LogisticRegression

from IPython.display import SVG,Image
from itertools import compress
from sklearn import tree

from imblearn.over_sampling import SMOTE

import joblib

## **Import Data**
- The final train features data produced by notebook **Amex Feature Engineering** was upload on google drive and will mount the drive to access it.
- Will will move it to our working directory and import the data.

In [2]:
!cp drive/MyDrive/'Colab Notebooks'/projects/kaggle/amex-churn/data/train_features.csv ./

In [3]:
features_data = pd.read_csv('train_features.csv')
features_data.head()

Unnamed: 0,customer_id,avg_S_26_impt,avg_B_40_impt,avg_B_18_impt,avg_B_25_impt,avg_R_20_impt,avg_D_125_impt,avg_R_5_impt,avg_D_145_impt,avg_R_22_impt,...,mode_D_114,mode_D_116,mode_D_117,mode_D_120,mode_D_126,mode_D_68,months_tenure,avg_lag_days,std_lag_days,target
0,000678921d09c5503d34055ab96b150a972f59a96471b9...,0.00506,0.014434,1.005112,0.002797,0.004248,0.004535,0.005242,0.005781,0.00418,...,1.0,0.0,3.0,0.0,1.0,6.0,11.967742,30.333333,0.887625,0
1,00093b69756b1afe3029c79b981e8d699b2a48bf4464a9...,0.420474,0.248442,0.075081,0.242857,0.004118,0.005617,0.004395,0.004326,0.004673,...,1.0,0.0,3.0,0.0,1.0,6.0,11.870968,30.083333,8.743396,0
2,0012e41fe6caa3ba31b55b3de2030cbb77b01203aeb4a5...,0.003805,0.037281,0.851478,0.00553,0.004607,0.003603,0.006291,0.005324,0.006016,...,1.0,0.0,2.0,1.0,0.0,4.0,9.032258,30.444444,1.130388,0
3,001cde1044b029fab66773573e6c69c7270b0c0c4b9475...,0.004603,0.005775,0.319829,0.124411,0.004355,0.00583,0.00497,0.004151,0.005892,...,1.0,0.0,3.0,0.0,1.0,6.0,11.967742,30.333333,10.093502,1
4,003491636f5638e541423c45a53830303c05dc785e8e67...,0.00506,0.018252,0.789428,0.004095,0.005759,0.004312,0.003435,0.004304,0.006357,...,1.0,0.0,4.0,0.0,1.0,5.0,12.0,30.416667,0.900337,0


In [4]:
features_data.shape

(458913, 345)

__Comments__
- Now will just make sure our data does not have null values.

In [5]:
features_data.describe()

Unnamed: 0,avg_S_26_impt,avg_B_40_impt,avg_B_18_impt,avg_B_25_impt,avg_R_20_impt,avg_D_125_impt,avg_R_5_impt,avg_D_145_impt,avg_R_22_impt,avg_D_71_impt,...,mode_D_114,mode_D_116,mode_D_117,mode_D_120,mode_D_126,mode_D_68,months_tenure,avg_lag_days,std_lag_days,target
count,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,...,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0,458913.0
mean,0.063837,0.201876,0.598427,0.102031,0.03661586,0.08873233,0.035157,0.06414677,0.008439,0.06751,...,0.638424,0.05011,2.129068,0.133356,0.698104,4.73814,11.098712,30.250433,10.331743,0.258934
std,0.411714,5.578416,0.334547,0.189717,0.2224204,0.2205368,0.161141,0.1943208,0.027667,0.270267,...,0.480457,0.218172,2.353139,0.33996,0.558506,1.723081,2.621421,6.122461,7.650738,0.43805
min,2e-06,6e-06,2.1e-05,-2.128732,6.533815e-07,3.321086e-07,4e-06,3.944042e-07,2e-06,9e-06,...,0.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
25%,0.004844,0.026382,0.281129,0.005747,0.004466648,0.004565676,0.004512,0.004574254,0.004435,0.009731,...,0.0,0.0,-1.0,0.0,0.0,4.0,11.580645,29.75,7.0367,0.0
50%,0.005889,0.073542,0.621126,0.030877,0.005067436,0.005265437,0.005145,0.005261208,0.005019,0.011818,...,1.0,0.0,3.0,0.0,1.0,5.0,12.0,30.416667,10.906529,0.0
75%,0.032713,0.252218,0.954132,0.112173,0.005713529,0.006341005,0.005891,0.006282124,0.005619,0.031658,...,1.0,0.0,4.0,0.0,1.0,6.0,12.129032,31.0,13.531949,1.0
max,84.508014,3162.277041,1.009988,11.65605,13.00948,5.850759,13.001722,4.767686,1.008719,42.218672,...,1.0,1.0,6.0,1.0,1.0,6.0,12.967742,392.0,248.901587,1.0


In [6]:
sum(features_data.isnull().sum() > 0)

0

## **Datasets**
- First will min max scaler our features so that they can values between 0 and 1.
- We will split our data into train, test and validation datasets on 80:10:10 ratio.
- Due to the class inbalance in our target columns will over sample our train dataset.
- The competition provided test dataset will be referred to as the submission dataset to avoid confusion from here onwards.

In [7]:
scaler = MinMaxScaler()

y = np.array(features_data['target'])
X = np.array(features_data.drop(['customer_id', 'target'], axis = 1))

X = scaler.fit(X).transform(X)

__Comments__
- Now will split our dataset.
- We will first create a test set which is 20% of the data then further equlal split into test and valid sets.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_valid, X_test, y_valid , y_test = train_test_split(X_test, y_test, test_size = 0.5)

In [10]:
X_train.shape, X_test.shape, X_valid.shape

((367130, 343), (45892, 343), (45891, 343))

__Comments__
- Now over sample our train data sets to address class imbalance.

In [13]:
y_train.mean()

0.25888922180154167

In [14]:
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
y_train.mean()

0.5

## **Modelling**
- Now we build our model
- Since our default prediction is a binary classification, will try the following algorithmns
  1. decision tree
  2. logistic regression
  3. Random Forest

### **1. Decision Tree**
- We first try to use a decision tree to model our data.
- First will identify hyperparameters to use by performing hyperparameters optimization.

#### **Hyperparameter optimization**
- Here we going to look for the best consideration for
  1. Criterion 
  2. Splitter 
  3. Maximum number of leaves in tree
  4. Minimum sample to split a node
  5. Number of features to consider at every split
- For this will try both Random Search and Grid Search.

In [23]:
# Criterion to be considered
criterion = ['gini', 'entropy']


# Spitter to be considered
splitter = ['best', 'random']

# Maximum number of levels in tree
max_depth =list(np.arange(1,300))
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = list(np.arange(1,1000))

# Minimum number of samples required at each leaf node
min_samples_leaf = list(np.arange(1,1000))

# Number of features to consider at every split
max_features = list(np.arange(1,50)) 
max_features.append('auto')
max_features.append('sqrt')
max_features.append('log2')

##### **Random Search**
- We will first try identify the best hyper-parameters by using random search.
- This will allow us to narrow the range of hyper-parameters to consider when we perform grid search.
- First will create out random grid from the lists defined above.
- Then a model to which will pass into the RandomizedSearchCV.

In [24]:
random_grid = { 'criterion':criterion,'splitter':splitter,'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [26]:
hyper_model1 = DecisionTreeClassifier()

hyper_random_model = RandomizedSearchCV(estimator = hyper_model1, param_distributions = random_grid, n_iter = 2000, cv = 5, verbose=2, n_jobs = -1, scoring = 'accuracy')

__Comments__
- Now we going to fit our train dta in the random search to identify the best hyper parameters.

In [None]:
hyper_random_model.fit(X_train,y_train)

Fitting 5 folds for each of 2000 candidates, totalling 10000 fits


In [None]:
hyper_random_model.best_params_

##### **Grid Search**
- From the best hyper parameter we got from the random search will now perform grid search with range of parameters around them.

In [None]:
max_depth = list(np.arange(1,11))
min_samples_leaf = list(np.arange(1,11))
min_samples_split = list(np.arange(1,11))
criterion = ['gini','entropy']
max_features = list(np.arange(1,32))

param_grid = {'criterion':criterion,'max_depth': max_depth,'min_samples_leaf':min_samples_leaf,'max_features':max_features}

__Comments__
- For our grid search will consider the best hyper parameter for the metrics accuracy, precision and recall.

In [None]:
scoring = ['accuracy', 'precision','recall']

for metric in scoring:
  grid_model = DecisionTreeClassifier()
  grid = GridSearchCV(grid_model, param_grid = param_grid, n_jobs = -1, cv= 5, scoring = metric, iid = True)
  grid.fit(X_train,y_train)
  print(f'Using {metric} metric and best parameters are {grid.best_params_} \n The classification report below \n{classification_report(y_train,grid.predict(X_train))}')