
# Capstone Project 

# Author : Hamidreza Salahi

# Notebook : 3

# Baseline models

After completing EDA and having a clean dataset to work with, the next step is to do some baseline modeling. The goal of this notebook is to find the best classification model amongst *Logistic Regression, SVC, Decision tree* in terms of their accuracy using pipeline and grid search.

## Contents:
* [Train-Test Split](#Train-Test-Split)
* [Scaling Data](#Scaling-Data)
* [Pipeline and Gridsearch](#Pipeline-and-Gridsearch)


In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#Importing clean data
loan_df = pd.read_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\loan_sample_after_EDA.csv')

loan_df.head()

Unnamed: 0,loan_status,last_fico_avg,int_rate,term,fico_avg,acc_open_past_24mths,funded_amnt,loan_amnt,tot_hi_cred_lim,dti,...,home_improvement,house,major_purchase,medical,moving,other,renewable_energy,small_business,vacation,wedding
0,0,697.0,20.55,60,702.0,7.0,32025.0,32025.0,210073.0,39.97,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,682.0,9.99,36,687.0,4.0,11200.0,11200.0,97239.0,28.19,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,692.0,15.05,36,662.0,2.0,20000.0,20000.0,32716.0,19.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,507.0,11.53,36,672.0,2.0,10000.0,10000.0,14200.0,3.13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,507.0,17.27,60,662.0,5.0,11050.0,11050.0,245250.0,8.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
loan_df.shape

(228958, 79)

### Train-Test Split

The first step in modeling is to seperate the dependent, y = `loan_status`, from all the independent variables, X

In [4]:
# Seperating the dependent variable (y) from the independent variables (X)
X = loan_df.drop(columns='loan_status')
y = loan_df['loan_status']

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228958 entries, 0 to 228957
Data columns (total 78 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   last_fico_avg          228958 non-null  float64
 1   int_rate               228958 non-null  float64
 2   term                   228958 non-null  int64  
 3   fico_avg               228958 non-null  float64
 4   acc_open_past_24mths   228958 non-null  float64
 5   funded_amnt            228958 non-null  float64
 6   loan_amnt              228958 non-null  float64
 7   tot_hi_cred_lim        228958 non-null  float64
 8   dti                    228958 non-null  float64
 9   inq_last_6mths         228958 non-null  float64
 10  mo_sin_rcnt_tl         228958 non-null  float64
 11  mo_sin_rcnt_rev_tl_op  228958 non-null  float64
 12  mths_since_recent_bc   228958 non-null  float64
 13  earliest_cr_line_year  228958 non-null  int64  
 14  revol_util             228958 non-nu

In [27]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 228958 entries, 0 to 228957
Series name: loan_status
Non-Null Count   Dtype
--------------   -----
228958 non-null  int64
dtypes: int64(1)
memory usage: 1.7 MB


Next step is to split the dataset into Training, Validation and Test.

In [28]:
# import train_test_split
from sklearn.model_selection import train_test_split

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                      y, 
                                                      test_size = 0.2, 
                                                      random_state = 15)

In [29]:
# check dataframes shapes
print(f"The shape of the X_train dataframe is: {X_train.shape}.")
print(f"The shape of the X_test dataframe is: {X_test.shape}.\n")
print(f"The shape of the y_train dataframe is: {y_train.shape}.")
print(f"The shape of the y_test dataframe is: {y_test.shape}.\n")

The shape of the X_train dataframe is: (183166, 78).
The shape of the X_test dataframe is: (45792, 78).

The shape of the y_train dataframe is: (183166,).
The shape of the y_test dataframe is: (45792,).



### Scaling Data

Now I am going to apply MinMaxScaler to the dataset. It is noted that the scaling is applied *after* train-test split to avoid data leakage i.e., the test data is not supposed to be exposed to MinMaxScaling at first. 

In [30]:
from sklearn.preprocessing import MinMaxScaler
# apply MinMaxScaler()
# instantiate the model
scaler = MinMaxScaler()

# fit the model
scaler = scaler.fit(X_train)

# transform
X_train_scaled= scaler.transform(X_train)
X_test_sclaed = scaler.transform(X_test)


### Pipeline and Gridsearch

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

In [32]:
estimators = [('model', DecisionTreeClassifier())]

pipe = Pipeline(estimators)

param_grid = [
    {'model': [DecisionTreeClassifier()],
            'model__max_depth': [1, 4, 8],
            'model__splitter':['best', 'random'],
            'model__min_samples_leaf':[1,3, 5],
            'model__max_features':['sqrt', 'log2']},       
            {'model':[LogisticRegression(solver='saga')],
            'model__C':[.01, 1, 100, 1e5],
            'model__max_iter':[400],
            'model__penalty':['l1', 'l2']}
]

In [33]:
grid = GridSearchCV(pipe, param_grid, cv=5)
fittedgrid = grid.fit(X_train_scaled, y_train)

In [34]:
fittedgrid.best_params_

{'model': LogisticRegression(C=1, max_iter=400, penalty='l1', solver='saga'),
 'model__C': 1,
 'model__max_iter': 400,
 'model__penalty': 'l1'}

In [35]:
fittedgrid.score(X_test_sclaed, y_test)

0.8869671558350803