# About Company
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer first apply for home loan after that company validates the customer eligibility for loan.

# Problem

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers. Here they have provided a partial data set.

## Data           
### Variable----------Description
Loan_ID----------Unique Loan ID

Gender----------Male/ Female

Married----------Applicant married (Y/N)

Dependents----------Number of dependents

Education----------Applicant Education - (Graduate/ Under Graduate)

Self_Employed----------Self employed (Y/N)

ApplicantIncome----------Applicant income

CoapplicantIncome----------Coapplicant income

LoanAmount----------Loan amount in thousands

Loan_Amount_Term----------Term of loan in months

Credit_History----------credit history meets guidelines

Property_Area----------Urban/ Semi Urban/ Rural

Loan_Status----------Loan approved (Y/N)


 

Note: 

Evaluation Metric is accuracy i.e. percentage of loan approval you correctly predict.

In [1]:
import pandas as pd
import numpy as np
import statistics as st
import seaborn as sns

In [2]:
df = pd.read_csv('Loan Prediction_train.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
df.drop('Loan_ID',axis=1,inplace=True)

In [4]:
df.shape

(614, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(7)
memory usage: 57.7+ KB


In [6]:
(df.isnull().sum()/df.shape[0])*100

Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

In [7]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [8]:
df['Dependents'].fillna(value = st.mode(df['Dependents']),inplace=True)

In [9]:
df['Dependents'] = df['Dependents'].str.split('+',0,expand=True)
df['Dependents'] = df['Dependents'].astype('int32')

In [10]:
for i in df.select_dtypes('object').columns:
    print('-----------------',i,'-------------------')
    print(df[i].value_counts())

----------------- Gender -------------------
Male      489
Female    112
Name: Gender, dtype: int64
----------------- Married -------------------
Yes    398
No     213
Name: Married, dtype: int64
----------------- Education -------------------
Graduate        480
Not Graduate    134
Name: Education, dtype: int64
----------------- Self_Employed -------------------
No     500
Yes     82
Name: Self_Employed, dtype: int64
----------------- Property_Area -------------------
Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64
----------------- Loan_Status -------------------
Y    422
N    192
Name: Loan_Status, dtype: int64


In [11]:
#from sklearn.preprocessing import LabelEncoder
#le = LabelEncoder()
#for i in df.select_dtypes('object').columns:
#    df[i] = le.fit_transform(df[i])

In [12]:
df['Gender'].replace({'Female':0,'Male':1},inplace=True)
df['Married'].replace({'No':0,'Yes':1},inplace=True)
df['Education'].replace({'Graduate':0,'Not Graduate':1},inplace=True)
df['Self_Employed'].replace({'No':0,'Yes':1},inplace=True)
df['Property_Area'].replace({'Rural':0,'Semiurban':1,'Urban':2},inplace=True)
df['Loan_Status'].replace({'N':0,'Y':1},inplace=True)

In [13]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df['Gender'] = imputer.fit_transform(df[['Gender']])
df['Married'] = imputer.fit_transform(df[['Married']])
df['Self_Employed'] = imputer.fit_transform(df[['Self_Employed']])
df['LoanAmount'] = imputer.fit_transform(df[['LoanAmount']])
df['Loan_Amount_Term'] = imputer.fit_transform(df[['Loan_Amount_Term']])
df['Credit_History'] = imputer.fit_transform(df[['Credit_History']])

In [14]:
df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [15]:
df['Loan_Status'].value_counts()

1    422
0    192
Name: Loan_Status, dtype: int64

In [16]:
df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1.0,0.0,0,0,0.0,5849,0.0,146.412162,360.0,1.0,2,1
1,1.0,1.0,1,0,0.0,4583,1508.0,128.0,360.0,1.0,0,0
2,1.0,1.0,0,0,1.0,3000,0.0,66.0,360.0,1.0,2,1
3,1.0,1.0,0,1,0.0,2583,2358.0,120.0,360.0,1.0,2,1
4,1.0,0.0,0,0,0.0,6000,0.0,141.0,360.0,1.0,2,1


In [17]:
df_test = pd.read_csv('Loan Prediction_test.csv')

In [18]:
test = df_test['Loan_ID']

In [19]:
df_test.drop('Loan_ID',axis=1,inplace=True)
df_test.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [20]:
(df_test.isnull().sum()/df_test.shape[0])*100

Gender               2.997275
Married              0.000000
Dependents           2.724796
Education            0.000000
Self_Employed        6.267030
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           1.362398
Loan_Amount_Term     1.634877
Credit_History       7.901907
Property_Area        0.000000
dtype: float64

In [21]:
df_test['Dependents'].fillna(value = st.mode(df_test['Dependents']),inplace=True)
df_test['Dependents']=df_test['Dependents'].str.split('+',0,expand=True)
df_test['Dependents'] = df_test['Dependents'].astype('int32')

In [22]:
for i in df_test.select_dtypes('object').columns:
    print('-----------------',i,'-------------------')
    print(df_test[i].value_counts())

----------------- Gender -------------------
Male      286
Female     70
Name: Gender, dtype: int64
----------------- Married -------------------
Yes    233
No     134
Name: Married, dtype: int64
----------------- Education -------------------
Graduate        283
Not Graduate     84
Name: Education, dtype: int64
----------------- Self_Employed -------------------
No     307
Yes     37
Name: Self_Employed, dtype: int64
----------------- Property_Area -------------------
Urban        140
Semiurban    116
Rural        111
Name: Property_Area, dtype: int64


In [23]:
df_test['Gender'].replace({'Female':0,'Male':1},inplace=True)
df_test['Married'].replace({'No':0,'Yes':1},inplace=True)
df_test['Education'].replace({'Graduate':0,'Not Graduate':1},inplace=True)
df_test['Self_Employed'].replace({'No':0,'Yes':1},inplace=True)
df_test['Property_Area'].replace({'Rural':0,'Semiurban':1,'Urban':2},inplace=True)

In [24]:
#df_test['Gender'].fillna(method = 'ffill',inplace=True)
#df_test['Dependents'].fillna(method = 'ffill',inplace=True)
#df_test['Self_Employed'].fillna(method = 'ffill',inplace=True)
#df_test['LoanAmount'].fillna(value = df['LoanAmount'].mean(),inplace=True)
#df_test['Loan_Amount_Term'].fillna(value = df['Loan_Amount_Term'].mean(),inplace=True)
#df_test['Credit_History'].fillna(method = 'ffill',inplace=True)

In [25]:
df_test['Gender'] = imputer.fit_transform(df_test[['Gender']])
df_test['Self_Employed'] = imputer.fit_transform(df_test[['Self_Employed']])
df_test['LoanAmount'] = imputer.fit_transform(df_test[['LoanAmount']])
df_test['Loan_Amount_Term'] = imputer.fit_transform(df_test[['Loan_Amount_Term']])
df_test['Credit_History'] = imputer.fit_transform(df_test[['Credit_History']])

In [26]:
df_test.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64

In [27]:
df_test.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1.0,1,0,0,0.0,5720,0,110.0,360.0,1.0,2
1,1.0,1,1,0,0.0,3076,1500,126.0,360.0,1.0,2
2,1.0,1,2,0,0.0,5000,1800,208.0,360.0,1.0,2
3,1.0,1,2,0,0.0,2340,2546,100.0,360.0,0.825444,2
4,1.0,0,0,1,0.0,3276,0,78.0,360.0,1.0,2


In [28]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV

## Full train data is trained and tested on test dataset:

In [29]:
X = df.drop('Loan_Status',axis=1)
y = df['Loan_Status']

In [30]:
#xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3,random_state=42)

## Logistic Regression:

In [31]:
lr = LogisticRegression()
lr.fit(X,y)
#ypred = lr.predict(xtest)
#accuracy_score(ytest,ypred)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
y_pred_test_lr = lr.predict(df_test)
y_pred_test_lr = pd.DataFrame(y_pred_test_lr)
y_pred_test_lr.replace({0:'N',1:'Y'},inplace=True)

In [33]:
y_pred_test_lr.rename(columns={0: 'Loan_Status'},inplace=True)
df_lr = pd.concat([test,y_pred_test_lr],axis=1)
df_lr.to_csv('Logistic Reg copy.csv', header=True, index=True)

## Random Forest Classifier:

In [34]:
rf = RandomForestClassifier()
rf.fit(X,y)
#ypred = rf.predict(xtest)
#accuracy_score(ytest,ypred)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [35]:
y_pred_test_rf = rf.predict(df_test)
y_pred_test_rf = pd.DataFrame(y_pred_test_rf)
y_pred_test_rf.replace({0:'N',1:'Y'},inplace=True)

In [36]:
y_pred_test_rf.rename(columns={0: 'Loan_Status'},inplace=True)
df_rf = pd.concat([test,y_pred_test_rf],axis=1)
df_rf.to_csv('Random Forest copy.csv', header=True, index=True)

## Gradient Boost Classifier:

In [37]:
gb = GradientBoostingClassifier()
gb.fit(X,y)
#ypred = gb.predict(xtest)
#accuracy_score(ytest,ypred)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [38]:
y_pred_test_gb = gb.predict(df_test)
y_pred_test_gb = pd.DataFrame(y_pred_test_gb)
y_pred_test_gb.replace({0:'N',1:'Y'},inplace=True)

In [39]:
y_pred_test_gb.rename(columns={0: 'Loan_Status'},inplace=True)
df_gb = pd.concat([test,y_pred_test_gb],axis=1)
df_gb.to_csv('Gradient Boosting copy.csv', header=True, index=True)

## Extra Trees Classifier:

In [40]:
ets = ExtraTreesClassifier()
ets.fit(X,y)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [41]:
y_pred_test_ets = ets.predict(df_test)
y_pred_test_ets = pd.DataFrame(y_pred_test_ets)
y_pred_test_ets.replace({0:'N',1:'Y'},inplace=True)

In [42]:
y_pred_test_ets.rename(columns={0: 'Loan_Status'},inplace=True)
df_ets = pd.concat([test,y_pred_test_ets],axis=1)
df_ets.to_csv('Extra Trees copy.csv', header=True, index=True)

# Over Sampling:

In [43]:
Xy_train = pd.concat([X,y],axis=1)
X_train0 = Xy_train.loc[Xy_train['Loan_Status']==0]
X_train1 = Xy_train.loc[Xy_train['Loan_Status']==1]

len1 = len(X_train1)
len0 = len(X_train0)

Xy_train0_os = X_train0.sample(len1,replace=True)

Xy_train_os = pd.concat([Xy_train0_os,X_train1],axis=0)

In [44]:
X_train_os = Xy_train_os.drop(['Loan_Status'],axis=1)
y_train_os = Xy_train_os['Loan_Status']

In [45]:
y_train_os.value_counts()

1    422
0    422
Name: Loan_Status, dtype: int64

## Logistic Regression:

In [46]:
lr = LogisticRegression()
lr.fit(X_train_os,y_train_os)
#ypred = lr.predict(xtest)
#accuracy_score(ytest,ypred)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [47]:
y_pred_test_lr = lr.predict(df_test)
y_pred_test_lr = pd.DataFrame(y_pred_test_lr)
y_pred_test_lr.replace({0:'N',1:'Y'},inplace=True)

In [48]:
y_pred_test_lr.rename(columns={0: 'Loan_Status'},inplace=True)
df_lr = pd.concat([test,y_pred_test_lr],axis=1)
df_lr.to_csv('Logistic Reg os.csv', header=True, index=True)

## Random Forest Classifier:

In [49]:
rf = RandomForestClassifier()
rf.fit(X_train_os,y_train_os)
#ypred = rf.predict(xtest)
#accuracy_score(ytest,ypred)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [50]:
y_pred_test_rf = rf.predict(df_test)
y_pred_test_rf = pd.DataFrame(y_pred_test_rf)
y_pred_test_rf.replace({0:'N',1:'Y'},inplace=True)

In [51]:
y_pred_test_rf.rename(columns={0: 'Loan_Status'},inplace=True)
df_rf = pd.concat([test,y_pred_test_rf],axis=1)
df_rf.to_csv('Random Forest os.csv', header=True, index=True)

## Gradient Boost Classifier:

In [52]:
gb = GradientBoostingClassifier()
gb.fit(X_train_os,y_train_os)
#ypred = gb.predict(xtest)
#accuracy_score(ytest,ypred)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [53]:
y_pred_test_gb = gb.predict(df_test)
y_pred_test_gb = pd.DataFrame(y_pred_test_gb)
y_pred_test_gb.replace({0:'N',1:'Y'},inplace=True)

In [54]:
y_pred_test_gb.rename(columns={0: 'Loan_Status'},inplace=True)
df_gb = pd.concat([test,y_pred_test_gb],axis=1)
df_gb.to_csv('Gradient Boosting os.csv', header=True, index=True)

## Extra Trees Classifier:

In [55]:
ets = ExtraTreesClassifier()
ets.fit(X_train_os,y_train_os)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

In [56]:
y_pred_test_ets = ets.predict(df_test)
y_pred_test_ets = pd.DataFrame(y_pred_test_ets)
y_pred_test_ets.replace({0:'N',1:'Y'},inplace=True)

In [57]:
y_pred_test_ets.rename(columns={0: 'Loan_Status'},inplace=True)
df_ets = pd.concat([test,y_pred_test_ets],axis=1)
df_ets.to_csv('Extra Trees os.csv', header=True, index=True)

# Hyperparameter Tuning:

## Gradient Boost Classifier Tuned:

In [58]:
params = {"loss": ['deviance', 'exponential'],
              "min_samples_split": [10, 20, 40],
              "max_depth": [2, 6, 8],
              "min_samples_leaf": [20, 40, 100],
              "max_leaf_nodes": [5, 20, 100]
              }

In [59]:
grid_cv_gb = GridSearchCV(gb,param_grid=params,cv=5)
grid_cv_gb.fit(X,y)

GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_c...
                 

In [60]:
print(grid_cv_gb.best_params_)

{'loss': 'exponential', 'max_depth': 2, 'max_leaf_nodes': 5, 'min_samples_leaf': 40, 'min_samples_split': 10}


In [61]:
gb_hyp = GradientBoostingClassifier(**grid_cv_gb.best_params_)
gb_hyp.fit(X,y)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=2,
                           max_features=None, max_leaf_nodes=5,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=40, min_samples_split=10,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [62]:
y_pred_test_gb = gb_hyp.predict(df_test)
y_pred_test_gb = pd.DataFrame(y_pred_test_gb)
y_pred_test_gb.replace({0:'N',1:'Y'},inplace=True)

In [63]:
y_pred_test_gb.rename(columns={0: 'Loan_Status'},inplace=True)
df_gb = pd.concat([test,y_pred_test_gb],axis=1)
df_gb.to_csv('Gradient Boosting Hyp.csv', header=True, index=True)

## Gradient Boost Classifier OverSampling Tuned:

In [64]:
params = {"loss": ['deviance', 'exponential'],
              "min_samples_split": [10, 20, 40],
              "max_depth": [2, 6, 8],
              "min_samples_leaf": [20, 40, 100],
              "max_leaf_nodes": [5, 20, 100]
              }

In [65]:
grid_cv_gb = GridSearchCV(gb,param_grid=params,cv=5)
grid_cv_gb.fit(X_train_os,y_train_os)

GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_c...
                 

In [66]:
print(grid_cv_gb.best_params_)

{'loss': 'deviance', 'max_depth': 8, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 40}


In [67]:
gb_hyp = GradientBoostingClassifier(**grid_cv_gb.best_params_)
gb_hyp.fit(X_train_os,y_train_os)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=8,
                           max_features=None, max_leaf_nodes=20,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=20, min_samples_split=40,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [68]:
y_pred_test_gb = gb_hyp.predict(df_test)
y_pred_test_gb = pd.DataFrame(y_pred_test_gb)
y_pred_test_gb.replace({0:'N',1:'Y'},inplace=True)

In [69]:
y_pred_test_gb.rename(columns={0: 'Loan_Status'},inplace=True)
df_gb = pd.concat([test,y_pred_test_gb],axis=1)
df_gb.to_csv('Gradient Boosting os Hyp.csv', header=True, index=True)