# Vehicle loan default prediction

# Introduction

Financial institutions incur significant losses due to the default of vehicle loans. This has led to the tightening up of vehicle loan underwriting and increased vehicle loan rejection rates. The need for a better credit risk scoring model is also raised by these institutions. This warrants a study to estimate the determinants of vehicle loan default. A financial institution has hired you to accurately predict the probability of loanee/borrower defaulting on a vehicle loan in the first EMI (Equated Monthly Instalments) on the due date.

# Data source

The data is obtained from third party data website Kaggle. The data comes from L&T finance services.

Following Information regarding the loan and loanee are provided in the datasets:
* Loanee Information (Demographic data like age, Identity proof etc.)
* Loan Information (Disbursal details, loan to value ratio etc.)
* Bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.)

Link of the data source - [L&T vehicle loan default data](https://www.kaggle.com/datasets/mamtadhaker/lt-vehicle-loan-default-prediction)

# Data import

In [4]:
import pandas as pd
import re

loan_train_df = pd.read_csv("./data/train.csv")
loan_test_df = pd.read_csv("./data/test.csv")

pd.set_option('display.max_columns', None)

# Structure of data

In [5]:
loan_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233154 entries, 0 to 233153
Data columns (total 41 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   UniqueID                             233154 non-null  int64  
 1   disbursed_amount                     233154 non-null  int64  
 2   asset_cost                           233154 non-null  int64  
 3   ltv                                  233154 non-null  float64
 4   branch_id                            233154 non-null  int64  
 5   supplier_id                          233154 non-null  int64  
 6   manufacturer_id                      233154 non-null  int64  
 7   Current_pincode_ID                   233154 non-null  int64  
 8   Date.of.Birth                        233154 non-null  object 
 9   Employment.Type                      225493 non-null  object 
 10  DisbursalDate                        233154 non-null  object 
 11  State_ID     

There are total 41 columns and 233154 rows.

In [6]:
loan_train_df.head(7)

Unnamed: 0,UniqueID,disbursed_amount,asset_cost,ltv,branch_id,supplier_id,manufacturer_id,Current_pincode_ID,Date.of.Birth,Employment.Type,DisbursalDate,State_ID,Employee_code_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,AVERAGE.ACCT.AGE,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES,loan_default
0,420825,50578,58400,89.55,67,22807,45,1441,01-01-84,Salaried,03-08-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0,0
1,537409,47145,65550,73.23,67,22807,45,1502,31-07-85,Self employed,26-09-18,6,1998,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,1yrs 11mon,1yrs 11mon,0,1
2,417566,53278,61360,89.63,67,22807,45,1497,24-08-85,Self employed,01-08-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0,0
3,624493,57513,66113,88.48,67,22807,45,1501,30-12-93,Self employed,26-10-18,6,1998,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,0yrs 8mon,1yrs 3mon,1,1
4,539055,52378,60300,88.39,67,22807,45,1495,09-12-77,Self employed,26-09-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,1,1
5,518279,54513,61900,89.66,67,22807,45,1501,08-09-90,Self employed,19-09-18,6,1998,1,1,0,0,0,0,825,A-Very Low Risk,2,0,0,0,0,0,0,0,0,0,0,0,1347,0,0,0,1yrs 9mon,2yrs 0mon,0,0
6,529269,46349,61500,76.42,67,22807,45,1502,01-06-88,Salaried,23-09-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0,0


Checking the test dataset.

In [7]:
loan_test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112392 entries, 0 to 112391
Data columns (total 40 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   UniqueID                             112392 non-null  int64  
 1   disbursed_amount                     112392 non-null  int64  
 2   asset_cost                           112392 non-null  int64  
 3   ltv                                  112392 non-null  float64
 4   branch_id                            112392 non-null  int64  
 5   supplier_id                          112392 non-null  int64  
 6   manufacturer_id                      112392 non-null  int64  
 7   Current_pincode_ID                   112392 non-null  int64  
 8   Date.of.Birth                        112392 non-null  object 
 9   Employment.Type                      108949 non-null  object 
 10  DisbursalDate                        112392 non-null  object 
 11  State_ID     

Test set contains 40 columns as it doesn't `loan_default` column.

# Preprocessing

Creating a separate label set "Y_train".

In [12]:
Y_train = loan_train_df["loan_default"]

In [13]:
Y_train

0         0
1         1
2         0
3         1
4         1
         ..
233149    0
233150    0
233151    0
233152    0
233153    0
Name: loan_default, Length: 233154, dtype: int64

In [14]:
# creating a separate set for training example
X_train = loan_train_df.copy()
X_train = X_train.drop("loan_default", axis=1)

## Removing null values

In [8]:
def fill_emp_nan(data_df):
    data_df = data_df.copy()
    
    # replace "nan" with "Unemployed"
    data_df["Employment.Type"] = data_df["Employment.Type"].fillna("Unemployed")
    return data_df


Checking for the missing values.

In [16]:
loan_test_df.isna().sum()

UniqueID                                  0
disbursed_amount                          0
asset_cost                                0
ltv                                       0
branch_id                                 0
supplier_id                               0
manufacturer_id                           0
Current_pincode_ID                        0
Date.of.Birth                             0
Employment.Type                        3443
DisbursalDate                             0
State_ID                                  0
Employee_code_ID                          0
MobileNo_Avl_Flag                         0
Aadhar_flag                               0
PAN_flag                                  0
VoterID_flag                              0
Driving_flag                              0
Passport_flag                             0
PERFORM_CNS.SCORE                         0
PERFORM_CNS.SCORE.DESCRIPTION             0
PRI.NO.OF.ACCTS                           0
PRI.ACTIVE.ACCTS                

In [15]:
X_train.isna().sum()

UniqueID                                  0
disbursed_amount                          0
asset_cost                                0
ltv                                       0
branch_id                                 0
supplier_id                               0
manufacturer_id                           0
Current_pincode_ID                        0
Date.of.Birth                             0
Employment.Type                        7661
DisbursalDate                             0
State_ID                                  0
Employee_code_ID                          0
MobileNo_Avl_Flag                         0
Aadhar_flag                               0
PAN_flag                                  0
VoterID_flag                              0
Driving_flag                              0
Passport_flag                             0
PERFORM_CNS.SCORE                         0
PERFORM_CNS.SCORE.DESCRIPTION             0
PRI.NO.OF.ACCTS                           0
PRI.ACTIVE.ACCTS                

Only `Employment.Type` has missing values and it contains 7661 missing values.

Let's check for the type of values `Employment.Type` consists of.

Test set aslo contains missing values in `Employment.Type` column.

In [17]:
loan_train_df["Employment.Type"].unique()

array(['Salaried', 'Self employed', nan], dtype=object)

So there are three categories "Salaried", "Self employed", "nan". As only `Employment.Type` has missing values and it has two categories(Salaried, Self employed) for employed people, it is safe to assume that "nan" refers to unemployed people.

In [18]:
X_train = fill_emp_nan(X_train)
X_test = fill_emp_nan(loan_test_df)

In [19]:
print("missing values in training data: ", X_train.isna().sum().sum())
print("missing values in test data: ", X_test.isna().sum().sum())

missing values in training data:  0
missing values in test data:  0


## Droping non-usable columns

In [9]:
def drop_cols(data_df, columns):
    data_df = data_df.copy()
    
    # droping UniqueID, supplier_id, Current_pincode_ID, Employee_code_ID
    data_df = data_df.drop(columns, axis=1)
    
    return data_df

Let's first focus on categorical columns(mainly columns with many categorical values) leaving out the binary categorical columns.

In [21]:
cat_columns = ["UniqueID", "branch_id", "supplier_id", "manufacturer_id", "Current_pincode_ID", "Employment.Type", "State_ID", "Employee_code_ID", "PERFORM_CNS.SCORE.DESCRIPTION"]

{column: len(X_train[column].unique()) for column in cat_columns}

{'UniqueID': 233154,
 'branch_id': 82,
 'supplier_id': 2953,
 'manufacturer_id': 11,
 'Current_pincode_ID': 6698,
 'Employment.Type': 3,
 'State_ID': 22,
 'Employee_code_ID': 3270,
 'PERFORM_CNS.SCORE.DESCRIPTION': 20}

In [20]:
X_train.head()

Unnamed: 0,UniqueID,disbursed_amount,asset_cost,ltv,branch_id,supplier_id,manufacturer_id,Current_pincode_ID,Date.of.Birth,Employment.Type,DisbursalDate,State_ID,Employee_code_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,AVERAGE.ACCT.AGE,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES
0,420825,50578,58400,89.55,67,22807,45,1441,01-01-84,Salaried,03-08-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
1,537409,47145,65550,73.23,67,22807,45,1502,31-07-85,Self employed,26-09-18,6,1998,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,1yrs 11mon,1yrs 11mon,0
2,417566,53278,61360,89.63,67,22807,45,1497,24-08-85,Self employed,01-08-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
3,624493,57513,66113,88.48,67,22807,45,1501,30-12-93,Self employed,26-10-18,6,1998,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,0yrs 8mon,1yrs 3mon,1
4,539055,52378,60300,88.39,67,22807,45,1495,09-12-77,Self employed,26-09-18,6,1998,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,1


Drop `UniqueID` column.

Drop `supplier_id`, `Current_pincode_ID`, `Employee_code_ID` because of high cardinality. High cardinality columns will make training the model take too much time. For this project, it's better to avoid that situation.

In [22]:
X_train = drop_cols(X_train, ["UniqueID", "supplier_id", "Current_pincode_ID", "Employee_code_ID"])
X_test = drop_cols(X_test, ["UniqueID", "supplier_id", "Current_pincode_ID", "Employee_code_ID"])

In [23]:
X_train.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,branch_id,manufacturer_id,Date.of.Birth,Employment.Type,DisbursalDate,State_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,AVERAGE.ACCT.AGE,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES
0,50578,58400,89.55,67,45,01-01-84,Salaried,03-08-18,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
1,47145,65550,73.23,67,45,31-07-85,Self employed,26-09-18,6,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,1yrs 11mon,1yrs 11mon,0
2,53278,61360,89.63,67,45,24-08-85,Self employed,01-08-18,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
3,57513,66113,88.48,67,45,30-12-93,Self employed,26-10-18,6,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,0yrs 8mon,1yrs 3mon,1
4,52378,60300,88.39,67,45,09-12-77,Self employed,26-09-18,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,1


In [24]:
# remove "Date.of.Birth", "DisbursalDate" columns
X_train = drop_cols(X_train, ["Date.of.Birth", "DisbursalDate"])
X_test = drop_cols(X_test, ["Date.of.Birth", "DisbursalDate"])
X_train.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,branch_id,manufacturer_id,Employment.Type,State_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,AVERAGE.ACCT.AGE,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES
0,50578,58400,89.55,67,45,Salaried,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
1,47145,65550,73.23,67,45,Self employed,6,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,1yrs 11mon,1yrs 11mon,0
2,53278,61360,89.63,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,0
3,57513,66113,88.48,67,45,Self employed,6,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,0yrs 8mon,1yrs 3mon,1
4,52378,60300,88.39,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0yrs 0mon,1


## Extracting year, month in separate columns

In [10]:
def extract_month_year(data_df, column):
    data_df = data_df.copy()
    pattern_year = "([\d]+)yrs"
    pattern_month = "([\d]+)mon"
    data_df[column + "_year"] = data_df[column].apply(lambda x: re.findall(pattern_year, x)[0])
    data_df[column + "_month"] = data_df[column].apply(lambda x: re.findall(pattern_month, x)[0])
    data_df[column + "_year"] = data_df[column + "_year"].astype("int")
    data_df[column + "_month"] = data_df[column + "_month"].astype("int")
    data_df = data_df.drop(column, axis=1)
    return data_df

`Date.of.Birth`, `DisbursalDate` can also be dropped as these fields cannot be used for generalization.

In [25]:
# extract year and month in separate columns from "AVERAGE.ACCT.AGE"
X_train = extract_month_year(X_train, "AVERAGE.ACCT.AGE")
X_test = extract_month_year(X_test, "AVERAGE.ACCT.AGE")
X_train.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,branch_id,manufacturer_id,Employment.Type,State_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,CREDIT.HISTORY.LENGTH,NO.OF_INQUIRIES,AVERAGE.ACCT.AGE_year,AVERAGE.ACCT.AGE_month
0,50578,58400,89.55,67,45,Salaried,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0,0,0
1,47145,65550,73.23,67,45,Self employed,6,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,1yrs 11mon,0,1,11
2,53278,61360,89.63,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,0,0,0
3,57513,66113,88.48,67,45,Self employed,6,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,1yrs 3mon,1,0,8
4,52378,60300,88.39,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0yrs 0mon,1,0,0


In [26]:
# extract year and month in separate columns from "AVERAGE.ACCT.AGE"
X_train = extract_month_year(X_train, "CREDIT.HISTORY.LENGTH")
X_test = extract_month_year(X_test, "CREDIT.HISTORY.LENGTH")
X_train.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,branch_id,manufacturer_id,Employment.Type,State_ID,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PERFORM_CNS.SCORE.DESCRIPTION,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,NO.OF_INQUIRIES,AVERAGE.ACCT.AGE_year,AVERAGE.ACCT.AGE_month,CREDIT.HISTORY.LENGTH_year,CREDIT.HISTORY.LENGTH_month
0,50578,58400,89.55,67,45,Salaried,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,47145,65550,73.23,67,45,Self employed,6,1,1,0,0,0,0,598,I-Medium Risk,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,0,1,11,1,11
2,53278,61360,89.63,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,57513,66113,88.48,67,45,Self employed,6,1,1,0,0,0,0,305,L-Very High Risk,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,1,0,8,1,3
4,52378,60300,88.39,67,45,Self employed,6,1,1,0,0,0,0,0,No Bureau History Available,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


## One hot encoding

In [11]:
def one_hot_encode(data_df, columns):
    for column in columns:
        dummies_df = pd.get_dummies(data_df[column], prefix=column, dtype="int")
        data_df = pd.concat([data_df, dummies_df], axis=1)
        data_df = data_df.drop(column, axis=1)
    return data_df

In [27]:
X_train = one_hot_encode(X_train, ["branch_id", "manufacturer_id", "Employment.Type", "State_ID", "PERFORM_CNS.SCORE.DESCRIPTION"])
X_test = one_hot_encode(X_test, ["branch_id", "manufacturer_id", "Employment.Type", "State_ID", "PERFORM_CNS.SCORE.DESCRIPTION"])

X_train.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,NO.OF_INQUIRIES,AVERAGE.ACCT.AGE_year,AVERAGE.ACCT.AGE_month,CREDIT.HISTORY.LENGTH_year,CREDIT.HISTORY.LENGTH_month,branch_id_1,branch_id_2,branch_id_3,branch_id_5,branch_id_7,branch_id_8,branch_id_9,branch_id_10,branch_id_11,branch_id_13,branch_id_14,branch_id_15,branch_id_16,branch_id_17,branch_id_18,branch_id_19,branch_id_20,branch_id_29,branch_id_34,branch_id_35,branch_id_36,branch_id_42,branch_id_43,branch_id_48,branch_id_61,branch_id_62,branch_id_63,branch_id_64,branch_id_65,branch_id_66,branch_id_67,branch_id_68,branch_id_69,branch_id_70,branch_id_72,branch_id_73,branch_id_74,branch_id_76,branch_id_77,branch_id_78,branch_id_79,branch_id_82,branch_id_84,branch_id_85,branch_id_97,branch_id_100,branch_id_101,branch_id_103,branch_id_104,branch_id_105,branch_id_111,branch_id_117,branch_id_120,branch_id_121,branch_id_130,branch_id_135,branch_id_136,branch_id_138,branch_id_142,branch_id_146,branch_id_147,branch_id_152,branch_id_153,branch_id_158,branch_id_159,branch_id_160,branch_id_162,branch_id_165,branch_id_202,branch_id_207,branch_id_217,branch_id_248,branch_id_249,branch_id_250,branch_id_251,branch_id_254,branch_id_255,branch_id_257,branch_id_258,branch_id_259,branch_id_260,branch_id_261,manufacturer_id_45,manufacturer_id_48,manufacturer_id_49,manufacturer_id_51,manufacturer_id_67,manufacturer_id_86,manufacturer_id_120,manufacturer_id_145,manufacturer_id_152,manufacturer_id_153,manufacturer_id_156,Employment.Type_Salaried,Employment.Type_Self employed,Employment.Type_Unemployed,State_ID_1,State_ID_2,State_ID_3,State_ID_4,State_ID_5,State_ID_6,State_ID_7,State_ID_8,State_ID_9,State_ID_10,State_ID_11,State_ID_12,State_ID_13,State_ID_14,State_ID_15,State_ID_16,State_ID_17,State_ID_18,State_ID_19,State_ID_20,State_ID_21,State_ID_22,PERFORM_CNS.SCORE.DESCRIPTION_A-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_B-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_C-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_D-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_E-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_F-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_G-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_H-Medium Risk,PERFORM_CNS.SCORE.DESCRIPTION_I-Medium Risk,PERFORM_CNS.SCORE.DESCRIPTION_J-High Risk,PERFORM_CNS.SCORE.DESCRIPTION_K-High Risk,PERFORM_CNS.SCORE.DESCRIPTION_L-Very High Risk,PERFORM_CNS.SCORE.DESCRIPTION_M-Very High Risk,PERFORM_CNS.SCORE.DESCRIPTION_No Bureau History Available,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: More than 50 active Accounts found,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: No Activity seen on the customer (Inactive),PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: No Updates available in last 36 months,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Not Enough Info available on the customer,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Only a Guarantor,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Sufficient History Not Available
0,50578,58400,89.55,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,47145,65550,73.23,1,1,0,0,0,0,598,1,1,1,27600,50200,50200,0,0,0,0,0,0,1991,0,0,1,0,1,11,1,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,53278,61360,89.63,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,57513,66113,88.48,1,1,0,0,0,0,305,3,0,0,0,0,0,0,0,0,0,0,0,31,0,0,0,1,0,8,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,52378,60300,88.39,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


## Adjusting non-identical columns in train set and test set

Checking for non-identical columns in train set and test set.

In [34]:
def adjust_columns(X_train, X_test):
    
    X_train = X_train.copy()
    X_test = X_test.copy()

    extra_train_set_col = []
    extra_test_set_col = []
    
    # print("--columns in train set but not in test set--")
    for column in X_train.columns:
        if column not in X_test.columns:
            # print(column)
            extra_train_set_col.append(column)
    #print("\n")
    
    
    # print("--columns in test set but not in train set--")
    for column in X_test.columns:
        if column not in X_train.columns:
            # print(column)
            extra_test_set_col.append(column)

    # Removing columns that are present in test set but not in train set because model will be trained in train set.
    X_test = X_test.drop(extra_test_set_col, axis=1)

    # Adding columns in test set which are not present in test set.
    for column in extra_train_set_col:
        X_test[column] = 0

    X_test = X_test.reindex(X_train.columns, axis=1)

    return X_train, X_test


In [35]:
X_train, X_test = adjust_columns(X_train, X_test)

In [39]:
def column_check(X_train, X_test):
    
    if X_train.columns.tolist == X_test.columns.tolist:
        return True
            

In [40]:
column_check(X_train, X_test)

True

## Scaling

In [46]:
from sklearn.preprocessing import StandardScaler

stscaler = StandardScaler()

stscaler.fit(X_train)

X_train_scaled = pd.DataFrame(stscaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test_scaled = pd.DataFrame(stscaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [47]:
X_train_scaled.head()

Unnamed: 0,disbursed_amount,asset_cost,ltv,MobileNo_Avl_Flag,Aadhar_flag,PAN_flag,VoterID_flag,Driving_flag,Passport_flag,PERFORM_CNS.SCORE,PRI.NO.OF.ACCTS,PRI.ACTIVE.ACCTS,PRI.OVERDUE.ACCTS,PRI.CURRENT.BALANCE,PRI.SANCTIONED.AMOUNT,PRI.DISBURSED.AMOUNT,SEC.NO.OF.ACCTS,SEC.ACTIVE.ACCTS,SEC.OVERDUE.ACCTS,SEC.CURRENT.BALANCE,SEC.SANCTIONED.AMOUNT,SEC.DISBURSED.AMOUNT,PRIMARY.INSTAL.AMT,SEC.INSTAL.AMT,NEW.ACCTS.IN.LAST.SIX.MONTHS,DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS,NO.OF_INQUIRIES,AVERAGE.ACCT.AGE_year,AVERAGE.ACCT.AGE_month,CREDIT.HISTORY.LENGTH_year,CREDIT.HISTORY.LENGTH_month,branch_id_1,branch_id_2,branch_id_3,branch_id_5,branch_id_7,branch_id_8,branch_id_9,branch_id_10,branch_id_11,branch_id_13,branch_id_14,branch_id_15,branch_id_16,branch_id_17,branch_id_18,branch_id_19,branch_id_20,branch_id_29,branch_id_34,branch_id_35,branch_id_36,branch_id_42,branch_id_43,branch_id_48,branch_id_61,branch_id_62,branch_id_63,branch_id_64,branch_id_65,branch_id_66,branch_id_67,branch_id_68,branch_id_69,branch_id_70,branch_id_72,branch_id_73,branch_id_74,branch_id_76,branch_id_77,branch_id_78,branch_id_79,branch_id_82,branch_id_84,branch_id_85,branch_id_97,branch_id_100,branch_id_101,branch_id_103,branch_id_104,branch_id_105,branch_id_111,branch_id_117,branch_id_120,branch_id_121,branch_id_130,branch_id_135,branch_id_136,branch_id_138,branch_id_142,branch_id_146,branch_id_147,branch_id_152,branch_id_153,branch_id_158,branch_id_159,branch_id_160,branch_id_162,branch_id_165,branch_id_202,branch_id_207,branch_id_217,branch_id_248,branch_id_249,branch_id_250,branch_id_251,branch_id_254,branch_id_255,branch_id_257,branch_id_258,branch_id_259,branch_id_260,branch_id_261,manufacturer_id_45,manufacturer_id_48,manufacturer_id_49,manufacturer_id_51,manufacturer_id_67,manufacturer_id_86,manufacturer_id_120,manufacturer_id_145,manufacturer_id_152,manufacturer_id_153,manufacturer_id_156,Employment.Type_Salaried,Employment.Type_Self employed,Employment.Type_Unemployed,State_ID_1,State_ID_2,State_ID_3,State_ID_4,State_ID_5,State_ID_6,State_ID_7,State_ID_8,State_ID_9,State_ID_10,State_ID_11,State_ID_12,State_ID_13,State_ID_14,State_ID_15,State_ID_16,State_ID_17,State_ID_18,State_ID_19,State_ID_20,State_ID_21,State_ID_22,PERFORM_CNS.SCORE.DESCRIPTION_A-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_B-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_C-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_D-Very Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_E-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_F-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_G-Low Risk,PERFORM_CNS.SCORE.DESCRIPTION_H-Medium Risk,PERFORM_CNS.SCORE.DESCRIPTION_I-Medium Risk,PERFORM_CNS.SCORE.DESCRIPTION_J-High Risk,PERFORM_CNS.SCORE.DESCRIPTION_K-High Risk,PERFORM_CNS.SCORE.DESCRIPTION_L-Very High Risk,PERFORM_CNS.SCORE.DESCRIPTION_M-Very High Risk,PERFORM_CNS.SCORE.DESCRIPTION_No Bureau History Available,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: More than 50 active Accounts found,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: No Activity seen on the customer (Inactive),PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: No Updates available in last 36 months,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Not Enough Info available on the customer,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Only a Guarantor,PERFORM_CNS.SCORE.DESCRIPTION_Not Scored: Sufficient History Not Available
0,-0.291335,-0.921895,1.292133,0.0,0.435916,-0.285929,-0.411719,-0.154257,-0.046172,-0.855453,-0.467804,-0.535617,-0.285264,-0.176064,-0.09201,-0.091711,-0.094259,-0.087651,-0.065216,-0.031884,-0.039835,-0.039323,-0.086581,-0.020784,-0.399782,-0.253566,-0.29245,-0.454227,-0.735941,-0.505382,-0.705464,-0.158432,-0.244364,-0.203025,-0.202888,-0.118376,-0.116952,-0.104697,-0.134204,-0.140382,-0.113629,-0.08202,-0.113687,-0.16889,-0.070712,-0.148521,-0.160566,-0.139186,-0.081675,-0.18597,-0.0546,-0.198424,-0.108524,-0.050111,-0.143822,-0.146609,-0.054521,-0.091932,-0.08432,-0.123613,-0.036723,4.425166,-0.100121,-0.059044,-0.087138,-0.074706,-0.079217,-0.137025,-0.060668,-0.07897,-0.097196,-0.121885,-0.066775,-0.025875,-0.090064,-0.04088,-0.037705,-0.03976,-0.130054,-0.084449,-0.103349,-0.019541,-0.04898,-0.135605,-0.061692,-0.067868,-0.118021,-0.18645,-0.137916,-0.045087,-0.153629,-0.134783,-0.14702,-0.05456,-0.017206,-0.108746,-0.123541,-0.109411,-0.06632,-0.092994,-0.054442,-0.028027,-0.089115,-0.060775,-0.079954,-0.129473,-0.085677,-0.084423,-0.073595,-0.040083,-0.038551,-0.039976,-0.027485,1.765627,-0.277853,-0.21411,-0.363442,-0.102091,-0.941304,-0.207878,-0.057862,-0.005073,-0.007174,-0.002071,1.175829,-1.099815,-0.184322,-0.199635,-0.134783,-0.41374,-0.48817,-0.213639,2.441062,-0.173141,-0.254635,-0.271642,-0.125318,-0.172285,-0.135605,-0.288231,-0.205123,-0.148777,-0.107936,-0.131968,-0.154155,-0.066775,-0.02818,-0.025875,-0.018057,-0.253938,-0.202693,-0.271851,-0.226295,-0.160018,-0.194336,-0.131918,-0.174045,-0.156256,-0.12782,-0.191851,-0.069911,-0.197769,0.996806,-0.003587,-0.111932,-0.081381,-0.126496,-0.064836,-0.128114
1,-0.555997,-0.544482,-0.132372,0.0,0.435916,-0.285929,-0.411719,-0.154257,-0.046172,0.911822,-0.276131,-0.020549,1.536941,-0.146773,-0.070871,-0.070599,-0.094259,-0.087651,-0.065216,-0.031884,-0.039835,-0.039323,-0.073427,-0.020784,-0.399782,2.347632,-0.29245,0.415299,2.322195,-0.064243,2.392561,-0.158432,-0.244364,-0.203025,-0.202888,-0.118376,-0.116952,-0.104697,-0.134204,-0.140382,-0.113629,-0.08202,-0.113687,-0.16889,-0.070712,-0.148521,-0.160566,-0.139186,-0.081675,-0.18597,-0.0546,-0.198424,-0.108524,-0.050111,-0.143822,-0.146609,-0.054521,-0.091932,-0.08432,-0.123613,-0.036723,4.425166,-0.100121,-0.059044,-0.087138,-0.074706,-0.079217,-0.137025,-0.060668,-0.07897,-0.097196,-0.121885,-0.066775,-0.025875,-0.090064,-0.04088,-0.037705,-0.03976,-0.130054,-0.084449,-0.103349,-0.019541,-0.04898,-0.135605,-0.061692,-0.067868,-0.118021,-0.18645,-0.137916,-0.045087,-0.153629,-0.134783,-0.14702,-0.05456,-0.017206,-0.108746,-0.123541,-0.109411,-0.06632,-0.092994,-0.054442,-0.028027,-0.089115,-0.060775,-0.079954,-0.129473,-0.085677,-0.084423,-0.073595,-0.040083,-0.038551,-0.039976,-0.027485,1.765627,-0.277853,-0.21411,-0.363442,-0.102091,-0.941304,-0.207878,-0.057862,-0.005073,-0.007174,-0.002071,-0.850464,0.909244,-0.184322,-0.199635,-0.134783,-0.41374,-0.48817,-0.213639,2.441062,-0.173141,-0.254635,-0.271642,-0.125318,-0.172285,-0.135605,-0.288231,-0.205123,-0.148777,-0.107936,-0.131968,-0.154155,-0.066775,-0.02818,-0.025875,-0.018057,-0.253938,-0.202693,-0.271851,-0.226295,-0.160018,-0.194336,-0.131918,-0.174045,6.399751,-0.12782,-0.191851,-0.069911,-0.197769,-1.003205,-0.003587,-0.111932,-0.081381,-0.126496,-0.064836,-0.128114
2,-0.083183,-0.765651,1.299116,0.0,0.435916,-0.285929,-0.411719,-0.154257,-0.046172,-0.855453,-0.467804,-0.535617,-0.285264,-0.176064,-0.09201,-0.091711,-0.094259,-0.087651,-0.065216,-0.031884,-0.039835,-0.039323,-0.086581,-0.020784,-0.399782,-0.253566,-0.29245,-0.454227,-0.735941,-0.505382,-0.705464,-0.158432,-0.244364,-0.203025,-0.202888,-0.118376,-0.116952,-0.104697,-0.134204,-0.140382,-0.113629,-0.08202,-0.113687,-0.16889,-0.070712,-0.148521,-0.160566,-0.139186,-0.081675,-0.18597,-0.0546,-0.198424,-0.108524,-0.050111,-0.143822,-0.146609,-0.054521,-0.091932,-0.08432,-0.123613,-0.036723,4.425166,-0.100121,-0.059044,-0.087138,-0.074706,-0.079217,-0.137025,-0.060668,-0.07897,-0.097196,-0.121885,-0.066775,-0.025875,-0.090064,-0.04088,-0.037705,-0.03976,-0.130054,-0.084449,-0.103349,-0.019541,-0.04898,-0.135605,-0.061692,-0.067868,-0.118021,-0.18645,-0.137916,-0.045087,-0.153629,-0.134783,-0.14702,-0.05456,-0.017206,-0.108746,-0.123541,-0.109411,-0.06632,-0.092994,-0.054442,-0.028027,-0.089115,-0.060775,-0.079954,-0.129473,-0.085677,-0.084423,-0.073595,-0.040083,-0.038551,-0.039976,-0.027485,1.765627,-0.277853,-0.21411,-0.363442,-0.102091,-0.941304,-0.207878,-0.057862,-0.005073,-0.007174,-0.002071,-0.850464,0.909244,-0.184322,-0.199635,-0.134783,-0.41374,-0.48817,-0.213639,2.441062,-0.173141,-0.254635,-0.271642,-0.125318,-0.172285,-0.135605,-0.288231,-0.205123,-0.148777,-0.107936,-0.131968,-0.154155,-0.066775,-0.02818,-0.025875,-0.018057,-0.253938,-0.202693,-0.271851,-0.226295,-0.160018,-0.194336,-0.131918,-0.174045,-0.156256,-0.12782,-0.191851,-0.069911,-0.197769,0.996806,-0.003587,-0.111932,-0.081381,-0.126496,-0.064836,-0.128114
3,0.243307,-0.514764,1.198738,0.0,0.435916,-0.285929,-0.411719,-0.154257,-0.046172,0.045917,0.107215,-0.535617,-0.285264,-0.176064,-0.09201,-0.091711,-0.094259,-0.087651,-0.065216,-0.031884,-0.039835,-0.039323,-0.086376,-0.020784,-0.399782,-0.253566,1.122986,-0.454227,1.488158,-0.064243,0.139452,-0.158432,-0.244364,-0.203025,-0.202888,-0.118376,-0.116952,-0.104697,-0.134204,-0.140382,-0.113629,-0.08202,-0.113687,-0.16889,-0.070712,-0.148521,-0.160566,-0.139186,-0.081675,-0.18597,-0.0546,-0.198424,-0.108524,-0.050111,-0.143822,-0.146609,-0.054521,-0.091932,-0.08432,-0.123613,-0.036723,4.425166,-0.100121,-0.059044,-0.087138,-0.074706,-0.079217,-0.137025,-0.060668,-0.07897,-0.097196,-0.121885,-0.066775,-0.025875,-0.090064,-0.04088,-0.037705,-0.03976,-0.130054,-0.084449,-0.103349,-0.019541,-0.04898,-0.135605,-0.061692,-0.067868,-0.118021,-0.18645,-0.137916,-0.045087,-0.153629,-0.134783,-0.14702,-0.05456,-0.017206,-0.108746,-0.123541,-0.109411,-0.06632,-0.092994,-0.054442,-0.028027,-0.089115,-0.060775,-0.079954,-0.129473,-0.085677,-0.084423,-0.073595,-0.040083,-0.038551,-0.039976,-0.027485,1.765627,-0.277853,-0.21411,-0.363442,-0.102091,-0.941304,-0.207878,-0.057862,-0.005073,-0.007174,-0.002071,-0.850464,0.909244,-0.184322,-0.199635,-0.134783,-0.41374,-0.48817,-0.213639,2.441062,-0.173141,-0.254635,-0.271642,-0.125318,-0.172285,-0.135605,-0.288231,-0.205123,-0.148777,-0.107936,-0.131968,-0.154155,-0.066775,-0.02818,-0.025875,-0.018057,-0.253938,-0.202693,-0.271851,-0.226295,-0.160018,-0.194336,-0.131918,-0.174045,-0.156256,-0.12782,-0.191851,14.303957,-0.197769,-1.003205,-0.003587,-0.111932,-0.081381,-0.126496,-0.064836,-0.128114
4,-0.152567,-0.821604,1.190882,0.0,0.435916,-0.285929,-0.411719,-0.154257,-0.046172,-0.855453,-0.467804,-0.535617,-0.285264,-0.176064,-0.09201,-0.091711,-0.094259,-0.087651,-0.065216,-0.031884,-0.039835,-0.039323,-0.086581,-0.020784,-0.399782,-0.253566,1.122986,-0.454227,-0.735941,-0.505382,-0.705464,-0.158432,-0.244364,-0.203025,-0.202888,-0.118376,-0.116952,-0.104697,-0.134204,-0.140382,-0.113629,-0.08202,-0.113687,-0.16889,-0.070712,-0.148521,-0.160566,-0.139186,-0.081675,-0.18597,-0.0546,-0.198424,-0.108524,-0.050111,-0.143822,-0.146609,-0.054521,-0.091932,-0.08432,-0.123613,-0.036723,4.425166,-0.100121,-0.059044,-0.087138,-0.074706,-0.079217,-0.137025,-0.060668,-0.07897,-0.097196,-0.121885,-0.066775,-0.025875,-0.090064,-0.04088,-0.037705,-0.03976,-0.130054,-0.084449,-0.103349,-0.019541,-0.04898,-0.135605,-0.061692,-0.067868,-0.118021,-0.18645,-0.137916,-0.045087,-0.153629,-0.134783,-0.14702,-0.05456,-0.017206,-0.108746,-0.123541,-0.109411,-0.06632,-0.092994,-0.054442,-0.028027,-0.089115,-0.060775,-0.079954,-0.129473,-0.085677,-0.084423,-0.073595,-0.040083,-0.038551,-0.039976,-0.027485,1.765627,-0.277853,-0.21411,-0.363442,-0.102091,-0.941304,-0.207878,-0.057862,-0.005073,-0.007174,-0.002071,-0.850464,0.909244,-0.184322,-0.199635,-0.134783,-0.41374,-0.48817,-0.213639,2.441062,-0.173141,-0.254635,-0.271642,-0.125318,-0.172285,-0.135605,-0.288231,-0.205123,-0.148777,-0.107936,-0.131968,-0.154155,-0.066775,-0.02818,-0.025875,-0.018057,-0.253938,-0.202693,-0.271851,-0.226295,-0.160018,-0.194336,-0.131918,-0.174045,-0.156256,-0.12782,-0.191851,-0.069911,-0.197769,0.996806,-0.003587,-0.111932,-0.081381,-0.126496,-0.064836,-0.128114


# Training

## Evaluation of model

In [53]:
from sklearn.model_selection import KFold
from sklearn.linear_model import SGDClassifier
import numpy as np

kf = KFold(n_splits=4)
sgd_model = SGDClassifier(random_state=2)
result = []

for train_idx, test_idx in kf.split(X_train_scaled):
    sgd_model.fit(X_train_scaled.iloc[train_idx, :], Y_train.iloc[train_idx])
    result.append(sgd_model.score(X_train_scaled.iloc[test_idx, :], Y_train.iloc[test_idx]))

result = np.array(result)
print(result)
print("Avg score: ", result.mean())

[0.7974575  0.79056083 0.77489363 0.76799684]
Avg score:  0.7827271993449659


Accuracy achived = 78%

## Generating predictions

In [56]:
sgd_model = SGDClassifier(random_state=2)

sgd_model.fit(X_train, Y_train)

In [57]:
Y_predict = sgd_model.predict(X_test)
Y_predict

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [63]:
pd.Series(Y_predict).value_counts()

1    101055
0     11337
Name: count, dtype: int64