# Home Credit Default Risk - Exploration + Baseline Model

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. 

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

## A Different Explanation Of The Task:

<a href="http://www.homecredit.net/">Home Credit</a> is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018). 

The company uses of a variety of alternative data - including telco and transactional information - to predict their clients' repayment abilities.

They made available their data to the Kaggle community and are challenging Kagglers to help them unlock the full potential of their data.

**Contents**   
1. Dataset Preparation  



## 1. Dataset Preparation 

## Load data

In [1]:
import pandas as pd 
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from six.moves import cPickle as pickle

In [None]:
%qtconsole

In [2]:
import os
data_path="../data/original/pickles"

In [3]:
#application_train = pd.read_csv(PATH+"/application_train.csv")
#application_test = pd.read_csv(PATH+"/application_test.csv")
#bureau = pd.read_csv(PATH+"/bureau.csv")
#bureau_balance = pd.read_csv(PATH+"/bureau_balance.csv")
#credit_card_balance = pd.read_csv(PATH+"/credit_card_balance.csv")
#installments_payments = pd.read_csv(PATH+"/installments_payments.csv")
#previous_application = pd.read_csv(PATH+"/previous_application.csv")
#POS_CASH_balance = pd.read_csv(PATH+"/POS_CASH_balance.csv")


application_train = pickle.load(open(os.path.join(data_path, "application_train.csv.pickle"), "rb"))
application_test = pickle.load(open(os.path.join(data_path, "application_test.csv.pickle"), "rb"))

# Check the data 

## Data model

The structure of the data is explained in the following image (from the data description on the competition page)

<img src="https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png" width="800"></img>

The dataframe *application_train* and *application_test* contains the loan and loan applicants. The dataframe *bureau* contains the application data from other loans that the client took from other credit institutions and were reported to the credit bureau. The dataframe *previous_applications* contains information about previous loans at **Home Credit** by the same client, previous loans information and client information at the time of the loan (there is a line in the dataframe per previous loan application).

**SK_ID_CURR** is connecting the dataframes *application_train*|*test* with *bureau*, *previous_application* and also with dataframes *POS_CASH_balance*, *installments_payment* and *credit_card_balance*. **SK_ID_PREV** connects dataframe *previous_application* with *POS_CASH_balance*, *installments_payment* and *credit_card_balance*. **SK_ID_BUREAU** connects dataframe *bureau* with dataframe *bureau_balance*.

## Shape of Data

In [4]:
print("application_train -  rows:",application_train.shape[0]," columns:", application_train.shape[1])
print("application_test -  rows:",application_test.shape[0]," columns:", application_test.shape[1])
#print("bureau -  rows:",bureau.shape[0]," columns:", bureau.shape[1])
#print("bureau_balance -  rows:",bureau_balance.shape[0]," columns:", bureau_balance.shape[1])
#print("credit_card_balance -  rows:",credit_card_balance.shape[0]," columns:", credit_card_balance.shape[1])
#print("installments_payments -  rows:",installments_payments.shape[0]," columns:", installments_payments.shape[1])
#print("previous_application -  rows:",previous_application.shape[0]," columns:", previous_application.shape[1])
#print("POS_CASH_balance -  rows:",POS_CASH_balance.shape[0]," columns:", POS_CASH_balance.shape[1])

application_train -  rows: 307511  columns: 122
application_test -  rows: 48744  columns: 121


In [5]:
application_train.head(3)


Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
173139,300643,0,Cash loans,F,N,N,0,67500.0,706410.0,68944.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
288815,434593,1,Cash loans,M,N,Y,0,342000.0,751500.0,26622.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
237980,375624,0,Revolving loans,F,N,N,1,225000.0,270000.0,13500.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0


In [6]:
application_test.head(3)

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
40534,398352,Cash loans,F,N,N,0,90000.0,52128.0,5283.0,45000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,2.0,4.0
3254,122862,Cash loans,F,N,Y,1,54000.0,100246.5,9324.0,81000.0,...,0,0,0,0,,,,,,
18118,231542,Cash loans,M,Y,Y,0,405000.0,900000.0,38263.5,900000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,6.0


## Move TARGET from X_Train to Y_train

In [7]:
# remove target column from application_train to have same structure like application_test
X_train = application_train.drop(['TARGET'], axis=1)
# save train labels in Y_train
Y_train = application_train.TARGET

# define test data
X_test = application_test

print("Done!")

Done!


## Handle missing values

In [8]:
def missing_data(data, threshold):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    total_table = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return total_table[total_table['Percent'] > threshold]

threshold_missing_values = 51   
missing_values_df = missing_data(X_train, threshold_missing_values)

In [9]:
missing_values_df

Unnamed: 0,Total,Percent
COMMONAREA_MEDI,214865,69.872297
COMMONAREA_AVG,214865,69.872297
COMMONAREA_MODE,214865,69.872297
NONLIVINGAPARTMENTS_MODE,213514,69.432963
NONLIVINGAPARTMENTS_MEDI,213514,69.432963
NONLIVINGAPARTMENTS_AVG,213514,69.432963
FONDKAPREMONT_MODE,210295,68.386172
LIVINGAPARTMENTS_MEDI,210199,68.354953
LIVINGAPARTMENTS_MODE,210199,68.354953
LIVINGAPARTMENTS_AVG,210199,68.354953


In [10]:
missing_values_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, COMMONAREA_MEDI to ELEVATORS_MODE
Data columns (total 2 columns):
Total      30 non-null int64
Percent    30 non-null float64
dtypes: float64(1), int64(1)
memory usage: 720.0+ bytes


All columns that have more than 51% of missing data (currently 30 columns), we be deleted. This is just for now to have a simple base reference. Later on, we should analyse this columns if there's some positive effect on the results. 

In [11]:
# Delete columns with missing data percentage over the threshold%
for index in missing_values_df.index:
  X_train = X_train.drop([index], axis=1)
  X_test = X_test.drop([index], axis=1)

In [12]:
print("# Columns in X_train : " + str(len(X_train.columns)))
print("# Columns in X_test  : " + str(len(X_test.columns)))

# Columns in X_train : 91
# Columns in X_test  : 91


In [13]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307511 entries, 173139 to 185339
Data columns (total 91 columns):
SK_ID_CURR                      307511 non-null int64
NAME_CONTRACT_TYPE              307511 non-null object
CODE_GENDER                     307511 non-null object
FLAG_OWN_CAR                    307511 non-null object
FLAG_OWN_REALTY                 307511 non-null object
CNT_CHILDREN                    307511 non-null int64
AMT_INCOME_TOTAL                307511 non-null float64
AMT_CREDIT                      307511 non-null float64
AMT_ANNUITY                     307499 non-null float64
AMT_GOODS_PRICE                 307233 non-null float64
NAME_TYPE_SUITE                 306219 non-null object
NAME_INCOME_TYPE                307511 non-null object
NAME_EDUCATION_TYPE             307511 non-null object
NAME_FAMILY_STATUS              307511 non-null object
NAME_HOUSING_TYPE               307511 non-null object
REGION_POPULATION_RELATIVE      307511 non-null float64
D


## Encode non numerical values

First let's see how many non numerical values we have.

In [14]:
X_train.select_dtypes(include=['object']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307511 entries, 173139 to 185339
Data columns (total 15 columns):
NAME_CONTRACT_TYPE            307511 non-null object
CODE_GENDER                   307511 non-null object
FLAG_OWN_CAR                  307511 non-null object
FLAG_OWN_REALTY               307511 non-null object
NAME_TYPE_SUITE               306219 non-null object
NAME_INCOME_TYPE              307511 non-null object
NAME_EDUCATION_TYPE           307511 non-null object
NAME_FAMILY_STATUS            307511 non-null object
NAME_HOUSING_TYPE             307511 non-null object
OCCUPATION_TYPE               211120 non-null object
WEEKDAY_APPR_PROCESS_START    307511 non-null object
ORGANIZATION_TYPE             307511 non-null object
HOUSETYPE_MODE                153214 non-null object
WALLSMATERIAL_MODE            151170 non-null object
EMERGENCYSTATE_MODE           161756 non-null object
dtypes: object(15)
memory usage: 37.5+ MB


For any categorical variable (dtype == object) with 2 unique categories, we will use label encoding.

In [15]:
from sklearn.preprocessing import LabelEncoder

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in X_train:
    if X_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(X_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(X_train[col])
            # Transform both training and testing data
            X_train[col] = le.transform(X_train[col])
            X_test[col] = le.transform(X_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

3 columns were label encoded.


In [16]:
X_train.select_dtypes(include=['object']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307511 entries, 173139 to 185339
Data columns (total 12 columns):
CODE_GENDER                   307511 non-null object
NAME_TYPE_SUITE               306219 non-null object
NAME_INCOME_TYPE              307511 non-null object
NAME_EDUCATION_TYPE           307511 non-null object
NAME_FAMILY_STATUS            307511 non-null object
NAME_HOUSING_TYPE             307511 non-null object
OCCUPATION_TYPE               211120 non-null object
WEEKDAY_APPR_PROCESS_START    307511 non-null object
ORGANIZATION_TYPE             307511 non-null object
HOUSETYPE_MODE                153214 non-null object
WALLSMATERIAL_MODE            151170 non-null object
EMERGENCYSTATE_MODE           161756 non-null object
dtypes: object(12)
memory usage: 30.5+ MB


Now there are still 12 columns that need to be converted. It's strange that the column CODE_GENDER contains 3 unique values (expected were only 2). Let's count the values per category 

In [17]:
X_train.CODE_GENDER.value_counts()

F      202448
M      105059
XNA         4
Name: CODE_GENDER, dtype: int64

We see that the value XNA only appears in 3 data rows. This seems to be some wrong data or leck of information. Let's see how much percent of the data it is. 

In [18]:
xna_percent =  X_train.CODE_GENDER.value_counts()['XNA'] / X_train.size
print("XNA in Percentage: {}%".format(xna_percent))

XNA in Percentage: 1.4294137106003998e-07%


Because of the small percentage we decide to remove those rows.

In [19]:
index_xna_rows = X_train.CODE_GENDER != 'XNA'
X_train = X_train[index_xna_rows]
Y_train = Y_train[index_xna_rows]

index_xna_rows = X_test.CODE_GENDER != 'XNA'
X_test = X_test[index_xna_rows]

Now let´s encode the CODE_GENDER column.

In [20]:
# Train on the training data
le.fit(X_train.CODE_GENDER)
# Transform both training and testing data
X_train.CODE_GENDER = le.transform(X_train.CODE_GENDER)
X_test.CODE_GENDER = le.transform(X_test.CODE_GENDER)

In [21]:
X_train.select_dtypes(include=['object']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307507 entries, 173139 to 185339
Data columns (total 11 columns):
NAME_TYPE_SUITE               306215 non-null object
NAME_INCOME_TYPE              307507 non-null object
NAME_EDUCATION_TYPE           307507 non-null object
NAME_FAMILY_STATUS            307507 non-null object
NAME_HOUSING_TYPE             307507 non-null object
OCCUPATION_TYPE               211118 non-null object
WEEKDAY_APPR_PROCESS_START    307507 non-null object
ORGANIZATION_TYPE             307507 non-null object
HOUSETYPE_MODE                153211 non-null object
WALLSMATERIAL_MODE            151167 non-null object
EMERGENCYSTATE_MODE           161753 non-null object
dtypes: object(11)
memory usage: 28.2+ MB


We still have 11 columns to encode.

In [22]:
for col in X_train.select_dtypes(include=['object']):
    print(X_train[col].describe())
    print("---------------------------")

count            306215
unique                7
top       Unaccompanied
freq             248523
Name: NAME_TYPE_SUITE, dtype: object
---------------------------
count      307507
unique          8
top       Working
freq       158771
Name: NAME_INCOME_TYPE, dtype: object
---------------------------
count                            307507
unique                                5
top       Secondary / secondary special
freq                             218389
Name: NAME_EDUCATION_TYPE, dtype: object
---------------------------
count      307507
unique          6
top       Married
freq       196429
Name: NAME_FAMILY_STATUS, dtype: object
---------------------------
count                307507
unique                    6
top       House / apartment
freq                 272865
Name: NAME_HOUSING_TYPE, dtype: object
---------------------------
count       211118
unique          18
top       Laborers
freq         55186
Name: OCCUPATION_TYPE, dtype: object
---------------------------
count      3

For the rest of the categorical columns we use one-hot-encoding.

In [23]:
# one-hot encoding of categorical variables
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

print('Training Features shape: ', X_train.shape)
print('Testing Features shape: ', X_test.shape)

Training Features shape:  (307507, 207)
Testing Features shape:  (48744, 205)


There could be different values in train and test data, that would lead in a different number of columns for a specific attribute. Therefore we need to align (merge) the both data sets. 

In [24]:
# Align the training and testing data, keep only columns present in both dataframes
X_train, X_test = X_train.align(X_test, join = 'outer', axis = 1)

print('Training Features shape: ', X_train.shape)
print('Testing Features shape: ', X_test.shape)

Training Features shape:  (307507, 207)
Testing Features shape:  (48744, 207)


In [25]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 307507 entries, 173139 to 185339
Columns: 207 entries, AMT_ANNUITY to YEARS_BEGINEXPLUATATION_MODE
dtypes: float64(36), int64(44), uint8(127)
memory usage: 227.3 MB


For the new columns we need to run the imputer to handle missing values

In [26]:
# Show number of missing values per column
missing_data(X_train, 0)

Unnamed: 0,Total,Percent
APARTMENTS_AVG,156060,50.750064
APARTMENTS_MEDI,156060,50.750064
APARTMENTS_MODE,156060,50.750064
ENTRANCES_AVG,154827,50.349098
ENTRANCES_MODE,154827,50.349098
ENTRANCES_MEDI,154827,50.349098
LIVINGAREA_MODE,154349,50.193654
LIVINGAREA_MEDI,154349,50.193654
LIVINGAREA_AVG,154349,50.193654
FLOORSMAX_MODE,153019,49.761144


In [27]:
missing_data(X_test, 0)

Unnamed: 0,Total,Percent
NAME_INCOME_TYPE_Maternity leave,48744,100.0
NAME_FAMILY_STATUS_Unknown,48744,100.0
APARTMENTS_AVG,23887,49.005006
APARTMENTS_MEDI,23887,49.005006
APARTMENTS_MODE,23887,49.005006
ENTRANCES_AVG,23579,48.373133
ENTRANCES_MEDI,23579,48.373133
ENTRANCES_MODE,23579,48.373133
LIVINGAREA_MODE,23552,48.317742
LIVINGAREA_MEDI,23552,48.317742


In [28]:
from sklearn.preprocessing import Imputer

imp = Imputer(strategy='most_frequent')
imp.fit(X_train)
X_train = imp.transform(X_train)
X_test = imp.transform(X_test)

In [29]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

## Scale the data

In [30]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
std_scaler.fit(X_train)
X_train = pd.DataFrame(std_scaler.transform(X_train))
X_test = pd.DataFrame(std_scaler.transform(X_test))

In [31]:
X_train.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,197,198,199,200,201,202,203,204,205,206
count,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,...,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0,307507.0
mean,-3.30659e-16,3.4261420000000004e-17,-2.031976e-16,-1.150737e-15,-3.798998e-15,-2.794988e-15,1.407373e-16,-2.457334e-15,-2.24685e-15,-1.03834e-16,...,2.929409e-15,6.24383e-15,-2.384004e-15,-2.884168e-15,-5.602487e-16,2.100237e-15,2.5830600000000002e-17,9.662275e-15,-2.464681e-15,1.374864e-15
std,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,...,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002,1.000002
min,-1.758837,-1.376496,-1.348043,-0.6036822,-0.05876637,-0.07098789,-0.2699416,-0.3086179,-0.1558385,-0.8855573,...,-0.4424135,-0.4443983,-0.3517147,-0.2356748,-0.4437475,-0.4610136,-0.4507839,-23.0358,-22.78036,-21.12355
25%,-0.7302338,-0.8174787,-0.8118782,-0.2374185,-0.05876637,-0.07098789,-0.2699416,-0.3086179,-0.1558385,-0.8855573,...,-0.4424135,-0.4443983,-0.3517147,-0.2356748,-0.4437475,-0.4610136,-0.4507839,-0.01649207,-0.01651579,-0.01850761
50%,-0.1521299,-0.2124206,-0.2391562,-0.09129259,-0.05876637,-0.07098789,-0.2699416,-0.3086179,-0.1558385,-0.3467079,...,-0.4424135,-0.4443983,-0.3517147,-0.2356748,-0.4437475,-0.4610136,-0.4507839,0.1124873,0.1110322,0.110562
75%,0.5166327,0.5208089,0.382308,0.1421293,-0.05876637,-0.07098789,-0.2699416,-0.3086179,-0.1558385,0.7309908,...,-0.4424135,-0.4443983,-0.3517147,-0.2356748,-0.4437475,-0.4610136,-0.4507839,0.1124873,0.1110322,0.110562
max,15.93201,8.574014,9.509302,492.7004,87.28797,51.20153,31.24269,350.4691,41.78853,12.58568,...,2.260329,2.250234,2.843214,4.243136,2.253534,2.169133,2.218358,0.4150027,0.4101904,0.3880618


In [32]:
X_train.head(50)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,197,198,199,200,201,202,203,204,205,206
0,2.886486,0.266791,0.382308,-0.427192,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,-0.346708,...,-0.442413,-0.444398,-0.351715,-0.235675,-0.443747,2.169133,-0.450784,-0.039943,-0.039706,-0.029263
1,-0.033529,0.378819,0.577277,0.730429,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,-0.346708,...,-0.442413,2.250234,-0.351715,-0.235675,-0.443747,-0.461014,-0.450784,0.112487,0.111032,0.110562
2,-0.938873,-0.817479,-0.726579,0.237016,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,-0.346708,...,-0.442413,-0.444398,-0.351715,-0.235675,2.253534,-0.461014,-0.450784,0.112487,0.111032,0.110562
3,1.506426,0.68355,0.73569,-0.085599,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,0.192141,...,2.260329,-0.444398,-0.351715,-0.235675,-0.443747,-0.461014,-0.450784,0.006959,0.006675,0.01376
4,-0.155545,-0.818597,-0.848435,0.237016,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,-0.885557,...,-0.442413,-0.444398,-0.351715,-0.235675,2.253534,-0.461014,-0.450784,-0.122021,-0.120873,-0.102403
5,1.570694,1.428754,1.466824,0.237016,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,0.192141,...,-0.442413,-0.444398,2.843214,-0.235675,-0.443747,-0.461014,-0.450784,-0.039943,-0.039706,-0.029263
6,-0.303952,-0.154316,-0.2026,-0.142532,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,1.808689,...,-0.442413,2.250234,-0.351715,-0.235675,-0.443747,-0.461014,-0.450784,0.112487,0.111032,0.110562
7,0.86157,0.188753,0.370122,0.388835,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,0.192141,...,2.260329,-0.444398,-0.351715,-0.235675,-0.443747,-0.461014,-0.450784,0.112487,0.111032,0.110562
8,-0.928627,-1.096987,-1.031218,0.42679,-0.058766,-0.070988,-0.269942,-0.308618,-0.155839,-0.885557,...,-0.442413,-0.444398,-0.351715,-0.235675,-0.443747,2.169133,-0.450784,0.112487,0.111032,0.110562
9,-1.131367,-1.22252,-1.201816,-0.37026,-0.058766,-0.070988,0.897193,-0.308618,-0.155839,1.808689,...,2.260329,-0.444398,-0.351715,-0.235675,-0.443747,-0.461014,-0.450784,0.34465,0.340619,0.323527


## Train the Model

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict

In [None]:
# Train a simple SGDClassifier
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42, n_jobs=4) 

In [34]:
# RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(random_state=42)

In [36]:
Y_train_pred = cross_val_predict(rf_clf, X_train, Y_train, cv=10, verbose=10, method='predict_proba')

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   19.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   37.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   56.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.2min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.6min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  1.9min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  2.2min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  2.5min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  2.8min remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  3.1min finished


In [37]:
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_train, Y_train_pred)

array([[282278,    404],
       [ 24582,    243]])

In [39]:
from sklearn.metrics import precision_score, recall_score
precision_score(Y_train, Y_train_pred)

0.3755795981452859

In [None]:
recall_score(Y_train, Y_train_pred)

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

fpr, tpr, thresholds = precision_recall_curve(Y_train, Y_train_pred)

In [None]:
def plot_roc_curve(fpr, tpr, lable=None):
  plt.plot(fpr, tpr, linewidth=2)
  plt.plot([0, 1], [0, 1], 'k--')
  plt.axis([0, 1, 0, 1])
  plt.xlabel('False positive rate')
  plt.ylabel('True positive rate')

plot_roc_curve(fpr, tpr)
plt.show()

In [None]:
from sklearn.svm import SVC
#params = [
#  {'kernel':('linear', 'rbf'), 'C':[1, 10, 100]}
#]

params = [
  {'kernel':['linear'], 'C':[1]}
]

svc = SVC()
clf = GridSearchCV(svc, param_grid=params, n_jobs=5)
clf.fit(X_train, Y_train)

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

Y_train_pred = cross_val_predict(log_reg, X_train, Y_train, cv=10, method='predict_proba')