# BIG DATA FINAL PROJECT

## Credit Risk Modeling (Lending Club)

+ We will focus on credit modelling, a well known data science problem that focuses on modeling a borrower's credit risk. Credit has played a key role in the economy for centuries and some form of credit has existed since the beginning of commerce. We'll be working with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. 

+ Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assign an interest rate to the borrower. The interest rate is the percent in addition to the requested loan amount the borrower has to pay back. Lending Club also tries to verify each piece of information the borrower provides but it can't always verify all of the information (usually for regulation reasons).

+ A higher interest rate means that the borrower is riskier and more unlikely to pay back the loan while a lower interest rate means that the borrower has a good credit history is more likely to pay back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given a grade according to the interest rate they were assigned. If the borrower accepts the interest rate, then the loan is listed on the Lending Club marketplace.

+ Investors are primarily interested in receiveing a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application. Once they're ready to back a loan, they select the amount of money they want to fund. Once a loan's requested amount is fully funded, the borrower receives the money they requested minus the origination fee that Lending Club charges.

+ The borrower then makes monthly payments back to Lending Club either over 36 months or over 60 months. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.

+ While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors on Lending Club need to be equally as savvy about determining which loans are more likely to be paid off. While at first, you may wonder why investors would put money into anything but low interest loans. The incentive investors have to back higher interest loans is, well, the higher interest! If investors believe the borrower can pay back the loan, even if he or she has a weak financial history, then investors can make more money through the larger additional amount the borrower has to pay.

+ Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor who only wants to invest in the loans that have a good chance of being paid off on time. To do that, we'll need to first understand the features in the dataset and then experiment with building machine learning models that reliably predict if a loan will be paid off or not.

### Data Cleaning

In this project, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off. The data has been sourced from Lending Club's website.

In [1]:
# Importing all the necessary libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import tests as t
import seaborn as sns
sns.set(style='ticks')
%matplotlib inline

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Let us perform some basic data exploration to understand how the dataset looks like. 

In [3]:
loans_2007=pd.read_csv('loans_2007.csv',low_memory = False) # Reading the dataset on to a DataFrame

In [4]:
loans_2007.head() # Gives the first 5 rows of the dataset with all columns
pd.DataFrame(loans_2007.columns)  # This will print out a list of all columns

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,0
0,id
1,member_id
2,loan_amnt
3,funded_amnt
4,funded_amnt_inv
5,term
6,int_rate
7,installment
8,grade
9,sub_grade


The Dataframe contains many columns and can be cumbersome to try to explore all at once. Let's break up the columns into 3 groups of 18 columns and use the data dictionary to become familiar with what each column represents. As we understand each feature, we want to pay attention to any features that:

+ leak information from the future (after the loan has already been funded)
+ don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
+ formatted poorly and need to be cleaned up
+ require more data or a lot of processing to turn into a useful feature
+ contain redundant information
+ We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans. 

+ After analyzing first 18 columns, we can conclude that the following features need to be removed:

+ id: randomly generated field by Lending Club for unique identification purposes only
+ member_id: also a randomly generated field by Lending Club for unique identification purposes only
+ funded_amnt: leaks data from the future (after the loan is already started to be funded)
+ funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded)
+ grade: contains redundant information as the interest rate column (int_rate)
+ sub_grade: also contains redundant information as the interest rate column (int_rate)
+ emp_title: requires other data and a lot of processing to potentially be useful
+ issue_d: leaks data from the future (after the loan is already completed funded)
+ Recall that Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade and sub_grade values are categorical, the int_rate column contains continuous values, which are better suited for machine learning.

+ Let's now drop these columns from the Dataframe before moving onto the next group of columns.

In [5]:
loans_2007=loans_2007.drop(['id','member_id','funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','issue_d'],axis=1)

+ Within the next 18 group of columns, we need to drop the following columns:

+ zip_code: redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
+ out_prncp: leaks data from the future, (after the loan already started to be paid off)
+ out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
+ total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
+ total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
+ total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)
+ The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column describe properties of the loan after it's fully funded and started to be paid off. This information isn't available to an investor before the loan is fully funded and we don't want to include it in our model.

+ Let's go ahead and remove these columns from the Dataframe.

In [6]:
loans_2007=loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1)

+ In the last group of columns, we need to drop the following columns:

+ total_rec_int: leaks data from the future, (after the loan already started to be paid off),
+ total_rec_late_fee: also leaks data from the future, (after the loan already started to be paid off),
+ recoveries: also leaks data from the future, (after the loan already started to be paid off),
+ collection_recovery_fee: also leaks data from the future, (after the loan already started to be paid off),
+ last_pymnt_d: also leaks data from the future, (after the loan already started to be paid off),
+ last_pymnt_amnt: also leaks data from the future, (after the loan already started to be paid off).
+ All of these columns leak data from the future, meaning that they're describing aspects of the loan after it's already been fully funded and started to be paid off by the borrower.

In [7]:
loans_2007=loans_2007.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt'],axis=1)

In [8]:
loans_2007.head(1)
loans_2007.shape

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


(42538, 32)

+ Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 32 columns. We now need to decide on a target column that we want to use for modeling.

+ We should use the loan_status column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column contains text values and we need to convert it to a numerical one for training a model. Let's explore the different values in this column and come up with a strategy for converting the values in this column.

In [9]:
pd.DataFrame(loans_2007['loan_status'].value_counts())

Unnamed: 0,loan_status
Fully Paid,33136
Charged Off,5634
Does not meet the credit policy. Status:Fully Paid,1988
Current,961
Does not meet the credit policy. Status:Charged Off,761
Late (31-120 days),24
In Grace Period,20
Late (16-30 days),8
Default,3


+ From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other values describe loans that are still on going and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance.
+ Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a binary classification one. Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case.

In [10]:
loans_2007=loans_2007.loc[(loans_2007['loan_status']=='Fully Paid')|(loans_2007['loan_status']=='Charged Off')]
loans_2007=loans_2007.replace({'Fully Paid':1,'Charged Off':0})

+ Let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. 
+ In addition, removing these columns will reduce the number of columns we'll need to explore further

In [11]:
drop_columns=[]
for column in loans_2007.columns:
    non_null_unique_values=len(loans_2007[column].dropna().unique())
    if non_null_unique_values<=1:
        drop_columns.append(column)
loans_2007.drop(drop_columns,axis=1,inplace=True)
drop_columns

['pymnt_plan',
 'initial_list_status',
 'collections_12_mths_ex_med',
 'policy_code',
 'application_type',
 'acc_now_delinq',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'tax_liens']

## Preparing Features

Let's start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.

In [12]:
loans = loans_2007
null_counts=loans.isnull().sum()
pd.DataFrame(null_counts)

Unnamed: 0,0
loan_amnt,0
term,0
int_rate,0
installment,0
emp_length,1036
home_ownership,0
annual_inc,0
verification_status,0
loan_status,0
purpose,0


+ While most of the columns have 0 missing values, 2 columns have 50 or less rows with missing values, and 1 column, pub_rec_bankruptcies, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values.

+ This means that we'll keep the following columns and just remove rows containing missing values for them:

+ title
+ revol_util
+ last_credit_pull_d
+ and drop the pub_rec_bankruptcies column entirely since more than 1% of the rows have a missing value for this column.

In [13]:
loans.drop('pub_rec_bankruptcies',axis=1,inplace=True)
loans.dropna(axis=0,inplace=True)
loans.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth.

In [14]:
object_columns_df= loans.select_dtypes(include=['object'])
object_columns_df.head(1)

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016


+ Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

+ home_ownership: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
+ verification_status: indicates if income was verified by Lending Club,
+ emp_length: number of years the borrower was employed upon time of application,
+ term: number of payments on the loan, either 36 or 60,
+ addr_state: borrower's state of residence,
+ purpose: a category provided by the borrower for the loan request,
+ title: loan title provided the borrower,
+ There are also some columns that represent numeric values, that need to be converted:

+ int_rate: interest rate of the loan in %,
+ revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more here.
+ Based on the first row's values for purpose and title, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.

+ Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:

+ earliest_cr_line: The month the borrower's earliest reported credit line was opened,
+ last_credit_pull_d: The most recent month Lending Club pulled credit for this loan.
+ Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

In [15]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    loans[c].value_counts()


RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64

Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64

10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64

 36 months    28234
 60 months     9441
Name: term, dtype: int64

CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
AL     420
LA     420
KY     311
OK     285
UT     249
KS     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT      53
MS      19
TN      17
IN       9
ID       6
NE       5
IA       5
ME       3
Name: addr_state, dtype: int64

+ The home_ownership, verification_status, emp_length, term, and addr_state columns all contain multiple discrete values. We should clean the emp_length column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).

+ First, let's look at the unique value counts for the purpose and title columns to understand which column we want to keep.

In [16]:
loans["purpose"].value_counts()
loans["title"].value_counts()

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64

Debt Consolidation                          2068
Debt Consolidation Loan                     1599
Personal Loan                                624
Consolidation                                488
debt consolidation                           466
Credit Card Consolidation                    345
Home Improvement                             336
Debt consolidation                           314
Small Business Loan                          298
Credit Card Loan                             294
Personal                                     290
Consolidation Loan                           250
Home Improvement Loan                        228
personal loan                                219
Loan                                         202
Wedding Loan                                 199
personal                                     198
Car Loan                                     188
consolidation                                186
Other Loan                                   168
Wedding             

+ The home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.

+ It seems like the purpose and title columns do contain overlapping information but we'll keep the purpose column since it contains a few discrete values. In addition, the title column has data quality issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt Consolidation Loan and debt consolidation).

+ We can use the following mapping to clean the emp_length column:

+ "10+ years": 10
+ "9 years": 9
+ "8 years": 8
+ "7 years": 7
+ "6 years": 6
+ "5 years": 5
+ "4 years": 4
+ "3 years": 3
+ "2 years": 2
+ "1 year": 1
+ "< 1 year": 0
+ "n/a": 0
+ We erred on the side of being conservative with the 10+ years, < 1 year and n/a mappings. We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is a general heuristic but it's not perfect.

+ Lastly, the addr_state column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.

In [17]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans.drop(['last_credit_pull_d','addr_state','title','earliest_cr_line'],axis=1,inplace=True)
loans['int_rate']=loans['int_rate'].str.rstrip('%')
loans['int_rate']=loans['int_rate'].astype(float)
loans['revol_util']=loans['revol_util'].str.rstrip('%')
loans['revol_util']=loans['revol_util'].astype(float)
loans=loans.replace(mapping_dict)

Let's now encode the home_ownership, verification_status, purpose, and term columns as dummy variables so we can use them in our model.

In [18]:
dummy_dataframe=pd.get_dummies(loans[['home_ownership','verification_status','purpose','term']])
loans=pd.concat([loans,dummy_dataframe],axis=1)
loans.drop(['home_ownership','verification_status','purpose','term'],axis=1,inplace=True)

## Making Predictions

+ We established that this is a binary classification problem in the first mission of this course, and we converted the loan_status column to 0s and 1s as a result. Before diving in and selecting an algorithm to apply to the data, we should select an error metric.

+ An error metric will help us figure out when our model is performing well, and when it's performing poorly. To tie error metrics all the way back to the original question we wanted to answer, let's say we're using a machine learning model to predict whether or not we should fund a loan on the Lending Club platform. Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if our algorithm will make us money or lose us money.

+ In this case, we're primarily concerned with false positives and false negatives. Both of these are different types of misclassifications. With a false positive, we predict that a loan will be paid off on time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time. This loses us potential money, since we didn't fund a loan that actually would have been paid off.

+ Since we're viewing this problem from the standpoint of a conservative investor, we need to treat false positives differently than false negatives. A conservative investor would want to minimize risk, and avoid false positives as much as possible. They'd be more okay with missing out on opportunities (false negatives) than they would be with funding a risky loan (false positives).

#### Class Imbalances
+ We mentioned earlier that there is a significant class imbalance in the loan_status column. There are 6 times as many loans that were paid off on time (1), than loans that weren't paid off on time (0). This causes a major issue when we use accuracy as a metric. This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy.

+ A good first algorithm to apply to binary classification problems is logistic regression, for the following reasons:

+ it's quick to train and we can iterate more quickly,
+ it's less prone to overfitting than more complex models like decision trees,
+ it's easy to interpret.

In [19]:
features = loans.drop('loan_status',axis=1)
target = loans['loan_status']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

### Logistic Regression

In [21]:
penalty= {0:10,1:1}
lr = LogisticRegression(class_weight=penalty)
lr.fit(X_train,y_train)
predictions=lr.predict(X_test)



LogisticRegression(C=1.0, class_weight={0: 10, 1: 1}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)

In [22]:
from sklearn.metrics import confusion_matrix

In [23]:
pd.DataFrame(confusion_matrix(y_test, predictions))

Unnamed: 0,0,1
0,1006,96
1,4826,1607


In [24]:
fpr_logist = 98/(98+1004)*100
tpr_logist = 1621/(1621+4812)*100

In [25]:
fpr_logist
tpr_logist

8.892921960072595

25.19819679776154

In [26]:
fone_logist = f1_score(y_test,predictions)
fone_logist

0.397693817468106

In [27]:
precision_logist = precision_score(y_test,predictions)
precision_logist

0.9429901105293775

+ Our best model had a false positive rate of 8%, and a true positive rate of 25%. For a conservative investor, this means that they make money as long as the interest rate is high enough to offset the losses from 8% of borrowers defaulting, and that the pool of 25% of borrowers is large enough to make enough interest money to offset the losses.

+ If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our model is better than that, although we're excluding more loans than a random strategy would. Given this, there's still quite a bit of room to improve

### XG Boost Classifier

In [24]:
from xgboost import XGBClassifier

In [25]:
xgb_classifier = XGBClassifier(class_weight={0:50,1:1})
param_xgboost = {'gamma':[0,0.01,0.05,0.1,1,5,10,20],'learning_rate':[0,0.01,0.05,0.1,0.5],'max_depth':[3,4,5,6,7,8,9,10,20],'n_estimators':[100,150,200,300,400]}
xgb_search = RandomizedSearchCV(xgb_classifier, param_distributions = param_xgboost)

In [26]:
xgb_search.fit(X_train,y_train)



RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
          estimator=XGBClassifier(base_score=0.5, booster='gbtree', class_weight={0: 50, 1: 1},
       colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'gamma': [0, 0.01, 0.05, 0.1, 1, 5, 10, 20], 'learning_rate': [0, 0.01, 0.05, 0.1, 0.5], 'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 20], 'n_estimators': [100, 150, 200, 300, 400]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [27]:
new_preds = xgb_search.predict(X_test)

In [28]:
pd.DataFrame(confusion_matrix(y_test, new_preds))

Unnamed: 0,0,1
0,8,1094
1,13,6420


In [29]:
pd.DataFrame(xgb_search.cv_results_)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,param_learning_rate,param_gamma,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,15.514804,0.098985,0.081051,0.017921,400,3,0.5,1.0,"{'n_estimators': 400, 'max_depth': 3, 'learnin...",0.842441,0.854384,0.848099,0.848308,0.004878,7,0.903847,0.87528,0.894944,0.891357,0.011935
1,11.89732,0.054904,0.10718,0.002459,300,3,0.05,0.0,"{'n_estimators': 300, 'max_depth': 3, 'learnin...",0.857768,0.857968,0.857157,0.857631,0.000345,2,0.861345,0.860748,0.860954,0.861015,0.000248
2,65.352883,0.304598,1.222553,0.019668,300,20,0.05,0.0,"{'n_estimators': 300, 'max_depth': 20, 'learni...",0.853986,0.854783,0.855266,0.854678,0.000528,6,1.0,1.0,1.0,1.0,0.0
3,13.689645,0.205856,0.16091,0.00097,200,6,0.05,0.1,"{'n_estimators': 200, 'max_depth': 6, 'learnin...",0.856076,0.856276,0.857655,0.856669,0.000702,4,0.877221,0.872145,0.872101,0.873822,0.002403
4,16.199216,0.444288,0.095127,0.003202,400,3,0.0,5.0,"{'n_estimators': 400, 'max_depth': 3, 'learnin...",0.142232,0.142232,0.142246,0.142236,7e-06,9,0.142239,0.142239,0.142232,0.142236,3e-06
5,21.082724,1.184759,0.041996,0.002952,200,10,0.5,5.0,"{'n_estimators': 200, 'max_depth': 10, 'learni...",0.845825,0.846919,0.843818,0.845521,0.001284,8,0.899716,0.899716,0.910521,0.903318,0.005093
6,19.989028,0.113226,0.280649,0.002459,200,9,0.05,1.0,"{'n_estimators': 200, 'max_depth': 9, 'learnin...",0.853489,0.855778,0.856261,0.855176,0.001209,5,0.921266,0.916687,0.914353,0.917435,0.002872
7,5.970976,0.019282,0.063407,0.000601,100,5,0.05,0.05,"{'n_estimators': 100, 'max_depth': 5, 'learnin...",0.857569,0.858067,0.857456,0.857697,0.000265,1,0.862987,0.860897,0.8618,0.861894,0.000856
8,24.56199,0.144382,0.203596,0.002015,300,7,0.0,0.1,"{'n_estimators': 300, 'max_depth': 7, 'learnin...",0.142232,0.142232,0.142246,0.142236,7e-06,9,0.142239,0.142239,0.142232,0.142236,3e-06
9,47.271279,0.152171,0.480116,0.005393,400,10,0.01,5.0,"{'n_estimators': 400, 'max_depth': 10, 'learni...",0.856276,0.857171,0.857456,0.856967,0.000503,3,0.873986,0.870403,0.87011,0.8715,0.001762


In [31]:
xgb_search_fone = f1_score(y_test, new_preds)
xgb_search_fone

0.9206280920628093

### Random forests

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
rf= RandomForestClassifier(random_state=1, class_weight={0:50,1:1})
parameters_rf = {'n_estimators':[10,50,100,200],'max_depth':[3,5,6,7,8,9,20]}
rf_search = GridSearchCV(rf,param_grid = parameters_rf)
rf_search.fit(X_train,y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight={0: 50, 1: 1},
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=1, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [10, 50, 100, 200], 'max_depth': [3, 5, 6, 7, 8, 9, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [37]:
preds_rf=rf_search.predict(X_test)

In [38]:
pd.DataFrame(confusion_matrix(y_test, preds_rf))

Unnamed: 0,0,1
0,664,438
1,2561,3872


### Naive Bayes (Multinomial)

In [32]:
from sklearn.naive_bayes import MultinomialNB
classifier4 = MultinomialNB()

In [33]:
classifier4.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [34]:
preds4 = classifier4.predict(X_test)

In [35]:
pd.DataFrame(confusion_matrix(y_test, preds4))

Unnamed: 0,0,1
0,634,468
1,2781,3652


We are going to scale the features because the algorithms we are going to apply next are all sensitive to euclidean distances

In [29]:
scaler = StandardScaler()
X_train_transform = scaler.fit_transform(X_train)
X_test_transform = scaler.transform(X_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  This is separate from the ipykernel package so we can avoid doing imports until


### Support Vector Classifier

In [38]:
from sklearn.svm import SVC
classifier5 = SVC(class_weight={0:0.9,1:0.1},C = 7.0,kernel = 'rbf',gamma=100)

In [47]:
features.shape

(37675, 37)

In [None]:
classifier5.fit(X_train_transform,y_train)

In [36]:
preds5 = classifier5.predict(X_test_transform)

In [37]:
pd.DataFrame(confusion_matrix(y_test, preds5))

Unnamed: 0,0,1
0,0,1102
1,0,6433


### Ensemble

+ We decided to create an ensemble of our Logistic model, the Random Forests model and the Naive Bayes model
+ The ensemble predictions will be calculated using majority vote

In [40]:
from scipy.stats import mode

In [61]:
final_pred =[]
for i in range(0,len(X_test)):
    final_pred.append(mode([predictions[i], preds_rf[i], preds4[i]])[0])
final_pred = np.array(final_pred)

In [64]:
pd.DataFrame(confusion_matrix(y_test, final_pred))

Unnamed: 0,0,1
0,826,276
1,3493,2940
