# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether or not a person will default on their bank loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 


#### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default. 

- NT is the abbreviation for New Taiwain. 


#### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: 
- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
- X2: Gender (1 = male; 2 = female). 
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
- X4: Marital status (1 = married; 2 = single; 3 = others). 
- X5: Age (year). 
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 
    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; . . .;
    - etc...
    - X11 = the repayment status in April, 2005. 
    - The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005;
    - etc...
    - X13 = amount of bill statement in August, 2005; . . .; 
    - X17 = amount of bill statement in April, 2005. 
- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; . . .;
    - etc...
    - X23 = amount paid in April, 2005. 




You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) to predict credit card defaults and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

- You will be working in pairs for this assessment

### Please have ONE notebook and be prepared to explain how you worked in your pair.

1. Clean up your data set so that you can perform an EDA. 
    - This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Engineer new features. 
    - Create polynomial and/or interaction features. 
    - Additionaly, you must also create **at least 2 new features** that are not interactions or polynomial transformations. 
        - *For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.*
4. Perform some feature selection. 
    
5. You must fit **three** models to your data and tune **at least 1 hyperparameter** per model. 
6. Using the F-1 Score, evaluate how well your models perform and identify your best model.
7. Using information from your EDA process and your model(s) output provide insight as to which borrowers are more likely to deafult


In [506]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle 
from itertools import combinations
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import set_config
set_config(print_changed_only=False, display=None)
pd.set_option('display.max_columns', None)
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# 1. Data Cleaning

In [348]:
df = pd.read_csv('datasets/training_data.csv') #, index_col=0

### Rename columns using data dictionary

In [349]:

rename_dict= {'Unnamed: 0':'id','X1': 'credit_given', 'X2': 'gender', 'X3': 'education',
              'X4': 'married','X5': 'age', 'X6': 'repayment_sept', 'X7': 'repayment_aug', 
              'X8':'repayment_jul', 'X9': 'repayment_jun', 'X10': 'repayment_may', 
              'X11':'repayment_apr', 'X12': 'bill_sept', 'X13': 'bill_aug', 'X14':'bill_jul',
              'X15': 'bill_jun', 'X16':'bill_may', 'X17':'bill_apr', 'X18': 'payment_sept',
              'X19': 'payment_aug', 'X20':'payment_jul', 'X21': 'payment_jun',
              'X22': 'payment_may', 'X23': 'payment_apr'}
df=df.rename(columns=rename_dict)

In [350]:
# Drop the headers row
to_drop =df[df.id=='ID'].index
df.drop(to_drop, inplace=True)

In [351]:
# Change datatypes to float
cont_features = ['credit_given', 'age', 'bill_sept', 'bill_aug', 'bill_jul',
       'bill_jun', 'bill_may', 'bill_apr', 'payment_sept', 'payment_aug',
       'payment_jul', 'payment_jun', 'payment_may', 'payment_apr']

df[cont_features] = df[cont_features].astype(float)

### Separate features and target variable

In [352]:

X = df.drop('Y', axis = 1) # grabs everything else but target

# Create target variable
y = df['Y'].astype(int) # y is the column we're trying to predict

cat_features = ['gender', 'education', 'married','repayment_sept', 
         'repayment_aug', 'repayment_jul', 'repayment_jun',
       'repayment_may', 'repayment_apr']

### Impute Outliers

In [353]:
for feat in cont_features:
    abv_5_std = X[feat].mean()+ 5*X[feat].std()
    X[feat] = np.where(X[feat]>abv_5_std, X[feat].mean()+ 5*X[feat].std(), X[feat])

# 2. EDA

In [354]:
# sns.barplot(X.credit_given,y, orient='h');

In [355]:
print(X.age[y.values ==0].mean())
print(X.age[y.values ==1].mean())
print('\n')
print(X.credit_given[y.values ==0].mean())
print(X.credit_given[y.values ==1].mean())

35.3756510789308
35.699085123309466


177909.01375548774
129301.44789180589


In [356]:
# sns.distplot(X.age[y.values ==0]);
# sns.distplot(X.age[y.values ==1]);

In [357]:
# sns.distplot(X.credit_given[y.values ==0]);
# sns.distplot(X.credit_given[y.values ==1]);

# 3. Feature Engineering

### Dummy Variables

In [358]:
dummies = pd.get_dummies(X[cat_features])

### Interaction Features

In [359]:
# Generate combinations of features
interactions = list(combinations(X_cont, 2))
interaction_dict = {}

for interaction in interactions:
    X_copy = X[X_cont].copy()
    X_copy['interact'] = X_copy[interaction[0]] * X_copy[interaction[1]]  
    logreg = LogisticRegression(C=1e5, class_weight='balanced')
    logreg.fit(X_copy, y) #run model with each possible interaction
    y_pred = logreg.predict(X_copy)
    interaction_dict[metrics.f1_score(y, y_pred)] = interaction # add F1 for each interaction to a dictionary
sorted(interaction_dict.items(), reverse = True)[:5]

[(0.3671787829370869, ('payment_sept', 'payment_aug')),
 (0.3542783219367954, ('payment_sept', 'payment_apr')),
 (0.35408631772268145, ('payment_jul', 'payment_apr')),
 (0.3515395586878094, ('bill_jul', 'payment_jul')),
 (0.3512912604754575, ('bill_sept', 'bill_may'))]

In [496]:
# Add best interactions to new features dataframe
top_interactions = sorted(interaction_dict.keys(), reverse = True)[:25]
new_features = pd.DataFrame()

for interaction in top_interactions:
    feature1 = interaction_dict[interaction][0]
    feature2 = interaction_dict[interaction][1]
    new_features[feature1+'_X_'+feature2] = df[feature1] * df[feature2]

### Polynomial Features

In [498]:
for feat in X_cont:
    new_features[feat+'^2'] = df[feat]**2
    new_features[feat+'^3'] = df[feat]**3
new_features.head()

Unnamed: 0,payment_sept_X_payment_aug,payment_sept_X_payment_apr,payment_jul_X_payment_apr,bill_jul_X_payment_jul,bill_sept_X_bill_may,bill_aug_X_bill_apr,payment_sept_X_payment_jun,payment_aug_X_payment_jul,bill_aug_X_payment_jun,bill_sept_X_payment_jul,bill_sept_X_payment_may,payment_jul_X_payment_jun,payment_sept_X_payment_jul,payment_aug_X_payment_may,payment_may_X_payment_apr,payment_jun_X_payment_may,payment_jun_X_payment_apr,age_X_payment_jul,credit_given_X_age,age_X_payment_may,age_X_payment_jun,payment_aug_X_payment_apr,payment_jul_X_payment_may,age_X_payment_aug,age_X_bill_jun,credit_given^2,credit_given^3,age^2,age^3,bill_sept^2,bill_sept^3,bill_aug^2,bill_aug^3,bill_jul^2,bill_jul^3,bill_jun^2,bill_jun^3,bill_may^2,bill_may^3,bill_apr^2,bill_apr^3,payment_sept^2,payment_sept^3,payment_aug^2,payment_aug^3,payment_jul^2,payment_jul^3,payment_jun^2,payment_jun^3,payment_may^2,payment_may^3,payment_apr^2,payment_apr^3
0,80180000.0,1437790000.0,1455187000.0,2205366000.0,40481450000.0,41013320000.0,60060000.0,81150178.0,1334341000.0,2252914000.0,2445684000.0,60786726.0,101210000.0,88093766.0,1579700000.0,65987922.0,863536674.0,364356.0,7920000.0,395532.0,216216.0,1152820000.0,111199427.0,288648.0,7962948.0,48400000000.0,1.0648e+16,1296.0,46656.0,49549870000.0,1.10297e+16,49358620000.0,1.096591e+16,47480410000.0,1.034598e+16,48926340000.0,1.082216e+16,33072700000.0,6014567000000000.0,34079010000.0,6291155000000000.0,100000000.0,1000000000000.0,64288324.0,515463800000.0,102434641.0,1036741000000.0,36072036.0,216648600000.0,120714169.0,1326287000000.0,20672400000.0,2972257000000000.0
1,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,9454.0,5800000.0,9454.0,9454.0,106276.0,106276.0,9454.0,9454.0,40000000000.0,8000000000000000.0,841.0,24389.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4860000.0,0.0,0.0,0.0,0.0,0.0,0.0,32400000000.0,5832000000000000.0,729.0,19683.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3150100.0,2779500.0,2283000.0,72436550.0,2170775000.0,2205961000.0,2868444.0,2587400.0,80297860.0,78188180.0,76441540.0,2356056.0,2820266.0,2529600.0,2232000.0,2303424.0,2322000.0,48704.0,2560000.0,47616.0,49536.0,2550000.0,2264736.0,54400.0,1404224.0,6400000000.0,512000000000000.0,1024.0,32768.0,2639082000.0,135574900000000.0,2690704000.0,139572200000000.0,2265094000.0,107802600000000.0,1925630000.0,84500490000000.0,1785570000.0,75451030000000.0,1808546000.0,76912020000000.0,3433609.0,6362477000.0,2890000.0,4913000000.0,2316484.0,3525689000.0,2396304.0,3709479000.0,2214144.0,3294646000.0,2250000.0,3375000000.0
4,2200000.0,2000000.0,600000.0,2926800.0,21790220.0,21562520.0,600000.0,660000.0,2398500.0,4954200.0,2477100.0,180000.0,1200000.0,330000.0,300000.0,90000.0,300000.0,16200.0,270000.0,8100.0,8100.0,1100000.0,180000.0,29700.0,146988.0,100000000.0,1000000000000.0,729.0,19683.0,68178050.0,562946200000.0,63920020.0,511040600000.0,23794880.0,116071400000.0,29637140.0,161344600000.0,6964321.0,18378840000.0,7273809.0,19617460000.0,4000000.0,8000000000.0,1210000.0,1331000000.0,360000.0,216000000.0,90000.0,27000000.0,90000.0,27000000.0,1000000.0,1000000000.0


### Log Features

In [499]:
for feat in X_cont:
    new_features['log_'+feat] = df[feat].map(lambda x: np.log(x))
new_features.replace([np.inf, -np.inf], 0, inplace=True)
new_features.fillna(0, inplace=True)
new_features.head()

Unnamed: 0,payment_sept_X_payment_aug,payment_sept_X_payment_apr,payment_jul_X_payment_apr,bill_jul_X_payment_jul,bill_sept_X_bill_may,bill_aug_X_bill_apr,payment_sept_X_payment_jun,payment_aug_X_payment_jul,bill_aug_X_payment_jun,bill_sept_X_payment_jul,bill_sept_X_payment_may,payment_jul_X_payment_jun,payment_sept_X_payment_jul,payment_aug_X_payment_may,payment_may_X_payment_apr,payment_jun_X_payment_may,payment_jun_X_payment_apr,age_X_payment_jul,credit_given_X_age,age_X_payment_may,age_X_payment_jun,payment_aug_X_payment_apr,payment_jul_X_payment_may,age_X_payment_aug,age_X_bill_jun,credit_given^2,credit_given^3,age^2,age^3,bill_sept^2,bill_sept^3,bill_aug^2,bill_aug^3,bill_jul^2,bill_jul^3,bill_jun^2,bill_jun^3,bill_may^2,bill_may^3,bill_apr^2,bill_apr^3,payment_sept^2,payment_sept^3,payment_aug^2,payment_aug^3,payment_jul^2,payment_jul^3,payment_jun^2,payment_jun^3,payment_may^2,payment_may^3,payment_apr^2,payment_apr^3,log_credit_given,log_age,log_bill_sept,log_bill_aug,log_bill_jul,log_bill_jun,log_bill_may,log_bill_apr,log_payment_sept,log_payment_aug,log_payment_jul,log_payment_jun,log_payment_may,log_payment_apr
0,80180000.0,1437790000.0,1455187000.0,2205366000.0,40481450000.0,41013320000.0,60060000.0,81150178.0,1334341000.0,2252914000.0,2445684000.0,60786726.0,101210000.0,88093766.0,1579700000.0,65987922.0,863536674.0,364356.0,7920000.0,395532.0,216216.0,1152820000.0,111199427.0,288648.0,7962948.0,48400000000.0,1.0648e+16,1296.0,46656.0,49549870000.0,1.10297e+16,49358620000.0,1.096591e+16,47480410000.0,1.034598e+16,48926340000.0,1.082216e+16,33072700000.0,6014567000000000.0,34079010000.0,6291155000000000.0,100000000.0,1000000000000.0,64288324.0,515463800000.0,102434641.0,1036741000000.0,36072036.0,216648600000.0,120714169.0,1326287000000.0,20672400000.0,2972257000000000.0,12.301383,3.583519,12.313123,12.311189,12.291792,12.306791,12.110987,12.125974,9.21034,8.989444,9.222368,8.700514,9.304468,11.876033
1,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,106276.0,9454.0,5800000.0,9454.0,9454.0,106276.0,106276.0,9454.0,9454.0,40000000000.0,8000000000000000.0,841.0,24389.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,106276.0,34645980.0,12.206073,3.367296,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897,5.786897
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4860000.0,0.0,0.0,0.0,0.0,0.0,0.0,32400000000.0,5832000000000000.0,729.0,19683.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.100712,3.295837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3150100.0,2779500.0,2283000.0,72436550.0,2170775000.0,2205961000.0,2868444.0,2587400.0,80297860.0,78188180.0,76441540.0,2356056.0,2820266.0,2529600.0,2232000.0,2303424.0,2322000.0,48704.0,2560000.0,47616.0,49536.0,2550000.0,2264736.0,54400.0,1404224.0,6400000000.0,512000000000000.0,1024.0,32768.0,2639082000.0,135574900000000.0,2690704000.0,139572200000000.0,2265094000.0,107802600000000.0,1925630000.0,84500490000000.0,1785570000.0,75451030000000.0,1808546000.0,76912020000000.0,3433609.0,6362477000.0,2890000.0,4913000000.0,2316484.0,3525689000.0,2396304.0,3709479000.0,2214144.0,3294646000.0,2250000.0,3375000000.0,11.289782,3.465736,10.846849,10.856534,10.770441,10.689259,10.651502,10.657894,7.524561,7.438384,7.327781,7.344719,7.305188,7.31322
4,2200000.0,2000000.0,600000.0,2926800.0,21790220.0,21562520.0,600000.0,660000.0,2398500.0,4954200.0,2477100.0,180000.0,1200000.0,330000.0,300000.0,90000.0,300000.0,16200.0,270000.0,8100.0,8100.0,1100000.0,180000.0,29700.0,146988.0,100000000.0,1000000000000.0,729.0,19683.0,68178050.0,562946200000.0,63920020.0,511040600000.0,23794880.0,116071400000.0,29637140.0,161344600000.0,6964321.0,18378840000.0,7273809.0,19617460000.0,4000000.0,8000000000.0,1210000.0,1331000000.0,360000.0,216000000.0,90000.0,27000000.0,90000.0,27000000.0,1000000.0,1000000000.0,9.21034,3.295837,9.018817,8.986572,8.492491,8.602269,7.878155,7.899895,7.600902,7.003065,6.39693,5.703782,5.703782,6.907755


# 4. Feature Selection

In [500]:
# Concatenate all engineered features together
X = pd.concat([X[cont_features], new_features, dummies], axis=1)
X.shape

(22499, 158)

### Train/Test Split

In [501]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=5)

### Standardizing

In [502]:
scaler = StandardScaler()  
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)  
X_test_scaled = scaler.transform(X_test)  

## Logistic Regression Feature Selection

### All Features

In [503]:
logreg = LogisticRegression(C=1e5, class_weight='balanced')
logreg.fit(X_train, y_train)
y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

Train F1: 0.2660499018783291
Test F1: 0.26479352916134524


### Scaled Features

In [504]:
logreg = LogisticRegression(C=1e5, class_weight='balanced')
logreg.fit(X_train_scaled, y_train)
y_train_pred = logreg.predict(X_train_scaled)
y_test_pred = logreg.predict(X_test_scaled)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

Train F1: 0.536552193131588
Test F1: 0.5429250084832032


### Select K-Best

In [505]:
# Select K-Best - use loop to determine best K value, which is 7
for k in range(5,9):
    selector = SelectKBest(k=k)
    selector.fit(X_train, y_train)
    kbest_features = X.columns[selector.get_support()]
    logreg = LogisticRegression(C=1e5, class_weight='balanced')
    logreg.fit(X_train[kbest_features], y_train)
    y_train_pred = logreg.predict(X_train[kbest_features])
    print(k, metrics.f1_score(y_train, y_train_pred))


5 0.5185939887926643
6 0.519625682349883
7 0.523354703650399
8 0.5163528245787908


### Recursive Feature Elimination

In [368]:
logreg = LogisticRegression(C=1e5, class_weight='balanced')
selector = RFECV(estimator=logreg, step=1, cv=5, scoring='f1', n_jobs=-1, verbose=2)
selector.fit(X, y)
rfe_features = X.columns[selector.support_]
rfe_features

In [411]:
logreg = LogisticRegression(C=1e5, class_weight='balanced')
logreg.fit(X_train[rfe_features], y_train)
y_train_pred = logreg.predict(X_train[rfe_features])
y_test_pred = logreg.predict(X_test[rfe_features])
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

Train F1: 0.3691510226954329
Test F1: 0.36699978736976396


## Decision Tree Feature Selection
(apparently this isn't really necessary but oh well)
### All Features

In [516]:
tree = DecisionTreeClassifier()
tree.fit(X_train,y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

Train F1: 0.9993376606173002
Test F1: 0.39009044435705853


### Select K-Best

In [417]:
for k in range(1,10): # Determine best k - it's 3
    selector = SelectKBest(k=k)
    selector.fit(X_train, y_train)
    kbest_features = X.columns[selector.get_support()]
    tree = DecisionTreeClassifier()
    tree.fit(X_train[kbest_features],y_train)
    y_train_pred = tree.predict(X_train[kbest_features])
    y_test_pred = tree.predict(X_test[kbest_features])
    print(k, metrics.f1_score(y_train, y_train_pred), metrics.f1_score(y_test, y_test_pred))


1 0.3937216338880484 0.42099322799097066
2 0.3937216338880484 0.42099322799097066
3 0.46500672947510097 0.4812563323201621
4 0.4422569871682194 0.45741831815747186
5 0.4405693950177936 0.4500537056928035
6 0.45149973688826517 0.4538174052322477
7 0.44833687190375093 0.4560085836909871
8 0.5191295191295192 0.42309711286089235
9 0.8320382546323968 0.37719298245614036


# 5. Model Fitting and Hyperparameter Tuning
Logistic Regression, KNN, Decision Tree, Random Forest, Boosting

### Class Imbalance

## KNN Model Tuning

In [401]:
# Loop to find optimal n value for KNN - determined to be 19
# for n in range(1,21):
#     knn = KNeighborsClassifier(n_neighbors=n)
#     knn.fit(X_train_scaled, y_train)
#     # y_train_pred = knn.predict(X_train_scaled)
#     y_test_pred = knn.predict(X_test_scaled)
#     # print('Train F1:',metrics.f1_score(y_train, y_train_pred))
#     print('Test F1:',metrics.f1_score(y_test, y_test_pred), 'for n:', n)

Test F1: 0.39285714285714285 for n: 1
Test F1: 0.3122119815668203 for n: 2
Test F1: 0.4108956602031395 for n: 3
Test F1: 0.35214446952595935 for n: 4
Test F1: 0.4299802761341223 for n: 5
Test F1: 0.39667590027700833 for n: 6
Test F1: 0.4514811031664964 for n: 7
Test F1: 0.4069119286510591 for n: 8
Test F1: 0.44881075491209926 for n: 9
Test F1: 0.41883656509695294 for n: 10
Test F1: 0.44992134242265336 for n: 11
Test F1: 0.42951541850220265 for n: 12
Test F1: 0.4513742071881607 for n: 13
Test F1: 0.4269911504424779 for n: 14
Test F1: 0.44728434504792336 for n: 15
Test F1: 0.43886462882096067 for n: 16
Test F1: 0.4509594882729211 for n: 17
Test F1: 0.43910431458219545 for n: 18
Test F1: 0.4527495995728777 for n: 19
Test F1: 0.4464964693101576 for n: 20


## Decision Tree Grid Search

In [481]:
# parameters={'max_depth': range(1,8),
#             'criterion': ['gini', 'entropy'],
#             'class_weight': ['balanced', None],
#             'max_features':['auto', 'sqrt', 'log2', None],
#             'max_leaf_nodes': range(5, 20),
#             'min_samples_leaf': range(1,8)
#             }

# dtc = DecisionTreeClassifier()
# grid_tree = GridSearchCV(dtc, parameters, cv=10, scoring='f1', verbose=1, n_jobs=-1)
# grid_tree.fit(X_train, y_train)

# print(grid_tree.best_score_)
# print(grid_tree.best_params_)
# print(grid_tree.best_estimator_)

Fitting 10 folds for each of 11760 candidates, totalling 117600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 340 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 840 tasks      | elapsed:   23.7s
[Parallel(n_jobs=-1)]: Done 1540 tasks      | elapsed:   41.9s
[Parallel(n_jobs=-1)]: Done 2440 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 3540 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 4840 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 6340 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 7730 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done 8680 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 9730 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 10880 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done 12130 tasks      | elapsed:  9.1min
[Parallel(n_jobs=-1)]: Done 13480 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 14930 tasks   

0.5136521021406051
{'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': None, 'max_leaf_nodes': 11, 'min_samples_leaf': 1}
DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced',
                       criterion='entropy', max_depth=5, max_features=None,
                       max_leaf_nodes=11, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort='deprecated', random_state=None,
                       splitter='best')


In [492]:
# pickle_out = open("tree_model.pickle","wb")
# pickle.dump(grid_tree, pickle_out)
# pickle_out.close()

## Random Forest

In [528]:
forest = RandomForestClassifier(random_state = 1, n_estimators=100, max_depth=3, max_features=None,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))


Train F1: 0.42334949645654607
Test F1: 0.43997766610831945


### Grid Search

In [None]:
tree = RandomForestClassifier()
params = { 
    'n_estimators': [100,300,500,700,1000],
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(2,10)),
    'max_features':['auto', 'sqrt', None],
    'class_weight': ['balanced', 'balanced_subsample'],
    'max_leaf_nodes': range(5, 20),
    'min_samples_leaf': range(1,8)
}
grid_forest = GridSearchCV(RandomForestClassifier(), params, cv=5, scoring='f1', verbose=1, n_jobs=-1)
grid_forest.fit(X_train, y_train)

print(grid_forest.best_score_)
print(grid_forest.best_params_)
print(grid_forest.best_estimator_)

pickle_out = open("forest_model.pickle","wb")
pickle.dump(grid_tree, pickle_out)
pickle_out.close()

# 6. Model Evaluation

### Best Logistic Regression Model

In [418]:
logreg = LogisticRegression(C=1e5, class_weight='balanced')
logreg.fit(X_train_scaled, y_train)
y_train_pred = logreg.predict(X_train_scaled)
y_test_pred = logreg.predict(X_test_scaled)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

Train F1: 0.5361315938740783
Test F1: 0.5456410256410257


### Best KNN Model

In [None]:
knn = KNeighborsClassifier(n_neighbors=19)
knn.fit(X_train_scaled, y_train)
y_train_pred = knn.predict(X_train_scaled)
y_test_pred = knn.predict(X_test_scaled)
print('Train F1:',metrics.f1_score(y_train, y_train_pred))
print('Test F1:',metrics.f1_score(y_test, y_test_pred))

### Best Decision Tree

In [491]:
pickle_in = open("tree_model.pickle","rb")
decision_tree = pickle.load(pickle_in)
pickle_in.close()
y_train_pred = decision_tree.best_estimator_.predict(X_train)
print("Train F1:",metrics.f1_score(y_train, y_train_pred))
y_test_pred = decision_tree.best_estimator_.predict(X_test)
print("Test F1:",metrics.f1_score(y_test, y_test_pred))

Train F1: 0.4986434108527132
Test F1: 0.5013177159590043


## Voting Classifier

## Bagging Classifier

### Best Random Forest

## 7. Final Model