# Applying Classification Modeling
The goal of this week's assessment is to find the model which best predicts whether a person will default on their loan. In doing so, we want to utilize all of the different tools we have learned over the course: data cleaning, EDA, feature engineering/transformation, feature selection, hyperparameter tuning, and model evaluation. 

Dataset: The dataset comes customers default payments in Taiwan. More information about the dataset and columns are found in the link below.

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#


You will fit three different models (KNN, Logistic Regression, and Decision Tree Classifier) and use gridsearch to find the best hyperparameters for those models. Then you will compare the performance of those three models on a test set to find the best one.  


## Process/Expectations

#### You will be working in pairs for this assessment; please have ONE notebook and be prepared to explain how you worked in your pair.
1. Clean up your data set so that you can do EDA. This includes handling null values, categorical variables, removing unimportant columns, and removing outliers.
2. Perform EDA to identify opportunities to create new features.
    - [Great Example of EDA for classification](https://www.kaggle.com/stephaniestallworth/titanic-eda-classification-end-to-end) 
    - [Using Pairplots with Classification](https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166)
3. Create polynomial and/or interaction features. You must also create at least 2 new features that are not interactions or polynomial transformations. For example, you can create a new dummy variable that based on the value of a continuous variable (billamount6 >2000) or take the average of some past amounts.
4. Perform some feature selction. This can happen beforehand using F-scores, or you can do it as part of your model building process by looking at the weights of your regularized logistic regression or feature importance of your decision tree.  
5. You must fit each of the three models to your data and tune at least 1 hyperparameter per model. 
6. After identifying the best hyperparameters for each model, fit those models to the test set and identify the best model overall using the evaluation metric of your choice.
7. Present your best model.

In [1]:
# import libraries
import pandas as pd
from math import exp
from scipy.stats import norm
from scipy import stats
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
import warnings
import statsmodels.api as sm

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.utils import resample

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score

import xgboost as xgb
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, roc_auc_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, accuracy_score

## 1. Data Cleaning

In [2]:
df = pd.read_csv('student_data.csv')
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,2873,350000,1,1,2,37,-2,-2,-2,-2,...,466,466,316,316,316,466,466,316,316,0
1,3598,50000,2,2,1,37,2,2,2,0,...,13026,13268,13497,5500,0,580,600,600,600,0
2,27623,50000,2,1,2,23,-1,-1,-1,-1,...,4800,9810,660,2548,2321,4800,9810,660,2980,0
3,6874,20000,1,3,1,56,0,0,0,0,...,13784,13420,13686,1508,1216,1116,0,490,658,0
4,6444,110000,2,2,2,32,0,0,0,0,...,108829,110557,106082,5400,5400,4100,4100,4100,4200,0


In [None]:
df.drop(columns = ['ID'], inplace = True)
df.rename(columns = {'default payment next month':'DEFAULT'}, inplace = True) 

## 2. EDA

In [None]:
df.boxplot('LIMIT_BAL')

In [None]:
df.loc[df['LIMIT_BAL']>900000]
df.drop(index=[13774], inplace=True)

In [None]:
pd.DataFrame.boxplot(df,figsize=(8,8))
plt.xticks(rotation=90)

## 3. Feature Engineering

In [None]:
education_dummies = pd.get_dummies(df['EDUCATION'], prefix='education')
marriage_dummies = pd.get_dummies(df['MARRIAGE'], prefix='marriage')
df = pd.concat([df, education_dummies, marriage_dummies], axis=1)
df.drop(columns = ['EDUCATION','MARRIAGE'], inplace = True)
df.columns

In [None]:
df.drop(columns = ['EDUCATION','MARRIAGE'], inplace = True)
df.columns

In [None]:
corrmat = df[['DEFAULT','LIMIT_BAL', 'AGE','SEX', 'education_0', 'education_1', 'education_2',
       'education_3', 'education_4', 'education_5', 'education_6',
       'marriage_0', 'marriage_1', 'marriage_2', 'marriage_3']].corr()

sns.set(font_scale=1)
fig,ax= plt.subplots()
fig.set_size_inches(10,10)
plt.tight_layout()
sns.heatmap(corrmat,square=True,annot=True, cbar = True)

In [None]:
corrmat = df[['DEFAULT','LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',\
              'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', \
              'BILL_AMT6', 'PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5',\
              'PAY_AMT6']].corr()
sns.set(font_scale=1)
fig,ax= plt.subplots()
fig.set_size_inches(10,10)
plt.tight_layout()
sns.heatmap(corrmat,square=True,annot=True, cbar = True)

In [None]:
sns.countplot(x="SEX", data=df,hue="DEFAULT", palette="muted")

In [None]:
# df.PAY_0 = np.where(df.PAY_0 < 1 , 0, df.PAY_0)
# df.PAY_0 = np.where(df.PAY_0 >=1 , 1, df.PAY_0)

# df.PAY_2 = np.where(df.PAY_2 < 1 , 0, df.PAY_0)
# df.PAY_2 = np.where(df.PAY_2 >=1 , 1, df.PAY_0)

# df.PAY_3 = np.where(df.PAY_3 < 1 , 0, df.PAY_0)
# df.PAY_3 = np.where(df.PAY_3 >=1 , 1, df.PAY_0)

# df.PAY_4 = np.where(df.PAY_4 < 1 , 0, df.PAY_0)
# df.PAY_4 = np.where(df.PAY_4 >=1 , 1, df.PAY_0)

# df.PAY_5 = np.where(df.PAY_5 < 1 , 0, df.PAY_0)
# df.PAY_5 = np.where(df.PAY_5 >=1 , 1, df.PAY_0)

# df.PAY_6 = np.where(df.PAY_6 < 1 , 0, df.PAY_0)
# df.PAY_6 = np.where(df.PAY_6 >=1 , 1, df.PAY_0)

In [None]:
X = df.drop(['DEFAULT'],axis=1)
y = df['DEFAULT']

X.corrwith(df['DEFAULT']).plot.bar(
        figsize = (20, 10), title = "Correlation with Default", fontsize = 20,
        rot = 90, grid = True)

## 4. Feature Selection

In [None]:
# recursive feature selection

In [None]:
df.columns

In [None]:
features=df[['LIMIT_BAL', 'SEX', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5',
       'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4',
       'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3',
       'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'education_0',
       'education_1', 'education_2', 'education_3', 'education_4',
       'education_5', 'education_6', 'marriage_0', 'marriage_1', 'marriage_2',
       'marriage_3']]

In [None]:
# min_train = X_train.min()
# range_train = (X_train - min_train).max()
# X_train_scaled = (X_train - min_train)/range_train

In [None]:
# min_test = X_test.min()
# range_test = (X_test - min_test).max()
# X_test_scaled = (X_test - min_test)/range_test

## 5. Model Fitting and Hyperparameter Tuning
KNN, Logistic Regression, Decision Tree

In [None]:
X = df.drop('DEFAULT', axis = 1)
y = df['DEFAULT']
feature_cols = X.columns

In [None]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Logistic Regression

In [None]:
# Create intercept term
X = sm.add_constant(X)

# Fit model
logit_model = sm.Logit(y, X)

# Get results of the fit
result = logit_model.fit()

In [None]:
logreg = LogisticRegression(fit_intercept = False, C = 1e15, solver='liblinear')
model_log = logreg.fit(X, y)
model_log

### KNN

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
scaler = StandardScaler()  
scaler.fit(X_train)

X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test) 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
print(knn)

In [None]:
knn.fit(X_train, y_train)

In [None]:
# make class predictions for the testing set
y_pred_class = knn.predict(X_test)

In [None]:
# calculate accuracy
print('Accuracy:' + str(metrics.accuracy_score(y_test, y_pred_class)))
print('F1:' + str(metrics.f1_score(y_test, y_pred_class)))

## Final Model: XGBoost

In [None]:
df['DEFAULT'].value_counts()

In [None]:
X = df.drop('DEFAULT', axis = 1)
y = df['DEFAULT']
feature_cols = X.columns

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=23)

In [None]:
training = pd.concat([X_train,y_train], axis =1)

In [None]:
ndefault = training[training['DEFAULT']==0]
default = training[training['DEFAULT']==1]

In [None]:
upsampled = resample(default,
                          replace=True, # sample with replacement
                          n_samples=len(ndefault), # match number in majority class
                          random_state=23) # reproducible results

upsampled = pd.concat([ndefault, upsampled])
upsampled['DEFAULT'].value_counts()

In [None]:
upsampled

In [None]:
# ndefault = df[df['DEFAULT']==0]
# default = df[df['DEFAULT']==1]

# upsampled = resample(default,
#                           replace=True, # sample with replacement
#                           n_samples=len(ndefault), # match number in majority class
#                           random_state=23) # reproducible results

# upsampled = pd.concat([ndefault, upsampled])
# upsampled['DEFAULT'].value_counts()

In [None]:
X_train = upsampled.drop('DEFAULT', axis = 1)
y_train = upsampled['DEFAULT']
feature_cols = X_train.columns

In [None]:
X_train.columns

In [None]:
xgb.XGBClassifier()
xg_clf = xgb.XGBClassifier(objective ='binary:logistic', 
                           colsample_bytree = 0.3, 
                           subsample = 0.5,
                           learning_rate = 0.1,
                           max_depth = 4, 
                           alpha = 1, 
                           n_estimators = 10000)
xg_clf.fit(X_train,y_train)
preds = xg_clf.predict(X_test)

test_f1 = f1_score(y_test, preds)
test_acc = accuracy_score(y_test, preds)

print("Accuracy: %f" % (test_acc))
print("F1: %f" % (test_f1))

## TEST

In [None]:
test = pd.read_csv('hold_out_features.csv')

In [None]:
test.head()

In [None]:
test.drop(columns = ['Unnamed: 0'], inplace = True)

In [None]:
education_dummies = pd.get_dummies(test['EDUCATION'], prefix='education')
marriage_dummies = pd.get_dummies(test['MARRIAGE'], prefix='marriage')
test = pd.concat([test, education_dummies, marriage_dummies], axis=1)
test.drop(columns = ['EDUCATION','MARRIAGE'], inplace = True)
test.columns

In [None]:
final_results=xg_clf.predict(test)

In [None]:
final_results=pd.Series(final_results)
final_results.to_csv('MR.csv', index=False)

In [None]:
final_results.value_counts()

In [None]:
# import pickle

# with open('MandR_Pickle.pickle','wb') as f:
#     pickle.dump(xg_clf, f)

### Notes: clean-up code for the project

In [None]:
# def fit_predict(model1, x_train, y_train, test):
#     model.fit(x_train, y_train)
#     predictions = model.predict(test)
# print('Test Accuracy Score', accuracy_score(test, prediction))
# print('F1 Score', f1_score(test, prediction))
# return