## Startup task: Train and test a classification model #2

This is a Mozilla's task for an Outreachy applicants on a spring 2020 round given in the issue: https://github.com/mozilla/PRESC/issues/2.

Things to DO:
* Choose a dataset
* Load a dataset from the repo
* Train a classification model from scikit-learn
* Compute an evaluation metric on a held-out test set

Additional:
* Basic exploratory analysis of the dataset
* Data preprocessing
* Hyperparameter tuning

## Choose a dataset

The information about the dataset is available in [UCI repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#).

The dataset consists of the credit card information of the clients and the history of their past payments. 
This information may help to predict credible and non-credible clients.

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* X2: Gender (1 = male; 2 = female).
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* X4: Marital status (1 = married; 2 = single; 3 = others).
* X5: Age (year).
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
* X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
* X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

## Importing libraries

In [2]:
import pandas as pd
import numpy as np
# import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Load a dataset from the repo

In [3]:
dataset = pd.read_csv('datasets/defaults.csv')
# preview first 5 lines
dataset.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## Exploratory data analysis

In [4]:
dataset.shape

(30000, 25)

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   id         30000 non-null  int64
 1   limit_bal  30000 non-null  int64
 2   sex        30000 non-null  int64
 3   education  30000 non-null  int64
 4   marriage   30000 non-null  int64
 5   age        30000 non-null  int64
 6   pay_0      30000 non-null  int64
 7   pay_2      30000 non-null  int64
 8   pay_3      30000 non-null  int64
 9   pay_4      30000 non-null  int64
 10  pay_5      30000 non-null  int64
 11  pay_6      30000 non-null  int64
 12  bill_amt1  30000 non-null  int64
 13  bill_amt2  30000 non-null  int64
 14  bill_amt3  30000 non-null  int64
 15  bill_amt4  30000 non-null  int64
 16  bill_amt5  30000 non-null  int64
 17  bill_amt6  30000 non-null  int64
 18  pay_amt1   30000 non-null  int64
 19  pay_amt2   30000 non-null  int64
 20  pay_amt3   30000 non-null  int64
 21  pay_amt4   3

In [6]:
dataset.describe()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,8660.398374,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


In [6]:
# transpose the table for more convinient view
dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,30000.0,15000.5,8660.398374,1.0,7500.75,15000.5,22500.25,30000.0
limit_bal,30000.0,167484.322667,129747.661567,10000.0,50000.0,140000.0,240000.0,1000000.0
sex,30000.0,1.603733,0.489129,1.0,1.0,2.0,2.0,2.0
education,30000.0,1.853133,0.790349,0.0,1.0,2.0,2.0,6.0
marriage,30000.0,1.551867,0.52197,0.0,1.0,2.0,2.0,3.0
age,30000.0,35.4855,9.217904,21.0,28.0,34.0,41.0,79.0
pay_0,30000.0,-0.0167,1.123802,-2.0,-1.0,0.0,0.0,8.0
pay_2,30000.0,-0.133767,1.197186,-2.0,-1.0,0.0,0.0,8.0
pay_3,30000.0,-0.1662,1.196868,-2.0,-1.0,0.0,0.0,8.0
pay_4,30000.0,-0.220667,1.169139,-2.0,-1.0,0.0,0.0,8.0


### Checking for the duplicates 

In [7]:
# to drop pure duplicate rows or investigate conflicting IDs from our dataset
dataset.drop_duplicates()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000,1,3,1,39,0,0,0,0,...,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,29997,150000,1,3,2,43,-1,-1,-1,-1,...,8979,5190,0,1837,3526,8998,129,0,0,0
29997,29998,30000,1,2,2,37,4,3,2,-1,...,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,29999,80000,1,3,1,41,1,-1,0,0,...,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


### Separate Data Types

In [8]:
dataset.dtypes

id           int64
limit_bal    int64
sex          int64
education    int64
marriage     int64
age          int64
pay_0        int64
pay_2        int64
pay_3        int64
pay_4        int64
pay_5        int64
pay_6        int64
bill_amt1    int64
bill_amt2    int64
bill_amt3    int64
bill_amt4    int64
bill_amt5    int64
bill_amt6    int64
pay_amt1     int64
pay_amt2     int64
pay_amt3     int64
pay_amt4     int64
pay_amt5     int64
pay_amt6     int64
defaulted    int64
dtype: object

We can see some categorical types, as 
* sex
* education
* marriage

We will change their datatype from numerical to categorical. 

Also convert id as an index.

In [9]:
dataset['sex']=dataset['sex'].astype(object)
dataset['education']=dataset['education'].astype(object)
dataset['marriage']=dataset['marriage'].astype(object)
dataset['age']=dataset['sex'].astype(object)
dataset=dataset.set_index('id')

In [10]:
dataset.dtypes

limit_bal     int64
sex          object
education    object
marriage     object
age          object
pay_0         int64
pay_2         int64
pay_3         int64
pay_4         int64
pay_5         int64
pay_6         int64
bill_amt1     int64
bill_amt2     int64
bill_amt3     int64
bill_amt4     int64
bill_amt5     int64
bill_amt6     int64
pay_amt1      int64
pay_amt2      int64
pay_amt3      int64
pay_amt4      int64
pay_amt5      int64
pay_amt6      int64
defaulted     int64
dtype: object

In [11]:
dataset.head()

Unnamed: 0_level_0,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,20000,2,2,1,2,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,2,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,2,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,2,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,1,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


### Checking nulls 

In [12]:
dataset.isnull().sum()

limit_bal    0
sex          0
education    0
marriage     0
age          0
pay_0        0
pay_2        0
pay_3        0
pay_4        0
pay_5        0
pay_6        0
bill_amt1    0
bill_amt2    0
bill_amt3    0
bill_amt4    0
bill_amt5    0
bill_amt6    0
pay_amt1     0
pay_amt2     0
pay_amt3     0
pay_amt4     0
pay_amt5     0
pay_amt6     0
defaulted    0
dtype: int64

### Visualization

Types of the visualization:
- Scatter plot (B)
- Pair plot (M)
- Box plot (U)
- Violin plot(U)
- Distribution plot (U)
- Joint plot (U) & (B)
- Bar chart (B)
- Line plot (B)

In [None]:
# maybe later when it will be more necessary

## Train/Test Split

The train_test_split function allows to break a dataset with ease while pursuing an ideal model. Ideally, we have to adjust it depending on the size of the dataset and parameter complexity.

In [13]:
#from sklearn import datasets, linear model
from sklearn.model_selection import train_test_split

The test_size=0.3 inside the function indicates the percentage of the data that should be held over for testing. (usually around 80/20 or 70/30).

In [14]:
# split into X and Y
dataset_X = dataset.iloc[:,:-1]
dataset_Y = dataset.iloc[:,-1]

# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(dataset_X, dataset_Y, test_size=0.3, random_state = 0)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(21000, 23) (21000,)
(9000, 23) (9000,)


## Train a classification model from scikit-learn  

### Choosing a classification model
From different classification models I chose the following algorithms:
* K-NN algorithm
* SUPPORT VECTOR MACHINE (SVM)
* Gaussian Naive Bayes algorithm for classification
* Decision Trees (DTs)

In [15]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

models = [
    ('KNN', KNeighborsClassifier()),
    ('SVM', SVC()),
    ('NB', GaussianNB()),
    ('DT', DecisionTreeClassifier()),
]

### Comparing Models 

The metric for this comparison is the accuracy score, that is quite naive metric.

In [16]:
for name, model in models:
    clf = model
    clf.fit(X_train, y_train)
    accuracy = clf.score(X_test, y_test)
    print(name, accuracy)

KNN 0.7603333333333333
SVM 0.7857777777777778
NB 0.35288888888888886
DT 0.7393333333333333


* KNN 0.7603333333333333
* SVM 0.7857777777777778
* NB 0.35288888888888886
* DT 0.7393333333333333

The result shows that SVM is the best in accuracy, but it was very slow, so the optimal algorithms is KNN.

Let's Improve the results:

## Cross Validation
1. The dataset is split into K smaller sets
2. A model is trained using K-1 of the folds(smaller sets) as training data and the remaining fold is used for validation
3. Step 2 and 3 are repeated until all folds where used for validation once
4. Use the average testing accuracy as the estimate of out-of-sample accuracy.


In [17]:
from sklearn.model_selection import cross_val_score
from sklearn import metrics

# 10-fold cross-validation with K=5 for KNN
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
dataset_Y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, dataset_Y_pred))
#scores = cross_val_score(knn, X_)

0.7603333333333333


In [18]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=20)
print(cross_val_score(knn, dataset_X, dataset_Y, cv=10, scoring='accuracy').mean())

0.7786003136074423


We can also use another scoring method:

In [19]:
from sklearn.model_selection import cross_validate
from sklearn import metrics

scoring = ['precision_macro', 'recall_macro']
scores = cross_validate(knn, dataset_X, dataset_Y, cv=10, scoring='accuracy', return_train_score=False)

In [20]:
print(scores)

{'fit_time': array([0.17434502, 0.19045591, 0.20136547, 0.19538784, 0.19370341,
       0.21280622, 0.1946981 , 0.19268274, 0.20665193, 0.1927855 ]), 'score_time': array([1.19669247, 1.23386455, 1.17443275, 1.21468759, 1.20644259,
       1.16122532, 1.2161572 , 1.21706462, 1.17596412, 1.17013335]), 'test_score': array([0.77440853, 0.77740753, 0.78207264, 0.77374209, 0.78066667,
       0.78066667, 0.78359453, 0.77859286, 0.78126042, 0.7735912 ])}


## Improve result
### Hyperparameter Tuning

In [24]:
# Randomized search on hyper parameters
# RandomizedSearchCV implements a “fit” and a “score” method. 
# It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.
#%%time
from sklearn.model_selection import RandomizedSearchCV

parameters = {'leaf_size': list(range(1,50)), 'n_neighbors': list(range(1,50)), 'p': [1,2]}

clf = RandomizedSearchCV(KNeighborsClassifier(), parameters)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
print(clf.best_params_)

0.7833333333333333
{'p': 2, 'n_neighbors': 41, 'leaf_size': 28}


## Conclusion

The best result from the chosen models gave KNN model with the result:
* 0.7833333333333333
And hyperparameters:
* {'p': 2, 'n_neighbors': 41, 'leaf_size': 28}

SVM had better accuracy, but the learning time was extremely slow.
I need to work more on splitting the data into training and test sets and trying other models.

Thank you for your time.