# Supervised Learning

Author: Nirta Ika Yunita
<br> Date: November 11, 2019

Given the train.csv and test.csv, perform the following actions.

<br>1\. Create classifiers from train.csv data, in order to predict att10 as label based on the rest of the attributes. 
<br>Use at least 5 algorithms (k-NN, Decision Tree, Logistic Regression, Voting, Averaging, Bagging, Random Forest, Averaging, Voting, AdaBoost, XGBoost, LightGBM, CatBoost, or Stacking).
<br>Use AUC for your model evaluation performance.

<br>2\. Choose the best Classifier based on highest AUC and use it for predicting the test.csv data.

## Load Library

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## Import Data

In [2]:
d_train = pd.read_csv('train.csv')
d_test = pd.read_csv('test.csv')

In [3]:
d_train.head()

Unnamed: 0,att1,att2,att3,att4,att5,att6,att7,att8a,att8b,att8c,...,att8e,att8f,att8g,att8h,att8i,att8j,att9a,att9b,att9c,att10
0,0.16,0.82,6,202,4,1,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0.43,0.48,2,153,3,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,1
2,0.39,0.54,2,127,3,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,1
3,0.73,1.0,5,253,6,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
4,0.44,0.46,2,149,3,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1


In [4]:
d_test.head()

Unnamed: 0,att1,att2,att3,att4,att5,att6,att7,att8a,att8b,att8c,att8d,att8e,att8f,att8g,att8h,att8i,att8j,att9a,att9b,att9c
0,0.66,0.62,4,250,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
1,0.66,0.5,4,263,3,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0.32,0.74,3,211,3,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,0.37,0.57,2,155,3,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
4,0.41,0.49,2,130,3,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0


In [5]:
print('Train data shape {}' .format(d_train.shape))
print('Test data shape {}' .format(d_test.shape))

Train data shape (8000, 21)
Test data shape (2000, 20)


There is no **att10** column in d_test. Other columns are same between **d_train** and **d_test**.
<br>We will use **d_train** dataset and split it become train and test data. Then, predict the result using **d_test** dataset

In [6]:
d_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 21 columns):
att1     8000 non-null float64
att2     8000 non-null float64
att3     8000 non-null int64
att4     8000 non-null int64
att5     8000 non-null int64
att6     8000 non-null int64
att7     8000 non-null int64
att8a    8000 non-null int64
att8b    8000 non-null int64
att8c    8000 non-null int64
att8d    8000 non-null int64
att8e    8000 non-null int64
att8f    8000 non-null int64
att8g    8000 non-null int64
att8h    8000 non-null int64
att8i    8000 non-null int64
att8j    8000 non-null int64
att9a    8000 non-null int64
att9b    8000 non-null int64
att9c    8000 non-null int64
att10    8000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 1.3 MB


**att10** is the target. **att8a** until **att9c** are the results of one hot encode.

In [7]:
d_train[d_train.columns[:7]].describe()

Unnamed: 0,att1,att2,att3,att4,att5,att6,att7
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,0.61226,0.717594,3.791375,201.245,3.486375,0.145375,0.020625
std,0.248338,0.170957,1.230463,49.837947,1.446055,0.352501,0.142134
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0
50%,0.64,0.72,4.0,201.0,3.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0


In [8]:
# split d_train dataset
from sklearn.model_selection import train_test_split
X = d_train.iloc[:, 0:-1]
y = d_train.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## k-NN

In [9]:
from sklearn.neighbors import KNeighborsClassifier

model_knn = KNeighborsClassifier(n_neighbors = 5)
model_knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [10]:
y_pred_train = model_knn.predict(X_train)
y_pred_test = model_knn.predict(X_test)

In [11]:
from sklearn.metrics import classification_report

print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       0.98      0.95      0.97      4848
           1       0.87      0.94      0.90      1552

    accuracy                           0.95      6400
   macro avg       0.92      0.95      0.93      6400
weighted avg       0.95      0.95      0.95      6400


Classification report for test
              precision    recall  f1-score   support

           0       0.97      0.94      0.95      1233
           1       0.81      0.92      0.86       367

    accuracy                           0.93      1600
   macro avg       0.89      0.93      0.91      1600
weighted avg       0.94      0.93      0.93      1600



In [12]:
from sklearn.metrics import accuracy_score

acc_train_knn = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_knn = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_knn))
print('Accuracy score test = {}' .format(acc_test_knn))

Accuracy score train = 0.9498
Accuracy score test = 0.9319


In [13]:
from sklearn.metrics import roc_auc_score

auc_train_knn = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_knn = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_knn))
print('AUC test = {}' .format(auc_test_knn))

AUC train = 0.9461
AUC test = 0.9271


## Decision Tree

In [14]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier(max_depth = 5)
model_dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [15]:
y_pred_train = model_dt.predict(X_train)
y_pred_test = model_dt.predict(X_test)

In [16]:
print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      4848
           1       0.97      0.92      0.95      1552

    accuracy                           0.97      6400
   macro avg       0.97      0.96      0.97      6400
weighted avg       0.97      0.97      0.97      6400


Classification report for test
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1233
           1       0.97      0.93      0.95       367

    accuracy                           0.98      1600
   macro avg       0.97      0.96      0.97      1600
weighted avg       0.98      0.98      0.98      1600



In [17]:
acc_train_dt = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_dt = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_dt))
print('Accuracy score test = {}' .format(acc_test_dt))

Accuracy score train = 0.975
Accuracy score test = 0.9781


In [18]:
auc_train_dt = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_dt = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_dt))
print('AUC test = {}' .format(auc_test_dt))

AUC train = 0.9577
AUC test = 0.9628


## Logistic Regression

In [19]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
y_pred_train = model_lr.predict(X_train)
y_pred_test = model_lr.predict(X_test)

In [21]:
print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       0.83      0.93      0.87      4848
           1       0.64      0.40      0.49      1552

    accuracy                           0.80      6400
   macro avg       0.73      0.66      0.68      6400
weighted avg       0.78      0.80      0.78      6400


Classification report for test
              precision    recall  f1-score   support

           0       0.84      0.92      0.88      1233
           1       0.61      0.40      0.48       367

    accuracy                           0.80      1600
   macro avg       0.72      0.66      0.68      1600
weighted avg       0.79      0.80      0.79      1600



We see that the value of recall is quite low, lower than 0.5. We need to do something about this one. We need to do over-sampling (example: to ADASYN).

In [22]:
acc_train_lr = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_lr = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_lr))
print('Accuracy score test = {}' .format(acc_test_lr))

Accuracy score train = 0.7992
Accuracy score test = 0.8031


In [23]:
auc_train_lr = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_lr = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_lr))
print('AUC test = {}' .format(auc_test_lr))

AUC train = 0.6649
AUC test = 0.6617


### Over-sampling to ADASYN

In [24]:
from imblearn.over_sampling import ADASYN

X_train, y_train = ADASYN().fit_sample(X_train, y_train)

In [25]:
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [26]:
y_pred_train = model_lr.predict(X_train)
y_pred_test = model_lr.predict(X_test)

In [27]:
print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       0.80      0.73      0.76      4848
           1       0.75      0.81      0.78      4742

    accuracy                           0.77      9590
   macro avg       0.77      0.77      0.77      9590
weighted avg       0.77      0.77      0.77      9590


Classification report for test
              precision    recall  f1-score   support

           0       0.94      0.72      0.82      1233
           1       0.47      0.85      0.61       367

    accuracy                           0.75      1600
   macro avg       0.71      0.79      0.71      1600
weighted avg       0.84      0.75      0.77      1600



We see that the recall score now is better than before even if the accuracy is a little lower.

In [28]:
acc_train_lr = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_lr = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_lr))
print('Accuracy score test = {}' .format(acc_test_lr))

Accuracy score train = 0.7709
Accuracy score test = 0.7494


In [29]:
auc_train_lr = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_lr = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_lr))
print('AUC test = {}' .format(auc_test_lr))

AUC train = 0.7714
AUC test = 0.7857


## Voting

In [30]:
from sklearn.ensemble import VotingClassifier

model_voting = VotingClassifier(estimators = [('knn', model_knn), ('dt', model_dt), ('lr', model_lr)])
model_voting.fit(X_train, y_train)

VotingClassifier(estimators=[('knn',
                              KNeighborsClassifier(algorithm='auto',
                                                   leaf_size=30,
                                                   metric='minkowski',
                                                   metric_params=None,
                                                   n_jobs=None, n_neighbors=5,
                                                   p=2, weights='uniform')),
                             ('dt',
                              DecisionTreeClassifier(class_weight=None,
                                                     criterion='gini',
                                                     max_depth=5,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_imp

In [31]:
y_pred_train = model_voting.predict(X_train)
y_pred_test = model_voting.predict(X_test)

In [32]:
print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       0.97      0.89      0.93      4848
           1       0.90      0.97      0.93      4742

    accuracy                           0.93      9590
   macro avg       0.93      0.93      0.93      9590
weighted avg       0.93      0.93      0.93      9590


Classification report for test
              precision    recall  f1-score   support

           0       0.99      0.88      0.93      1233
           1       0.70      0.96      0.81       367

    accuracy                           0.90      1600
   macro avg       0.84      0.92      0.87      1600
weighted avg       0.92      0.90      0.90      1600



In [33]:
acc_train_voting = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_voting = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_voting))
print('Accuracy score test = {}' .format(acc_test_voting))

Accuracy score train = 0.9308
Accuracy score test = 0.8975


In [34]:
auc_train_voting = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_voting = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_voting))
print('AUC test = {}' .format(auc_test_voting))

AUC train = 0.9312
AUC test = 0.9211


## Random Forest

In [35]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [36]:
y_pred_train = model_rf.predict(X_train)
y_pred_test = model_rf.predict(X_test)

In [37]:
print('Classification report for train')
print(classification_report(y_train, y_pred_train))

print('\nClassification report for test')
print(classification_report(y_test, y_pred_test))

Classification report for train
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4848
           1       1.00      1.00      1.00      4742

    accuracy                           1.00      9590
   macro avg       1.00      1.00      1.00      9590
weighted avg       1.00      1.00      1.00      9590


Classification report for test
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1233
           1       0.96      0.95      0.95       367

    accuracy                           0.98      1600
   macro avg       0.97      0.97      0.97      1600
weighted avg       0.98      0.98      0.98      1600



In [38]:
acc_train_rf = round(accuracy_score(y_train, y_pred_train), 4)
acc_test_rf = round(accuracy_score(y_test, y_pred_test), 4)

print('Accuracy score train = {}' .format(acc_train_rf))
print('Accuracy score test = {}' .format(acc_test_rf))

Accuracy score train = 0.9989
Accuracy score test = 0.9794


In [39]:
auc_train_rf = round(roc_auc_score(y_train, y_pred_train), 4)
auc_test_rf = round(roc_auc_score(y_test, y_pred_test), 4)

print('AUC train = {}' .format(auc_train_rf))
print('AUC test = {}' .format(auc_test_rf))

AUC train = 0.9988
AUC test = 0.9694


## Each Algorithm AUC Summary

In [40]:
print("This is summary of each algorithm AUC (data train).\n")
print("AUC train k-NN = {}" .format(auc_train_knn))
print("AUC train Decision Tree = {}" .format(auc_train_dt))
print("AUC train Logistic Regression = {}" .format(auc_train_lr))
print("AUC train Voting = {}" .format(auc_train_voting))
print("AUC train Random Forest = {}" .format(auc_train_rf))

This is summary of each algorithm AUC (data train).

AUC train k-NN = 0.9461
AUC train Decision Tree = 0.9577
AUC train Logistic Regression = 0.7714
AUC train Voting = 0.9312
AUC train Random Forest = 0.9988


In [41]:
print("This is summary of each algorithm AUC (data test).\n")
print("AUC test k-NN = {}" .format(auc_test_knn))
print("AUC test Decision Tree = {}" .format(auc_test_dt))
print("AUC test Logistic Regression = {}" .format(auc_test_lr))
print("AUC test Voting = {}" .format(auc_test_voting))
print("AUC test Random Forest = {}" .format(auc_test_rf))

This is summary of each algorithm AUC (data test).

AUC test k-NN = 0.9271
AUC test Decision Tree = 0.9628
AUC test Logistic Regression = 0.7857
AUC test Voting = 0.9211
AUC test Random Forest = 0.9694


From the result above, we will use random forest for predicting the test.csv.

## Predicting Test Dataset

In [42]:
d_test.head()

Unnamed: 0,att1,att2,att3,att4,att5,att6,att7,att8a,att8b,att8c,att8d,att8e,att8f,att8g,att8h,att8i,att8j,att9a,att9b,att9c
0,0.66,0.62,4,250,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
1,0.66,0.5,4,263,3,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0.32,0.74,3,211,3,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,0.37,0.57,2,155,3,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
4,0.41,0.49,2,130,3,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0


In [43]:
predict = model_rf.predict_proba(d_test)
d_predict = pd.DataFrame(predict[:, 1], columns = ['att10'])
d_predict.head(10)

Unnamed: 0,att10
0,0.0
1,0.0
2,0.2
3,1.0
4,1.0
5,0.0
6,0.0
7,0.1
8,0.0
9,0.0


In [44]:
d_predict.to_csv('answer.csv')

In [45]:
d_result_test = pd.concat([d_test, d_predict], axis = 1)
d_result_test.head()

Unnamed: 0,att1,att2,att3,att4,att5,att6,att7,att8a,att8b,att8c,...,att8e,att8f,att8g,att8h,att8i,att8j,att9a,att9b,att9c,att10
0,0.66,0.62,4,250,2,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0.0
1,0.66,0.5,4,263,3,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0.0
2,0.32,0.74,3,211,3,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0.2
3,0.37,0.57,2,155,3,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1.0
4,0.41,0.49,2,130,3,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,1.0
