# Classification using scikit-learn

- **Classification Problem**: assign a discrete class based on certain input features
    - *Binary Classification*: cancer or no cancer, cats vs dogs
    - *Multiclass Classifcation*: handwritten digits, cats vs dogs vs monkeys 
- **scikit-learn**: Simple Python ML Library, https://scikit-learn.org/

### Dataset
**UCI Heart Disease Dataset**: https://www.kaggle.com/ronitf/heart-disease-uci<br/>
Goal: presense/absence of heart disease based on health-related features

**Data Preprocessing Techniques**:
- *One-Hot Encoding*: for categorical features
- *Feature Normalization*: scale from 0 to 1
    
**Data for Models**:
- *Training Data*: examples used to train the model
- *Testing Data*: examples used to test the model, separate from training data
- We randomly choose 75% of all data for training, and the remaining 25% for testing. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [3]:
# External script provides X_train, y_train, X_test, y_test Pandas Dataframes. 
# X_train: features of training examples  
# y_train: labels of training examples
# X_test: features of testing examples  
# y_test: labels of testing examples

%run -i load_data.py

In [4]:
# Original Dataset 
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
# Features of training examples, after preprocessing 
X_train.head()

Unnamed: 0,age,sex,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,ca,...,cp_1,cp_2,cp_3,thal_0,thal_1,thal_2,thal_3,slope_0,slope_1,slope_2
153,0.770833,0.0,0.490566,0.347032,0.0,0.0,0.618321,0.0,0.0,0.25,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
23,0.666667,1.0,0.528302,0.267123,1.0,0.5,0.503817,1.0,0.16129,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
263,0.708333,0.0,0.132075,0.326484,0.0,0.5,0.748092,1.0,0.290323,0.5,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
110,0.729167,0.0,0.811321,0.454338,0.0,0.5,0.633588,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
81,0.333333,1.0,0.320755,0.415525,0.0,0.0,0.755725,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


### Simple Classifiers

We have implemented k-NN for you here! How do other simple classifiers compare? 

- **K-Nearest Neighbors (k-NN)**: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- **Naive Bayes**: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- **Decision Trees**: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- **Logistic Regression**: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [6]:
# k-NN classifier example

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10, weights='distance')
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7850877192982456
AUROC:  0.8818471833177716
              precision    recall  f1-score   support

           0       0.76      0.76      0.76       102
           1       0.81      0.80      0.80       126

    accuracy                           0.79       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.79      0.79      0.79       228



In [7]:
# ToDo: Implement your own simple classifier here and see how it compares to k-NN :) 

from sklearn import linear_model

lr = linear_model.LogisticRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
scores = lr.predict_proba(X_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.7763157894736842
AUROC:  0.8765950824774354
              precision    recall  f1-score   support

           0       0.75      0.75      0.75       102
           1       0.80      0.80      0.80       126

    accuracy                           0.78       228
   macro avg       0.77      0.77      0.77       228
weighted avg       0.78      0.78      0.78       228





### Gridsearch
**Gridsearch**: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


Gridsearch is a process to determine the optimal hyperparameter values for a given model. What hyperparameters are relevant for the classifier(s) you implemented above? What hyperparameter values are optimal according to gridsearch?

In [180]:
# k-NN classifier with gridsearch example

from sklearn import neighbors
from sklearn.model_selection import GridSearchCV

base_clf = neighbors.KNeighborsClassifier()
parameters = {'n_neighbors': [1, 2, 5, 10, 15, 25], 'weights': ['uniform', 'distance']}

clf = GridSearchCV(base_clf, parameters, cv=3)
clf.fit(X_train, y_train)
print('Best Hyperparameters: ', clf.best_params_, '\n')

pred = clf.predict(X_test)
scores = clf.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Best Hyperparameters:  {'weights': 'distance', 'n_neighbors': 10} 

Accuracy:  0.7850877192982456
AUROC:  0.8818471833177716
              precision    recall  f1-score   support

           0       0.76      0.76      0.76       102
           1       0.81      0.80      0.80       126

   micro avg       0.79      0.79      0.79       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.79      0.79      0.79       228



In [12]:
# ToDo: Implement a different classifier here with gridsearch :) 
from sklearn.model_selection import GridSearchCV

base_lg = linear_model.LogisticRegression()
parameters = {'max_iter': [75, 100, 150, 200, 300, 400],
              'fit_intercept': [True, False],
              'solver': ['newton-cg', 'lbfgs', 'liblinear']}

lg = GridSearchCV(base_lg, parameters, cv=10)
lg.fit(X_train, y_train)
print('Best Hyperparameters: ', lg.best_params_, '\n')

pred = lg.predict(X_test)
scores = lg.predict_proba(X_test)[:,1]   

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Best Hyperparameters:  {'fit_intercept': True, 'max_iter': 75, 'solver': 'newton-cg'} 

Accuracy:  0.7807017543859649
AUROC:  0.8786959228135699
              precision    recall  f1-score   support

           0       0.76      0.75      0.75       102
           1       0.80      0.81      0.80       126

    accuracy                           0.78       228
   macro avg       0.78      0.78      0.78       228
weighted avg       0.78      0.78      0.78       228





## Ensemble Methods (Advanced)

**Ensemble Methods**: https://scikit-learn.org/stable/modules/ensemble.html <br/>
Ensemble methods combine the predictions of several base classifiers, which helps improve generalizability / robustness over a single classifier. There are two common approaches:
- *Averaging Methods*: build several classifiers independently and average their predictions
- *Boosting Methods*: build several classifiers sequentially, where each new classifier tries to improve the previous one

- **Random Forest**: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
- **Gradient Boosted Decision Trees (GBDT)**: 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
- **Voting**: 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier

In [7]:
# ToDo: Implement an ensemble method here! Use gridsearch to tune hyperparameters. 
from sklearn import ensemble

gbdt = 

















