# Classification

Let us pick up from where we have left off. 

In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [3]:
dataset = pd.read_csv("https://raw.githubusercontent.com/CodeOp-tech/DA-ML-classification-part01/master/datasets/synth_covid.csv?token=AHM3F3H2SPNCQ2T5OFAGPA3APKVY4")
dataset

Unnamed: 0,blood_pressure,lung_capacity,body_temperature,has_covid
0,132.894691,6.931665,39.270112,0
1,117.128239,6.715135,37.005833,1
2,108.982006,6.580677,38.079465,0
3,112.337762,5.482720,37.662576,0
4,113.165263,6.664360,36.922810,1
...,...,...,...,...
995,116.208860,7.408413,37.088040,0
996,108.632769,6.854598,36.226869,1
997,137.732933,3.548004,35.543415,0
998,108.552490,2.931925,37.007822,0


In [4]:
dataset.columns

Index(['blood_pressure', 'lung_capacity', 'body_temperature', 'has_covid'], dtype='object')

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset[['blood_pressure','lung_capacity','body_temperature']], 
                                                    dataset['has_covid'], test_size=0.4)

To compare the models, we will use **accuracy score**. 

To read more about different available metrics for quantifying the quality of predictions in scikit-learn library, read [this](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics). 

In [6]:
from sklearn.metrics import accuracy_score

losses = {}

## Logistic Regression:

 - If you use linear regression in classification setting, the predicted y will be in continuous variables and not guaranteed to be between 0 and 1
 - Since we want to ensure that the predicted y is in between 0 and 1 to represent probability of "has_covid", we will use logistic regression
 - Further reading: [Difference between linear regression and logistic classifier](https://www.analyticsvidhya.com/blog/2020/12/beginners-take-how-logistic-regression-is-related-to-linear-regression/#:~:text=The%20Differences%20between%20Linear%20Regression,Logistic%20regression%20provides%20discreet%20output.)

In [7]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [8]:
lr.predict(X_test)

array([0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,

In [9]:
from sklearn.metrics import accuracy_score
losses['Logistic'] = accuracy_score(y_test, lr.predict(X_test).round())
print(losses)

{'Logistic': 0.69}


## Support Vector Machines (SVM)


The advantages of support vector machines are:

 - Effective in high dimensional spaces.
 - Still effective in cases where number of dimensions is greater than the number of samples.
 - Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
 - Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

 - If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
 - SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

In [10]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)

SVC()

In [11]:
clf.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [12]:
losses['SVM (linear)'] = accuracy_score(y_test, clf.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

Logistic  :  0.69
SVM (linear)  :  0.6125


In [13]:
clf = svm.SVC(kernel='poly')
clf.fit(X_train, y_train)
losses['SVM (polynomial)'] = accuracy_score(y_test, clf.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

Logistic  :  0.69
SVM (linear)  :  0.6125
SVM (polynomial)  :  0.615


## Nearest Neighbors

We are going to go over k-nearest neightbor algorithm (knn).

In [14]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [15]:
knn.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,

In [16]:
losses['KNN'] = accuracy_score(y_test, knn.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

Logistic  :  0.69
SVM (linear)  :  0.6125
SVM (polynomial)  :  0.615
KNN  :  0.605


In [17]:
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(X_train, y_train)
losses['KNN (10)'] = accuracy_score(y_test, knn.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

Logistic  :  0.69
SVM (linear)  :  0.6125
SVM (polynomial)  :  0.615
KNN  :  0.605
KNN (10)  :  0.6425


## XGBoost
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

[Further reading](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)

In [18]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)



ModuleNotFoundError: No module named 'xgboost'

In [None]:
xgb.predict(X_test)

In [None]:
losses['XGBoost'] = accuracy_score(y_test, xgb.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

In [None]:
xgb = XGBClassifier(max_depth=3)
xgb.fit(X_train, y_train)
losses['XGBoost: maxdepth3'] = accuracy_score(y_test, xgb.predict(X_test).round())
for key, value in losses.items():
    print(key, ' : ', value)

Reference:
- [User Guide for scikit-learn](https://scikit-learn.org/stable/user_guide.html)