<div align="center">
  <h1>Classification Problems</h1>
  "Does an object belong to a set or not."
</div>

In [38]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://scikit-learn.org/stable/_images/sphx_glr_plot_classification_0011.png", width=500)

## 1 Useful Algorithms

### KNN

This algorithm predicts class-membership of a point by looking at its k nearest neighbors and assigning the class of the majority of the neighboring points (uniform weighting) or the class resulting after weighting the neighboring point classes with the distances.

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=1, **kwargs)

**n_neighbors** ... default is 5 <br>
**weights** ... 'uniform' or 'distance', default is ‘uniform’ (= all points in each neighborhood are weighted equally) <br>
**algorithm** ... ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’, default is 'auto' <br>
**leaf_size** ... default is 30 <br>
**metric** ... chosen distance metric, default is ‘minkowski’) <br>
**p** ... power parameter for Minkowski metric, default is 2 (euclidean distance) <br>
**metric_params** ... additional keyword arguments for the metric function, default is None <br>
**n_jobs** ... number of parallel jobs to run, does not affect fit method. Default is 1

#### Useful Functions

fit(X,y) <br>
kneighbors(X=None, n_neighbors=None, return_distance=True) <br> 
kneighbors_graph(X=None, n_neighbors=None, mode='connectivity') <br>
predict(X) <br>
predict_proba(X) <br>
score(X, y, sample_weight=None)

http://scikit-learn.org/stable/modules/neighbors.html#classification <br>
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [39]:
from sklearn.neighbors import KNeighborsClassifier

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

n = KNeighborsClassifier(n_neighbors=3)
n.fit(X, y)

print(n.predict([[1.1]]))
print(n.predict_proba([[0.9]]))

[0]
[[ 0.66666667  0.33333333]]


### K-Fold

This algorithm is used for cross-validation. It provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

**n_splits** ... number of folds, must be >= 2, default is 3 <br>
**shuffle** ... shuffle data before splitting, optional, default is False <br>
**random_state** ... pseudo-random number generator state used for shuffling, if None, default numpy RNG is used. None, int or RandomState

#### Useful Functions

get_n_splits(X=None, y=None, groups=None)<br>
split(X, y=None, groups=None)

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

In [41]:
from sklearn.model_selection import KFold
import numpy as np

X = np.array([[5, 6], [7, 8], [5, 6], [7, 8]])
y = np.array([5, 6, 7, 8])
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1] TEST: [2 3]


## 2 Important Functions

### Precision

pr = tp / (tp + fp)
                                
tp ... number of true positives <br>
tp ... number of false positives.

sklearn.metrics.precision_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)

**y_true** ... correct target values <br>
**y_pred** ... predicted target values <br>
**labels** ... optional <br>
**pos_label** ... string or binary, 1 by default <br>
**average** ... None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted' <br>
**sample_weight** ... optional

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html


In [21]:
from sklearn.metrics import precision_score

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(precision_score(y_true, y_pred, average=None))
print(precision_score(y_true, y_pred, average='weighted'))
print(precision_score(y_true, y_pred, average='micro'))
# print(precision_score(y_true, y_pred)) # error because input is multiclass

[ 0.66666667  0.          0.        ]
0.222222222222
0.333333333333


### Recall

re = tp / (tp + fn)

tp ... number of true positives <br>
fn ... number of false negatives

sklearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)

**y_true** ... correct target values <br>
**y_pred** ... predicted target values <br>
**labels** ... optional <br>
**pos_label** ... string or binary, 1 by default <br>
**average** ... None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted' <br>
**sample_weight** ... optional

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [22]:
from sklearn.metrics import recall_score

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(recall_score(y_true, y_pred, average=None))
print(recall_score(y_true, y_pred, average='weighted'))
print(recall_score(y_true, y_pred, average='micro'))
# print(recall_score(y_true, y_pred)) # error because input is multiclass

[ 1.  0.  0.]
0.333333333333
0.333333333333


### F1 scores

F1 = 2 \* (precision \* recall) / (precision + recall)

sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)

**y_true** ... correct target values <br>
**y_pred** ... predicted target values <br>
**labels** ... optional <br>
**pos_label** ... string or binary, 1 by default <br>
**average** ... None, 'binary' (default), 'micro', 'macro', 'samples', 'weighted' <br>
**sample_weight** ... optional

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [79]:
from sklearn.metrics import f1_score

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(f1_score(y_true, y_pred, average=None))
print(f1_score(y_true, y_pred, average='weighted'))
print(f1_score(y_true, y_pred, average='micro'))

[ 0.8  0.   0. ]
0.266666666667
0.333333333333


### Classification Report

Reports precision, recall and F1 scores.

sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2)

**y_true** ... correct target values <br>
**y_pred** ... predicted target values <br>
**labels** ... optional <br>
**target_names** ... optional <br>
**sample_weight** ... optional <br>
**digits** ... for formatting output

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

In [24]:
from sklearn.metrics import classification_report

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(classification_report(y_true, y_pred))

# Note: support is the number of occurrences of each class in y_true.

             precision    recall  f1-score   support

          0       0.67      1.00      0.80         2
          1       0.00      0.00      0.00         2
          2       0.00      0.00      0.00         2

avg / total       0.22      0.33      0.27         6



## 3 Student Excercise

Write a program that uses KFold to split data into a training and a test set.

Then use KNN to classify the data into classes.

Finally, print the classification report.

Hint: To get data, use "from sklearn import datasets" and then "datasets.load_digits()" or "datasets.load_iris()".

In [78]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn import datasets

data = datasets.load_digits()
X = data.data
y = data.target
np.unique(y)

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

n = KNeighborsClassifier(n_neighbors=7)
n.fit(X_train, y_train)
score = n.score(X_test, y_test)
pred = n.predict(X_test)

print(classification_report(y_test, pred))
print(score)


             precision    recall  f1-score   support

          0       0.99      1.00      0.99        88
          1       0.94      0.98      0.96        91
          2       0.98      0.92      0.95        86
          3       0.89      0.90      0.90        91
          4       1.00      0.92      0.96        92
          5       0.96      0.98      0.97        91
          6       0.99      1.00      0.99        91
          7       0.94      1.00      0.97        89
          8       0.97      0.89      0.93        87
          9       0.89      0.93      0.91        92

avg / total       0.95      0.95      0.95       898

0.952115812918


### More Helpful Resources:

http://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html <br>
https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/