# Chapter 3 Classification

There are two most common tasks in **supervised machine learning**: **regression** (predicting values) and **classification** (predicting classes). In this chapter, we focuse on building classification systems.

## Data Preparation

We will use the MNIST dataset of handwritten images as an example.

- Load MNIST dataset using sklearn.datasets.fetch_mldata() or from http://yann.lecun.com/exdb/mnist/ with python-mnist package.
- Construct training set and test set. We will use training set to build the classifier, and use test set to evaluate its performance.
- Explore the dataset (find size of dataset, show a random image, show multiple images

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Attempt 1: get MNIST from mldata.org
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist

In [None]:
mnist['DESCR']

In [None]:
images = mnist['data']
labels = mnist['target']

In [None]:
images.shape

In [None]:
some_digit = images[12345]
some_digit

In [None]:
# Use imshow() from matplotlib to show the image
some_digit = images[54321]
some_digit = some_digit.reshape([28, 28])
plt.imshow(some_digit,
           cmap=matplotlib.cm.binary)

## Build a Binary Clasifier

To start, we aim at building a binary classifier to identify if an handwritten digit is five.

- Create the labels for binary classification (1 for five, and 0 for all other digits)
- Apply the **k-nearest-neighbor algorithm** using sklearn.neighbors.KNeighborsClassifier.
- Use its fit() method to train the model, use predict method to make predictions on given images.

## K-Nearest Neighbor (KNN) Method

- Used to classify new data points based on "distance" to known data
- Find the K nearest neighbors, based on your distance metric
- Let them all vote on the classification

## Evaluate Performance of a Classifier

- Use sklearn.metrics.accuracy_score to calculate classification accuracy on the training set and on the test set.
- Display the images where the model predicts wrong.
- Use cross-validation to evaluate the performance of the model on various training and test sets.
- Use confusion matrix to show the percentage of **false positives** and **true negatives**.

#### Cross Validation
- partition the dataset into k mutully-exclusive subsets
- perform training on all but the 1st set, test the performance on the 1st set.
- perform training on all but the 2nd set, test the performance on the 2nd set.
- perform training on all but the 3rd set, test the performance on the 3rd set.
- ....
- perform training on all but the last set, test the performance on the last set.

In this way, the model is tested on k different training sets. If all performances are acceptable, we should have high confidence on the model.

### Confusion Matrix
For each pair of class A and B:
- count the number of instances of class A being classified as B
- count the number of instances of class B being classified as A

The numbers will form an $n\times n$ matrix, where $n$ is the number of classes.

**For binary classifiers**:
confusion matrx = [[TN, FP], [FN, TP]]

- TN: true negative
- FP: false positive
- FN: false negative
- TP: true positive

**Precision** = TP / (TP + FP)

- What does precision represent?
- Can a bad model have high precision?


**Recall** = TP / (TP + FN)
- What does recall represent?
- Can a bad model have high recall?

$F_1$ **score**

$F_1 = \frac{1}{\frac{1}{precision} + \frac{1}{recall}}$

## More Performance Measures

- Precision-Recall tradeoff
- Distribution of scores
- ROC curve: False positive vs. True Positive
- AUC (Area under curve)