## Chapter 3 – Classification

_This notebook contains sample code adapted from chapter 3._

### Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

from time import time

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

### Load MNIST Dataset

In [2]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

OSError: could not read bytes

In [None]:
X, y = mnist["data"], mnist["target"]
X.shape

In [None]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In [None]:
import numpy as np

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Scale the data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)

### A Binary classifier: 5 or not 5

Use several models to try out a simplier problem: binary classification. 

First set up the training label and test label for the 5_or_not_5 classifier. The input features (X_train and X_test) remain the same. 

In [None]:
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

### 1. Logistic Regression Classifier 

Logistic regression is a linear classification model

In [None]:
from sklearn.linear_model import LogisticRegression
logit_clf = LogisticRegression(solver = 'lbfgs') # the default solver='liblinear' is very slow

In [None]:
start_time = time()
logit_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Use 3-fold cross-validation to evaluate the model

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(logit_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Evaluate the model using cross_val_predict (not on the test data)

In [None]:
from sklearn.model_selection import cross_val_predict
y_train_pred_logit_clf = cross_val_predict(logit_clf, X_train, y_train_5, cv=3)

Print out the confusion matrix: the count of true negatives is C{0,0}, false negatives is C{1,0}, true positives is C{1,1} and false positives is C{0,1}.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred_logit_clf)

Get the precision score 

In [None]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred_logit_clf)

And the recall score

In [None]:
recall_score(y_train_5, y_train_pred_logit_clf)

Now test the model on test data set

In [None]:
y_pred_logit_clf = logit_clf.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test_5, y_pred_logit_clf)

Check the precision and recall score of our model on the test data

In [None]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_test_5, y_pred_logit_clf)

And the recall score

In [None]:
recall_score(y_test_5, y_pred_logit_clf)

### 2. Stochastic Gradient Descent Classifier 

Stochastic Gradient Descent Classifier is also a linear classification model

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5, random_state=42, verbose=2)

In [None]:
start_time = time()
sgd_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

Cross validate the model

In [None]:
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
y_train_pred_sgd_clf = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In [None]:
confusion_matrix(y_train_5, y_train_pred_sgd_clf)

Compared with logistic regression model, the number of false positive in the SGD model is increased. 

In [None]:
confusion_matrix(y_train_5, y_train_pred_logit_clf)

In [None]:
precision_score(y_train_5, y_train_pred_sgd_clf)

In [None]:
recall_score(y_train_5, y_train_pred_sgd_clf)

Try the trained model on our test data

In [None]:
y_pred_sgd_clf = sgd_clf.predict(X_test)

In [None]:
confusion_matrix(y_test_5, y_pred_sgd_clf)

In [None]:
precision_score(y_test_5, y_pred_sgd_clf)

In [None]:
recall_score(y_test_5, y_pred_sgd_clf)

### Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=42)

In [None]:
start_time = time()
dt_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

In [None]:
cross_val_score(dt_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
y_train_pred_dt_clf = cross_val_predict(dt_clf, X_train, y_train_5, cv=3)

In [None]:
confusion_matrix(y_train_5, y_train_pred_dt_clf)

In [None]:
precision_score(y_train_5, y_train_pred_dt_clf)

In [None]:
recall_score(y_train_5, y_train_pred_dt_clf)

Try on the test data

In [None]:
y_pred_dt_clf = dt_clf.predict(X_test)

In [None]:
confusion_matrix(y_test_5, y_pred_dt_clf)

In [None]:
precision_score(y_test_5, y_pred_dt_clf)

In [None]:
recall_score(y_test_5, y_pred_dt_clf)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)

In [None]:
start_time = time()
forest_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

In [None]:
cross_val_score(forest_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
y_train_pred_forest_clf = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)

In [None]:
confusion_matrix(y_train_5, y_train_pred_forest_clf)

In [None]:
precision_score(y_train_5, y_train_pred_forest_clf)

In [None]:
recall_score(y_train_5, y_train_pred_forest_clf)

Try on the test data

In [None]:
y_pred_forest_clf = forest_clf.predict(X_test)

In [None]:
confusion_matrix(y_test_5, y_pred_forest_clf)

In [None]:
precision_score(y_test_5, y_pred_forest_clf)

In [None]:
recall_score(y_test_5, y_pred_forest_clf)

### KNN (K Nearest Neighor) Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1, n_neighbors=3)

In [None]:
start_time = time()
knn_clf.fit(X_train, y_train_5)
print('Time elapsed: %.2fs' % (time()-start_time))

In [None]:
cross_val_score(knn_clf, X_train, y_train_5, cv=3, scoring="accuracy")

In [None]:
y_train_pred_knn_clf = cross_val_predict(knn_clf, X_train, y_train_5, cv=3)

In [None]:
confusion_matrix(y_train_5, y_train_pred_knn_clf)

In [None]:
precision_score(y_train_5, y_train_pred_knn_clf)

In [None]:
recall_score(y_train_5, y_train_pred_knn_clf)

Try on the test data

In [None]:
y_pred_knn_clf = knn_clf.predict(X_test)

In [None]:
confusion_matrix(y_test_5, y_pred_knn_clf)

In [None]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_test_5, y_pred_knn_clf)

In [None]:
recall_score(y_test_5, y_pred_knn_clf)