<h1 align="center"> PCA to Speed-up Machine Learning Algorithms </h1>

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
<br>
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. 

Parameters | Number
--- | ---
Classes | 10
Samples per class | ~7000 samples per class
Samples total | 70000
Dimensionality | 784
Features | integers values from 0 to 255

The MNIST database of handwritten digits is available on the following website: [MNIST Dataset](http://yann.lecun.com/exdb/mnist/)

In [30]:
import pandas as pd
import numpy as np 
# Suppress scientific notation
#np.set_printoptions(suppress=True)

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# Used for Downloading MNIST
from sklearn.datasets import fetch_mldata

# Used for Splitting Training and Test Sets
from sklearn.model_selection import train_test_split

%matplotlib inline

## Downloading MNIST Dataset

In [31]:
# Change data_home to wherever to where you want to download your data
mnist = fetch_mldata('MNIST original', data_home='~/Desktop/alternativeData')

In [32]:
mnist

{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([ 0.,  0.,  0., ...,  9.,  9.,  9.])}

In [33]:
# These are the images
mnist.data.shape

(70000, 784)

In [34]:
# These are the labels
mnist.target.shape

(70000,)

## Standardizing the Data

Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Notebook going over the importance of feature Scaling: http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py


In [35]:
# Standardize features by removing the mean and scaling to unit variance
mnist.data = StandardScaler().fit_transform(mnist.data)

## Splitting Data into Training and Test Sets

In [36]:
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [37]:
print(train_img.shape)

(60000, 784)


In [38]:
print(train_lbl.shape)

(60000,)


In [39]:
print(test_img.shape)

(10000, 784)


In [40]:
print(test_lbl.shape)

(10000,)


## PCA to Speed up Machine Learning Algorithms (Logistic Regression)

<b>Step 0:</b> Import and use PCA. After PCA we will go apply a machine learning algorithm of our choice to the transformed data

In [41]:
from sklearn.decomposition import PCA

Make an instance of the Model

In [42]:
pca = PCA(.95)

Fit PCA on training set. <b>Note: we are fitting PCA on the training set only</b>

In [51]:
pca.fit(train_img)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Apply the mapping (transform) to <b>both</b> the training set and the test set. 

In [15]:
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

<b>Step 1: </b> Import the model you want to use

In sklearn, all machine learning models are implemented as Python classes

In [16]:
from sklearn.linear_model import LogisticRegression

<b>Step 2:</b> Make an instance of the Model

In [17]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
# solver = 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')

<b>Step 3:</b> Training the model on the data, storing the information learned from the data

Model is learning the relationship between x (digits) and y (labels)

In [18]:
logisticRegr.fit(train_img, train_lbl)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

<b>Step 4:</b> Predict the labels of new data (new images)

Uses the information the model learned during the model training process

In [21]:
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

array([ 1.])

In [22]:
# Predict for Multiple Observations (images) at Once
logisticRegr.predict(test_img[0:10])

array([ 1.,  9.,  2.,  2.,  7.,  1.,  8.,  3.,  3.,  7.])

## Measuring Model Performance

accuracy (fraction of correct predictions): correct predictions / total number of data points

Basically, how the model performs on new data (test set)

In [30]:
score = logisticRegr.score(test_img, test_lbl)
print(score)

0.9195


## F1 Score 

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. 

If you are curious about why accuracy is not a great metric (link to why accuracy is a bad metric
https://github.com/mGalarnyk/datasciencecoursera/blob/master/Stanford_Machine_Learning/Week6/MachineLearningSystemDesign.md)

In [31]:
pred_label = logisticRegr.predict(test_img)

consider changing metric to http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

make similar problem to coursera to show that problems can be from just using normal sklearn and going with accuracy. 

In [36]:
metrics.f1_score(test_lbl, pred_label, average='weighted')

0.91923082375011567