# GA Data Science (DAT19) - Lab 15

### Dimensionality Reduction and Principal Component Analysis (PCA)

In [None]:
# usual imports
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import pandas as pd
from bokeh.plotting import figure,show,output_notebook

output_notebook()
%matplotlib inline

from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
# scikit-learn algorithm that is new!
from sklearn.decomposition import PCA


###Iris Dataset (i.e. scikit-learn iris)  
Load the sklearn `iris` dataset.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
print iris

In [None]:
X = iris.data
y = iris.target
target_names = iris.target_names

The PCA algorithm takes an argument `n_components` which specifies how many of the principal components we want to keep.  This dataset has only 4 features, so let's try keeping 2 to start: 

In [None]:
# create the model and fit the data
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

In [None]:
X_r

How much of the variance do the first two principal components explain?  The PCA class has an attribute `explained_variance_ratio_` that reports this information:

In [None]:
# Percentage of variance explained (first two components):
print "First component: " + str(pca.explained_variance_ratio_[0])
print "Second component: " + str(pca.explained_variance_ratio_[1])

We can see that the first principal component explains most of the variance.  Since we kept only 2 components we can use a simple 2-dimensional plot to view the datapoints in the new coordinate system.  We'll label them using our known target info:

In [None]:
color_mapping = {0:'red',1:'blue',2:'orange'}

colors = list()

for value in y:
    new_color = color_mapping[value]
    colors.append(new_color)


In [None]:
p = figure(title="PCA in Iris Dataset",tools='')

x_values =  X_r[:,0]
y_values =  X_r[:,1]
p.circle(x = x_values,y=y_values,size = 5,color=colors)

show(p)

We can use a plot to help validate our choice of `n`.  Let's refit the model, but this time keep all components - this is the default behavior if `n_components` is not specified:

In [None]:
# create the model and fit the data - no n_components set:
pca = PCA()
X_r = pca.fit(X).transform(X)

As before, the explained variance ratios are in `pca.explained_variance_ratio_`, but this time there should be 4 ...

In [None]:
ratios = pca.explained_variance_ratio_
print ratios

In [None]:
print pca.components_

### Exercise: Plot the explained_variance_ratio

This is called a Scree plot, used to determine a reasonable amount of components to keep

###Let's see how the model performs
So a concern if we're taking features away is are we losing too much information? And if we're losing information, are we gaining speed?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

We can use the `%%timeit` magic function to time how long a cell takes.

In [None]:
%%timeit model = KNeighborsClassifier(2)
model.fit(X_train, y_train)

In [None]:
model = KNeighborsClassifier(2)
model.fit(X_train, y_train)
print classification_report(y_test,model.predict(X_test))

In [None]:
X_r_train, X_r_test, y_r_train, y_r_test = train_test_split(X_r,y, test_size=0.2, random_state=0)

In [None]:
%%timeit model = KNeighborsClassifier(2)
model.fit(X_r_train, y_r_train)

Multiple runs can cache results and will bring the times closer togehter. 

In [None]:
model = KNeighborsClassifier(2)
model.fit(X_r_train, y_r_train)
print classification_report(y_r_test,model.predict(X_r_test))

###Handwritten Digits Dataset (i.e. scikit-learn digits)  

Load the sklearn `digits` dataset, which contains a set of 8x8 pixel images of handwritten digits.  This is one of the built-in datasets included in scikit-learn.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

Take a look at the dataset:

In [None]:
print digits.DESCR


Notice that each row in the dataset has 64 features, one for each of the individual pixels making up the image, where the value of each feature is the greyscale level (0 to 15).

In [None]:
# print digits

In [None]:
X, y = pd.DataFrame(digits.data), pd.DataFrame(digits.target)

print("data shape: %r, target shape: %r" % (X.shape, y.shape))
print("classes: %r" % list(np.unique(y)))

In [None]:
n_samples, n_features = X.shape
print("n_samples=%d" % n_samples)
print("n_features=%d" % n_features)

In [None]:
X.head()

In [None]:
n_img_per_row = 20 # number of digits per row
img = np.zeros((10*n_img_per_row, 10*n_img_per_row)) # generate a new 200x200 array filled with zeros
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix+8, iy:iy+8] = X.ix[i*n_img_per_row + j].reshape((8, 8)) # set each 8x8 area of the img to the values of each row (reshaped from 1x64 to 8x8)

plt.figure(figsize=(8, 8), dpi=250) # define a figure, with size (width and height) and resolution
#axes(frameon = 0) # remove the frame/border from the axes
plt.imshow(img, cmap=plt.cm.binary) # show the image using a binary color map
plt.xticks([]) # no x ticks
plt.yticks([]) # no y ticks
print

### EXERCISE: Priciple Component Analysis for Digits Data Set

#### 1. Fit and transform digits data set using PCA

#### 2. What are the explained variance ratios

#### 3. Plot the variances and determine appropriate number of components to use.

#### 4. Show the digits of the PCA transformed version of the digits data!

###Bonus Round:
Pick your favorite algorithm and time how long it takes to train