<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Principal Component Analysis (PCA) Examples

---

In [None]:
# usual imports
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
import pandas as pd

%matplotlib inline

# new import!
from sklearn.decomposition import PCA

### Iris Dataset (i.e. scikit-learn iris)  
Load the sklearn `iris` dataset.  This is one of the built-in datasets included in scikit-learn (and one we've seen before).

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

Take a look at the dataset:

In [None]:
print(iris.DESCR)

In [None]:
print(iris.data);

In [None]:
X = iris.data
y = iris.target
target_names = iris.target_names
target_names

from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
X_scaled

The PCA algorithm takes an argument `n_components` which specifies how many of the principal components we want to keep.  This dataset has only 4 features, so let's try keeping 2 to start: 

In [None]:
# create the model and fit the data
pca = PCA(n_components=2)
X_r = pca.fit_transform(X_scaled)
X_r;

How much of the variance do the first two principal components explain?  The PCA class has an attribute `explained_variance_ratio_` that reports this information:

In [None]:
# show percentage of variance explained (first two components):
print("First component: {}".format(pca.explained_variance_ratio_[0]))
print("Second component: {}".format(pca.explained_variance_ratio_[1]))

In [None]:
pd.DataFrame(X).corr()
pd.DataFrame(X_r).corr()

In [None]:
pca.explained_variance_

In [None]:
print(pca.components_)
print(iris['feature_names'])

print(X_scaled[0].dot(pca.components_[0]))
print(X_scaled[0].dot(pca.components_[1]))

X_r

We can see that the first principal component explains most of the variance.  Since we kept only 2 components we can use a simple 2-dimensional plot to view the datapoints in the new coordinate system.  We'll label them using our known target info:

In [None]:
pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    pl.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('PCA of IRIS dataset')

pl.show()

We can use a plot to help validate our choice of `n`.  Let's refit the model, but this time keep all components - this is the default behavior if `n_components` is not specified:

In [None]:
# create the model and fit the data - no n_components set:
pca = PCA()
X_r = pca.fit_transform(X)

As before, the explained variance ratios are in `pca.explained_variance_ratio_`, but this time there should be 4 ...

In [None]:
ratios = pca.explained_variance_ratio_
print(ratios)

In [None]:
print(pca.components_)

In [None]:
comp_id = [1, 2, 3, 4] # id number of component

fig = plt.figure(figsize=(8,5));

plt.plot(comp_id, ratios, 'ro-', linewidth=2);
plt.title('Scree Plot');
plt.xlabel('Principal Component');
plt.ylabel('Eigenvalue');

There is a clear 'elbow in the curve', so it looks like our choice of `2` components was ok.  Let's look at another dataset that has more features per record.  

### Handwritten Digits Dataset (i.e. scikit-learn digits)  

Load the sklearn `digits` dataset, which contains a set of 8x8 pixel images of handwritten digits.  This is one of the built-in datasets included in scikit-learn.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

Take a look at the dataset:

In [None]:
print(digits.DESCR)

Notice that each row in the dataset has 64 features, one for each of the individual pixels making up the image, where the value of each feature is the greyscale level (0 to 15).

In [None]:
# print digits

In [None]:
X, y = digits.data, digits.target

print("data shape: {}, target shape: {}".format(X.shape, y.shape))
print("classes: {}".format(list(np.unique(y))))
y

In [None]:
n_samples, n_features = X.shape
print("n_samples = {}".format(n_samples))
print("n_features = {}".format(n_features))

Here's a small routine to visually plot the first 400 rows (i.e. digits) in the dataset:

In [None]:
# note: this uses pandas indexing, so temporarily load into pandas dataframe
Xpd = pd.DataFrame(digits.data) # explanatory (or independent or feature) variables
Xpd.head()

### EXERCISE: Principal Component Analysis for Digits Data Set

#### 1. Fit and transform digits data set using PCA

#### 2. What are the explained variance ratios

#### 3. Plot the variances and determine appropriate number of components to use.

In [None]:
def plot_gallery(data, labels, shape, interpolation='nearest'):
    '''helper function for plot images of the digits'''
    for i in range(data.shape[0]):
        plt.subplot(1, data.shape[0], (i + 1))
        plt.imshow(data[i].reshape(shape), interpolation=interpolation, cmap=pl.cm.binary)
        plt.title(labels[i])
        plt.xticks(()), plt.yticks(())

In [None]:
subsample = np.random.permutation(X.shape[0])[:4]      # pick 4 random records 
images = X[subsample]
labels = ['True class: %d' % l for l in y[subsample]]  # label with the true (known) value

plot_gallery(images, labels, shape=(8, 8))             # plot them in grayscale

In [None]:
pca = PCA(n_components=30)

X_pca = pca.fit_transform(Xpd)

ratios = pca.explained_variance_ratio_
plt.plot(range(len(ratios)), ratios.cumsum());

In [None]:
images2 = pca.inverse_transform(X_pca[subsample])
labels = ['True class: %d' % l for l in y[subsample]]  # label with the true (known) value

plot_gallery(images2, labels, shape=(8, 8))             # plot them in grayscale

### Extra: Visualization of Digits

Adapted from: https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

In [None]:
n_img_per_row = 20                                    # number of digits per row

img = np.zeros((10*n_img_per_row, 10*n_img_per_row))  # generate a new 200x200 array filled with zeros

In [None]:
# set each 8x8 area of the img to the values of each row (reshaped from 1x64 to 8x8)       
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix+8, iy:iy+8] = X[i*n_img_per_row + j].reshape((8, 8)) 

In [None]:
pl.figure(figsize=(8, 8), dpi=250);     # define a figure, with size (width and height) and resolution
pl.imshow(img, cmap=pl.cm.binary);      # show the image using a binary color map
pl.xticks([]); # no x ticks
pl.yticks([]); # no y ticks
pl.title('A Selection from the 64-Dimensional Digits Dataset\n', fontsize=16);

In [None]:
subsample = np.random.permutation(X.shape[0])[:4]      # pick 4 random records 
images = X[subsample]
labels = ['True class: %d' % l for l in y[subsample]]  # label with the true (known) value

plot_gallery(images, labels, shape=(8, 8))             # plot them in grayscale