## PCA Example

Let’s get a better sense for how PCA works by applying it to some image data. The MNIST dataset contains  images of handwritten digits from 0 to 9.  The original images are 28 × 28 pixels. A lower-resolution subset of the images is distributed with scikit-learn, where each image is downsampled into 8 × 8 pixels. The original data in scikit-learn has 64 dimensions. In Example 6-1, we apply PCA and visualize the dataset using the first three principal components.

In [1]:
from sklearn import datasets
from sklearn.decomposition import PCA

In [2]:
# load the data
digits_data = datasets.load_digits()
n = len(digits_data.images)

In [8]:
digits_data.images.shape

(1797, 8, 8)

In [7]:
# each image is represented as an 8-by-8 array.
# flatten this array as input to PCA
image_data = digits_data.images.reshape((n, -1))
image_data.shape

(1797, 64)

In [9]:
# Groundtruth label of the number appearing in each image
labels = digits_data.target
labels

array([0, 1, 2, ..., 8, 9, 8])

In [11]:
# Fit a PCA transformer to the dataset
# The number of components is automatically chosen to account for
# at least 80% of the total variance.
pca_transformer = PCA(n_components=0.8)
pca_images = pca_transformer.fit_transform(image_data)
pca_transformer.explained_variance_ratio_

array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415,
       0.0491691 , 0.04315987, 0.03661373, 0.03353248, 0.03078806,
       0.02372341, 0.02272697, 0.01821863])

In [19]:
pca_transformer.explained_variance_ratio_[:3].sum()
# The first three principal components account for roughly 40% of the total variance in the dataset

0.4030395858767508

In [27]:
pca_images[0]

array([-1.25946645, 21.27488348, -9.46305462, 13.01418869, -7.12882278,
       -7.44065876,  3.25283716,  2.55347036, -0.58184214,  3.62569695,
        2.58595688,  1.55160708,  0.85449671])

In [20]:
# Visualize the results
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for i in range(100):
    ax.scatter(pca_images[i, 0], pca_images[i, 1], pca_images[i, 2],
              marker=r'${}$'.format(labels[i]), s=64)

ax.set_xlabel('Principal component 1')
ax.set_ylabel('Principal component 2')
ax.set_zlabel('Principal component 3')
# PCA projections of subset of MNIST data—markers correspond to image labels

<IPython.core.display.Javascript object>

Text(0.5,0,'Principal component 3')

Since there is a fair amount of overlap between numbers, it would be difficult to tell them apart using a linear classifier in the projected space. Hence, if the task is to classify the handwritten digits and the chosen model is a linear classifier, then the first three principal components are not sufficient as features. Nevertheless, it is interesting to see how much of a 64-dimensional dataset can be captured in just 3 dimensions.

The two main things to remember about PCA are its mechanism (linear projection) and objective (to maximize the variance of projected data). The solution involves the eigen decomposition of the covariance matrix, which is closely related to the SVD of the data matrix. One can also remember PCA with the mental picture of squashing the data into a pancake that is as fluffy as possible.

PCA is a well-known dimensionality reduction method. But it has its limitations, such as high computational cost and uninterpretable outcome. It is useful as a preprocessing step, especially when there are linear correlations between features.

In [None]:
# References and credits to
# Feature Engineering for Machine Learning