# Principal Component Analysis

One of the simplest and most widely used techniques in machine learning is principal component analysis (PCA). PCA is a tool for understanding the variance in a dataset&mdash;i.e., what are the main ways that observations in the dataset vary.

To introduce the basic idea, imagine that you are given a dataset that contains three variables (columns) and 1,000 observations (rows). The dataset contains the measurements of the sizes (in square centimeters) of three different spots that appear on the wings of a particular kind of butterfly. Each row represents a different butterfly whose spots were measured, and each column represents one of the spots.

At first glance, the data do not appear unusual. You examine the mean and standard deviation of each spot's size and find the following:

|spot|mean|standard deviation|
|:--:|:--:|:------:|
|1 ($x$-axis)|1.2 cm$^2$| 0.095 cm$^2$|
|2 ($y$-axis)|0.7 cm$^2$| 0.128 cm$^2$|
|3 ($z$-axis)|0.4 cm$^2$| 0.065 cm$^2$|

You might suspect that there is some covariance in the spots, so you might go ahead and make plots of each size versus each other:

In [None]:
%matplotlib inline

# This code-block generates random data-points and makes a set of plots in
# line with the textual description of the notebook. Understanding this
# code-block is not required for understanding the lesson.

import matplotlib.pyplot as plt
import numpy as np

n = 1000
seed = 0

np.random.seed(seed)
x0 = np.random.randn(n)*0.15 # + 1.2
y0 = np.random.randn(n)*0.09 # + 0.7
z0 = np.random.randn(n)*0.01 # + 0.4
coords0 = np.stack([x0, y0, z0], axis=0)
coords = np.dot(
    [[0.408248, -0.816497, 0.408248],
     [0.816497, 0.526599, 0.236701],
     [-0.408248, 0.236701, 0.88165]],
    coords0)
coords += [[1.2],[0.7],[0.4]]

(fig,axs) = plt.subplots(1, 3, figsize=(7,1.5), dpi=288)
fig.subplots_adjust(0,0,1,1,0.4,0)

(x,y,z) = coords
axlbls = [r'Spot {ii} Size [cm$^2$]'.format(ii=ii) for ii in (1,2,3)]

axs[0].plot(x, y, 'k.')
axs[0].set_xlabel(axlbls[0])
axs[0].set_ylabel(axlbls[1])

axs[1].plot(y, z, 'k.')
axs[1].set_xlabel(axlbls[1])
axs[1].set_ylabel(axlbls[2])

axs[2].plot(z, x, 'k.')
axs[2].set_xlabel(axlbls[2])
axs[2].set_ylabel(axlbls[0])

for ax in axs:
    ax.axis('equal')

plt.show()

However, imagine that next you make a 3D animation of the data to get a better sense for its overall structure. When you view the animation, it looks like this:

In [None]:
%matplotlib widget

# We need mpld3 to get an HTML version of the 3D figure.
import mpld3

# This code-block generates a 3D plot of the points generated in the code cell
# above this one. Understanding this code-block is not required for the lesson.

(fig,ax) = plt.subplots(1, 1, figsize=(2,2), dpi=144)
fig.subplots_adjust(0,0,1,1,0,0)
framecount = 600
coordplot = ax.scatter(coords[0], coords[2], c='k', ec=None, s=4)
ax.set_xlim([-0.6,0.6])
ax.set_ylim([-0.6,0.6])
y0 = coords[1] - np.mean(coords[1])
xz0 = coords[[0,2],:]
xz0 = xz0 - np.mean(xz0, axis=1)[:,None]
def _draw_frame(frame):
    points = coords[[0,2],:]
    axspts = np.array([[1,0],[0,1]])
    # Rotate the points
    th = 2*np.pi*frame / (framecount - 1)
    rmtx = np.array([[np.cos(th), -np.sin(th)], [np.sin(th), np.cos(th)]])
    points = rmtx @ xz0
    coordplot.set_offsets(np.c_[points[0], y0])
    return (coordplot,)

from matplotlib.animation import FuncAnimation
from IPython.display import HTML
anim = FuncAnimation(
    fig,
    _draw_frame,
    frames=framecount,
    interval=30,
    blit=True)

display(HTML(anim.to_jshtml(default_mode='loop')))

plt.close(fig)