# Principal Components Lab

In this lab we will apply Principal Components Analysis to the Auto-MPG dataset that we studied in the Chapter on LinearRegression.  Before diving into the real data, we will work with the simulated data from the notes to show how to use python to and numpy to
calculate the information we need.

In [None]:
%setup
import holoviews as hv
hv.extension('bokeh')
from holoviews.operation import gridmatrix
from bokeh.palettes import Spectral10

## Simulated data for demo purposes

First we load the datamatrix.

In [None]:
data = pd.read_csv('../data/simulated_pca_data.csv',index_col=0)
data

The data consists of 200 samples, with 15 features per sample.  To carry out principal component analysis,
we must:

1. center the data
2. compute the covariance matrix D
3. find the eigenvectors and eigenvalues of D

Then we can:

4. project the data into the space spanned by the first two eigenvectors of the covariance matrix and plot this
5. draw the loading axes on the plot.


### Step 1. Centering the data

To center the data, we subtract the mean of each column from that column.  The `mean` method computes the mean of each column of the data:

In [None]:
data.mean()

Subtracting this from the data centers it -- python/pandas understands that when you subtract a scalar from a column, you are really subtracting that scalar from every entry in the column.

In [None]:
data_centered = data - data.mean()

### Step 2. Computing the Covariance Matrix

Remember that the covariance matrix is $X_{0}^{\intercal}X/N$.

In [None]:
D = np.dot(data_centered.transpose(),data_centered)/200

### Step 3.  Finding the eigenvalues and eigenvectors of D

The command `np.linalg.eigh` returns a pair consisting of the vector of eigenvalues and the matrix of eigenvectors.
By default, the eigenvalues are returned in increasing order, but we like them in decreasing order, so we reverse the list. 

In [None]:
L, P = np.linalg.eigh(D)
L = L[::-1]

We can plot the eigenvalues.

In [None]:
eigenvalue_plot = figure(title='Eigenvalues of Covariance Matrix')
eigenvalue_plot.scatter(x=range(L.shape[0]),y=L,size=8)
eigenvalue_plot.line(x=range(L.shape[0]),y=L,color='red')
show(eigenvalue_plot)

### Step 4. Projecting the data into the first two principal components

The columns of the matrix `P` are the eigenvectors, but they are ordered like the eigenvalues (from smallest to largest).
So the two most significant principal components are the *last two* columns of the matrix, and we need to reverse their order.

In [None]:
PC2 = np.dot(data_centered,P[:,-2::-1])

In [None]:
scatter_plot = figure(title='Plot of First Two Principal Components',x_range=(-3,3),y_range=(-3,3))
scatter_plot.scatter(x=PC2[:,0],y=PC2[:,1])
show(scatter_plot)

### Step 4: Draw the loading directions

To project the axis of the $i^{th}$ feature into the space spanned by the two principal eigenvectors, we draw a line in the direction of the
vector we obtain by multiplying the row vector $[0,\ldots, 1,\ldots 0]$, where the $1$ is in position $i$, into that space.  But multiplying
that vector times the matrix $P$ just picks out the $i^{th}$ of of $P$, so we want to draw a line in the direction of the point corresponding to the $i^{th}$
row of $P$.  For example, the $0^{th}$ feature is in the direction of $(PC[0,0],PC[0,1])$.

In [None]:
scatter_plot.line(x=[-100*PC2[0,0,],100*PC2[0,0]],y=[-100*PC2[0,1],100*PC2[0,1]],color='green')
scatter_plot.title.text = 'Plot of First Two Principal Components with Feature 0 Axis'
show(scatter_plot)

## PCA for Auto Data

Let's look at what PCA can tell us about the auto data.  

In [None]:
# we load the data file, and drop the rows with ? for the horsepower
data = pd.read_csv('../../data/auto-mpg/auto-mpg.csv',na_values='?')
data = data.dropna()

In the section on
linear regression we explored the relationship between the gas mileage and various other properties of each 
car model.  We'll continue that analysis from the perspective of principal component analysis in this lab, focusing in particular on:

- mileage (mpg)
- vehicle weight (weight)
- acceleration 
- horsepower
- displacement

Because this data is to very different scales, we will not only center it, but rescale it, to make it easier to work with.

Display the data so you can see what it looks like.

In [None]:
display(data)


Since we're only interested in mileage, weight, acceleration, and horsepower in this lab, let's just keep those features.

In [None]:
# fill in the []
data = data[]

Next, we'll create a density plot that shows how the different features are related.  See Figure 9 in the notes for a similar
plot of the simulated data that we considered there.



In [None]:
density_plot = gridmatrix(hv.Dataset(data),chart_type=hv.Points)
density_plot

We see from this that, for example, increasing horsepower means lower acceleration -- acceleration is measured in time to 60mph,
so low numbers correspond to more accleleration.  On the other hand, weight and acceleration, while also somewhat correlated,
are less strongly so then weight and engine displacement.

Remember the steps:

1. Center the data (and rescale it)
2. Find the covariance matrix
3. Compute its eigenvalues and eigenvectors and plot the eigenvalues
4. Select the two largest eigenvalues and corresponding eigenvectors
5. Draw a scatter plot of the data projected into the span of these two principal directions
6. Draw the loadings.

In [None]:
# Step 1: center the data and rescale it
data_centered = # subtract the mean from each column
data_centered = # scale to standard deviation 1

In [None]:
# Step 2: Find the covariance matrix.  Hint: data.shape[0] is the number of samples
D = #

In [None]:
# Step 3: Find the eigenvalues and eigenvectors and plot them
L, P = #
L = #
P = 3

eigenvalue_plot = figure(title='Eigenvalues')
eigenvalue_plot.circle()
eigenvalue_plot.line()
show(eigenvalue_plot)

In [None]:
# Step 4: Project and plot the data
PC2 = #
scatter_plot = figure(title='Principal Components')
scatter_plot.scatter()
show(scatter_plot)

In [None]:
# Step 5: add the loading directions
names = data.columns
for i in range(5):
    scatter_plot.line(x=#,
                      y=#,
                      color=Category10[5][i],line_width=3,legend_label=names[i])
scatter_plot.title.text = 'Principal Components with Loadings'
show(scatter_plot)

In looking at the figure above, notice that weight and mileage are almost perfect opposites -- so there is an unavoidable tradeoff with higher weight vehicles having lower mileage.  Moving to the lower right of the graph, you have better acceleration and also higher horsepower.  Horsepower and displacement point in roughly the same direction, though not perfectly.

So:

- bottom left quadrant are small, high mileage cars with better acceleration
- bottom right quadrant are smaller, high-horsepower, fast cars
- upper right quadrant are big, heavy, low-mileage cars that are still powerful and relatively fast
- upper left quadrant are low acceleration, lower horsepower cars.

