# Principal Components Lab

In this lab we will apply Principal Components Analysis to the Auto-MPG dataset that we studied in the Chapter on LinearRegression.  Before diving into the real data, we will work with the simulated data from the notes to show how to use python to and numpy to
calculate the information we need.

In [None]:
import numpy as np
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
import seaborn as sns
from bokeh.palettes import Spectral10, Category10
output_notebook()

## Simulated data for demo purposes

First we load the datamatrix.

In [None]:
data = np.genfromtxt('simulated_pca_data.csv',delimiter=',')
data.shape
secret_label = data[:,-1]
data=data[:,:-1]

The data consists of 200 samples, with 15 features per sample. It was constructed out of 4 different groups with different characteristics.  These groups are labeled but we will ignore the labels for the moment. To carry out principal component analysis,
we must:

1. center the data
2. compute the covariance matrix D
3. find the eigenvectors and eigenvalues of D

Then we can:

4. project the data into the space spanned by the first two eigenvectors of the covariance matrix and plot this
5. draw the loading axes on the plot.


### Step 1. Centering the data

To center the data, we subtract the mean of each column from that column.  The `mean` method computes the mean of each column of the data:

In [None]:
np.mean(data,axis=0)

In [None]:
data_centered = data - np.mean(data,axis=0)

Subtracting this from the data centers it -- python understands that when you subtract a scalar from a column, you are really subtracting that scalar from every entry in the column.

Here we use a couple of programs we haven't discussed just to generate a gridplot for discussion. 

In [None]:
sns.pairplot(pd.DataFrame(data_centered))

### Step 2. Computing the Covariance Matrix

Remember that the covariance matrix is $X_{0}^{\intercal}X/N$.

In [None]:
D = np.dot(data_centered.transpose(),data_centered)/200

### Correlation Coefficients

Remember that the correlation coefficient $R^2$ of two variables is
$$
R^{2}_{XY}=\frac{\sigma_{XY}^2}{\sigma_{X}^2\sigma_{Y}^2}
$$

We can extract this info from the covariance matrix.

In [None]:
def r(i,j):
    return D[i,j]/np.sqrt(D[i,i]*D[j,j])

In [None]:
for i in range(15):
    print(i, r(0,i))

### Step 3.  Finding the eigenvalues and eigenvectors of D

The command `np.linalg.eigh` returns a pair consisting of the vector of eigenvalues and the matrix of eigenvectors.
By default, the eigenvalues are returned in increasing order, but we like them in decreasing order, so we reverse the list. 

In [None]:
L, P = np.linalg.eigh(D)
L = L[::-1]

We can plot the eigenvalues.

In [None]:
eigenvalue_plot = figure(title='Eigenvalues of Covariance Matrix')
eigenvalue_plot.scatter(x=range(L.shape[0]),y=L,size=8)
eigenvalue_plot.line(x=range(L.shape[0]),y=L,color='red')
show(eigenvalue_plot)

### Step 4. Projecting the data into the first two principal components

The columns of the matrix `P` are the eigenvectors, but they are ordered like the eigenvalues (from smallest to largest).
So the two most significant principal components are the *last two* columns of the matrix, and we need to reverse their order.

In [None]:
PC2 = np.dot(data_centered,P[:,-2::-1])

In [None]:

scatter_plot = figure(title='Plot of First Two Principal Components',x_range=(-3,3),y_range=(-3,3))
scatter_plot.scatter(x=PC2[:,0],y=PC2[:,1])
show(scatter_plot)

In [None]:
colors=['red','green','blue','orange','black']
color_list = [colors[int(secret_label[i])] for i in range(200)]
scatter_plot = figure(title='Plot of First Two Principal Components with secret labels',x_range=(-3,3),y_range=(-3,3))
scatter_plot.scatter(x=PC2[:,0],y=PC2[:,1],color=color_list)
show(scatter_plot)

### Step 4: Draw the loading directions

To project the axis of the $i^{th}$ feature into the space spanned by the two principal eigenvectors, we draw a line in the direction of the
vector we obtain by multiplying the row vector $[0,\ldots, 1,\ldots 0]$, where the $1$ is in position $i$, into that space.  But multiplying
that vector times the matrix $P$ just picks out the $i^{th}$ of of $P$, so we want to draw a line in the direction of the point corresponding to the $i^{th}$
row of $P$.  For example, the $0^{th}$ feature is in the direction of $(PC[0,0],PC[0,1])$.

In [None]:
for i in range(PC2.shape[1]):
    scatter_plot.line(x=[-100*PC2[i,0],100*PC2[i,0]],y=[-100*PC2[i,1],100*PC2[i,1]],color='gray',line_dash='dashed')
scatter_plot.title.text = 'Plot of First Two Principal Components with Feature Loadings'
show(scatter_plot)

## PCA for Auto Data

Let's look at what PCA can tell us about the auto data.  

In [None]:
# we load the data file, and drop the rows with ? for the horsepower
data = np.genfromtxt('auto-mpg.csv',delimiter=',',skip_header=1)


In the section on
linear regression we explored the relationship between the gas mileage and various other properties of each 
car model.  We'll continue that analysis from the perspective of principal component analysis in this lab, focusing in particular on:

- mileage (mpg) (column 0)
- vehicle weight (column 4)
- acceleration (column 5) -- note that big numbers mean poor acceleration!
- horsepower  (column 3)
- displacement (column 2)

Display the data so you can see what it looks like.

Since we're only interested in mileage, weight, acceleration, and horsepower in this lab, let's just keep those features.

In [None]:
data = data[:,[0,2,3,4,5]]
data.shape

In [None]:
data[:4,:]

This array has some missing data in it, indicated by nan's (Not a Number).  Here's how we find the
bad spots.



In [None]:
np.argwhere(np.isnan(data))

This little function returns true if any element of a row of the matrix (axis=1) is nan. We look at it's first twenth entries.

In [None]:
np.isnan(data).any(axis=1)[:20]

argwhere looks for "True":

In [None]:
np.argwhere(np.isnan(data).any(axis=1))

In [None]:
good_rows = ~np.isnan(data).any(axis=1)

In [None]:
data = data[good_rows,:]
data.shape

Now let's go ahead and do PCA on this data.

Remember the steps:

1. Center the data (and rescale it)
2. Find the covariance matrix
3. Compute its eigenvalues and eigenvectors and plot the eigenvalues
4. Select the two largest eigenvalues and corresponding eigenvectors
5. Draw a scatter plot of the data projected into the span of these two principal directions
6. Draw the loadings.

In [None]:
# Step 1: center the data and rescale it
data_centered = data - np.mean(data,axis=0)
data_centered = data_centered/np.std(data,axis=0)

In [None]:
sns.pairplot(pd.DataFrame(data_centered))

In [None]:
# Step 2: Find the covariance matrix.  Hint: data.shape[0] is the number of samples
D = np.dot(data_centered.transpose(),data_centered)/data.shape[0]

In [None]:
D

In [None]:
# Step 3: Find the eigenvalues and eigenvectors and plot them
L, P = np.linalg.eigh(D)
L = L[::-1]
P = P[:,::-1]

eigenvalue_plot = figure(title='Eigenvalues')
eigenvalue_plot.circle(x=range(L.shape[0]),y=L)
eigenvalue_plot.line(x=range(L.shape[0]),y=L,color='green')
show(eigenvalue_plot)

In [None]:
# Step 4: Project and plot the data
PC2 = np.dot(data_centered, P[:,:2])
scatter_plot = figure(title='Principal Components')

scatter_plot.scatter(x=PC2[:,0],y=PC2[:,1])
show(scatter_plot)

In [None]:
# Step 5: add the loading directions
names = ['mpg','displacement','hp','weight','accel']
for i in range(5):
    scatter_plot.line(x=[0,P[i,0]],y=[0,P[i,1]],color=Category10[5][i],line_width=3,legend_label=names[i])
scatter_plot.title.text = 'Principal Components with Loadings'
show(scatter_plot)

In looking at the figure above, notice that weight and mileage are almost perfect opposites -- so there is an unavoidable tradeoff with higher weight vehicles having lower mileage.  Moving to the lower right of the graph, you have better acceleration and also higher horsepower.  Horsepower and displacement point in roughly the same direction, though not perfectly.

So:

- bottom left quadrant are bigger, relatively high mileage cars with poor acceleration
- bottom right quadrant are bigger, high-horsepower, slow cars
- upper right quadrant are bigger, faster, relatively high mileage cars 
- upper left quadrant are smaller, faster, relatively high mileage cars.



In [None]:
names=[]
with open('auto-mpg.csv') as f:
    for line in f:
        fields = line.rstrip().split(',')
        names.append(fields[-1])
names = names[1:]
names = [names[i] for i in range(len(names)) if good_rows[i]]


In [None]:
source=ColumnDataSource({'pc0':PC2[:,0],'pc1':PC2[:,1],'type':names,'mpg':data[:,0],'disp':data[:,1],'hp':data[:,2],'wt':data[:,3],'accel':data[:,4]})

In [None]:
scatter_plot=figure(title='PCA plot with labels (hover to see car type)')
scatter_plot.scatter(x='pc0',y='pc1',source=source)
scatter_plot.add_tools(HoverTool(tooltips=[("type","@type"),('mpg','@mpg'),('disp','@disp'),('hp','@hp'),('wt','@wt'),('accel','@accel')]))
loadings = ['mpg','displacement','hp','weight','accel']
for i in range(5):
    scatter_plot.line(x=[0,P[i,0]],y=[0,P[i,1]],color=Category10[5][i],line_width=3,legend_label=loadings[i])
show(scatter_plot)

## Using sklearn

In [None]:
from sklearn.decomposition import PCA

In [None]:
P=PCA(n_components=2)

In [None]:
PC = P.fit_transform(data_centered)

In [None]:
PC.shape

In [None]:
scatter_plot=figure(title='sklearn version')
scatter_plot.scatter(x=PC[:,0],y=PC[:,1])
show(scatter_plot)

The loadings are in the components_ portion (each column is the projection of
the corresponding feature into the space spanned by the PC).

In [None]:
P.components_