# Performing Principal Component Analysis (PCA) - Lab

## Introduction

Now that you have a high-level overview of PCA, as well as some of the details of the algorithm itself, it's time to practice implementing PCA on your own using the NumPy package. 

## Objectives

You will be able to:
    
* Implement PCA from scratch using NumPy

## Import the data

- Import the data stored in the file `'foodusa.csv'` (set `index_col=0`)
- Print the first five rows of the DataFrame 

In [38]:
import pandas as pd
data = pd.read_csv('./foodusa.csv', index_col = 0).T
data


City,ATLANTA,BALTIMORE,BOSTON,BUFFALO,CHICAGO,CINCINNATI,CLEVELAND,DALLAS,DETROIT,HONALULU,HOUSTON,KANSAS CITY,LOS ANGELES,MILWAUKEE,MINNEAPOLIS,NEW YORK,PHILADELPHIA,PITTSBURGH,ST LOUIS,SAN DIEGO,SAN FRANCISCO,SEATTLE,WASHINGTON DC
Bread,24.5,26.5,29.7,22.8,26.7,25.3,22.8,23.3,24.1,29.3,22.3,26.1,26.9,20.3,24.6,30.8,24.5,26.2,26.5,25.5,26.3,22.5,24.2
Burger,94.5,91.0,100.8,86.6,86.7,102.5,88.8,85.5,93.7,105.9,83.6,88.9,89.3,89.6,92.2,110.7,92.3,95.4,92.4,83.7,87.1,77.7,93.8
Milk,73.9,67.5,61.4,65.3,62.7,63.3,52.4,62.5,51.5,80.2,67.8,65.4,56.2,53.8,51.9,66.0,66.7,60.2,60.8,57.0,58.3,62.0,66.0
Oranges,80.1,74.6,104.0,118.4,105.9,99.3,110.9,117.9,109.7,133.2,108.6,100.9,82.7,111.8,106.0,107.3,98.0,117.1,115.1,92.8,101.8,91.1,81.6
Tomatoes,41.6,53.3,59.6,51.2,51.2,45.6,46.8,41.8,52.4,61.7,42.4,43.2,38.4,53.9,50.7,62.6,61.7,49.3,46.2,35.4,41.5,44.9,46.2


## Normalize the data

Next, normalize your data by subtracting the mean from each of the columns.

In [39]:
def center(column):
    mean = column.mean()
    centered_column = column - mean
    return centered_column

In [40]:
for column in data.columns:
    data[column] = center(data[column])
data.head()

City,ATLANTA,BALTIMORE,BOSTON,BUFFALO,CHICAGO,CINCINNATI,CLEVELAND,DALLAS,DETROIT,HONALULU,HOUSTON,KANSAS CITY,LOS ANGELES,MILWAUKEE,MINNEAPOLIS,NEW YORK,PHILADELPHIA,PITTSBURGH,ST LOUIS,SAN DIEGO,SAN FRANCISCO,SEATTLE,WASHINGTON DC
Bread,-38.42,-36.08,-41.4,-46.06,-39.94,-41.9,-41.54,-42.9,-42.18,-52.76,-42.64,-38.8,-31.8,-45.58,-40.48,-44.68,-44.14,-43.44,-41.7,-33.38,-36.7,-37.14,-38.16
Burger,31.58,28.42,29.7,17.74,20.06,35.3,24.46,19.3,27.42,23.84,18.66,24.0,30.6,23.72,27.12,35.22,23.66,25.76,24.2,24.82,24.1,18.06,31.44
Milk,10.98,4.92,-9.7,-3.56,-3.94,-3.9,-11.94,-3.7,-14.78,-1.86,2.86,0.5,-2.5,-12.08,-13.18,-9.48,-1.94,-9.44,-7.4,-1.88,-4.7,2.36,3.64
Oranges,17.18,12.02,32.9,49.54,39.26,32.1,46.56,51.7,43.42,51.14,43.66,36.0,24.0,45.92,40.92,31.82,29.36,47.46,46.9,33.92,38.8,31.46,19.24
Tomatoes,-21.32,-9.28,-11.5,-17.66,-15.44,-21.6,-17.54,-24.4,-13.88,-20.36,-22.54,-21.7,-20.3,-11.98,-14.38,-12.88,-6.94,-20.34,-22.0,-23.48,-21.5,-14.74,-16.16


## Calculate the covariance matrix

The next step is to calculate the covariance matrix for your normalized data. 

In [41]:
import numpy as np

In [42]:
cov_mat = np.cov(data)
cov_mat

array([[ 20.22148458,   4.25353202,   9.41788775, -31.09978814,
         -2.79311621],
       [  4.25353202,  25.82522372,  -1.38808854, -32.31274071,
          3.62207352],
       [  9.41788775,  -1.38808854,  41.76654387, -42.92338498,
         -6.8729581 ],
       [-31.09978814, -32.31274071, -42.92338498, 123.99893913,
        -17.6630253 ],
       [ -2.79311621,   3.62207352,  -6.8729581 , -17.6630253 ,
         23.70702609]])

## Calculate the eigenvectors

Next, calculate the eigenvectors and eigenvalues for your covariance matrix. 

In [43]:
import numpy as np
eig_values, eig_vectors = np.linalg.eig(cov_mat)

## Sort the eigenvectors 

Great! Now that you have the eigenvectors and their associated eigenvalues, sort the eigenvectors based on their eigenvalues to determine primary components!

In [44]:
eig_vectors;

In [45]:
# Get the index values of the sorted eigenvalues
e_indices = np.argsort(eig_vectors[:,-1])

# Sort 
eigenvectors_sorted = eig_vectors[:,e_indices]
eigenvectors_sorted

array([[ 0.34474078,  0.10836979, -0.78629411,  0.22616666,  0.4472136 ],
       [ 0.55712543, -0.43352679,  0.50258883,  0.22151802,  0.4472136 ],
       [-0.24405121,  0.715219  ,  0.34212738,  0.3344393 ,  0.4472136 ],
       [ 0.05504807,  0.13108103,  0.04285946, -0.88201504,  0.4472136 ],
       [-0.71286308, -0.52114303, -0.10128157,  0.09989106,  0.4472136 ]])

## Reprojecting the data

Finally, reproject the dataset using your eigenvectors. Reproject this dataset down to 2 dimensions.

In [51]:
pca = eigenvectors_sorted
pca[:2]

array([[ 0.34474078,  0.10836979, -0.78629411,  0.22616666,  0.4472136 ],
       [ 0.55712543, -0.43352679,  0.50258883,  0.22151802,  0.4472136 ]])

## Summary

Well done! You've now coded PCA on your own using NumPy! With that, it's time to look at further applications of PCA.

In [52]:
pca

array([[ 0.34474078,  0.10836979, -0.78629411,  0.22616666,  0.4472136 ],
       [ 0.55712543, -0.43352679,  0.50258883,  0.22151802,  0.4472136 ],
       [-0.24405121,  0.715219  ,  0.34212738,  0.3344393 ,  0.4472136 ],
       [ 0.05504807,  0.13108103,  0.04285946, -0.88201504,  0.4472136 ],
       [-0.71286308, -0.52114303, -0.10128157,  0.09989106,  0.4472136 ]])