# Performing Principal Component Analysis (PCA) - Lab

## Introduction

Now that you have a high-level overview of PCA, as well as some of the details of the algorithm itself, it's time to practice implementing PCA on your own using the NumPy package. 

## Objectives

You will be able to:
    
* Implement PCA from scratch using NumPy

## Import the data

- Import the data stored in the file `'foodusa.csv'` (set `index_col=0`)
- Print the first five rows of the DataFrame 

In [10]:
import pandas as pd
data = pd.read_csv('foodusa.csv', index_col=0)
data

Unnamed: 0_level_0,Bread,Burger,Milk,Oranges,Tomatoes
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ATLANTA,24.5,94.5,73.9,80.1,41.6
BALTIMORE,26.5,91.0,67.5,74.6,53.3
BOSTON,29.7,100.8,61.4,104.0,59.6
BUFFALO,22.8,86.6,65.3,118.4,51.2
CHICAGO,26.7,86.7,62.7,105.9,51.2
CINCINNATI,25.3,102.5,63.3,99.3,45.6
CLEVELAND,22.8,88.8,52.4,110.9,46.8
DALLAS,23.3,85.5,62.5,117.9,41.8
DETROIT,24.1,93.7,51.5,109.7,52.4
HONALULU,29.3,105.9,80.2,133.2,61.7


## Normalize the data

Next, normalize your data by subtracting the mean from each of the columns.

In [3]:
data = (data-data.mean())/ data.std()
data.head()

Unnamed: 0_level_0,Bread,Burger,Milk,Oranges,Tomatoes
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ATLANTA,-0.315653,0.349901,1.669632,-1.60762,-0.942461
BALTIMORE,0.482151,-0.113372,0.748801,-1.993876,0.596473
BOSTON,1.758636,1.183792,-0.128866,0.070839,1.425129
BUFFALO,-0.993785,-0.695773,0.432265,1.082128,0.320254
CHICAGO,0.561931,-0.682536,0.058178,0.204273,0.320254


## Calculate the covariance matrix

The next step is to calculate the covariance matrix for your normalized data. 

In [4]:
cov_mat = data.cov()
cov_mat

Unnamed: 0,Bread,Burger,Milk,Oranges,Tomatoes
Bread,1.0,0.6817,0.328239,0.036709,0.382241
Burger,0.6817,1.0,0.333422,0.210937,0.631898
Milk,0.328239,0.333422,1.0,-0.002779,0.254417
Oranges,0.036709,0.210937,-0.002779,1.0,0.358061
Tomatoes,0.382241,0.631898,0.254417,0.358061,1.0


## Calculate the eigenvectors

Next, calculate the eigenvectors and eigenvalues for your covariance matrix. 

In [5]:
import numpy as np
eig_values, eig_vectors = np.linalg.eig(cov_mat)

In [8]:
eig_values

array([2.42246795, 1.10467489, 0.2407653 , 0.73848053, 0.49361132])

## Sort the eigenvectors 

Great! Now that you have the eigenvectors and their associated eigenvalues, sort the eigenvectors based on their eigenvalues to determine primary components!

In [7]:
# Get the index values of the sorted eigenvalues
e_indices = eig_values.argsort()[::-1]

# Sort

eigenvectors_sorted = eig_vectors[:,e_indices] #eigenvalues[e_indices]
eigenvectors_sorted

array([[ 0.49614868,  0.30861972,  0.38639398, -0.50930459,  0.49989887],
       [ 0.57570231,  0.04380176,  0.26247227,  0.02813712, -0.77263501],
       [ 0.33956956,  0.43080905, -0.83463952, -0.0491    , -0.00788224],
       [ 0.22498981, -0.79677694, -0.29160659, -0.47901574,  0.0059668 ],
       [ 0.50643404, -0.28702846,  0.01226602,  0.71270629,  0.39120139]])

In [9]:
eigenvalues_sorted = eig_values[e_indices]
eigenvectors_sorted = eig_vectors[:,e_indices]
eigenvectors_sorted

array([[ 0.49614868,  0.30861972,  0.38639398, -0.50930459,  0.49989887],
       [ 0.57570231,  0.04380176,  0.26247227,  0.02813712, -0.77263501],
       [ 0.33956956,  0.43080905, -0.83463952, -0.0491    , -0.00788224],
       [ 0.22498981, -0.79677694, -0.29160659, -0.47901574,  0.0059668 ],
       [ 0.50643404, -0.28702846,  0.01226602,  0.71270629,  0.39120139]])

## Reprojecting the data

Finally, reproject the dataset using your eigenvectors. Reproject this dataset down to 2 dimensions.

In [12]:
from sklearn.decomposition import PCA

# Instantiate PCA
pca = PCA(n_components=2)

principalComponents = pca.fit_transform(data)

df_pca = pd.DataFrame(principalComponents,
                       columns=['PC{}'.
                       format(i+1)
                        for i in range(2)])
df_pca

Unnamed: 0,PC1,PC2
0,-22.476271,10.084571
1,-25.325818,13.278372
2,5.810981,11.389537
3,14.139856,-5.965021
4,2.426889,-2.477207
5,-2.165798,6.663567
6,5.78854,-10.243124
7,10.757365,-12.621018
8,7.18531,-3.99488
9,35.597059,14.789443


## Summary

Well done! You've now coded PCA on your own using NumPy! With that, it's time to look at further applications of PCA.