# Performing Principal Component Analysis (PCA) - Lab

## Introduction

Now that you have a high-level overview of PCA, as well as some of the details of the algorithm itself, it's time to practice implementing PCA on your own using the NumPy package. 

## Objectives

You will be able to:
    
* Implement PCA from scratch using NumPy

## Import the data

- Import the data stored in the file `'foodusa.csv'` (set `index_col=0`)
- Print the first five rows of the DataFrame 

In [2]:
import pandas as pd
data = pd.read_csv('foodusa.csv', index_col=0)
data.head()


Unnamed: 0_level_0,Bread,Burger,Milk,Oranges,Tomatoes
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ATLANTA,24.5,94.5,73.9,80.1,41.6
BALTIMORE,26.5,91.0,67.5,74.6,53.3
BOSTON,29.7,100.8,61.4,104.0,59.6
BUFFALO,22.8,86.6,65.3,118.4,51.2
CHICAGO,26.7,86.7,62.7,105.9,51.2


## Normalize the data

Next, normalize your data by subtracting the mean from each of the columns.

In [3]:
data = data - data.mean()
data.head()

Unnamed: 0_level_0,Bread,Burger,Milk,Oranges,Tomatoes
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ATLANTA,-0.791304,2.643478,11.604348,-22.891304,-7.165217
BALTIMORE,1.208696,-0.856522,5.204348,-28.391304,4.534783
BOSTON,4.408696,8.943478,-0.895652,1.008696,10.834783
BUFFALO,-2.491304,-5.256522,3.004348,15.408696,2.434783
CHICAGO,1.408696,-5.156522,0.404348,2.908696,2.434783


## Calculate the covariance matrix

The next step is to calculate the covariance matrix for your normalized data. 

In [4]:
cov_mat = np.cov([data.Bread, data.Burger, data.Milk, data.Oranges, data.Tomatoes])
cov_mat

array([[  6.2844664 ,  12.91096838,   5.71905138,   1.31037549,
          7.28513834],
       [ 12.91096838,  57.07711462,  17.50752964,  22.69187747,
         36.29478261],
       [  5.71905138,  17.50752964,  48.30588933,  -0.27503953,
         13.44347826],
       [  1.31037549,  22.69187747,  -0.27503953, 202.75628458,
         38.76241107],
       [  7.28513834,  36.29478261,  13.44347826,  38.76241107,
         57.80055336]])

## Calculate the eigenvectors

Next, calculate the eigenvectors and eigenvalues for your covariance matrix. 

In [6]:
import numpy as np
eig_values, eig_vectors = np.linalg.eig(cov_mat)

In [7]:
eig_values

array([218.99867893,  91.72316894,   3.02922934,  20.81054128,
        37.66268981])

In [8]:
eig_vectors

array([[-0.02848905, -0.16532108, -0.96716354, -0.18972574,  0.02135748],
       [-0.2001224 , -0.63218494,  0.24877074, -0.65862454,  0.25420475],
       [-0.0416723 , -0.44215032,  0.03606094,  0.10765906, -0.88874949],
       [-0.93885906,  0.31435473, -0.01521357, -0.06904699, -0.12135003],
       [-0.27558389, -0.52791603, -0.03429221,  0.71684022,  0.36100184]])

## Sort the eigenvectors 

Great! Now that you have the eigenvectors and their associated eigenvalues, sort the eigenvectors based on their eigenvalues to determine primary components!

In [13]:
# Get the index values of the sorted eigenvalues
e_indices = np.argsort(eig_values)[::-1] # [::-1] creates a view of the np.array in reverse order
e_indices

array([0, 1, 4, 3, 2])

In [16]:
# Sort 
eigenvectors_sorted = eig_vectors[:,e_indices] # [:, e_indices] sorts the rows by the index indicated by e_indices
eigenvectors_sorted

array([[-0.02848905, -0.16532108,  0.02135748, -0.18972574, -0.96716354],
       [-0.2001224 , -0.63218494,  0.25420475, -0.65862454,  0.24877074],
       [-0.0416723 , -0.44215032, -0.88874949,  0.10765906,  0.03606094],
       [-0.93885906,  0.31435473, -0.12135003, -0.06904699, -0.01521357],
       [-0.27558389, -0.52791603,  0.36100184,  0.71684022, -0.03429221]])

## Reprojecting the data

Finally, reproject the dataset using your eigenvectors. Reproject this dataset down to 2 dimensions.

In [17]:
transformed = eigenvectors_sorted.dot(data.T).T
transformed

array([[ 1.11063671e+01,  1.47313493e+01, -1.41720384e+01,
         1.85530932e+00, -1.31519693e+01],
       [ 1.21900301e+00,  2.14498941e+01, -7.19007196e+00,
        -1.44250494e-01, -1.85096828e+01],
       [-1.22936562e+01, -4.73286484e+00, -2.84276415e+00,
        -1.45351480e+00, -5.90817824e+00],
       [-4.27410508e+00, -4.95746329e+00,  1.50456490e+00,
        -7.78972241e-01,  1.55082192e+01],
       [-2.08570219e+00,  1.77071529e+00,  2.26283823e+00,
        -3.23048970e+00,  4.48154292e+00],
       [ 2.02322957e+00, -4.83130671e+00, -6.11053639e+00,
         3.51881382e+00, -7.79622091e+00],
       [ 7.65139684e-01, -5.78241861e+00,  1.10305914e+01,
         2.06281561e+00,  4.46446433e+00],
       [ 5.01990281e+00, -7.08302707e+00,  4.06579025e+00,
        -1.07688088e+00,  1.49042564e+01],
       [-5.28963606e+00, -7.18560259e+00,  9.68250581e+00,
         2.48951246e+00,  1.42270390e-01],
       [-2.02949124e+01, -2.18073279e+01, -1.85701795e+01,
        -3.80426303e+00

In [18]:
two_d = eigenvectors_sorted[:2].dot(data.T).T

In [19]:
two_d

array([[ 11.10636706,  14.73134932],
       [  1.21900301,  21.44989411],
       [-12.29365615,  -4.73286484],
       [ -4.27410508,  -4.95746329],
       [ -2.08570219,   1.77071529],
       [  2.02322957,  -4.83130671],
       [  0.76513968,  -5.78241861],
       [  5.01990281,  -7.08302707],
       [ -5.28963606,  -7.18560259],
       [-20.29491244, -21.8073279 ],
       [  6.65984778,   1.94000258],
       [  6.31128707,   2.48929146],
       [ 14.12127203,  10.53050299],
       [ -6.30360841,  -4.25846371],
       [ -2.70119271,  -4.22170394],
       [-17.39101398, -11.4694373 ],
       [-11.5197783 ,   7.50279463],
       [ -3.85046231, -12.11400365],
       [  0.02742717,  -9.57890292],
       [ 16.08930141,   7.15585369],
       [  7.92495397,   0.76667379],
       [  8.40796253,  16.30333807],
       [  6.32837355,  13.38210659]])

## Summary

Well done! You've now coded PCA on your own using NumPy! With that, it's time to look at further applications of PCA.