##  Why do we PCA?

We can use PCA as a preprocessing step or in our modelling phase.  Essentually PCA, or Principal component analysis is an aspect of Dimensionality Reduction.

#### Dimensionality Reduction
- More easily manage data/visualize data
- Identify what is most important

**Types of Dimensionality Reduction**
- Feature Extraction (PCA)
- Feature Elimination 

**Risks of D.R.**
- Underfitting
- Losing Information



#### Principal Component Analysis
- *Identify Independent Features

With a PCA we are trying to identify the directions with the highest variability.  

- Directions/dimensions with high variability likely indicate Signal (Predictability).
- Directions/dimensions with low variability likely indicate noise.

These directions will translate to Principal Components later.

With a PCA we are trying to identify first **the** Principal Component.
Principal components will always be in terms of your original components.

Lets think of the first principal component as a line of best fit. Once the First Principal Component(PC1) is established, PC2 is going to be Orthagonal (perpendicular) to PC1

An EigenValue tells you how important and Eigenvector is

- The EigenPair (EigenValue & EigenVector) that explains the **most** variance is Principal Component 1
- The EigenPair (EigenValue & EigenVector) that explains the **second** most variance is Principal Component 2
- *and so on and so forth.*

Maximum number of Principal Components is equal to your number of features.

__PCA Process__
- Original Data
- Center it (Standardize, center at 0)
- Covariance Matrix of the standardized orginial data.
- EigenValue/EigenVector
- Decide how much explained variance you want in your final model and select the number of Principal Components that are needed to explain said amount of variance.  
- Keep the needed EigenValues to explain said variance.  
- Go back and Multiply original data by the EigenValues of the Selected Principal Components.

PCA works best to find the importance of relationship between various features.  
Having a dataset of entirely uncorrelated features will not show much benefit from a PCA.

#### Some Resources.
Georgia Tech PCA Part 1
https://www.youtube.com/watch?v=kw9R0nD69OU

Georgia Tech PCA Part 2
https://www.youtube.com/watch?v=_nZUhV-qhZA

Khan Academy: Eigen Everything
https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors

Now, to the lab!

*as of right now.  This Lab only covers the Congressional Voting Data problem.*

In [None]:
#as off right now.  This Lab only covers the Congressional Voting Data problem.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 
from sklearn import metrics


%matplotlib inline


# Case #1: Congressional Voting Data

After you've downloaded the data from the repository, go ahead and load it with Pandas and handle any preprocessing that is may need.

I want to make use of the PCA and not just do it, so I am going to set my treshhold of required explained variance to 90%

In [3]:
votes = pd.read_csv('datasets/votes.csv')

In [4]:
votes.drop('Unnamed: 0', axis =1, inplace = True)
votes.head()
# As data is exported with an additional index, the csv stores one which gets importrt
# we could use this column as garbage noise if we wanted to.

Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [4]:
votes.isnull().sum()/len(votes)
# wooo weee! we got some null values.

Class    0.000000
V1       0.027586
V2       0.110345
V3       0.025287
V4       0.025287
V5       0.034483
V6       0.025287
V7       0.032184
V8       0.034483
V9       0.050575
V10      0.016092
V11      0.048276
V12      0.071264
V13      0.057471
V14      0.039080
V15      0.064368
V16      0.239080
dtype: float64

In [5]:
for item in votes:
    print item

Class
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16


In [6]:
# need these to be numerical to run a correlation matrix
# replace "y" with 1, "n" with -1, and null with 0
# we also want our Class to get results, and since we only have 2, can convert to binary
# let's make 'democrat' = 1

# Votemap is like a dictionary of conversion terms.
votemap = {'y':1, 'n':-1, 'democrat':1, 'republican':0}

# we're going to use this column logic for our mapping function
for example in votes:
    print example
# returns the column names.  Wicked!

Class
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V16


In [7]:
for col in votes:
    #with those column names we can index columns and then 'map' our conversion dict to them
    votes[col] = votes[col].map(votemap)
    
#now, to handle all those na values that were left and had not conversion value.
votes.fillna(0, inplace=True)

In [8]:
# what does it look like now.
votes.head()

Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
0,0,-1.0,1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,1.0,1.0,-1.0,1.0
1,0,-1.0,1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,-1.0,0.0
2,1,0.0,1.0,1.0,0.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,1.0,-1.0,-1.0
3,1,-1.0,1.0,1.0,-1.0,0.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,1.0,-1.0,-1.0,1.0
4,1,1.0,1.0,1.0,-1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,1.0,0.0,1.0,1.0,1.0,1.0


Next, let's define the x and y variables: 

In [9]:
X = votes.drop('Class', axis = 1)
y = votes.Class

The Next step required Standardized values.  Lets just do that now. **There are 2 different methods below.**

**Note, this data has already pretty much been standardized because it exists on a standard scale between -1 and 1**

In [None]:
# Method 1: This method will convert directly to a Numpy Matrix
x_stand = StandardScaler().fit_transform(X)
x_stand

In [None]:
# Method 2: This will modify the values within the dataframe

X_2 = (X - X.mean()) / X.std()
# isn't it amazing how pandas does this?
X_2.head()

#### Next, create the covariance matrix from the standardized x-values and decompose these values to find the eigenvalues and eigenvectors

In [11]:
X_2CM = X_2.corr()
# .corr() is a pandas function that Compute pairwise correlation of columns, excluding NA/null values
X_2CM.head()
# PCA are eigenvectors of the covariance Matrix.

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,1.0,0.023232,0.39768,-0.421307,-0.3691,-0.402215,0.362022,0.399039,0.33984,-0.086469,0.105277,-0.41112,-0.34836,-0.370628,0.201782,0.11657
V2,0.023232,1.0,-0.054237,0.076274,0.133882,0.149569,-0.203465,-0.103966,-0.190123,-0.122931,0.188786,-0.019364,0.223338,-0.016535,-0.11094,-0.09144
V3,0.39768,-0.054237,1.0,-0.725232,-0.651244,-0.43196,0.579655,0.69815,0.603294,0.022112,0.218328,-0.645382,-0.526661,-0.585085,0.47833,0.311423
V4,-0.421307,0.076274,-0.725232,1.0,0.753347,0.47629,-0.580509,-0.694025,-0.639042,0.04436,-0.282151,0.690901,0.593952,0.647853,-0.538417,-0.270164
V5,-0.3691,0.133882,-0.651244,0.753347,1.0,0.624175,-0.694744,-0.827431,-0.782799,0.009348,-0.146776,0.634723,0.645797,0.695011,-0.558103,-0.274914


In [25]:
# This is the correlation matrix of our variables without being scaled.
X_3 = X.corr()
X_3.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,1.0,0.023232,0.39768,-0.421307,-0.3691,-0.402215,0.362022,0.399039,0.33984,-0.086469,0.105277,-0.41112,-0.34836,-0.370628,0.201782,0.11657
V2,0.023232,1.0,-0.054237,0.076274,0.133882,0.149569,-0.203465,-0.103966,-0.190123,-0.122931,0.188786,-0.019364,0.223338,-0.016535,-0.11094,-0.09144
V3,0.39768,-0.054237,1.0,-0.725232,-0.651244,-0.43196,0.579655,0.69815,0.603294,0.022112,0.218328,-0.645382,-0.526661,-0.585085,0.47833,0.311423
V4,-0.421307,0.076274,-0.725232,1.0,0.753347,0.47629,-0.580509,-0.694025,-0.639042,0.04436,-0.282151,0.690901,0.593952,0.647853,-0.538417,-0.270164
V5,-0.3691,0.133882,-0.651244,0.753347,1.0,0.624175,-0.694744,-0.827431,-0.782799,0.009348,-0.146776,0.634723,0.645797,0.695011,-0.558103,-0.274914


While the Correlaiton matrixes of the formally standardized and normal data sets appear to be the same, for some reason python does not see them as equals.  this may be due to their decimol place extension.

In [15]:
X_3 == X_2CM
'''Why are these not the same, but look the same?'''

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
V1,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
V2,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
V3,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
V4,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
V5,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False
V6,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False
V7,False,False,False,False,True,True,True,False,False,False,False,False,False,False,False,False
V8,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
V9,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
V10,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


#### Now, let's check the eigenvalues: 

In [17]:
# Man isn't Numpy great!
eig_vals, eig_vecs = np.linalg.eig(X_2CM)
eig_vals1, eig_vecs1 = np.linalg.eig(X_3)
print eig_vals
print eig_vals1

[ 7.40236313  1.42718114  1.13099577  0.86295103  0.80236765  0.75097129
  0.13309252  0.21540746  0.24037679  0.5749521   0.30529835  0.33103556
  0.52533026  0.47190968  0.39281719  0.43295007]
[ 7.40236313  1.42718114  1.13099577  0.86295103  0.80236765  0.75097129
  0.13309252  0.21540746  0.24037679  0.5749521   0.30529835  0.33103556
  0.52533026  0.47190968  0.39281719  0.43295007]


#### And the eigenvectors: 

In [19]:
print eig_vecs

[[ -1.87809330e-01  -1.81722119e-01  -1.55275769e-01   5.53363824e-01
    3.95918006e-01   4.94792150e-01  -6.20394464e-02   1.06978258e-02
   -5.85601301e-02   1.72339485e-01   5.45292676e-02   1.61018808e-01
    1.17913902e-01  -3.23472153e-01  -1.21564992e-01   2.24123205e-02]
 [  5.38655080e-02  -6.10752223e-01   1.37255837e-01   4.08772744e-01
   -1.11659131e-01  -5.32813707e-01  -3.65539275e-04   1.00665084e-01
    9.38662181e-02   7.18097848e-02  -1.10895527e-01   7.01216422e-02
   -1.22742773e-01  -7.37275759e-03   4.54129677e-02   2.88598767e-01]
 [ -2.93251619e-01  -8.58088375e-02   1.83349755e-01   2.74308742e-02
    3.39469693e-02   2.87636785e-02  -1.84410234e-01   1.37147698e-01
   -1.66809452e-01  -2.70406816e-01  -6.13057352e-01  -4.33207839e-01
    3.69655169e-01   3.77171990e-02  -1.00605249e-01  -6.95114324e-02]
 [  3.10693742e-01   1.35055455e-01  -1.01923087e-01   1.02105209e-01
   -4.98940153e-02  -6.73630916e-02  -2.87738053e-01  -4.43949692e-02
   -7.87331778e-0

**Great** We have some EigenVectors and EigenValues, but what do they mean?
These EigenValues and EigenVectors do not correspond to our original dataframe matrix but are based off of our square correlation of features matrix.


So we have our EigenValues which are like keys for each Eigenvector the number of EigenPairs right now is equal to our original number of feature because we have not done any EigenExtraction.  (EigenExtractionIsNotARealThingIMadeItUpForThisSentence.)

The EigenValue is essentually the key that is needed to transpose our identity matrix to our Eigenvector.

The EigenPairs are used to calculate the explained variance which is useful in determining principal components.

#### To find the principal components, find the eigenpairs, and sort them from highest to lowest.

In [20]:
value_vector_pairs = [[eig_vals[i], eig_vecs[:,i]] for i in range(len(eig_vals))]
value_vector_pairs.sort(reverse=True)

What are principal components?

In [21]:
value_vector_pairs


[[7.4023631295238799,
  array([-0.18780933,  0.05386551, -0.29325162,  0.31069374,  0.32982429,
          0.26111754, -0.29052391, -0.32161235, -0.30007697,  0.01129995,
         -0.06852963,  0.28732779,  0.27557143,  0.28569924, -0.24614685,
         -0.13753029])],
 [1.4271811422429181,
  array([-0.18172212, -0.61075222, -0.08580884,  0.13505545, -0.03476445,
         -0.08521068,  0.18229466,  0.04488362,  0.14629295,  0.38175157,
         -0.50627296,  0.15733647, -0.08472336,  0.14338653, -0.02048274,
          0.21972732])],
 [1.1309957702438529,
  array([-0.15527577,  0.13725584,  0.18334976, -0.10192309,  0.01447065,
          0.31285733,  0.01316406,  0.06646765, -0.00834646,  0.63236536,
          0.44793651, -0.04989752,  0.11856934,  0.13103759, -0.04185746,
          0.41748102])],
 [0.8629510277123037,
  array([ 0.55336382,  0.40877274,  0.02743087,  0.10210521,  0.07099146,
         -0.11745149,  0.06972849,  0.02339107,  0.02011706,  0.10313823,
         -0.41333262, -

#### Now, calculate the explained variance. Recall the methods we learned in lesson!

Explained Variance for an EigenValue/Vector is that EigenValue divided by the sum of all EigenValues multiplied by 100

In [22]:
EVSum = sum(eig_vals)
# EV = eigenvalue divided by the sum of all eigenvalues times 100
var_exp = [(i / EVSum)*100 for i in sorted(eig_vals, reverse=True)]

# Explained variance of all eigen pairs should add up to 100, as in 100% of explained variance.

In [23]:
print(var_exp)

[46.26476955952424, 8.9198821390182363, 7.0687235640240793, 5.3934439232018967, 5.0147978038499632, 4.6935705686746969, 3.5934506470415686, 3.2833141196630709, 2.9494355201732434, 2.7059379593370236, 2.4551074120212992, 2.0689722479974804, 1.9081147096158195, 1.5023549598944306, 1.3462966051801244, 0.83182826078280425]


How do the explained variance values relate.

#### Now, calculate the explained variance and the Cumulative explained variance

In [24]:
# Return the cumulative sum of the elements along a given axis.
# the moving sum 
cum_var_exp = np.cumsum(var_exp)
print cum_var_exp

# you can see that the explained variance of all the eigenvalues adds up to 100.
# in other words, this combination of eigenvalues is capable of explaining all variance.

[ 46.26476956  55.1846517   62.25337526  67.64681919  72.66161699
  77.35518756  80.94863821  84.23195232  87.18138785  89.8873258
  92.34243322  94.41140546  96.31952017  97.82187513  99.16817174 100.        ]


#### Now, find and interperet the principal components

A long Time ago we set our required Explained Variance to 90%.  Using the Cumulative Sum of explained variance we can see that between the 10th and 11th principal components we exceed that threshold. 

Now, repeat the process with sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [26]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
skl_pca = pca.fit_transform(X_2)
skl_pca

array([[-3.49236025,  0.08607588,  1.37104206, ..., -0.59099576,
         0.21114272, -0.06781472],
       [-3.730348  ,  0.61701382, -0.94909933, ..., -0.15029384,
         0.05639808,  0.16833472],
       [-2.05574679,  2.82392803, -0.13849764, ...,  0.84693551,
         0.35413561, -0.41179291],
       ..., 
       [-3.33971434,  0.74628917,  0.42819691, ..., -0.89081185,
         0.08251796, -0.4308882 ],
       [-2.50383963, -1.74407999,  0.04065769, ..., -0.54598294,
        -0.40164608, -0.20759022],
       [-3.68429367,  0.16971379, -0.28952655, ...,  0.1711483 ,
        -0.50490591,  0.3397782 ]])

In [27]:
pca.components_

array([[ 0.18780933, -0.05386551,  0.29325162, -0.31069374, -0.32982429,
        -0.26111754,  0.29052391,  0.32161235,  0.30007697, -0.01129995,
         0.06852963, -0.28732779, -0.27557143, -0.28569924,  0.24614685,
         0.13753029],
       [ 0.18172212,  0.61075222,  0.08580884, -0.13505545,  0.03476445,
         0.08521068, -0.18229466, -0.04488362, -0.14629295, -0.38175157,
         0.50627296, -0.15733647,  0.08472336, -0.14338653,  0.02048274,
        -0.21972732],
       [-0.15527577,  0.13725584,  0.18334976, -0.10192309,  0.01447065,
         0.31285733,  0.01316406,  0.06646765, -0.00834646,  0.63236536,
         0.44793651, -0.04989752,  0.11856934,  0.13103759, -0.04185746,
         0.41748102],
       [ 0.55336382,  0.40877274,  0.02743087,  0.10210521,  0.07099146,
        -0.11745149,  0.06972849,  0.02339107,  0.02011706,  0.10313823,
        -0.41333262, -0.06282039,  0.14603368,  0.00404089, -0.39109413,
         0.36213577],
       [-0.39591801,  0.11165913, -0

In [28]:
pca.explained_variance_ratio_

array([ 0.4626477 ,  0.08919882,  0.07068724,  0.05393444,  0.05014798,
        0.04693571,  0.03593451,  0.03283314,  0.02949436,  0.02705938])

I pretty sure now our data has been principally componentized so we can load it straight into a model or something.

# Case #2: Airport Delays

In [None]:
airports = pd.read_csv('datasets/airport.csv')

In [None]:
airports.head()

First, let's define the x and y variables: Airport is going to be our "x" variable

Then, standardize the x variable for analysis

Next, create the covariance matrix from the standardized x-values and decompose these values to find the eigenvalues and eigenvectors

Then, check your eigenvalues and eigenvectors:

To find the principal componants, find the eigenpairs, and sort them from highest to lowest. 

Next, Calculate the explained variance