# Principal Component Analysis 

_August 24, 2020_

Agenda today:
- dimensionality reduction: motivation for PCA & intuition for PCA
- Construct Principal Components using singular value decomposition and interpret them 
- Example implementation in Python

## Part I. Motivation and Intuition for PCA
We can easily visualize data with two or three dimensions, but as the dimensions of our data increase, making sense of our data through visualization becomes challenging. One of such solution is pairplots:
<img src="attachment:Screen%20Shot%202019-03-24%20at%207.02.22%20PM.png" style="width:500px;">

Pairplots are a great way of visualizing the pairwise correlations between variables, so are heatmaps. However, as the dimensions increase to hundreds or even thousands, using pairplots or heatmap might not be suitable anymore. In addition, pairplots and heatmap only show pairwise correlations between variables, but not linear combinations of variables. Therefore, an algorithm that perform dimensionality reduction is needed. 

## Part II. Constructing PCA

#### What is PCA?
PCA is a method to summarize our data -- it essentially construct new characteristics from the original dataset and summarize our original variables. In other words, unlike other dimensionality reduction algorithm that retain or select the original features, PCA linearly transform the original variables. PCA extracts our features and output principal components that are linear combinations of the original variables by projecting the original variables to lower dimensions in such a way that it captures the variability of the data. PCA would not discard any information or any variables, and it is up to us to select the features we want to include in our model. PCA outputs $p$ principal components where $p$ equals the number of features we have in our dataset. 

#### What is principal components?
The principal components are linear combination of the original variables. The first principal component represents the combination of variables that explain the most amount of variances in the data, and the second component explains the second most amount of variance in the data, and so on. For example, 
<img src="attachment:Screen%20Shot%202019-03-25%20at%2011.18.58%20AM.png" style="width:500px;">


#### PCA Step-by-step
1. It is important to **center** and standardize your data. PCA lives off of correlation and covariance of your data, and using wildly different scales could lead to inflated weights for the linear combination. Let's call this centered and standardized matrix **Z**. 
2. Calculate a covariance matrix of p x p where p responds to number of predictors. 
3. Calculate the eigenvectors and eigenvalues of the covariance matrix. 
4. Arrange the eigenvalues from largest to smallest. You should obtain p eigenvalues which correspond to number of components. 
5. Choose the amount of components you want to include based on number of variance explained.

#### PCA Terminology 
- Eigenvectors 
    - Eigenvectors are the direction of the unit scaled vector in the p-dimensional space for the principal components. 
- Eigenvalues
    - Eigenvalues are the magnitude of the variation in each of the components, denoted as $\lambda$
- Loadings
    - Shall not be confused with eigenvectors
    - $Loadings=Eigenvectors⋅\sqrt{Eigenvalues}$
    - loadings are the covariances/correlations between the original variables and the unit-scaled components
    - Properties of Loadings:
        - Their sums of squares within each component are the eigenvalues (components' variances).
        - Loadings are coefficients in linear combination describing a variable by the (standardized) components.

## Part III. PCA Example
In this example, we want to examine how to reduce the dimensionality of the diamonds dataset, and which features produce the highest variability in our dataset. Let's find out!

In [None]:
import numpy as np

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('mpg')
df.head()
df.columns

In [None]:
df.head()

In [None]:
df.dropna(axis = 0, inplace = True)

In [None]:
df.columns

In [None]:
df = df[df.horsepower != '?']
df['horsepower'] = df['horsepower'].astype('int')

In [None]:
features = ['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration','horsepower']

In [None]:
# preprocess the data 
from sklearn.preprocessing import StandardScaler
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['mpg']].values

In [None]:
# Standardizing the features
x = StandardScaler().fit_transform(x)

In [None]:
# create principal components 
from sklearn.decomposition import PCA
pca = PCA()
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2',
                                                                 'principal component 3','principal component 4',
                                                                 'principal component 5','principal component 6'])

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.explained_variance_

In [None]:
principalDf.iloc[:,:3].head()

In [None]:
eig_values = pca.explained_variance_
eig_vectors = pca.components_
print(eig_values)
print(eig_vectors)

In [None]:
# examine the first pricipal component
eig_vectors[0]

In [None]:
# examine the components
pc1 = pca.components_[0]
pc2 = pca.components_[1]
# the .components attribute shows principal axes in feature space, representing the directions of maximum variance in the data. 
#The components are sorted by explained_variance_


In [None]:
print(pc1)
print(pc2)

In [None]:
# get the loadings
structure_loading_1 = pc1* np.sqrt(eig_values[0])
str_loading_1 = pd.Series(structure_loading_1, index=features)
str_loading_1

In [None]:
str_loading_1.sort_values(ascending=False)

In [None]:
index = np.arange(6)
plt.bar(index, pca.explained_variance_ratio_)
plt.title('Scree plot for PCA')
plt.xlabel('Num of components')
plt.ylabel('proportion of explained variance')

In [None]:
# plotting screeplots 
#print(pca.explained_variance_ratio_)
#print(pca.explained_variance_)

plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('explained variance ratio')

#### Using numpy 

In [None]:
# you can also use numpy to solve for this aa
import numpy as np 
corr_mat = pd.DataFrame(x).corr()
eigenvalues, eigenvectors = np.linalg.eig(corr_mat)

In [None]:
eigenvalues

In [None]:
eigenvectors

### Conclusions
PCA is one of the most versatile algorithm used not only in research in data science but also social sciences, natural sciences and others. It allows us to further examine the relationship between different variables in our dataset, and the importance of each variable. Some of the practical applications of PCA are:
- [Facial Recognition](https://en.wikipedia.org/wiki/Eigenface)
- General purpose dimensionality reduction 
- Clinical psychology -- distinguishing patients with schizophrenia from healthy patients [link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566788/)
- Data Visualization 