# Principal Component Analysis (PCA)

Principle Component Analysis (PCA) is a standard tool in modern data analysis. PCA Provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structures that often underline it. The goal of this posting is to provide some intuitive understanding of PCA via some coding examples. To understand what's happening behind the scene (mathematics) you need to have a sound knowledge of linear algebra (matrix algebra).

# Reading

There are plenty of literature out there on PCA, among all I found following paper and the video is well written and explain all the nuance in PCA.

Paper: https://arxiv.org/pdf/1404.1100.pdf

Lecture: [video](https://www.youtube.com/watch?v=a9jdQGybYmE "link title")

In [8]:
import pandas as pd
import numpy as np
import numpy as np
from sklearn.decomposition import PCA
from IPython.display import display, HTML
from sympy import init_printing, Matrix, symbols, sqrt
init_printing(use_latex = 'mathjax')

## Load Data
The data is taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:

- Climate & Terrain
- Housing
- Health Care & Environment
- Crime
- Transportation
- Education
- The Arts
- Recreation
- Economics

In [9]:
data = pd.read_csv('places.csv')
data = data[[
            'location',
            'climate_log10',
            'housing_log10',
            'health_log10',
            'crime_log10',
            'transportation_log10',
            'education_log10',
            'arts_log10',
            'recreation_log10',
            'economy_log10'
        ]]

data = data.rename(columns={
            'climate_log10': '$x_1$',
            'housing_log10': '$x_2$',
            'health_log10': '$x_3$',
            'crime_log10': '$x_4$',
            'transportation_log10': '$x_5$',
            'education_log10': '$x_6$',
            'arts_log10': '$x_7$',
            'recreation_log10': '$x_8$',
            'economy_log10': '$x_9$'
    })
print(len(data))
data.head()

329


Unnamed: 0,location,$x_1$,$x_2$,$x_3$,$x_4$,$x_5$,$x_6$,$x_7$,$x_8$,$x_9$
0,"Abilene,TX",2.716838,3.792392,2.374748,2.965202,3.605413,3.440437,2.998259,3.147676,3.882695
1,"Akron,OH",2.759668,3.910518,3.21906,2.947434,3.688687,3.387034,3.745387,3.420286,3.638489
2,"Albany,GA",2.670246,3.865637,2.790988,2.986772,3.403292,3.40824,2.374748,2.933993,3.720159
3,"Albany-Schenectady-Troy,NY",2.677607,3.898067,3.15564,2.78533,3.837778,3.531351,3.66792,3.20871,3.768194
4,"Albuquerque,NM",2.818885,3.923917,3.267875,3.171141,3.816771,3.480869,3.652826,3.416973,3.757927


# Naive Basis

Each sample (row) $\vec{X}$ is a $m=9$ dimentional vector, each dimention is a some measurement of the location. There are 329 vectors in the dataset. Since each vector is a $m$ dimentional, the vector lies in an $m$-dimentional vector space span by some orthonormal basis. Lets say the basis are $[b_1...b_9]^T$

Now we can write a vector $\vec{X} =b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 + b_5x_5 + b_6x_6 + b_7x_7 + b_8x_8 + b_9x_9$


we can write the basis as $m\times m$ metrix $B$. We could use any orthonormal basis, but the naive choice would be identity matrix. $B$ is call the **naive basis** (naive basis reflects the method we gathered the data) 


In [20]:
B = Matrix([[1, 0, 0, 0, 0, 0, 0, 0, 0],[0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0],[0,0, 0, 1, 0, 0, 0, 0, 0],[0,0, 0, 0, 1, 0, 0, 0, 0],[0,0, 0, 0, 0, 1, 0, 0, 0],[0,0, 0, 0, 0, 0, 1, 0, 0],[0,0, 0, 0, 0, 0, 0, 1, 0],[0,0, 0, 0, 0, 0, 0, 0, 1]])
# display(B)

$$ B = \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \\ b_5 \\ b_6 \\ b_7 \\ b_8 \\ b_9 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 &0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 &0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 &0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 &0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 &0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 &0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix} = I$$

# Change of Basis

_Question_: **Is there another basis, which is a linear combination of orthogonal basis, that best re-express the data set?**

$X$ is a $m\times n$ (9 columns, 329 rows) matrix. We can do a linear transformation (rotation and stretch) on this dataset and generate a new matrix. Let us call this new matrix $Y$. In other words $P$ matrix transform $X$ to $Y$. Geometrical interpretation of $P$ is a rotation and a stretch of $X$ to obtain $Y$. The rows if the $P$ matrix is the new basis vectors that express the columns of $X$.

$$PX=Y$$

- $P$ -Linear Transformation
- $X$ -Original Data
- $Y$ -Transformed Data


 Vectors in $P$ is the **Principal Components** of $X$. There are infinite numbers of $P$ exits. In other words, we can take the original data set $X$ and rotate and stretch in any number of ways you can imagine. Among these combinations, we can pick a one of them and for us to choose one transformation we can ask very fundamental questions
 
 - What is the best way to re-express the original data set $X$ ?
 - What is a good choise of basis of $P$
 
 
 
