## PCA for Dimensionality Reduction

In this notebook we explore Principal Component Analysis (PCA) and how it can be used to interpret data.
We first review how PCA is defined, what exactly the principal components are, and how PCA allows us to reduce the dimensionality of our dataset in a *good* way. 
We then see how we can use PCA in Python by analyzing a [diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) provided by NCSU
and what the principla components are. 


### Principal Components of a Dataset 

Suppose we have a dataset of $n$ samples. Each sample is measured by one of $m$ attributes.
We can interpret this as a matrix $X$ of size $n \times m$  where each row vector corresponds to a sample. 
For the purpose of dimensionality reduction, we would like to find a matrix $Y$ of size $n \times d$ where $d << m $ and $Y$ approximates $X$ 
(that is, $Y$ still captures the underlying properties of the dataset.)
In this way, the $d$ remaining attributes would be the componenets of $X$ that capture the most relevant information, or are the **principal components** of $X$.

Before we see how we can find these values, let's consider some scenarios where such a reduction can occur. 
One possibility is a set of attributes in are the same for each sample - there is no information that these attributes have that allow us to distinguish one sample from another. 
Therefore we could easily remove this set of attributes shrinking the size of $X$.
Typically this is not the case however and there is some variability between sample values for a particular attribute. 
An attribute that has less variability amongst its values may contain less information than an attribute with high variability and therefore could be removed from $X$ to generate $Y$. 

Another possiblity is there are some dependency between attributes within the data. 
For example, if we find $x_i = x_j + x_k$ for all samples and attributes $i,j,k$ then attribute $i$ is dependent on $j$ and $k$ and could be removed.


In [4]:
from sklearn.datasets import load_diabetes
dia = load_diabetes()


In [8]:
dia.data.shape

(442, 10)

In [15]:
dia.target.shape

(442,)

In [19]:
dia.data[:, 0].min()

-0.1072256316073538

In [20]:
dia.data[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187239, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632753, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567042, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286131, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665608,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02268774, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187239,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03198764, -0.04664087]])