# Principal Component Analysis (PCA)
This notebook explains the basic concepts of Principal Component Analysis (PCA).

Farhad Kamangar
Feb. 10, 2017

## What is PCA?
The main goal of Principal Components Analysis (PCA) is  to transform a set of multi-dimensional data points from their original space to another space such that the correlation between the variables in the transform space is minimized.
Once the data is projected into the new space, some of the dimensions which have low variance can be ignored and the original data may be presented in a lower dimensional space without too much loss of information.

## Data Presentation

Suppose we have a set of $N$ data points in a $D$ dimensional space. This means that each data point is a point in a $D$ dimensional space and the data set can be presented as an $N$ by $D$ matrix.

For example assume that we measure the height (inches), weight (pounds), and their waist size (inches) of 5 people and present it as a data set:
$$\large \left( {\matrix{ 65 & 72 & 68 & 80 & 66 \cr  150 & 180 & 156 & 190 & 152  \cr   30 & 34 & 32 & 38 & 36  \cr  } } \right)$$

Notes:
* Each column represents one person ( one data point).
* Each row represents an attribute, such as height (a dimension)
* Each person can be represented a single point, $X$, in a 3-dimensional space.

$$\large X= \left[ {\matrix{ x_1 \cr  x_2  \cr   x_3  \cr  } } \right]$$



## Variance and Covariance
Variance of an attribute (a dimension) is defined as: 

$$\large {\mathop{\rm var}} ({x_i}) = {{\sum\limits_{k = 1}^N {{{({x_{i,k}} - \mu_i)}^2}} } \over {(N - 1)}}$$

where $x_{i,k}$ is the value of the $i_{th}$ attribute in sample $k$, and $\mu_i$ is the expected value of the $i_{th}$ attribute.

Covariance between two attributes (two dimensions) is defined as:

$$\large {\mathop{\rm cov}} ({x_i},{x_j}) = {{\sum\limits_{k = 1}^N {{{({x_{i,k}} - \mu_i)({x_{j,k}} - \mu_j)}}} } \over {(N - 1)}}$$

Notes:
* If covariance between two attributes is positive, then those dimensions have the tendency to increase together.
* If covariance between two attributes is negative, then those dimensions have the tendency to increase or decrease opposite of each other (one increases, the other decreases)
* If covariance between two attributes is zero, then the two attributes are independent of each other.  



## What is a Covariance Matrix?
Covariance matrix shows the variance and the correlation between the dimensions in a multi-dimensional data.
Suppose we have a set of $N$ data points in a $D$ dimensional space. This means that each data point is a point in a $D$ dimensional space and the data set can be presented as an $N$ by $D$ matrix.
 
The covariance of the data set can be calculated as:


$$\large \Sigma = \left( {\matrix{   {E{{({x_1} - {\mu _1})}^2}} & {...} & {E\left( {({x_1} - {\mu _1})({x_d} - {\mu _d})} \right)}  \cr     \vdots  &  \cdots  &  \vdots   \cr    {E\left( {({x_d} - {\mu _d})(x{}_1 - {\mu _1})} \right)} &  \ldots  & {E{{({x_d} - {\mu _d})}^2}}  \cr  } } \right)$$


where $E$ is the expected value, $x_i$ represents the $i_th$ dimension and ${\mu _i}$ is the average value for the $i_th$ dimension. 

The covariance matrix can also be formulated in the matrix form as:

$$\large \Sigma={1 \over N}\sum\limits_{i = 1}^N {({{\bf{X}}} - {\bf{ \mu}}){{({\bf{X}} - {\bf{ \mu}})}^T}} $$

Let's calculate the covariance of the above example, i.e., height, weight, and waist of 5 persons:


In [1]:
# imports
try:
    if __IPYTHON__:
        from IPython import get_ipython

        get_ipython().magic('matplotlib')
        from ipython_utilities import *
        from ipywidgets import interact, interactive, fixed, \
            FloatSlider, IntSlider, FloatRangeSlider, Label
        from IPython.display import display, HTML
        in_ipython_flag = True
except:
    in_ipython_flag = False
import cv2 as cv
from matplotlib import pyplot as plt
import numpy as np
from threading import Thread

Using matplotlib backend: Qt4Agg
Using matplotlib backend: Qt4Agg


In [2]:
data=np.array([[65,72,68,80,66],[150,180,156,190,152],[30,34,32,38,36]])
mean=np.mean(data,1) # Mean for each dimension
zero_mean_data=data-mean[:, np.newaxis]
covariance_matrix=np.cov(data)
display_as_html_table(data, "Original Data Set")
mean=mean[np.newaxis, :].T  
display_as_html_table(mean, "Mean")
display_as_html_table(zero_mean_data, "Zero mean data set")
display_as_html_table(covariance_matrix, "Covariance Matrix")

0,1,2,3,4
65.0,72.0,68.0,80.0,66.0
150.0,180.0,156.0,190.0,152.0
30.0,34.0,32.0,38.0,36.0


0
70.2
165.6
34.0


0,1,2,3,4
-5.2,1.8,-2.2,9.8,-4.2
-15.6,14.4,-9.6,24.4,-13.6
-4.0,0.0,-2.0,4.0,2.0


0,1,2
37.2,106.1,14.0
106.1,330.8,38.0
14.0,38.0,10.0


In [3]:
from numpy.random import standard_normal
from matplotlib.patches import Ellipse
from numpy.linalg import svd


def plot_2d_pca(mean_x,
                mean_y,
                sigma_x,
                sigma_y,
                rotation_angle,
                center=True):
    rotation_angle=np.pi*rotation_angle/180.
    mean = np.array([mean_x, mean_y])
    sigma = np.array([sigma_x, sigma_y])
    rotation_matrix = np.array([[np.cos(rotation_angle), -np.sin(rotation_angle)],
                                [np.sin(rotation_angle), np.cos(rotation_angle)]])
    data_set = np.dot(standard_normal((1000, 2)) * sigma[np.newaxis, :], rotation_matrix.T) + mean[np.newaxis, :]

    figure_labels = plt.get_figlabels()
    if figure_labels or figure_labels != "PCA Demo":
        fig = plt.figure("PCA Demo", figsize=(8, 8))
        ax = fig.add_subplot(111)
    else:
        plt.figure("PCA Demo")
    ax.clear()
    ax.scatter(data_set[:200, 0], data_set[:200, 1], marker='*')
    ax.grid()
    limit = 10.0
    ax.set_xlim([-limit, limit])
    ax.set_ylim([-limit, limit])
#     ellipse = Ellipse(xy=np.array([mean_x, mean_y]), width=sigma_x * 3, height=sigma_y * 3, angle=rotation_angle / np.pi * 180,
#                 facecolor=[1.0, 0, 0], alpha=0.3)
#     ax.add_artist(ellipse)
    if center:
        X_mean = data_set.mean(axis=0, keepdims=True)
    else:
        X_mean = np.zeros((1, 2))
    U, s, V = svd(data_set - X_mean, full_matrices=False)
    for v in np.dot(np.diag(s / np.sqrt(data_set.shape[0])), V):  # Each eigenvector
        ax.arrow(X_mean[0, 0], X_mean[0, 1], -v[0], -v[1], width=0.02,
                 head_width=0.1, head_length=0.1, fc='r', ec='b')
    plt.show()


controls = interactive(plot_2d_pca,
                       mean_x=FloatSlider(min=-10.0, max=10.0, value=0),
                       mean_y=FloatSlider(min=-10.0, max=10.0, value=0),
                       sigma_x=FloatSlider(min=0.1, max=4, value=1.0),
                       sigma_y=FloatSlider(min=0.1, max=4, value=0.5),
                       rotation_angle=FloatSlider(min=0.0, max=180, value=30.0),
                       center=True);
arrange_widgets_in_grid(controls)
