__The idea of dimensionality reduction__

Suppose that $X$ is $m\times n$, where the columns $X^i$ of $X$ represent variables,

The rows $X_i$ of $X$ represent the sample elements.

The aim of the dimensionality reduction: to introduce new variables such that each $X_i$ is (approximately) represented via __smaller number of parameters__.

To do this, we want to introduce __new co-ordinates__ in $\mathbb{R}^n$

__First step: centering the data__

Let $\overline{X} = (X_1 + \dotsm + X_m)/m = (1/m)(1,\dots,1)X$ be the mean row of $X$. Then $X$ be the origin of our new coordinate system.

Let $A$ be the matrix such that each row $A_i$ is $X_i - \overline{X}$. If $\widetilde{X}$ is the matrix of the same size as $X$ where each row is replaced by $\overline{X}$, then $A=X-\widetilde{X}=X-(1/m)(1,\dots,1)^T(1,\dots,1)X$.

Now the mean row of $A$ is 0.

(Sometimes people also standartize the data, but this is not necessary)

__Dimensionality reduction and the basis change__

Supopse we want to reduce the dimensionality of the problem (that is, the number of variables) from $n$ to $r$.

We want to find an r-dimensional subspace $L$ of $\mathbb{R}^n$ and project our data onto $L$. $A$ basis of $L$ will give new co-ordinate axes, that is, new variables.

Problem: how to find L and the new axes to minimize the information lost?

__Second step SVD and principal axes__

consider the SVD $A=U\Sigma V^T$.

Then the columns of $V$ are called _princial axes_.

The first $r$ columns $V^1, \dots , V^r$ form an orthonormal basis of $L$.  
_$V^r$ in descending order. Roughly speaking, the eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data_

Let $x \in \mathbb{R}^n$, and let $p=pt_Lx$. The coordinates of the vector $p$ are called _principal components_ of $x$ 

_They are more or less express vector $X$ in terms of only our new variables. These new variables corresponding to these vectors $V^1, V^r$ are our $A$. These are new variables which are more or less precisely express all elements of our sample._

__The new coordinates via principal components__

Let $x \in \mathbb{R}^n$, and let $p=pr_Lx$. THe coordinates of the vector $p$ in the basis $v = \{V^1,\dots,V^r\}$ are called _principal components_ of $x$.

They are the first $r$ coordinates for the least sqaure solution of the system $V_y = x$, 
that is, the vector $y=(V^TV)^{-1}V^Tx=V^Tx$.

So, the principal components of the t-th row $A_t$ of $A$ are the components of the vector
$V^TA^T_t=(A_tV)^T$, the t-th row of the matrix $AV$.

Note that $AV=(U\Sigma V^T)V = U\Sigma$

In [None]:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([
        [-1,-1,0],
        [-2,-1,2],
        [-3,-2,3],
        [1,1,5],
    ])

pca = PCA(n_components = 2)
print(pca.fir_transform(X))
print('the mean row of X is', pca.mean_)
print('the singular values are', pca.singular_values_)
print('the principal components are', pca.components_)

PCA can be thought of as an unsupervised learning algorithm. The whole process of obtaining principal components from a raw dataset can be simplified in six parts.

- Take the whole dataset consisting of d+1 dimensions and ignore the labels such that our new dataset becomes d dimensional.
- Compute the mean for every dimension of the whole dataset.
- Compute the covariance matrix of the whole dataset.
- Compute eigenvectors and the corresponding eigenvalues.
- Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d × k dimensional matrix W.
- Use this d × k eigenvector matrix to transform the samples onto the new subspace.

How do we interpret the result from PCA?
- The direction given by $\vec{v}_1$ (called the first principal direction) in $\mathbb{R}^m$ accounts for an amount $\lambda_1$ of the total variance, $T$. It’s a $\frac{\lambda_1}{T}$ fraction of the total variance. Similarly, we can do the same for the second principal direction, $\vec{v}_2$, and the fraction would be $\frac{\lambda_2}{T}$
- The vector $\vec{v}_1 \in \mathbb{R}^m$ is the most significant/important direction of the data set.
- Among the directions that are orthogonal to $\vec{v}_1, \vec{v}_2$ is the most significant direction and similarly, among the directions orthogonal to both $\vec{v}_1$ and $\vec{v}_2, \vec{v}_3$ is the most significant direction and so on.

__What exactly has PCA done for the data set?__

t has eventually reduced the dimension of the data set. It is often the case that the largest few eigenvalues of S are much greater than the rest. If we assume that we have $m=15$ and total variance $T=110$ such that $\lambda_1=100.1, \lambda_2=9.5$ and $\lambda_3,\lambda_4,\dots \lambda_15$ are all less than 0.1. So even though our data points are forming some impossible to imagine figure in $\mathbb{R}^{15}$. PCA tells us that these points cluster only near a 2D plane spanned by $\vec{v}_1$ and $\vec{v}_2$. To be more precise, the data points would be clustered around a plane passing through $\mu$ and spanned by orthogonal vectors $\vec{v}_2$. data point will look like a rectangular strip inside the plane. Therefore, we have reduced the problem from 15 to 2 dimensional which is in some sense easier to visualize.