# Compression Data via Dimensionality Reduction

- An alternative to **Feature selection** is **Feature extraction**
- Here we learn how to summarize content of a dataset by transforming into new feature subspace of lower dimensionality 


## Unsupervised dimensionality reduction via principal component analysis

- feature extraction transforms or projects the data into a new feature subspace
- feature extraction = data compression (in dimensionality reduction context) which
  - improves storage space
  - computational efficiency
  - predictive performance (by reducing the curse of dimensionality) especially for non-regularized models

### The main steps in pricipal component analysis

- principal component analysis (PCA) - unsupervised linear transformation technique used as
  - feature extraction
  - exploratory data analysis
  - denoising of signals in stock market trading
  - analysis of genome data 
- PCA helps identify patterns based on correlation between features
- PCA aims to find the directions of maximum variance in high-dimensional data and projects the data onto a new subspace with equal of fewer dimensions than the original one. 
- The orthogonal axes (pricipal component)
  - directions of maximum variance
  - given the constraint that the new feature axes are orthogonal to each other
- PC1 and PC2 are the principal components
- Using PCA for dimensionality reduction, 
  - construct a $ d \times k$-dimensional transformation matrix $\mathbf{W}$ 
  - maps the features of a training example $x$ onto a new $k$-dimensional feature subspace that is lesser than $d$

$$
\textbf{x} = [x_1, x_2, \dots x_d], \mathbf{x} \in \Reals^d \\
\mathbf{W}  \in \Reals^{d \times k} \\
\textbf{x}\mathbf{W} = \mathbf{z} \\
\mathbf{z} = [z_1, z_2, \dots z_k], \mathbf{z} \in \Reals^k \\
$$

- $k << d$
- the first principal component will have the largest variance
- succeeding pricipal component will have the largest variance given that they are uncorrelated (orthogonal) to  other  pricipal components
- if PCA directions are  highly sensitive to data scaling, we need to standardize the features prior to PCA. We want to assign equal importance

**PCA algorithm**

1. standardize the $d$-dimensional dataset
2. Contruct a covariance matrix
3. Decompose the covariance matrix into its eigenvectors and eigenvalues
4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors
5. select $k$ eigenvectors corresponding to the $k$ largest eigenvalues ($k$ is the dimensionality of the new feature subspace)
6. Construct a projection matrix $\mathbf{W}$ from the "top" $k$ eigenvectors
7. Transform the $d$-dimensional input dataset $\mathbf{X}$ using the projection matrix  $\mathbf{W}$ to obtain the new $k$-dimensional feature subspace

### Extracting the principal component step by step

1. Standardizing the data
2. Constructing the covariance matrix
3. Obtaining the eigenvalues and eigenvectors of the covariance matrix