# Example 30: PCA for dimensionality reduction

## Contents
* [Acknowledgements](#ackw)
* [Overview](#overview) 
    * [Priiciapl Component Analysis](#ekf)
    * [Summary PCA](#sumekf)
    * [Test Case](#motion_model)
* [Include files](#include_files)
* [The main function](#m_func)
* [Results](#results)
* [Source Code](#source_code)

## <a name="overview"></a> Overview


In this notebook we will discuss <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">Principal Component Analysis</a> for dimensionality reduction.

Dimensionality reduction refres to a number of techniques for reducing the dimensions associated with a data set. Consider, for example, a data set where each input point has five features. A dimensionality reduction technique can help us reduce the number of features to three or two. 

One perhaps aparent reason why one would like to reduce the number of features in a data set is visualization. It is easy to visualize two or even three dimensional data sets. However, as the dimensions increase, the difficulty of doing so also increases both in terms of computing power required as well as the conceptual understanding.  

Apart from visualization, one other reason why someone would like to reduce the number of dimensions of a data set is that a large number  of input features can cause poor performance of ML algorithms. This is frequently abbreviated as the curse of dimensionality. Furthermore, often fewer input dimensions translate to fewer model parameters or simpler model structure in general. A model with too many parameters is likely to overfit the training set and therefore it may not perform well on new data. Finally, dimensionality reduction may also be appropriate when the variables in a dataset are noisy.

In generaly, there are two main classes of dimensionality reduction technique namely feature selection and feature extraction. With feature selection, as the name implies, we somehow select a subset of the original features. On the other hand, with feature extraction, we derive information from the data set to build or create new features altogether. 

### <a name="ekf"></a> Principal Component Analysis

PCA can be thought of as fitting a p-dimensional ellipsoid to the data (see figure below). Each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis and its corresponding principal component from our representation of the dataset, we lose only an equally small amount of information. 

![PCA eignevectors](pca_vectors.jpeg)

Thus, it appears that the major question is how can we find the axes of the ellipsoid? It turns out that we can have structured approach towards this direction:

1. We first center the data around the origin by subtracting the mean of each variable from the dataset. 
2. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix e.g. eigenvalue decomposition. 
3. Then we must normalize each of the orthogonal eigenvectors to become unit vectors. 



The vectors from step,  are the eignevectors of the matrix $\mathbf{X}^T\mathbf{X}$. Thus they should satisfy the following equation:

$$\mathbf{X}^T  \mathbf{X}\mathbf{w}_j = \lambda_j\mathbf{w}_{j}$$

where $\lambda_j$ is the eigenvalue corresponding to the eigenvector $\mathbf{w}_j$.

Note that the matrix $\mathbf{X}^T  \mathbf{X}$ is symmetric and therefore the eigenvectors are orthogonal. At step three we normalize $\mathbf{w}_j$ to havel length one. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data.

 This choice of basis will transform our covariance matrix into a diagonalised form with the diagonal elements representing the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. 

One other approach to do PCA is by performing a <a href="https://en.wikipedia.org/wiki/Singular_value_decomposition"> Singular Value Decomposition</a>.

### Singular Value Decomposition

The SVD of a matrix $\mathbf{A}$ has the following form

$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$

where $\mathbf{\Sigma}$ is  a diagonal-like matrix containing the singular values $\sigma_i$ and $\mathbf{V}$ is an orthogonal matrix. 

Schematically, SVD is shown in the figure below

![SVD Schematics](svd.png)

The eigenvalues $\lambda_i$ of $\mathbf{A}^T\mathbf{A}$ and the singular values $\sigma_i$ of $\mathbf{A}$ are connected via:

$$\lambda_i = \sigma_{i}^2$$

Furthermore, the columns of $\mathbf{V}$ are the eigenvectors for $\mathbf{A}^T\mathbf{A}$. Moreover, the SVD process orders the singular values according to their size i.e.: 

$$\sigma_1 \ge \sigma_2 \ge \dots \ge 0$$

Hence, we can use as $\mathbf{w}_1$ the first column of $\mathbf{V}$ as this will be the eigenvector for the largest singular value $\sigma_1 = \sqrt{\lambda_1}$. Similarly for the rest $\mathbf{w}_i$. The columns of $\mathbf{V}$ are called the principal components.

#### Total variance

Very frequently when we discuss PCA, the terms total variance and variance explained frquently come up. Let's briefly see how these terms are defined.

In simple words, the total variance is the sum of variances of all individual principal components. Moreover, the fraction of variance explained by a principal component is the ratio between the variance of that principal component and the total variance.

Ok but how do we calculate this? This is simple once we know the principal components or the eigenvectors of $\mathbf{X}^T  \mathbf{X}$. Concretely, each column of $\mathbf{V}$ is such an eignevector. 

As a side note, if we compute the total variance over the original data set and the total variance over the transformed data set then these should be the same. However, in the latter case, the total variance is redistributed among the new variables unequally. Specifically, the first variable not only explains the most variance among the new variables, but the most variance a single variable can possibly explain. 

As a second side note, the fraction of variance explained from principal $j$ component is also equalt to: 

$$\frac{\lambda_j}{\sum_i \lambda_i}$$

#### Dimensionality reduction

So by now, we have a process that allows us to calculate the principal directions in the data set $\mathbf{X}$. How can we use that in order to reduce the dimensions of the data set? The answer is that we don't have to retain the whole $\mathbf{V}$ but instead only the eigenvectors that correspond to the largest eigenvalues $\lambda_i$ or equivalently to the largest singular values $\sigma_i$. The new data set then is given by the following transformation:


$$\mathbf{X}_{new} = \mathbf{X}\mathbf{V}_{L}$$

where $\mathbf{V}_{L}$ means that only the first $L$ columns of $\mathbf{V}$ are retained and $L<<d$. Thus, $\mathbf{V}_{L}$ is constrained to contain the first $L$ largest principal components. These components are uncorrelated (orthogonal) to the other principal components even if the input features are correlated, the resulting principal components will be mutually orthogonal (uncorrelated).
This trasformation therefore reduces the number of features from $d$ to $L$.

### <a name="sumekf"></a> Summary of PCA

Princiapl Component Analysis or PCA is a technique to obtain the so-called principal components of a data set. The principal components correspond to the eigenvectors of the matrix $\mathbf{X}^T  \mathbf{X}$. This matrix is symmetric and thus the components are orthogoanl to each other. Therefore, we can use them to trasform the data.

The first $k$ principal components (where can be 1, 2, 3 etc.) explain the most variance any $k$ variables can explain, and the last $m-k$ variables explain the least variance any variables can explain. By retaining only the first $k$ components we can reduce a $d-$dimensional data set to a $k-$dimensional set.

When performing dimensionality reduction, one should bear in mind that we project from a higher dimensional subspace to a lower dimensional one. Thus, information is unavoidably lost. When it comes to PCA, if the number of variables is large, it becomes hard to interpret the principal components. Furthermore, the technique is mostly suitable when the associated features have a linear relationship among them. Finally, PCA is sensitive to  outliers.

### <a name="motion_model"></a> Test Case

Let us first consider the following toy example. The data set $\mathbf{X}$ is:

```
DynMat<real_t> X(6, 2);
X(0,0) = -1.;
X(0,1) = -1.;
    
X(1,0) = -2.;
X(1,1) = -1.;
    
X(2,0) = -3.;
X(2,1) = -2.;
    
X(3,0) = 1.;
X(3,1) = 1.;
    
X(4,0) = 2.;
X(4,1) = 1.;
    
X(5,0) = 3.;
X(5,1) = 2.;
```

Observe that the empirical mean for each of the two columns is zero. The <a href="https://bitbucket.org/blaze-lib/blaze/src/master/">Blaze</a> library that we use to represent matrices and vectors has support for SVD. We will use the following function in the code below:


```
template< typename MT1, bool SO, typename VT, bool TF, typename MT2, typename MT3 >
void svd( const DenseMatrix<MT1,SO>& A, DenseMatrix<MT2,SO>& U,
          DenseVector<VT,TF>& s, DenseMatrix<MT3,SO>& V );
```

The example above is rather simple. We will use the <a href="https://archive.ics.uci.edu/ml/datasets/wine">wine</a> data set as a more complicated example. This data set has 178 examples and 12 features. We can load the data set by issuing 

```
auto data = kernel::load_wine_data_set(false);
```

Furthermore, we will use the ```PCA``` class that helps us with maintaining the relevant information. Note that the class only transforms the supplied data set according to the transformation given above. However, it is the application's responsibility to scale the data appropriately if necessary.

## <a name="include_files"></a> Include files

```
#include "cubic_engine/base/cubic_engine_types.h"
#include "kernel/maths/matrix_utilities.h"
#include "kernel/utilities/data_set_loaders.h"
#include "kernel/maths/pca.h"

#include <iostream>
```

## <a name="m_func"></a> The main function

```
namespace example
{

using cengine::uint_t;
using cengine::real_t;
using cengine::DynMat;
using cengine::DynVec;

void test_case_1(){

    DynMat<real_t> X(6, 2);
    X(0,0) = -1.;
    X(0,1) = -1.;

    X(1,0) = -2.;
    X(1,1) = -1.;

    X(2,0) = -3.;
    X(2,1) = -2.;

    X(3,0) = 1.;
    X(3,1) = 1.;

    X(4,0) = 2.;
    X(4,1) = 1.;

    X(5,0) = 3.;
    X(5,1) = 2.;

    // caluclate the sample variance
    // of each of the 3 variables (columns)
    auto col1 = kernel::get_column(X, 0);
    auto col2 = kernel::get_column(X, 1);

    auto col1_var = var(col1);
    auto col2_var = var(col2);

    std::cout<<"Variable 1 variance: "<<col1_var<<std::endl;
    std::cout<<"Variable 2 variance: "<<col2_var<<std::endl;

    // compute the total variance
    auto total_var = col1_var + col2_var;

    std::cout<<"Total variance: "<<total_var<<std::endl;

    DynMat<real_t> U;
    DynVec<real_t> s;
    DynMat<real_t> V;

    std::cout<<"Variable 1 explains: "<<col1_var/total_var<<std::endl;
    std::cout<<"Variable 2 explains: "<<col2_var/total_var<<std::endl;

    svd(X, U, s, V );

    std::cout<<"Singular values: "<<s<<std::endl;

    auto sum_eigen_values = 0.0;
    for(uint_t v=0; v<s.size(); ++v){
       sum_eigen_values += s[v]*s[v];
    }

    std::cout<<"Sum eignenvalies: "<<sum_eigen_values<<std::endl;
    //std::cout<<"Variable 1 variance: "<<s[0]*s[0]<<std::endl;
    //std::cout<<"Variable 2 variance: "<<s[1]*s[1]<<std::endl;
    std::cout<<"Variable 1 explains: "<<(s[0]*s[0])/sum_eigen_values<<std::endl;
    std::cout<<"Variable 2 explains: "<<(s[1]*s[1])/sum_eigen_values<<std::endl;

    // Principal axes in feature space,
    // representing the directions of maximum variance in the data.
    // these are the columns of the V matrix
    std::cout<<"V matrix: "<<V<<std::endl;

    // reconstruct the data set with PCA
    // The full principal components decomposition of
    // X can be given as T= XW
    DynMat<real_t> T = X*V;

    // caluclate the sample variance
    // of each of the 3 variables (columns)
    auto pca_col1 = kernel::get_column(T, 0);
    auto pca_col2 = kernel::get_column(T, 1);

    auto pca_col1_var = var(pca_col1);
    auto pca_col2_var = var(pca_col2);

    std::cout<<"PCA variable 1 variance: "<<pca_col1_var<<std::endl;
    std::cout<<"PCA variable 2 variance: "<<pca_col2_var<<std::endl;

    // this should be the same at the total variance
    // compute the total variance
    auto pca_total_var = pca_col1_var + pca_col2_var;

    std::cout<<"PCA Total variance: "<<pca_total_var<<std::endl;

    std::cout<<"PCA Variable 1 explains: "<<pca_col1_var/pca_total_var<<std::endl;
    std::cout<<"PCA Variable 2 explains: "<<pca_col2_var/pca_total_var<<std::endl;

}

void test_case_2(){

    using  kernel::PCA;

    // load the wine data set
    auto data = kernel::load_wine_data_set(false);

    // extract the column means
    auto means = kernel::get_column_means(data.first);

    // crenter the columns
    kernel::center_columns(data.first, means);

    auto variances = kernel::get_column_variances(data.first);
    auto total_var = sum(variances);
    std::cout<<"Total variance: "<<total_var<<std::endl;

    for(uint_t c=0; c<variances.size(); ++c){
        std::cout<<"Variable: "<<c<<" explains: "<<variances[c]/total_var<<std::endl;
    }

    // keep the first three components
    // with the largest variance
    PCA pca(3);

    // transform the data
    pca.fit(data.first);

    auto singular_vals = pca.get_singular_values();
    std::cout<<"Singular values: "<<singular_vals<<std::endl;

    auto explained_var = pca.get_explained_variance();

    for(uint_t c=0; c<explained_var.size(); ++c){
        std::cout<<"Component: "<<c<<" explains: "<<explained_var[c]<<std::endl;
    }

}

}

int main() {
   
    using namespace example;
    
    try{

        std::cout<<"========================="<<std::endl;
        std::cout<<"Doing test 1"<<std::endl;
        std::cout<<"========================="<<std::endl;
        test_case_1();

        std::cout<<"========================="<<std::endl;
        std::cout<<"Doing test 2"<<std::endl;
        std::cout<<"========================="<<std::endl;
        test_case_2();
    }
    catch(std::runtime_error& e){
        std::cerr<<"Runtime error: "
                 <<e.what()<<std::endl;
    }
    catch(std::logic_error& e){
        std::cerr<<"Logic error: "
                 <<e.what()<<std::endl;
    }
    catch(...){
        std::cerr<<"Unknown exception was raised whilst running simulation."<<std::endl;
    }
   
    return 0;
}

```

## <a name="results"></a> Results


Upon running the driver code above we get:

```
=========================
Doing test 1
=========================
Variable 1 variance: 5.6
Variable 2 variance: 2.4
Total variance: 8
Variable 1 explains: 0.7
Variable 2 explains: 0.3
Singular values: (     6.30061 )
(    0.549804 )

Sum eignenvalies: 40
Variable 1 explains: 0.992443
Variable 2 explains: 0.00755711
V matrix: (    -0.838492    -0.544914 )
(    -0.544914     0.838492 )

PCA variable 1 variance: 7.93954
PCA variable 2 variance: 0.0604569
PCA Total variance: 8
PCA Variable 1 explains: 0.992443
PCA Variable 2 explains: 0.00755711
=========================
Doing test 2
=========================
Total variance: 224.788
Variable: 0 explains: 0.00293193
Variable: 1 explains: 0.00555198
Variable: 2 explains: 0.000334826
Variable: 3 explains: 0.0496143
Variable: 4 explains: 0.907476
Variable: 5 explains: 0.00174249
Variable: 6 explains: 0.00443849
Variable: 7 explains: 6.89034e-05
Variable: 8 explains: 0.00145735
Variable: 9 explains: 0.023909
Variable: 10 explains: 0.000232419
Variable: 11 explains: 0.0022425
Singular values: (     190.221 )
(     45.1776 )
(     31.5194 )
(     16.7921 )
(     12.5484 )
(     7.59523 )
(      5.1764 )
(     4.45796 )
(     3.56478 )
(     2.65211 )
(     1.94634 )
(     1.20831 )

Component: 0 explains: 0.909437
Component: 1 explains: 0.0512981
Component: 2 explains: 0.0249696
```

## <a name="source_code"></a> Source Code


<a href="../exe.cpp">exe.cpp</a>