# 9.5  Feature scaling via PCA sphering

- In the [Section 9.3](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) we saw how *feature scaling* via *standard normalization* significantly improves the topology of a machine learning cost function.


- This enables much more rapid minimization via first order methods like e.g., the generic gradient descent algorithm.  


- In this Section we describe how PCA is used to perform a more advanced form of standard normalization - commonly called *PCA sphereing* (also commonly referred to as *whitening*).  


- With this improvement on standard normalization we use PCA to rotate the mean-centered dataset so that its largest orthogonal directions of variance allign with the coordinate axes prior to scaling each input by its standard deviation.  



- This typically allows us to better compactify the data, resulting in a cost function whose contours are even more 'circular' than that provided by standard normalization and thus makes cost functions even easier to optimize.  

You can toggle the code on and off in this presentation via the button below.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [1]:
# This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
import numpy as np

# custom libs
from mlrefined_libraries import unsupervised_library as unsuplib
from mlrefined_libraries import basics_library as baslib
from mlrefined_libraries import math_optimization_library as optlib
from mlrefined_libraries import superlearn_library as superlearn

unsup_datapath = '../../mlrefined_datasets/unsuperlearn_datasets/'
sup_datapath = '../../mlrefined_datasets/superlearn_datasets/'

normalizers = unsuplib.normalizers
optimizers = optlib.optimizers
cost_lib = superlearn.cost_functions
classification_plotter = superlearn.classification_static_plotter.Visualizer();


# this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

##  The PCA sphering scheme

- Before discussing PCA-sphereing, let us recall a few key notations and ideas from our discussion of PCA itself (see Section 8.3).


- Stacking our input data together column-wise we create our $N\times P$ data matrix $\mathbf{X}$.  



- We then denote $\frac{1}{P}\mathbf{X}\mathbf{X}^T + \lambda \mathbf{I}_{N\times N}$ the regularized covariance matrix of this data 


- And $\frac{1}{P}\mathbf{X}^{\,} \mathbf{X}^T +\lambda \mathbf{I}_{N\times N}= \mathbf{V}^{\,}\mathbf{D}^{\,}\mathbf{V}^T$ its eigenvalue/vector decomposition.

- Also recall that when performing PCA we first *mean-center* our dataset.


- We then aim to represent each of our mean-centered datapoints 
datapoint $\mathbf{x}_p$ by $\mathbf{w}_p = \mathbf{V}_{\,}^T\mathbf{x}_p^{\,}$. 


- In the space spanned by the principal components we can represent the entire set of transformed mean-centered data as 

\begin{equation}
\text{(PCA transformed data)}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\mathbf{W} = \mathbf{V}^T\mathbf{X}^{\,}.
\end{equation}

- Remember that this PCA transformation *rotates* the mean-centered data so that its largest orthogonal directions of variance align with the coordinate axes of the input space.


- "PCA-sphereing" takes this process one step further.


- To *sphere* the data we simply divide off the standard deviation along each coordinate of the PCA-transformed (mean-centered) data $\mathbf{W}$. 

- PCA-sphereing is simply the standard normalization with a step inserted in between mean centering and the dividing off of standard deviations.


- In between these two steps we rotate the data using PCA.  


- By rotating the data prior to scaling we can typically shrink the space consumed by the data considerably more than standard normalization.


- This simultaneously makes the associated cost function considerably more "circular" and much easier minimize properly.

- In the Figure below we show a generic comparison of how standard normalization and PCA sphereing affect a prototypical dataset, and its associated cost function.  


- Because PCA sphereing first rotates the data prior to scaling it typically results in more compact transformed data, and a transformed cost function with more 'circular' contours.

  <img src= '../../mlrefined_images/unsupervised_images/standard_normal_vs_pca_sphereing.png' width="60%"  height="auto" alt=""/>
      <img src= '../../mlrefined_images/unsupervised_images/standard_vs_sphereing_contours.png' width="60%"  height="auto" alt=""/>

- Formally if the *standard normalalization* scheme applied to a single datapoint $\mathbf{x}_p$ can be written in two steps as... 

---
**Standard normalization scheme:**
1.  **(mean center)** for each $n$ replace $x_{p,n} \longleftarrow \left({x_{p,n} - \mu_n}\right)$ where $\mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n}$
2.  **(divide off std)** for each $n$ replace $x_{p,n} \longleftarrow \frac{x_{p,n}}{\sigma_n}$ where $\sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_{p,n}\right)^2}$
---

- ... then PCA-sphereing scheme can be then be written in three highly related steps as follows

---
**PCA-sphereing scheme:**
1.  **(mean center)** for each $n$ replace $x_{p,n} \longleftarrow \left({x_{p,n} - \mu_n}\right)$ where $\mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n}$
2.  **(PCA rotation)** transform $\mathbf{w}_p = \mathbf{V}_{\,}^T\mathbf{x}_p^{\,}$ where $\mathbf{V}$ is the full set of eignenvectors of the reguliarzed covariance matrix
3.  **(divide off std)** for each $n$ replace $w_{p,n} \longleftarrow \frac{w_{p,n}}{\sigma_n}$ where $\sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(w_{p,n}\right)^2}$
---

- We can express step 3 of PCA-sphereing more efficiently using the *eigenvalues* of the regularized covariance matrix.  


- The Raleigh quotient definition of the $n^{th}$ eigenvalue $d_n$ of this matrix states that numerically speaking 

\begin{equation}
d_n = \frac{1}{P}\mathbf{v}_n \mathbf{X}_{\,}^{\,} \mathbf{X}_{\,}^T \mathbf{v}_n 
\end{equation}


- Here $\mathbf{v}_n$ is the $n^{th}$ and corresponding eigenvector.  

- In terms of our PCA transformed data this is equivalently written as


\begin{equation}
d_n = \frac{1}{P}\left\Vert \mathbf{v}_n^T \mathbf{X} \right \Vert_2^2 = {\frac{1}{P}\sum_{p=1}^{P}\left(w_{p,n}\right)^2}
\end{equation}


- In other words the $n^{th}$ eigenvalue is the *variance* along the $n^{th}$ axis of the PCA-transformed data.  


- Since the final step of PCA-sphereing has us divide off the standard deviation along each axis of the transformed data we can then write it equivalently in terms of the eigenvalues as


---
**PCA-sphereing scheme:**
1.  **(mean center)** for each $n$ replace $x_{p,n} \longleftarrow \left({x_{p,n} - \mu_n}\right)$ where $\mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n}$
2.  **(PCA rotation)** transform $\mathbf{w}_p = \mathbf{V}_{\,}^T\mathbf{x}_p^{\,}$ where $\mathbf{V}$ is the full set of eignenvectors of the reguliarzed covariance matrix
3.  **(divide off std)**  for each $n$ replace $w_{p,n} \longleftarrow \frac{w_{p,n}}{d_n^{1/_2}}$ where $d_n^{1/_2}$ is the square root of the $n^{th}$ eigenvalue of the regularized covariance matrix
---

- While expressing PCA-sphereing in may seem largely cosmetic, it is indeed computationally advantageous to simply use the eigenvalues in step 3 of the method...


- ...instead of re-computing the standard deviations along each transformed input axis) since we compute them anyway in performing PCA in step 2.