# ZCA-sphereing

In this short post we describe *Zero Phase Component Analysis Sphereing* - or *ZCA sphereing* for short.  ZCA sphereing is a popular *input-normalization* technique akin [standard normalization](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) and [PCA-sphereing](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_5_PCA_sphereing.html), but made especially for image, video, and other naturally ordered data types (here we will focus on its application to image-based data.  As with these other input-normalization schemes, when properly applied ZCA-sphereing conditions data in such a way as to substantially accelerate the training of supervised and unsupervised learners (and convolutional networks in general).  Note: this post assumes 

- basic familiarity with edge-based feature extractors (like e.g., convolutional networks or even [basic edge-based histogram features](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_2_Histogram.html)) - i.e., that you understand the importance of extracting edge information from images for un/supervised learning tasks involving image data

- familiarity with [standard normalization](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) and [PCA-sphereing](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_5_PCA_sphereing.html) techniques

You can skip around this document to particular subsections via the hyperlinks below.

-  [Contrast normalization](#contrast-normalization)
-  [ZCA-sphereing](#ZCA-sphereing)
-  [Python implementation](#python-implementation)

In [1]:
# This code cell will not be shown in the HTML version of this notebook
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import ortho_group  # Requires version 0.18 of scipy

# custom libs
from mlrefined_libraries import unsupervised_library as unsuplib
from mlrefined_libraries import basics_library as baslib
datapath = '../../mlrefined_datasets/unsuperlearn_datasets/'
from mlrefined_libraries import superlearn_library as superlearn
normalizers = superlearn.normalizers 

# this is needed to compensate for matplotlib notebook's tendancy to blow up images when plotted inline
%matplotlib notebook
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

<a id='contrast-normalization'></a>
## Contrast normalization

Before discussing ZCA-sphereing lets talk about *contrast normalization*.  Contrast normalization is a standard pre-processing technique applied to almost all image-based datasets which simply involves *[standard normalizing](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) each image in a dataset*, that is subtracting its mean (pixel) value and dividing off its standard deviation.  This helps adjust for *contrast differences* in the various images in a dataset and virtually always done prior to any *feature-wise normalization technique* like ZCA sphereing.

Why normalize the image contrast first?  Because we almost never want an image-based learner to be sensitive to contrast, since this would make natural image based learning tasks like e.g., object detection much more difficult to solve (see e.g., the simple [example described in Figure 4.26](https://github.com/jermwatt/machine_learning_refined/blob/gh-pages/sample_chapters/1st_ed/chapter_4.pdf) of the first edition of our textbook).  In other words, contrast normalization helps our learners treat all images equally by normalizing their average intensity.  Below we show an example image on the left and its contrast normalized version on the right.

In [25]:
## This code cell will not be shown in the HTML version of this notebook
from PIL import Image
import matplotlib.gridspec as gridspec

# load image
image = Image.open('../../images/zca_images/dog.jpg').convert('L')

# contrast normalize
standard_func, inverse_standard = standard_normalizer(image)
image_contrast_normalized = standard_func(image)

# plot images
fig = plt.figure(figsize=(8,4)); gs=gridspec.GridSpec(1,2)
fig.add_subplot(gs[0]); plt.imshow(image, cmap='gray')
fig.add_subplot(gs[1]); plt.imshow(image_contrast_normalized, cmap='gray')

<IPython.core.display.Javascript object>

<matplotlib.image.AxesImage at 0x1214aceb8>

## Zero Phase Component Analysis (ZCA) sphereing

In [this set of notes](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) you can see how the simple idea of *standard normalizing* the input of a machine learning dataset could drastically improve the learning ability of a zero or first order local optimization technique by improving the contours of any associated cost function.  The natural evolution of this idea - [PCA sphereing](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_5_PCA_sphereing.html) - can help even further improve the nature of a cost function, allowing for even easier learning.  

While there is nothing preventing us from applying standard normalization or PCA-based sphereing to image-based data, doing so *destroys the natural spatial-correlation of image-based data* and thus much of the *edge information* in an image.  To get a sense of this we show the result of each input-normalization procedure applied to a large set of handwritten digits from the MNIST dataset - a random subset of which are shown below.

In [9]:
# This code cell will not be shown in the HTML version of this notebook
# load data
from sklearn.datasets import fetch_mldata
MNIST = fetch_mldata('MNIST original')
x = MNIST.data.astype('float64')
y = np.reshape(MNIST.target, (-1, 1))
ind = np.random.permutation(len(y))
P = 70
x = x[ind[:P],:].T
y = y[ind[:P]]

# plot a sample of the images
unsuplib.PCA_functionality.show_images(x)

<IPython.core.display.Javascript object>

Below we show the result of [*standard normalizing*](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_3_Scaling.html) a batch of these hand written digits - that is we normalize each feature (here each individual pixel) by mean centering and scaling by the standard deviation of all such pixels in the dataset.  Notice how doing this *introduces* edge-based artifacts into these images!  So standard normalizing image data like this will make edge-based feature extractors (like a convolutional network) much less effective.

In [12]:
# This code cell will not be shown in the HTML version of this notebook
# standard normalization function 
def standard_normalizer(x):
    # compute the mean and standard deviation of the input
    x_means = np.mean(x,axis = 1)[:,np.newaxis]
    x_stds = np.std(x,axis = 1)[:,np.newaxis]   

    # create standard normalizer function
    normalizer = lambda data: (data - x_means)/x_stds

    # create inverse standard normalizer
    inverse_normalizer = lambda data: data*x_stds + x_means

    # return normalizer 
    return normalizer, inverse_normalizer

# standard normalize data
standard_func,inverse_standard = standard_normalizer(x)
x_standard = standard_func(x)

# plot standard-normalized data
unsuplib.PCA_functionality.show_images(x_standard)

<IPython.core.display.Javascript object>

Below we show the result of [*PCA-sphereing*](https://jermwatt.github.io/machine_learning_refined/notes/9_Feature_engineer_select/9_5_PCA_sphereing.html) this handwritten digit dataset, and plot a random subset of the resulting PCA-sphered data.  Note: these are the same images (transformed by the sphereing process) we showed above.  

In this instance PCA-sphereing completely destroys the spatial structure of the original images - none of the original numbers are visible in these transformed versions.  There is little point to using edge-based feature extractors (like a convolutional network) if there are no longer any (useful) edges in our images!  

Note: the phenomena displayed for the particular MNIST dataset here is true more generally: standard normalization and PCA-sphereing tend to destroy the spatial correlation of images, video, etc., 

In [245]:
# This code cell will not be shown in the HTML version of this notebook
# mean-center the data
X = center(images)

# compute the full PCA transformation of dataset
W,S = PCA_sphere(X)

# plot PCA-sphered data
unsuplib.PCA_functionality.show_images(S)

<IPython.core.display.Javascript object>

## What went wrong with PCA-sphereing?

As detailed in the previous Example, in terms of the actions we perform on the data itself with PCA-sphereing (or whitening) we:

- **rotate and reflect** by multiplying the data by $\mathbf{V}^T$ (where $\mathbf{V}$ is the set of eigenvectors of the data's covariance matrix) the data so that its largest orthogonal directions of variance align with the coordinate axes (this is done by the standard PCA transformation)

- **normalize** these coordinate axes by dividing by their individual standard deviations by multiplying the rotated/reflected data by $\mathbf{D}^{-^1/_2}$, the diagonal matrix of inverted square roots of the eigenvalues of the data's covariance matrix (this extra bit of normalization added to the PCA transformation to makes it the PCA-sphereing operation)

Both of these actions contribute to the destruction of spatial correlation - but it is the **rotation / reflection** component that is by far the greater culprit.  We can get a visceral sense of this fact with our current dataset by simply applying the standard PCA transform - which will rotate / reflect the space so that the largest directions of variance coincide with the coordinate axes - without normalizing the result (which led to the PCA-sphereing result above).   We do this in the next cell, and plot the resulting transformed images.

In [246]:
# This code cell will not be shown in the HTML version of this notebook
# compute eigendecomposition of data covariance matrix for PCA transformation
def PCA(x,**kwargs):
    # regularization parameter for numerical stability
    lam = 10**(-7)
    if 'lam' in kwargs:
        lam = kwargs['lam']

    # create the correlation matrix
    P = float(x.shape[1])
    Cov = 1/P*np.dot(x,x.T) + lam*np.eye(x.shape[0])

    # use numpy function to compute eigenvalues / vectors of correlation matrix
    d,V = np.linalg.eigh(Cov)
    return d,V

# PCA-sphereing - use PCA to normalize input features
def PCA_sphereing(x,**kwargs):
    # Step 1: mean-center the data
    x_means = np.mean(x,axis = 1)[:,np.newaxis]
    x_centered = x - x_means

    # Step 2: compute pca transform on mean-centered data
    d,V = PCA(x_centered,**kwargs)

    # Step 3: divide off standard deviation of each (transformed) input, 
    # which are equal to the returned eigenvalues in 'd'.  
    stds = (d[:,np.newaxis])**(0.5)
    normalizer = lambda data: np.dot(V.T,data - x_means)/stds

    # create inverse normalizer
    inverse_normalizer = lambda data: np.dot(V,data*stds) + x_means

    # return normalizer 
    return normalizer,inverse_normalizer

<IPython.core.display.Javascript object>

As we can see in the examples plotted above, the rotation / reflection from PCA transformation utterly - created by multiplying our data by $\mathbf{V}^T$ - utterly destroys the spatial correlation structure of our images, a phenomenon that is true more generally speaking for other image datasets as well.  What can we do if we would still like the added optimization-boosting benefit of PCA-sphereing normalization but do not want to destroy the spatial correlation of our input data by rotating / reflecting it?  Well if indeed the rotation / reflection is producing the greatest challenge in terms of maintaining the spatial correlation of our data, why don't we simply rotate / reflect our dataset back to its original orientation after we finish sphereing it (i.e.,normalizing it along its largest orthogonal directions of variance)?  Since we know that it multiplication by $\mathbf{V}^T$ that produces the original rotation / reflection, multiplying by PCA-sphered data by $\left(\mathbf{V}^T\right)^{-1} = \mathbf{V}^{\,}$ (where the equality follows from the fact that $\mathbf{V}$ is an orthogonal matrix) will return the sphered data to its original orientation in the space.  This is illustrated in the Figure below.

<figure>
  <img src= '../../images/zca_images/zca_sphereing.png' width="110%"  height="auto" alt=""/>
  <figcaption>   
<strong>Figure 2:</strong> <em> ZCA-sphereing illustrated. </em>  </figcaption> 
</figure>

Since our PCA-sphered transformation of the input data took the form $\mathbf{S}^{\,} =  \mathbf{D}^{-^1/_2}\mathbf{V}^T\mathbf{X}^{\,}$ this re-rotation/reflection gives us the related formula

\begin{equation}
\text{(ZCA-sphered data)}\,\,\,\,\,\,\,\,\, \mathbf{Z}^{\,} =  \mathbf{V}\mathbf{S} = \mathbf{V}\mathbf{D}^{-^1/_2}\mathbf{V}^T\mathbf{X}^{\,}.
\end{equation}

For historical reasons this re-rotated version of our PCA-sphered data is often referred to as *Zero-phase Component Analysis (ZCA) sphereing*.  

We implement ZCA-sphereing in the next cell.

In [247]:
# ZCA-sphereing - use ZCA to normalize input features
def ZCA_sphereing(x,**kwargs):
    # Step 1: mean-center the data
    x_means = np.mean(x,axis = 1)[:,np.newaxis]
    x_centered = x - x_means

    # Step 2: compute pca transform on mean-centered data
    d,V = PCA(x_centered,**kwargs)

    # Step 3: divide off standard deviation of each (transformed) input, 
    # which are equal to the returned eigenvalues in 'd'.  
    stds = (d[:,np.newaxis])**(0.5)
    normalizer = lambda data: np.dot(V, np.dot(V.T,data - x_means)/stds)

    # create inverse normalizer
    inverse_normalizer = lambda data: np.dot(V,np.dot(V.T,data)*stds) + x_means

    # return normalizer 
    return normalizer,inverse_normalizer

With our ZCA-sphereing implementation written we can now transform the original centered dataset using it and examine the results.

In [32]:
# create ZCA sphereing normalizer and normalize data
zca_sphere,inverse_sphere = ZCA_sphereing(x)
x_sphered = zca_sphere(x)

# plot ZCA-sphered data
unsuplib.PCA_functionality.show_images(x_sphered)

<IPython.core.display.Javascript object>

What a difference!  Indeed it was the rotation/reflection of PCA that destroyed most of the spatial correlation, since these ZCA-sphered images retain much of the spatial correlation present in their original versions.  Now we have the best of both worlds: a global normalization scheme (that will help speed up training) that retains the spatial structure of data leveraged e.g., by edge detectors and convolution operations.