<a href="http://cocl.us/pytorch_link_top">
    <img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0110EN/notebook_images%20/Pytochtop.png" width="750" alt="IBM Product " />
</a> 

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0110EN/notebook_images%20/cc-logo-square.png" width="200" alt="cognitiveclass.ai logo" />

<h1>Data Preprocessing and Gradient Descent </h1> 

<h2>Table of Contents</h2>
<p>This lab will show how data normalization, data standardization, data decorrelation (Principal Component Analysis), Whitening Data and Zero-Phase Component Analysis affect convergence in parameter space. The simulations are based on the paper Efficient BackProp by Yann A. LeCun1, Léon Bottou1, Genevieve B. Orr2, and Klaus-Robert Müller.  </p>

<ul>
    <li><a href="#Auxiliary">Auxiliary Functions and Classes </a></li>
    <li><a href="#PyTorch_Classes"> Define the PyTorch Classes </a></li>
    <li><a href="#No_Transform">Data with No Pre-processing </a></li>
    <li><a href="#Standardize_Data">Standardize Data </a></li>
    <li><a href="#PCA">PCA </a></li>
    <li><a href="#Whitening">Whitening</a></li>
    <li><a href="#ZCA">Zero-Phase Component Analysis</a></li>
    <li><a href="#WHYZCA">Why ZCA?</a></li>
</ul>

<p>Estimated Time Needed: <strong>30 min</strong></p>

<hr>

<h2 id="Auxiliary">DataSet </h2>

We'll need the following libraries for ploting:  

In [None]:
# These are the libraries we are going to use in the lab.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
import torch



We generate 2D  data that is correlated.

In [None]:
samples=200

u=torch.tensor([[1.0,1.0],[0.10,-0.10]])/(2)**(0.5)

X=torch.mm(4*torch.randn(samples,2),u)+2
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy())
plt.show()

<h2 id="#Standardize_Data "> Standardize Data </h2>

In this section, we Standardize data $\mathbf{x}$, this is equivalent to the following matrix operation:

   $\quad
    \boldsymbol D= \begin{pmatrix} \sigma_1 & 0 \\
                             0  & \sigma_2 \end{pmatrix}  $ 

$\mathbf{\hat{x}}=(\mathbf x-\boldsymbol\mu)D^{-1}$

where $\boldsymbol\mu$ is the mean and $\sigma_i$ is the standard deviation of the i-th component.

In [None]:
Xhat=torch.mm(X-X.mean(dim=0),torch.eye(2)/X.std(dim=0)) 

we can plot the data.

In [None]:
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="No Pre-processing")
plt.scatter(Xhat[:, 0].numpy(), Xhat[:, 1].numpy(),label="Standardize Data")
plt.legend()

From now on we will deal with zero zero centered data

$\mathbf x=\mathbf x-\boldsymbol\mu$

In [None]:
X=X-X.mean(dim=0)


In [None]:
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="No Pre-processing")
plt.scatter(Xhat[:, 0].numpy(), Xhat[:, 1].numpy(),label="zero-mean")
plt.legend()

Loss function 

<h2 id="#PCA "> PCA</h2>
In this section, we create a dataset object that uses Principal component analysis (PCA). We find the projection of the data on the eigenvectors of the covariance matrix $\mathbf{Q}$, as shown below. We zero center the data.

$\frac{1}{N}   \mathbf{X}^T \mathbf{X} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T$

$\mathbf{\hat{x}}=\mathbf{x} \mathbf{Q} $

We calculate the empirical covariance matrix.

$\frac{1}{N}   \mathbf{X}^T \mathbf{X}$

In [None]:
Cov=torch.mm(torch.t(X),X)/X.shape[0]
Cov

We obtain the eigenvectors
$\frac{1}{N}   \mathbf{X}^T \mathbf{X} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T$

In [None]:
eigenvalues,eigenvectors=torch.eig(Cov,True)

we can plot the eigenvectors

In [None]:
row_vec=torch.t(eigenvectors).numpy()
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="data")
plt.quiver([0],[0],row_vec[:,0],row_vec[:,1],label="Eigen vectors")
plt.xlabel("x_{1}")
plt.ylabel("x_{2}")
plt.legend()
plt.show()

We find the projection  the eigenvectors:
$\mathbf{\hat{x}}=\mathbf{x} \mathbf{Q} $

In [None]:
Xhat=torch.mm(X,eigenvectors)

In [None]:
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="data") 

plt.scatter(Xhat[:, 0].numpy(), Xhat[:, 1].numpy(),label="transformed data")
plt.xlabel("q_{1}")
plt.ylabel("q_{2}")
plt.quiver([0],[0],row_vec[:,0],row_vec[:,1],label="Eigen vectors")
plt.legend()
plt.show()

we see the data is now uncorrelated: 

In [None]:
torch.mm(torch.t(Xhat),Xhat)/Xhat.shape[0]

but the data has a  standard deviation :

In [None]:
Xhat.std(dim=0)

<h2 id="#Whitening<"> Whitening</h2>

In this section we apply a Whitening Matrix, this gives the features all the same variance. The operation can be expressed as: 

$\mathbf{\hat{x}}=\mathbf{x} \mathbf{Q} \mathbf{\Lambda}^{-1/2} $

We repeat the same process as PCA:

In [None]:
Cov=torch.mm(torch.t(X),X)/X.shape[0]
eigenvalues,eigenvectors=torch.eig(Cov,True)


We calculate the diagonal matrix:

In [None]:
diag=torch.eye(2)
diag[0,0]=eigenvalues[0,0]**(-1/2)
diag[1,1]=eigenvalues[1,0]**(-1/2)

In [None]:
Xhat=torch.mm(torch.mm(X,eigenvectors),diag)

Create a linear regression object, and we initialize the values, so they are relatively far away from the minimum. We also create an optimizer object and a data loader object. 

In [None]:
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="data") 
plt.scatter(Xhat[:, 0].numpy(), Xhat[:, 1].numpy(),label="transformed data")
plt.xlabel("q_{1}")
plt.ylabel("q_{2}")

plt.legend()
plt.show()

we see the Standard deviation of the dataset:

In [None]:
Xhat.std(dim=0)

<h2 id="#ZCA"> Zero-Phase Component Analysis (ZCA) </h2>

We apply ZCA, ZCA is decorrelated and has Whitening applied to it, but the data has more income with the original data. We ca apply the transform the data as follows:

$\mathbf{\hat{x}}=\mathbf{x} \mathbf{Q} \mathbf{\Lambda}^{-1/2}\mathbf{Q}^{T} $

We apply Whitening:

In [None]:
Cov=torch.mm(torch.t(X),X)/X.shape[0]
eigenvalues,eigenvectors=torch.eig(Cov,True)
diag=torch.eye(2)
diag[0,0]=eigenvalues[0,0]**(-1/2)
diag[1,1]=eigenvalues[1,0]**(-1/2)
Xhat=torch.mm(torch.mm(X,eigenvectors),diag)

We then find the projection back into space:

In [None]:
Xhat=torch.mm(Xhat,torch.t(eigenvectors))

In [None]:
plt.scatter(X[:, 0].numpy(), X[:, 1].numpy(),label="data") 
plt.scatter(Xhat[:, 0].numpy(), Xhat[:, 1].numpy(),label="ZCA")
plt.legend()
plt.show()

<h2 id="#loss"> Why ZCA   </h2>

In contrast to PCA, ZCA has preserved the orientation of the original data points, this import in many applications. Let create some data and label it to persevere the orientation of the data.

In [None]:
samples=200

W1=3*torch.tensor([[1.0,1.0],[0.10,-0.10]])/(2)**(0.5)
W2=torch.tensor([[0.1,0.1],[1,-1]])/(2)**(0.5)
data_set1=torch.mm(torch.randn(samples,2),W1)
data_set2=torch.mm(torch.randn(samples,2),W2)
plt.scatter(data_set1[:, 0].numpy(), data_set1[:, 1].numpy(),c='r')
plt.scatter(data_set2[:, 0].numpy(), data_set2[:, 1].numpy(),c='b')
plt.xlabel("x_{1}")
plt.ylabel("x_{2}")
plt.show()

We convert the data into one dataset to calculate PCA and ZCA. 

In [None]:
X= torch.cat((data_set1, data_set2), 0)
X.shape

We calculate PCA and find the project onto the dataset:

In [None]:
Cov=torch.mm(torch.t(X),X)/X.shape[0]
eigenvalues,eigenvectors=torch.eig(Cov,True)
data_one_new=torch.mm(data_set1,eigenvectors)
data_two_new=torch.mm(data_set2,eigenvectors)

We can plot the PCA and the original data.

In [None]:
fig, axs = plt.subplots(2)
axs[0].scatter(data_set1[:, 0].numpy(), data_set1[:, 1].numpy(),c='r')
axs[0].scatter(data_set2[:, 0].numpy(), data_set2[:, 1].numpy(),c='b')
axs[0].title.set_text("DATA")
axs[1].scatter(data_one_new[:, 0].numpy(), data_one_new[:, 1].numpy(),c='r')
axs[1].scatter(data_two_new[:, 0].numpy(), data_two_new[:, 1].numpy(),c='b')
axs[1].title.set_text("PCA")

We see the orientation of the data appears different.

In [None]:
diag[0,0]=eigenvalues[0,0]**(-1/2)
diag[1,1]=eigenvalues[1,0]**(-1/2)
transform=torch.mm(torch.mm(eigenvectors,diag),torch.t(eigenvectors))

we can apply the ZCA transform 

In [None]:
data_one_new=torch.mm(data_set1,transform)
data_two_new=torch.mm(data_set2,transform)

we can plot the data we see that ZCA preserves the transform.

In [None]:
fig, axs = plt.subplots(2)
axs[0].scatter(data_set1[:, 0].numpy(), data_set1[:, 1].numpy(),c='r')
axs[0].scatter(data_set2[:, 0].numpy(), data_set2[:, 1].numpy(),c='b')
axs[0].title.set_text("DATA")
axs[1].scatter(data_one_new[:, 0].numpy(), data_one_new[:, 1].numpy(),c='r')
axs[1].scatter(data_two_new[:, 0].numpy(), data_two_new[:, 1].numpy(),c='b')
axs[1].title.set_text("PZCA")

<!--Empty Space for separating topics-->

<h2>About the Authors:</h2> 

<a href="https://www.linkedin.com/in/joseph-s-50398b136/">Joseph Santarcangelo</a> has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

<hr>

Copyright &copy; 2020 <a href="cognitiveclass.ai?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu">cognitiveclass.ai</a>. This notebook and its source code are released under the terms of the <a href="https://bigdatauniversity.com/mit-license/">MIT License</a>.