# MACHINE LEARNING FOR RESEARCHERS

# Notebook 4. Dimensionality reduction methods



This notebook describes **dimensionality reduction** methods in the **unsupervised domain** learning.

The following contents are covered: 

- Principal Component Analysis (PCA)
- Autoencoders (with ANNs) (*extra material not covered in the Unit 3 slides*)

It is highly recommended that this notebook is read and run after a first reading of the theory and in parallel with the slides available in AV. 
Note also that it is not required to develop any code. All examples are totally implemented, and therefore these notebooks have to be regarded as demonstrative material. The goal is understanding the operation of the algorithms. The notebook contains several questions that have be to submitted through AV. 

As with previous notebooks, some of the codes used for plotting the data sets have been adapted from <a href=https://github.com/ageron/handson-ml2>Geron (Github site)</a>. Please, consult that textbook for reference. 

## PCA

**Dimensionality reduction** algorithms aim at coding *automatically* the data from a high dimensional space to a small dimensional *feature space*. This data processing is necessary because data sets tend to be extremely \emph{sparse} when working in high dimensionality spaces, and thus reducing dimensions is \emph{critical} to increase predictions' reliability by avoiding large extrapolations about data.

PCA operates given a set of **unlabeled** observations $\{\pmb{x}_n\}$ for $n$ = $1,\ldots,N$, each of dimensionality $D$. The aim of PCA can be understood in a dual way as either:

- Maximizing the variance of the projected data (that is, the information kept by the data) over a feature space of dimensionality $M$<$D$.
- Minimizing the projection error in the data reconstruction. 

Both lead to exactly the same procedure, which involves finding the eigenvectors of the covariance matrix: 

\begin{equation}
\pmb{S} = \frac{1}{N} \sum_{n=1}^N (\pmb{x}_n - \overline{\pmb{x}})(\pmb{x}_n - \overline{\pmb{x}})^T
\end{equation}

That is, solving:
\begin{equation}
\pmb{u}_m^T \pmb{S} \pmb{u}_m + \lambda_m (1 - \pmb{u}_m^T \pmb{u}_m).
\end{equation}

for $m$=$1,\ldots,M$. 

These eigenvectors are chosen to have unit length and they determine the **projection directions** in the input space. Thus, the projection of point $\pmb{x}$ has length $\pmb{x}^T\pmb{u_i}$=$\pmb{u_i}^T\pmb{x}$ in direction $\pmb{u_i}$, and the projected point is:

\begin{equation}
\tilde{\pmb{x}}_n=\overline{\pmb{x}} + \sum_{i=1}^M [\pmb{u}_i^T (\pmb{x}_n-\overline{\pmb{x}})] \pmb{u}_i
\end{equation}

This projection has a (hopefully small) error compared to the original point. This error depends on how many dimensions ($M$) are chosen in the feature space. The goal of PCA is thus to determine this basis. For that, either $M$ is fixed with some criteria (e.g., $M$=2 for data visualization) or via an educated guess, or, more commonly, the value of **$M$ is determined such the sum of the $M$ eigenvalues reaches some ratio (say 90%, 95%) of the total sum of the eigenvalues**. This last procedure is justified in the observation that the largest the eigenvalues, the best is the associated eigenvector to project the data (less projection error).

Next, we show how to apply PCA to the MNIST data set with the sklearn library.

In [None]:
import numpy as np
from IPython.display import clear_output
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
# print(mnist.DESCR) # Uncomment to show description
X, t = mnist["data"], mnist["target"] # N = 70000, D = 784 
t = t.astype(int)
X_train, X_test, t_train, t_test = X[:60000], X[60000:], t[:60000], t[60000:]
X_train = X_train/255
X_test = X_test/255

#from sklearn.utils import shuffle
#X, t = shuffle(X, t, random_state=0)

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")
    
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.plasma, **options)
    plt.axis("off")
    

def plot_data(X, R, colors=['r.', 'g.', 'b.', 'k.', 'y.','c.','m.'], markersize=2):
    Raux = R.copy()
    if R.ndim>1:
        K = R.shape[1]
        for n in range(R.shape[0]):
            i = np.argmax(R[n])
            Raux[n] = np.zeros([1,K])
            Raux[n,i] = 1
    else:
        minR = np.amin(R)
        K = np.amax(R)-minR+1
        Raux = np.zeros([X.shape[0],K])
        for n in range(R.shape[0]):
            i = R[n] - minR 
            Raux[n] = np.zeros([1,K])
            Raux[n,i] = 1 
            
    for k in range(Raux.shape[1]):
        pattern = np.zeros([K,])
        pattern[k] = 1
        matches = (Raux==pattern).all(axis=1).nonzero()
        plt.plot(X[matches, 0], X[matches, 1], colors[k], markersize=markersize)


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Phi_train = pca.fit_transform(X_train)

plt.figure(figsize=(9,9))
plot_data(Phi_train[:1000], t[:1000], colors=['r+', 'g.', 'k.', 'k+', 'y+','c*','m.','r*','b*','g*','c*'], markersize=6)
plt.show()

Although some *clusters* can be intuited, it is clear that we need more dimensions than 2 to have a separable representation. We can ask PCA to provide, e.g., 95% of the spectral energy (the sum of eigenvalues), show some information, and do a SVM classification in the featured space.

In [None]:
from sklearn.decomposition import PCA
pca_85 = PCA(n_components=0.85) # 85% of the energy
Phi_85 = pca_85.fit_transform(X_train)
pca_50 = PCA(n_components=0.5) # 50% of the energy
Phi_50 = pca_50.fit_transform(X_train)
print(f'For 85% of energy -> M:{pca_85.n_components_}, original dimension:{X.shape[1]}')
print(f'For 50% of energy -> M:{pca_50.n_components_}, original dimension:{X.shape[1]}')

In [None]:
# Show mean"digit", the same 50% for 85%
meandigit = pca_85.mean_
plot_digit(meandigit)

In [None]:
# Show eigen"digits" (the directions for better projection)
plt.figure(figsize=(12, 8))
plot_digits(pca_85.components_)

In [None]:
# Show eigen"digits" (the directions for better projection)
plt.figure(figsize=(12, 8))
plot_digits(pca_50.components_)

The energy of firsts eigenvalues is higher, thus, the associated eigendigits resemble more clearly a 'digit'. Note that the eigendigits are the same, the difference is that with 50% of spectral energy more components are required.

**To compute the coordinate in the feature space, each of the eigendigtis is multiplied (scalar multiplication) by the digit**.

In [None]:
# Let's project digits and show the results
# First: original digits
plt.figure(figsize=(12, 8))
plot_digits(X_test[:100])

In [None]:
# Let's project digits and show the results
plt.figure(figsize=(12, 8))
X_back = pca_85.inverse_transform(pca_85.transform(X_test))
plot_digits(X_back[:100])

In [None]:
# Let's project digits and show the results
plt.figure(figsize=(12, 8))
X_back = pca_50.inverse_transform(pca_50.transform(X_test))
plot_digits(X_back[:100])

In [None]:
from sklearn.linear_model import LogisticRegression

# Solver is lbfgs which uses Ridge regularization by default
clf = LogisticRegression(random_state=0, multi_class='multinomial',max_iter=10000).fit(Phi_85, t_train)

In [None]:
print(f'The score (accurary) in the training set is {clf.score(Phi_85,t_train)*100:.02f}%')
print(f'The score (accurary) in the test set is {clf.score(pca_85.transform(X_test),t_test)*100:.02f}%')

plot_digits(X_test[:10], images_per_row=10)
print('Labels predicted for the 10 first pictures in the test set')
print(clf.predict(pca_85.transform(X_test)[:10]))
plt.show()

**Result is similar to the softmax regression, but the classifier took 5 times less time to train.** 

As aditional example, we show the eigenvectors associated to a 3D toy data set. 

In [None]:
# Dado el siguiente conjunto de puntos, 
# dibujar en rojo los dos primeros vectores principales,
# centrados en el punto promedio

import numpy as np
from IPython.display import clear_output
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 
from sklearn.datasets import make_blobs
X,_ = make_blobs(n_samples=[500], centers=[[1,1,1]], cluster_std=0.4, random_state=0)
X = X.dot([[1.5,-1.3,1],[-1.3,1.4,0.4],[0.1,0.8,0.3]])
fig = plt.figure(figsize=(15,10))
ax = plt.axes(projection='3d')
ax.view_init(35, 35)
ax.scatter3D(X[:,0], X[:,1], X[:,2], c=X[:,0], cmap='Blues')

ax.set_xlim(-2, 3); ax.set_ylim(-2, 3); ax.set_zlim(0, 5);
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')

from sklearn.decomposition import PCA
pca = PCA(n_components=2) 
Phi = pca.fit_transform(X)
mean = pca.mean_
components = pca.components_*1.5
ax.quiver(mean[0], mean[1], mean[2], components[0,0], components[0,1], components[0,2], linewidths=1.5, color='red', pivot='tail')
ax.quiver(mean[0], mean[1], mean[2], components[1,0], components[1,1], components[1,2], linewidths=1.5, color='red', pivot='tail')

## Autoencoders with Keras

In this section we cover an structure called **autoencoder**, which was introduced in Unit 2, along with other unsupervised methods, and that is targeted to dimensionality reduction and other applications like noise-filtering.

An autoencoder is a parametric structure (most typically an ANN) which transforms the data inputs $\pmb{x}_n$ into a **coded form $c(\pmb{x},\pmb{w_c})$**, and then back to the input space $\widetilde{\pmb{x}}_n$= $d(c(\pmb{x}_n, \pmb{w_c}),\pmb{w_d})$. Since the coding is of (much) lower dimensionality than the input space, some reconstruction noise $ J_n $ = $\|\pmb {x} _n- \widetilde {\pmb {x}} _ n\|^2 $ appears.

The autoencoder parameters $\pmb{w_c}$, $\pmb{w_d}$ are found by minimizing, e.g., the average modulus data reconstruction noise ($J$):
\begin{equation}
\min \limits _ {\pmb {w_c}, \pmb {w_d}} \frac {1} {2} \sum_ {n = 1} ^ N J_n
\end{equation}

The encoder and decoder parameters are found by **minimizing the joint reconstruction error of a training set**, i.e., the sum of the differences between the original data and the reconstructed one. 

An easy way to construct autoencoders is using the high-level **Keras** library to define ANNs. It uses some other low-level library to build the layers and the connections, like **tensorflow**, but conveniently hiding all the details to the network designer.  

In [None]:
 # Entrenamos un autoencoder construido en keras
import keras
from sklearn.preprocessing import normalize

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
import tensorflow as tf

early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, 
                                                  restore_best_weights=True)
checkpoint_cb = keras.callbacks.ModelCheckpoint("autoencoder.h5", 
                                                save_best_only=True)

repeat = True # Add another 300 training epochs

autoencoder = keras.models.Sequential([
    keras.layers.Dense(50, input_shape=[X_train.shape[1]], activation="tanh"), 
    keras.layers.Dense(10, input_shape=[50], activation="tanh"),
    keras.layers.Dense(50, input_shape=[10], activation="tanh"),
    keras.layers.Dense(X_train.shape[1], input_shape=[50], activation="tanh")
])
    
try: 
    autoencoder = keras.models.load_model("autoencoder.h5") # cargar modelo
    print('Modelo cargado')
    if repeat: 
        raise IOError
except IOError:
    # Code of length 11 as the PCA with 50% of energy
    autoencoder.compile(loss="mse", optimizer="adam")
    history = autoencoder.fit(X_train, X_train, epochs=30, batch_size=100, 
                              validation_data=(X_test, X_test), 
                              callbacks=[checkpoint_cb, early_stopping_cb])

In [None]:
# Let's project digits and show the results
plt.figure(figsize=(12, 8))
X_back = autoencoder.predict(X_test[:10])
plt.subplot(211)
plot_digits(X_back[:10])
plt.subplot(212)
plot_digits(X_test[:10])

## Questions

***
### Question 1
> **Compare this last result and the time to learn the coding with that of PCA. Which are the pros and cons of each method?**
***

***
### Question 2
> **Provide some example in your research topic where dimensionality reduction should be used, and explain which of two previous methods you would use.**
***