In [None]:
%load_ext nb_black

# Dimensionality reduction

The goal of this interactive demo is to show you how a machine learning model can perform dimensionality reduction. However, keep in mind, that the code in this notebook was simplified for the demo, and should not be used as a plug and play example for real machine learning projects.

In this notebook we will explore three different types of dimensionality reduction models:

- Projection based
- Manifold based
- Autoencoders

## The dataset

For the purpose of this demo we will use the [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset. This dataset contains 60'000 grayscale images (of size 28 x 28 pixels) of 10 distinct target classes:

1. T-shirt/top
2. Trouser
3. Pullover
4. Dress
5. Coat
6. Sandal
7. Shirt
8. Sneaker
9. Bag
10. Ankle boot

So let's go ahead and download and prepare the dataset:

In [None]:
# Import relevant tensorflow packages
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load dataset, already pre-split into train and test set
(X_tr, y_tr), (X_te, y_te) = keras.datasets.fashion_mnist.load_data()

# Reshape data to 2D array
X_tr = X_tr.reshape(len(X_tr), -1)
X_te = X_te.reshape(len(X_te), -1)

# Specify target class labels
labels = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
          "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

# Report shape of dataset
print("X_tr shape:", X_tr.shape)
print("X_te shape:", X_te.shape)

Let's have a look at the first few hundered images of this dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create image collage
img_collage = np.concatenate(
    [np.concatenate(
            [X_tr[idx + jdx * 25].reshape(28, 28) for idx in range(25)], axis=1
        ) for jdx in range(10)])

# Plot image collage
plt.figure(figsize=(15, 10))
plt.imshow(img_collage, cmap="binary")
plt.axis("off");

## Projection based dimensionaly reduction

Now that the data is ready, let's go ahead explore a projection based dimensionality reduction method, the **principal component analysis** (PCA).

In [None]:
from sklearn.decomposition import PCA

# Create PCA model
pca = PCA()

# Train and apply PCA model to the training data
X_tr_pca = pca.fit_transform(X_tr)

Now that the PCA model was trained and the data was projected to a lower dimensional space, let's take a look at how the 10 target classes distribute over those dimensions.

In [None]:
from utils import plot_2d_grid
plot_2d_grid(X_tr_pca, y_tr, labels, d1=0, d2=1)

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>d1</code> and <code>d2</code> parameters to any other values between 0 and 19 to see how the target classes distribute over the first 20 dimensions of this low-dimensional PCA-space.
</div>

In [None]:
from utils import plot_pca_decomposition
plot_pca_decomposition(X_tr, y_tr, n_dim=100)

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>n_dim</code> value to anything between 0 and 783 to see the data compression effect of the dimensionality reduction approach.
</div>

## Manifold based dimensionaly reduction

Next up is the exploration of the **manifold** based dimensionaly reduction. There are a lot of different routines how such a manifold can be identified and the data than can be mapped onto this manifold. Let's take a look at one of them, called Uniform Manifold Approximation and Projection (UMAP).

In [None]:
from umap import UMAP

umap = UMAP(n_components=4, min_dist=0.8, n_jobs=-1)
X_tr_umap = umap.fit_transform(X_tr)

As before, let's have a look at how the data is distributed over the low dimensionsinal space.

In [None]:
plot_2d_grid(X_tr_umap, y_tr, labels, d1=0, d2=1)

<div class="alert alert-success">
  <h2>Exercise</h2>
    <p></p>
Change the <code>d1</code> and <code>d2</code> parameters to any other values between 0 and 4 to see how the target classes distribute over the first 4 dimensions of this low-dimensional PCA-space.
</div>

# Autoencoder

Last but not least, let's look at a dimensionality reduction approach that uses deep learning, called **autoencoder**. To be more precise, we will use a sub-category of autoencoders, called **variational autoencoder (VAE)**.

In [None]:
# Reshape and rescale data to prepare it for the autoencoder model
mnist_digits = X_tr.reshape(-1, 28, 28, 1).astype("float32") / 255

# Create variational autoencoder (VAE)
from utils import get_variational_autoencoder
vae = get_variational_autoencoder(n_dim=2)

# Train variational autoencoder model
vae.fit(mnist_digits, epochs=10, batch_size=256)

Now that the autoencoder is trained, we can pass data through it and extract the low-dimensional representation.

In [None]:
# Compute low-dimensional data representation
X_tr_vae = vae.encoder.predict(mnist_digits)[0]

# Plot low-dimensional data representation
plot_2d_grid(X_tr_vae, y_tr, labels)