**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install datasets
!{sys.executable} -m pip install umap-learn
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

from datasets import load_dataset
from class_utils import imscatter
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
import umap

## Dimensionality Reduction

The goal of dimensionality reduction methods is to reduce the dimensionality o input data, while preserving as much useful information as possible. Dimensionality reduction can be applied with different purposes, e.g.:

* to decrease computational expenses related to processing high-dimensional data;
* visualization of high-dimensional data;
* ...
We will now illustrate how dimensionality reduction can be used for the purpose of data visualization.

### Loading the Data

In this example, we will be using the [Fashion MNIST](https://huggingface.co/datasets/fashion_mnist) dataset, which contains low-resolution ($28 \times 28$ pixel) images of different kinds of footwear, clothes, etc.

The images are from the following classes:

label id | label       |  | label id | label     
-------- | ----------- | - | -------- | ----------
**0**    | T-shirt/top |  | **5**    | Sandal    
**1**    | Trouser     |  | **6**    | Shirt     
**2**    | Pullover    |  | **7**    | Sneaker   
**3**    | Dress       |  | **8**    | Bag       
**4**    | Coat        |  | **9**    | Ankle boot
It is very easy to load the data in Python, because the `datasets` package from HugginFace includes a built-in function, which does so. We merely call the `load_dataset` function, specifying `"fashion_mnist"` as the dataset. With default parameters, we would get a dataset split into two parts: the train fold and the test fold. Since we are only going to be doing visualization and not supervised learning, there is going to be no need for a test set, so we specify `split='train+test'` – this way, the dataset is not going to be split.



In [None]:
dataset = load_dataset("fashion_mnist", split='train+test')
dataset

One rather covenient feature of datasets loading using `load_dataset` is that they come with a `.info` attribute, which holds metadata about the dataset. One can, for instance, display a short description of the dataset, or get the label names:



In [None]:
print(dataset.info.description)

In [None]:
class_names = dataset.info.features['label'].names
print(class_names)

In [None]:
X = np.asarray([np.asarray(img) for img in dataset['image']])
Y = np.asarray(dataset['label'])

To get some idea of what our data looks like, we will now display a few randomly selected samples:



In [None]:
fig, axes = plt.subplots(5, 5)
fig.set_size_inches([7, 7])

for ax_row in axes:
    for ax in ax_row:
        ind = np.random.randint(0, X.shape[0])
        ax.imshow(X[ind], cmap='Greys')
        ax.set_title(class_names[Y[ind]])
        ax.axis('off')
    
plt.subplots_adjust(hspace=0.5)

### Preprocessing

Note that normally we would **standardize**  the data (rescale each column so that its mean is at zero and the standard deviation is 1) before applying PCA or UMAP. This is so that the methods do not consider certain columns more important just because their scale is larger. In this case, however, our data consists of images so each dimension (each pixel) already has the same scale.



In [None]:
# WE DO NOT NEED THIS BECAUSE WE HAVE AN IMAGE DATASET

# input_preproc = make_pipeline(
#     SimpleImputer(),
#     StandardScaler()
# )

# X_preproc = input_preproc.fit_transform(X.reshape(X.shape[0], -1))
# X_preproc = X_preproc.reshape(X.shape)
# X = X_preproc

### Dimensionality Reduction using PCA and Visualization

Given that the images are $28 \times 28$ pixels, we are dealing with a 784-dimensional space. If we want to visualize its structure, we will need to reduce our data into 2-dimensional space. Naturally, we will loose a lot of information that way, but hopefully we will still be able to learn something about the structure of the space this way.

The first method that we are going ot test is called PCA. It is a very fast method, but it can only make use of linear relationships in the data – not of non-linear ones. However, for some datasets this is sufficient.



In [None]:
pca = PCA()
points_pca = pca.fit_transform(X.reshape((X.shape[0], -1)))

We will shuffle the points before we visualize them – the data is sorted by class in the original dataset. If we want to see whether PCA can separate the classes, shuffling the data first is going to be good idea: otherwise a later class could completely cover the points of an earlier class, thereby giving us a false impression of good separation.



In [None]:
perm_ind = np.random.permutation(points_pca.shape[0])
xx = points_pca[perm_ind]
yy = Y[perm_ind]

Finally, we only need to visualize all the resulting points and colour them by class. As we can see, the classes do not seem to be clearly separated from each other after doing PCA. Some classes actually appear to be separated (such as bag and trouser), but the figure is rather unreadable as a whole. 



In [None]:
plt.figure(figsize=[10, 7])
plt.scatter(xx[:, 0], xx[:, 1], c=yy,
            cmap=plt.cm.get_cmap('jet', len(class_names)),
            rasterized=True)
cbar = plt.colorbar()
cbar.set_ticks(range(len(class_names)))
cbar.set_ticklabels(class_names)
plt.xlabel("dim 1")
plt.ylabel("dim 2")

The figure would be still less informative if we didn't colour the points by class:



In [None]:
plt.figure(figsize=[10, 7])
plt.scatter(xx[:, 0], xx[:, 1], rasterized=True)
plt.xlabel("dim 1")
plt.ylabel("dim 2")

#### Note: Rasterizing Parts of the Image

Note that we have used parameter `rasterized=True` when plotting the points. This parameter indicates that the corresponding part of the image should be rasterized (instead of being kept in vector form). This is very useful when plotting a huge number of points: if we saved all of them in vector format, it would be very expensive to display the image.

We could, of course, simply save the entire figure in a raster format (such as jpeg or png) – but that would rasterize all parts of the image, including axes and such. On the whole, it is better to avoid rasterizing everything: especially if the figure is to be published.

When part of the figure is too complex to be presented in vector format, rastering it and keeping the rest of the figure in vector format is a good compromise.



### Dimensionality Reduction using UMAP and Visualization

Dimensionality reduction using UMAP will be much more time-consuming than it was using PCA. On the other hand, we can expect hte results to be a lot better, because UMAP can take advantage of non-linear as well as linear relationships in the data.

To apply UMAP instead of PCA, we literally only need to replace "PCA" with "UMAP", because both method have the unified interface used in the `scikit-learn` package. If we want to get a bit more information about what UMAP is doing, we can add the argument `verbose=True`.



In [None]:
um = umap.UMAP(verbose=True)
points_umap = um.fit_transform(X.reshape((X.shape[0], -1)))

In [None]:
perm_ind = np.random.permutation(points_umap.shape[0])
xx = points_umap[perm_ind]
yy = Y[perm_ind]
xt = X[perm_ind]

In [None]:
plt.figure(figsize=[10, 7])
cmap = plt.cm.get_cmap('jet', len(class_names))
plt.scatter(xx[:, 0], xx[:, 1], c=yy,
            cmap=cmap,
            rasterized=True)
cbar = plt.colorbar()
cbar.set_ticks(range(len(class_names)))
cbar.set_ticklabels(class_names)
plt.xlabel("dim 1")
plt.ylabel("dim 2")

From this image, we can already learn a lot more about the structure of the dataset. We can see that the samples are divided into 4 major groups. One of them contains trousers, another one contains handbags, the third one contains a mixture of different kinds of footwear and the last one a mixture of t-shirts, dresses, coats and such.

We can also see that while t-shirts and coats are quite intermixes, shoes have rather more structure within their cluster.

### More Advanced Visualization

Our UMAP-based visualization shows us that, for some reason, there is a contiguous path from the handbag cluster into the t-shirt cluster. It would be interesting to find out what kinds of samples we would find where the two clusters connect. To find out, let's plot a subset of the data, but instead of plotting just the points, let's visualize the actual images at the corresponding positions. That will provide us with a fuller idea of what the structure of the data is in the area of interest.

We will use an auxiliary function with an interface similar to `scatter`, which will plot images instead of points, though.



In [None]:
num2show = 800

plt.figure(figsize=[15, 10])
imscatter(xx[:num2show, 0], xx[:num2show, 1],
          xt[:num2show], cmap='Greys', zoom=1.2,
          frame_c=yy[:num2show], frame_cmap=cmap,
          frame_linewidth=2)
plt.xlabel("dim 1")
plt.ylabel("dim 2")

The figure should show how the images of handbags gradually change shape so that at the low resolution and grayscale coloring some of them could plausibly be mistaken for t-shirts.

