# Exercise: MNIST Visualizer (Dimensionality Reduction, PCA, t-SNE, UMAP)


`#dimensionality-reduction` `#pca` `#t-sne` `#umap` `#matplotlib` `#visualization`


> Objectives:
>
> - Explore how data embeddings in a high-dimensional space can be visualized in 2-dimensional mappings using the following methods:
>   - PCA (Principal Component Analysis)
>   - TSNE (t-Distributed Stochastic Neighbor Embedding)
>   - UMAP (Uniform Manifold Approximation and Projection)


## Standard Deep Atlas Exercise Set Up


- [ ] Ensure you are using the coursework Pipenv environment and kernel ([instructions](../SETUP.md))
- [ ] Apply the standard Deep Atlas environment setup process by running this cell:


In [None]:
import sys, os
sys.path.insert(0, os.path.join('..', 'includes'))

import deep_atlas
from deep_atlas import FILL_THIS_IN
deep_atlas.initialize_environment()
if deep_atlas.environment == 'COLAB':
    %pip install -q python-dotenv==1.0.0

### 🚦 Checkpoint: Start

- [ ] Run this cell to record your start time:


In [None]:
deep_atlas.log_start_time()

---


## Context


In previous exercises, we converted features into high-dimensional embeddings, useful for techniques like vector search and generative AI, but difficult to visualize.

As an ML engineer, you'll need to reduce high-dimensional data to 2D to inspect clusters and patterns. This exercise explores dimensionality reduction techniques using SciKit-Learn, without delving into the underlying math.


## Dependencies


In [None]:
if deep_atlas.environment == 'VIRTUAL':
    !pipenv install matplotlib==3.8.2 scikit-learn umap-learn
if deep_atlas.environment == 'COLAB':
    %pip install matplotlib==3.8.2 scikit-learn umap-learn


## Imports


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

## Data Loading


Let's download and process the MNIST dataset, which contains 28×28px black-and-white scans of handwritten digits (0–9), each represented by 784 values from 0 (black) to 255 (white).

SciKit-Learn, like TensorFlow, PyTorch, and fast.ai, provides an API to download popular datasets. Keep in mind that for real-world applications, you'll need to source, clean, and engineer your own datasets.

For this exercise, we can download the dataset using the the `fetch_openml` function (imported above):


In [None]:
# Load MNIST dataset
mnist = fetch_openml("mnist_784", version=1)

# Convert each pixel value to a float between 0 and 1
X, y = mnist.data / 255.0, mnist.target

# MNIST is a large dataset, containing 70,000 images.
# Reduce the dataset size for quicker execution:
X, y = X[:10000], y[:10000]

## Plotting values

In order to see the output of the dimensionality reduction, define a function which will plot data in two dimensions:

Tip: Note how the `plot_embeddings` function is being invoked below and return to this cell to understand the plots being rendered.


In [None]:
def plot_embedding(data, y, title):
    # `data` is a 2D array of shape (n_samples, 2)
    # `y` is a 1D array of shape (n_samples,), representing the labels

    plt.figure(figsize=(8, 6))
    scatter = plt.scatter(
        # Value of data in the first dimension
        data[:, 0],
        # Value of data in the second dimension
        data[:, 1],
        # Color of each point, representing the label
        c=y.astype(int),
        # Use a categorical color map with 10 distinct colors
        cmap="tab10",
        # size of each point
        s=1,
    )
    # Add a color bar to the right of the plot
    plt.colorbar(scatter)

    plt.title(title)
    plt.xlabel("Component 1")
    plt.ylabel("Component 2")

    plt.show()

## PCA: Principal Component Analysis


PCA's transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is linearly uncorrelated (i.e. orthogonal) to the preceding components.

PCA is particularly effective in scenarios where:

1. There are linear correlations between variables in your data.
2. The dataset has high dimensionality but you suspect that many of the features are redundant or irrelevant.
3. You want to perform data compression while maintaining the structure and complexity of the data.

Limitations

1. **Larger variance != more interesting**: PCA assumes that the component (or direction in feature space) with the largest variance is the most "interesting". This may not always be the case, and sometimes the components with smaller variance may also contain important information.
1. **Orthogonality**: The decision to make principal components orthogonal may not always make sense. This could be a limitation if the components are not really orthogonal in your data.
1. **Scaling**: PCA is sensitive to the scaling of your variables. If you have variables with large values, they may end up dominating the first principal component when they should not. It's often a good idea to normalize your data before applying PCA.


In [None]:
# Select and project principal components
X_pca = PCA(n_components=2).fit_transform(X)

# Plot the PCA projection
plot_embedding(X_pca, y, title="PCA")

## t-SNE: t-Distributed Stochastic Neighbor Embedding


t-SNE models each high-dimensional input vector by a two- or three-dimensional point in such a way that similar instances land close by each other and dissimilar instances are moved apart.

It is particularly effective in scenarios where:

1. You want to visualize clusters in your data, as t-SNE can separate clusters quite well in the low-dimensional space.
1. The structure of the data at various scales is of interest, as t-SNE can capture structure at different scales.

Limitations

1. **Hyperparameters**: t-SNE has a few hyperparameters (like perplexity and learning rate) that can significantly affect the resulting visualization. It might require some trial and error to find the best settings.
1. **Global vs. Local Structure**: t-SNE is particularly good at preserving local structure in the data (meaning it keeps similar instances close together), but it doesn't preserve the global structure as well. This means that the distance between widely separated clusters in the t-SNE plot may not mean anything.
1. **Randomness**: t-SNE uses a random initialization as part of its algorithm, which means that you can get different results every time you run it. This can make it hard to interpret the results. In the code below, this is countered by setting a seed (`random_state`).
1. **No Inverse Mapping**: Unlike PCA, t-SNE does not provide an explicit function to map new, unseen data into the same space.
1. **Computational Complexity**: t-SNE has a high computational complexity, making it less suitable for very large datasets.


In [None]:
# Model 784-dimensional data as 2-dimensional data,
# placing similar points close together
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X)

# Plot the t-SNE projection
plot_embedding(X_tsne, y, title="t-SNE")

## UMAP: Uniform Manifold Approximation Projection


UMAP is based on manifold learning techniques. It's primary advantage over t-SNE is that it preserves more of the global structure.

> In the context of UMAP and other dimensionality reduction techniques, a manifold refers to a shape or structure in high-dimensional space that can be approximated as a lower-dimensional space locally.
>
> For example, consider a piece of paper: it's a 2D object living in our 3D world. If you crumple that piece of paper into a ball, it's still a 2D surface, but now it's embedded in 3D space in a complex way. That crumpled piece of paper is an example of a 2D manifold in 3D space.
>
> In the case of UMAP, it assumes that the high-dimensional data lies on a manifold, and it tries to learn the structure of this manifold. It then uses this learned structure to project the data into a lower-dimensional space in a way that preserves as much of the original data structure as possible. This is why UMAP is particularly good at preserving both local and global structures in the data.

UMAP is particularly effective in scenarios where:

1. You want to preserve the global structure of the data while reducing dimensions.
2. You are dealing with very large datasets. UMAP is faster than t-SNE, making it more suitable for larger datasets.
3. You want more consistent results. Unlike t-SNE, which can produce different results with different runs due to its randomness, UMAP tends to produce more consistent results.

Limitations

1. **Complexity**: UMAP is based on some complex mathematical concepts, which can make it harder to reason about than PCA or t-SNE.
2. **Hyperparameters**: Like t-SNE, UMAP also has a few key hyperparameters (like the number of neighbors and the minimum distance) that can significantly affect the resulting visualization. It might require some trial and error to find the best settings.
3. **No Inverse Mapping**: Similar to t-SNE, UMAP does not provide an explicit function to map new, unseen data into the same space. However, recent versions of UMAP have added some support for this feature.
4. **Assumptions**: UMAP makes some assumptions about the data, such as it being uniformly distributed on a Riemannian manifold. If these assumptions are not met, the results may not be meaningful.


In [None]:
# Model 784-dimensional data as 2-dimensional data
X_umap = umap.UMAP(random_state=42).fit_transform(X)

# Plot the UMAP projection
plot_embedding(X_umap, y, title="UMAP")

### 🚦 Checkpoint: Stop

- [ ] Uncomment this code
- [ ] Complete the feedback form
- [ ] Run the cell to log your responses and record your stop time:


In [None]:
# deep_atlas.log_feedback(
#     {
#         # How long were you actively focused on this section? (HH:MM)
#         "active_time": FILL_THIS_IN,
#         # Did you feel finished with this section (Yes/No):
#         "finished": FILL_THIS_IN,
#         # How much did you enjoy this section? (1–5)
#         "enjoyment": FILL_THIS_IN,
#         # How useful was this section? (1–5)
#         "usefulness": FILL_THIS_IN,
#         # Did you skip any steps?
#         "skipped_steps": [FILL_THIS_IN],
#         # Any obvious opportunities for improvement?
#         "suggestions": [FILL_THIS_IN],
#     }
# )
# deep_atlas.log_stop_time()

## You did it!


These techniques are essential for data scientists when working with high-dimensional data, as they allow for the visualization and understanding of complex patterns that would be impossible to understand in the high-dimensional space.

These techniques will be useful as you explore new data sets for your projects and beyond.


Further exploration:

1. [PCA (Principal Component Analysis)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
2. [t-SNE (t-Distributed Stochastic Neighbor Embedding)](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)
3. [UMAP (Uniform Manifold Approximation and Projection)](https://umap-learn.readthedocs.io/en/latest/)
