Why doesn't UMAP map similar data to same region in learned latent space? #968

cdtennant · 2023-02-13T18:20:45Z

My question concerns the ability of UMAP to transform new data in a way that intuitively makes sense. There is a nice example of this in the documentation: https://umap-learn.readthedocs.io/en/latest/transform.html. Here MNIST data is used to train a model, and new (withheld) MNIST data is passed to the model, with the result that the new data is mapped into the expected regions of learned space (i.e. same as training data). However, when I try this on a synthetic dataset I'm unable to reproduce this behavior.

I first created a set of 500 training examples, X1, each with 32 features. These are generated randomly using np.random.rand and have values between 0 and 1. I then created a small test set of 10 examples, X2, where all examples are identical to one of the training samples, except it is multiplied by 10. It is therefore much different than training data. See plot comparing a single example from X1 and X2.

import numpy as np
import random
from sklearn.decomposition import PCA
import umap
from matplotlib import pyplot as plt

X1 = np.random.rand(500, 32)
X2 = np.copy(X1[0:10])
X2[0:10] = X1[0]*10

plt.plot(X1[0], 'go-', label='train')
plt.plot(X2[0], 'bo-', label='test')
plt.legend();

Next I compare the performance of PCA and UMAP trained on X1 and used to transform X2. The results are shown below.

pca = PCA(n_components=2, random_state=11)
X1_pca = pca.fit_transform(X1)
X2_pca = pca.transform(X2)

umap = umap.UMAP(n_components=2, random_state=11)
X1_umap = umap.fit_transform(X1)
X2_umap = umap.transform(X2)

plt.subplot(1, 2, 1)
plt.scatter(X1_pca[:, 0], X1_pca[:, 1], s=40, c='blue',  alpha=0.25, label='train')
plt.scatter(X2_pca[:, 0], X2_pca[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('PCA')

plt.subplot(1, 2, 2)
plt.scatter(X1_umap[:, 0], X1_umap[:, 1], s=40, c='blue',  alpha=0.25, label='train')
plt.scatter(X2_umap[:, 0], X2_umap[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('UMAP')

plt.show()

PCA generates a result that one would naively expect. The training data is clustered together, the test data is clearly separated, and all the points within the test data are mapped to the same point in 2D space. UMAP, on the other hand, does neither. That is, the test data is not clearly differentiated from the training data, and even though the test data consists of identical features, they are not even mapped to the same region in the 2D space.

Am I doing something wrong, or is my understanding of how UMAP should work incorrect? Any help is greatly appreciated.

The text was updated successfully, but these errors were encountered:

dewball345 · 2023-02-16T03:51:45Z

Did something similar except instead of scale I looked at offset:
#969

jlmelville · 2023-02-16T05:43:47Z

When transforming new data, UMAP only "knows" about the training data distribution. The green test point data are effectively invisible to each other, i.e. when looking for neighbors to build the k-nearest neighbors graph, each test point only has neighbors from the training data, never from each other.

UMAP also doesn't retain the original distance information from the training data. It has no way to know that the nearest neighbor distances between the training data and test data are larger than those in the within the training data, and there is no mechanism in the rest of the algorithm for using that information. As PCA works directly on the variance of the dataset, it is much more focused on retaining the ambient distances and also densities.

Finally, the reason those test points aren't overlapping in the UMAP plot is either because the optimization isn't fully converged or the stochastic nature of the optimization and negative sampling has resulted in those originally identical points experiencing slightly different attractive and negative interactions during the optimization. Can't say for sure which one of those things is having the biggest effect though.

If the data from the test set is drawn from a very different distributions than the training set, then these slightly surprising things will happen. The "Spheres" dataset used in the UMATO paper is a good example if you want to see a similar effect.

Hope this helps.

cdtennant · 2023-02-16T14:11:29Z

@jlmelville, thanks so much for your response. It was very helpful! I could easily verify that by adding a little bit of data similar to the test data to training, I was able to get a result more in line with what I was expecting. Thanks again.

cdtennant closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't UMAP map similar data to same region in learned latent space? #968

Why doesn't UMAP map similar data to same region in learned latent space? #968

cdtennant commented Feb 13, 2023

dewball345 commented Feb 16, 2023

jlmelville commented Feb 16, 2023

cdtennant commented Feb 16, 2023

Why doesn't UMAP map similar data to same region in learned latent space? #968

Why doesn't UMAP map similar data to same region in learned latent space? #968

Comments

cdtennant commented Feb 13, 2023

dewball345 commented Feb 16, 2023

jlmelville commented Feb 16, 2023

cdtennant commented Feb 16, 2023