Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why doesn't UMAP map similar data to same region in learned latent space? #968

Closed
cdtennant opened this issue Feb 13, 2023 · 3 comments
Closed

Comments

@cdtennant
Copy link

My question concerns the ability of UMAP to transform new data in a way that intuitively makes sense. There is a nice example of this in the documentation: https://umap-learn.readthedocs.io/en/latest/transform.html. Here MNIST data is used to train a model, and new (withheld) MNIST data is passed to the model, with the result that the new data is mapped into the expected regions of learned space (i.e. same as training data). However, when I try this on a synthetic dataset I'm unable to reproduce this behavior.

I first created a set of 500 training examples, X1, each with 32 features. These are generated randomly using np.random.rand and have values between 0 and 1. I then created a small test set of 10 examples, X2, where all examples are identical to one of the training samples, except it is multiplied by 10. It is therefore much different than training data. See plot comparing a single example from X1 and X2.

import numpy as np
import random
from sklearn.decomposition import PCA
import umap
from matplotlib import pyplot as plt

X1 = np.random.rand(500, 32)
X2 = np.copy(X1[0:10])
X2[0:10] = X1[0]*10

plt.plot(X1[0], 'go-', label='train')
plt.plot(X2[0], 'bo-', label='test')
plt.legend();

fig1

Next I compare the performance of PCA and UMAP trained on X1 and used to transform X2. The results are shown below.

pca = PCA(n_components=2, random_state=11)
X1_pca = pca.fit_transform(X1)
X2_pca = pca.transform(X2)

umap = umap.UMAP(n_components=2, random_state=11)
X1_umap = umap.fit_transform(X1)
X2_umap = umap.transform(X2)

plt.subplot(1, 2, 1)
plt.scatter(X1_pca[:, 0], X1_pca[:, 1], s=40, c='blue',  alpha=0.25, label='train')
plt.scatter(X2_pca[:, 0], X2_pca[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('PCA')

plt.subplot(1, 2, 2)
plt.scatter(X1_umap[:, 0], X1_umap[:, 1], s=40, c='blue',  alpha=0.25, label='train')
plt.scatter(X2_umap[:, 0], X2_umap[:, 1], s=40, c='green', alpha=0.25, label='test')
plt.legend()
plt.title('UMAP')

plt.show()

fig2

PCA generates a result that one would naively expect. The training data is clustered together, the test data is clearly separated, and all the points within the test data are mapped to the same point in 2D space. UMAP, on the other hand, does neither. That is, the test data is not clearly differentiated from the training data, and even though the test data consists of identical features, they are not even mapped to the same region in the 2D space.

Am I doing something wrong, or is my understanding of how UMAP should work incorrect? Any help is greatly appreciated.

@dewball345
Copy link

Did something similar except instead of scale I looked at offset:
#969

@jlmelville
Copy link
Collaborator

When transforming new data, UMAP only "knows" about the training data distribution. The green test point data are effectively invisible to each other, i.e. when looking for neighbors to build the k-nearest neighbors graph, each test point only has neighbors from the training data, never from each other.

UMAP also doesn't retain the original distance information from the training data. It has no way to know that the nearest neighbor distances between the training data and test data are larger than those in the within the training data, and there is no mechanism in the rest of the algorithm for using that information. As PCA works directly on the variance of the dataset, it is much more focused on retaining the ambient distances and also densities.

Finally, the reason those test points aren't overlapping in the UMAP plot is either because the optimization isn't fully converged or the stochastic nature of the optimization and negative sampling has resulted in those originally identical points experiencing slightly different attractive and negative interactions during the optimization. Can't say for sure which one of those things is having the biggest effect though.

If the data from the test set is drawn from a very different distributions than the training set, then these slightly surprising things will happen. The "Spheres" dataset used in the UMATO paper is a good example if you want to see a similar effect.

Hope this helps.

@cdtennant
Copy link
Author

@jlmelville, thanks so much for your response. It was very helpful! I could easily verify that by adding a little bit of data similar to the test data to training, I was able to get a result more in line with what I was expecting. Thanks again.

Picture1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants