-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why doesn't UMAP map similar data to same region in learned latent space? #968
Comments
Did something similar except instead of scale I looked at offset: |
When transforming new data, UMAP only "knows" about the training data distribution. The green test point data are effectively invisible to each other, i.e. when looking for neighbors to build the k-nearest neighbors graph, each test point only has neighbors from the training data, never from each other. UMAP also doesn't retain the original distance information from the training data. It has no way to know that the nearest neighbor distances between the training data and test data are larger than those in the within the training data, and there is no mechanism in the rest of the algorithm for using that information. As PCA works directly on the variance of the dataset, it is much more focused on retaining the ambient distances and also densities. Finally, the reason those test points aren't overlapping in the UMAP plot is either because the optimization isn't fully converged or the stochastic nature of the optimization and negative sampling has resulted in those originally identical points experiencing slightly different attractive and negative interactions during the optimization. Can't say for sure which one of those things is having the biggest effect though. If the data from the test set is drawn from a very different distributions than the training set, then these slightly surprising things will happen. The "Spheres" dataset used in the UMATO paper is a good example if you want to see a similar effect. Hope this helps. |
@jlmelville, thanks so much for your response. It was very helpful! I could easily verify that by adding a little bit of data similar to the test data to training, I was able to get a result more in line with what I was expecting. Thanks again. |
My question concerns the ability of UMAP to transform new data in a way that intuitively makes sense. There is a nice example of this in the documentation: https://umap-learn.readthedocs.io/en/latest/transform.html. Here MNIST data is used to train a model, and new (withheld) MNIST data is passed to the model, with the result that the new data is mapped into the expected regions of learned space (i.e. same as training data). However, when I try this on a synthetic dataset I'm unable to reproduce this behavior.
I first created a set of 500 training examples, X1, each with 32 features. These are generated randomly using np.random.rand and have values between 0 and 1. I then created a small test set of 10 examples, X2, where all examples are identical to one of the training samples, except it is multiplied by 10. It is therefore much different than training data. See plot comparing a single example from X1 and X2.
Next I compare the performance of PCA and UMAP trained on X1 and used to transform X2. The results are shown below.
PCA generates a result that one would naively expect. The training data is clustered together, the test data is clearly separated, and all the points within the test data are mapped to the same point in 2D space. UMAP, on the other hand, does neither. That is, the test data is not clearly differentiated from the training data, and even though the test data consists of identical features, they are not even mapped to the same region in the 2D space.
Am I doing something wrong, or is my understanding of how UMAP should work incorrect? Any help is greatly appreciated.
The text was updated successfully, but these errors were encountered: