-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best Practices for AlignedUMAP on large datasets #658
Comments
Hi Lukas, Leland has made some really solid progress recently with the pynndescent library that we are using in UMAP for computing the approximate nearest neighbours. As such I don't think you are going to be gaining much by farming the pre-computation of nearest neighbours out to FAISS. You also shouldn't need to perform a PCA to reduce the dimension before running UMAP. I would ask what your intention behind using the Aligned UMAP is? Are you trying to examine how the data changes over time? Are you just trying to find a more memory efficient method of training a full embedding of your data? Are you just looking for a fast way to transform new data into the same space that you've learned on a your previous data (or a sub-sample of your previous data)? A good example of this might be that you don't think you need all 5 million data points to form a representative sample and you'd like to learn an embedding with a subset of that data for memory (or computational time) reasons and then embed the rest of your data into your space. If you are just looking for a nice coherent (and efficient) way to transform new data points into your learned UMAP space then I'm going to highly recommend the new parametric-UMAP functionality. It builds a neural net function to transform into our low dimensional UMAP space. That has the advantages of both faster transforms and an easily updatable model. I'll apologize in advance if I misunderstood what you are trying to accomplish. Cheers, |
Hi @jc-healy! Thanks for your fast reply and also for your talks at PyData - I really enjoyed them. My current objective is to find the best possible low dimensional representation of a large quantity of high dimensional embeddings. Even though I don't expect any temporal changes in my data, I still want to transform new batches of high-dimensional samples with the trained UMAP model from time to time. My intention behind using # ...
reduced_embedding = aligned_mapper.mappers_[-1].fit_transform(digits.data) Thanks for hinting me towards the ParametricUMAP. I will definitely have a look at this as well. Best, |
In case anyone wonders how to conduct UMAP on larger datasets, I would highly suggest to follow the approach that John suggested: Take a representative sample and perform PUMAP on it. With the help of a GPU you can highly adapt the models behind it and can reduce the dimensions in a reasonable amount of time. |
Spoiler Alert: This is a question about best practices, so not directly related to the library. Sorry if it's the wrong place to ask, feel free to close/delete.
Hi @lmcinnes! Hi @ALL!
First of all, I want to thank you for your outstanding work with this package and everything you contributed! I am a new user and currently trying to perform Dimensionality Reduction with UMAP on a larger dataset (5M+ samples) of 512 dimensional textual embeddings. Inspired by this, my current approach is to perform PCA to reduce the data to a lower dimensionality and then - due to memory constraints - compute the respective UMAP transformations with the
update
functionality of theAlignedUMAP
class on batches.Since the
AlignedUMAP
approach is quite time consuming, I wanted to ask you first if I understood it correctly that you can take the last UMAP mapper and transform the original data with that one to the lower dimensional space or if I have to combine the UMAP objects somehow (as described here). For reference, I included a toy example of my current approach with the respective plots. Comparing the aligned transformation on the batches against the "full" one above, the distinction looks to me quite convincing at first glance.Lastly, I would be very interested in how you all would generally perform UMAP on larger datasets of reasonably coherent data (textual embeddings). Would it make sense to precompute the distance metric (e.g. FAISS) in order to save UMAP computation time? What are your experiences with that?
Thanks in advance,
Lukas
Mini Example
The text was updated successfully, but these errors were encountered: