Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Practices for AlignedUMAP on large datasets #658

Closed
lukaschoebel opened this issue Apr 28, 2021 · 3 comments
Closed

Best Practices for AlignedUMAP on large datasets #658

lukaschoebel opened this issue Apr 28, 2021 · 3 comments

Comments

@lukaschoebel
Copy link
Contributor

Spoiler Alert: This is a question about best practices, so not directly related to the library. Sorry if it's the wrong place to ask, feel free to close/delete.

Hi @lmcinnes! Hi @ALL!

First of all, I want to thank you for your outstanding work with this package and everything you contributed! I am a new user and currently trying to perform Dimensionality Reduction with UMAP on a larger dataset (5M+ samples) of 512 dimensional textual embeddings. Inspired by this, my current approach is to perform PCA to reduce the data to a lower dimensionality and then - due to memory constraints - compute the respective UMAP transformations with the update functionality of the AlignedUMAP class on batches.

Since the AlignedUMAP approach is quite time consuming, I wanted to ask you first if I understood it correctly that you can take the last UMAP mapper and transform the original data with that one to the lower dimensional space or if I have to combine the UMAP objects somehow (as described here). For reference, I included a toy example of my current approach with the respective plots. Comparing the aligned transformation on the batches against the "full" one above, the distinction looks to me quite convincing at first glance.

Lastly, I would be very interested in how you all would generally perform UMAP on larger datasets of reasonably coherent data (textual embeddings). Would it make sense to precompute the distance metric (e.g. FAISS) in order to save UMAP computation time? What are your experiences with that?

Thanks in advance,
Lukas

Mini Example

import umap
import umap.aligned_umap
import sklearn.datasets

digits = sklearn.datasets.load_digits()

slices = [ordered_digits[150 * i:min(ordered_digits.shape[0], 150 * i + 400)] for i in range(10)]
relation_dict = {i+150:i for i in range(400-150)}
relation_dicts = [relation_dict.copy() for i in range(len(slices) - 1)]

aligned_mapper = umap.AlignedUMAP().fit(slices[:2], relations=relation_dicts[:1])

for i, batch in enumerate(slices[2:]):
    aligned_mapper.update(batch, relations={v:k for k,v in relation_dicts[i-1].items()})

full_embedding = umap.UMAP().fit_transform(digits.data)
## taking only the last UMAP object to transform the data
aligned_embedding = aligned_mapper.mappers_[-1].fit_transform(digits.data)

full_umap

aligned_umap

@jc-healy
Copy link
Contributor

Hi Lukas,

Leland has made some really solid progress recently with the pynndescent library that we are using in UMAP for computing the approximate nearest neighbours. As such I don't think you are going to be gaining much by farming the pre-computation of nearest neighbours out to FAISS. You also shouldn't need to perform a PCA to reduce the dimension before running UMAP.

I would ask what your intention behind using the Aligned UMAP is? Are you trying to examine how the data changes over time? Are you just trying to find a more memory efficient method of training a full embedding of your data? Are you just looking for a fast way to transform new data into the same space that you've learned on a your previous data (or a sub-sample of your previous data)? A good example of this might be that you don't think you need all 5 million data points to form a representative sample and you'd like to learn an embedding with a subset of that data for memory (or computational time) reasons and then embed the rest of your data into your space.

If you are just looking for a nice coherent (and efficient) way to transform new data points into your learned UMAP space then I'm going to highly recommend the new parametric-UMAP functionality. It builds a neural net function to transform into our low dimensional UMAP space. That has the advantages of both faster transforms and an easily updatable model.

I'll apologize in advance if I misunderstood what you are trying to accomplish.

Cheers,
John

@lukaschoebel
Copy link
Contributor Author

Hi @jc-healy!

Thanks for your fast reply and also for your talks at PyData - I really enjoyed them.

My current objective is to find the best possible low dimensional representation of a large quantity of high dimensional embeddings. Even though I don't expect any temporal changes in my data, I still want to transform new batches of high-dimensional samples with the trained UMAP model from time to time. My intention behind using AlignedUMAP was to have a more memory efficient alternative to the "vanilla" UMAP. As you proposed, I certainly would only compute the UMAP model on the subset of the 5M samples and then transform the rest. However, if the subset is still large, I would still need to find a memory efficient way for this initial computation. Is my assumption that the last mapper of AlignedUMAP can be understood as the combination of all UMAP models on the smaller subsets correct? Am I following the right approach if I fit_transform my data on the last of the appended mappers as follows:

# ...
reduced_embedding = aligned_mapper.mappers_[-1].fit_transform(digits.data)

Thanks for hinting me towards the ParametricUMAP. I will definitely have a look at this as well.

Best,
Lukas

@lukaschoebel
Copy link
Contributor Author

In case anyone wonders how to conduct UMAP on larger datasets, I would highly suggest to follow the approach that John suggested: Take a representative sample and perform PUMAP on it. With the help of a GPU you can highly adapt the models behind it and can reduce the dimensions in a reasonable amount of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants