Best Practices for AlignedUMAP on large datasets #658

lukaschoebel · 2021-04-28T14:03:48Z

Spoiler Alert: This is a question about best practices, so not directly related to the library. Sorry if it's the wrong place to ask, feel free to close/delete.

Hi @lmcinnes! Hi @ALL!

First of all, I want to thank you for your outstanding work with this package and everything you contributed! I am a new user and currently trying to perform Dimensionality Reduction with UMAP on a larger dataset (5M+ samples) of 512 dimensional textual embeddings. Inspired by this, my current approach is to perform PCA to reduce the data to a lower dimensionality and then - due to memory constraints - compute the respective UMAP transformations with the update functionality of the AlignedUMAP class on batches.

Since the AlignedUMAP approach is quite time consuming, I wanted to ask you first if I understood it correctly that you can take the last UMAP mapper and transform the original data with that one to the lower dimensional space or if I have to combine the UMAP objects somehow (as described here). For reference, I included a toy example of my current approach with the respective plots. Comparing the aligned transformation on the batches against the "full" one above, the distinction looks to me quite convincing at first glance.

Lastly, I would be very interested in how you all would generally perform UMAP on larger datasets of reasonably coherent data (textual embeddings). Would it make sense to precompute the distance metric (e.g. FAISS) in order to save UMAP computation time? What are your experiences with that?

Thanks in advance,
Lukas

Mini Example

import umap
import umap.aligned_umap
import sklearn.datasets

digits = sklearn.datasets.load_digits()

slices = [ordered_digits[150 * i:min(ordered_digits.shape[0], 150 * i + 400)] for i in range(10)]
relation_dict = {i+150:i for i in range(400-150)}
relation_dicts = [relation_dict.copy() for i in range(len(slices) - 1)]

aligned_mapper = umap.AlignedUMAP().fit(slices[:2], relations=relation_dicts[:1])

for i, batch in enumerate(slices[2:]):
    aligned_mapper.update(batch, relations={v:k for k,v in relation_dicts[i-1].items()})

full_embedding = umap.UMAP().fit_transform(digits.data)
## taking only the last UMAP object to transform the data
aligned_embedding = aligned_mapper.mappers_[-1].fit_transform(digits.data)

The text was updated successfully, but these errors were encountered:

jc-healy · 2021-04-28T16:01:14Z

Hi Lukas,

Leland has made some really solid progress recently with the pynndescent library that we are using in UMAP for computing the approximate nearest neighbours. As such I don't think you are going to be gaining much by farming the pre-computation of nearest neighbours out to FAISS. You also shouldn't need to perform a PCA to reduce the dimension before running UMAP.

I would ask what your intention behind using the Aligned UMAP is? Are you trying to examine how the data changes over time? Are you just trying to find a more memory efficient method of training a full embedding of your data? Are you just looking for a fast way to transform new data into the same space that you've learned on a your previous data (or a sub-sample of your previous data)? A good example of this might be that you don't think you need all 5 million data points to form a representative sample and you'd like to learn an embedding with a subset of that data for memory (or computational time) reasons and then embed the rest of your data into your space.

If you are just looking for a nice coherent (and efficient) way to transform new data points into your learned UMAP space then I'm going to highly recommend the new parametric-UMAP functionality. It builds a neural net function to transform into our low dimensional UMAP space. That has the advantages of both faster transforms and an easily updatable model.

I'll apologize in advance if I misunderstood what you are trying to accomplish.

Cheers,
John

lukaschoebel · 2021-04-28T21:21:39Z

Hi @jc-healy!

Thanks for your fast reply and also for your talks at PyData - I really enjoyed them.

My current objective is to find the best possible low dimensional representation of a large quantity of high dimensional embeddings. Even though I don't expect any temporal changes in my data, I still want to transform new batches of high-dimensional samples with the trained UMAP model from time to time. My intention behind using AlignedUMAP was to have a more memory efficient alternative to the "vanilla" UMAP. As you proposed, I certainly would only compute the UMAP model on the subset of the 5M samples and then transform the rest. However, if the subset is still large, I would still need to find a memory efficient way for this initial computation. Is my assumption that the last mapper of AlignedUMAP can be understood as the combination of all UMAP models on the smaller subsets correct? Am I following the right approach if I fit_transform my data on the last of the appended mappers as follows:

# ...
reduced_embedding = aligned_mapper.mappers_[-1].fit_transform(digits.data)

Thanks for hinting me towards the ParametricUMAP. I will definitely have a look at this as well.

Best,
Lukas

lukaschoebel · 2021-05-31T08:18:42Z

In case anyone wonders how to conduct UMAP on larger datasets, I would highly suggest to follow the approach that John suggested: Take a representative sample and perform PUMAP on it. With the help of a GPU you can highly adapt the models behind it and can reduce the dimensions in a reasonable amount of time.

lukaschoebel closed this as completed May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practices for AlignedUMAP on large datasets #658

Best Practices for AlignedUMAP on large datasets #658

lukaschoebel commented Apr 28, 2021

jc-healy commented Apr 28, 2021

lukaschoebel commented Apr 28, 2021

lukaschoebel commented May 31, 2021

Best Practices for AlignedUMAP on large datasets #658

Best Practices for AlignedUMAP on large datasets #658

Comments

lukaschoebel commented Apr 28, 2021

Mini Example

jc-healy commented Apr 28, 2021

lukaschoebel commented Apr 28, 2021

lukaschoebel commented May 31, 2021