Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we apply the Gower metric to UMAP? #356

Open
simeng-yang opened this issue Jan 30, 2020 · 10 comments
Open

How can we apply the Gower metric to UMAP? #356

simeng-yang opened this issue Jan 30, 2020 · 10 comments

Comments

@simeng-yang
Copy link

simeng-yang commented Jan 30, 2020

From my rough work, if we let the custom metric be the Gower metric, the distance matrix for all points in the dataset can be computed for both numerical and categorical data. However, it seems this is simplest when we only use the Gower metric for precomputing the distance matrix for the entire dataset, i.e. with

umap.UMAP(metric="precomputed").fit_transform(precomputed_distances)

While it is possible to compute the distance matrix for a dataset beforehand, using metric="precomputed" is inappropriate towards a further transform on the embedding for new data, which is needed for inference, since it doesn't allow for a .transform on the embedding for new data.

I think what I would want is to have a metric which can be plugged into umap.UMAP() such that this metric can handle both numerical and categorical features.

From the examples in the doc, it seems the metric is used for computing the distances between each pair of points separately (i.e. such a metric returns distance(point1, point2)),

I'm wondering how one could use the Gower distance metric for both fitting against training data and transforming on test data?

Or is transform for mixed datasets currently still unsupported despite the above?

This is important for me since I'm trying to use UMAP for dimensionality reduction on complex mixed datasets for inference/classification.

@lmcinnes
Copy link
Owner

To make that work you would need to write a gower_distance function as numba jit compile it. You can pass in a jit compiled function as a metric and it will all "just work". In practice, because of the way Gower distance works, you probably want to define a custom function for your specific datatype (applying suitable dissimilarity measures on different column indexes and then summing it all up).

@simeng-yang
Copy link
Author

simeng-yang commented Jan 30, 2020

Thanks @lmcinnes!

I just have a few questions:

Firstly, does the distance metric compute the distance between all pairwise points in one call and return a matrix of distances or does it compute one distance between a pair of points and is repeatedly called to populate a distance matrix between all points in the dataset? The existing Python implementations I've seen for the Gower distance all output a matrix for the pairwise distances between all points, rather than an actual distance between one pair of points.

However, for the jit compilation with numba, I believe your examples used njit (for nopython jit), which speeds up compilation. Would simply having jit (or even slower, jit with forceobject=True) also fit the bill? For some function gower_distance(x, y), I had both x and y as 1D series / np.arrays (representing high-dimensional datapoints), neither of which seem supported by njit, hence resorting to jit.

Also, I would want to have a relatively simple, but highly generic Gower metric which can be easily used across multiple datasets (for some "out of the box" inference solution which can apply to arbitrary mixed datasets). In that case, do you think glossing over the customizing of the Gower metric would lead to acceptable results?

If not, how would you propose a suitable learning / fitting preprocessing step for tuning the Gower metric on arbitrary mixed datasets?

@simeng-yang
Copy link
Author

simeng-yang commented Jan 30, 2020

One more roadblock I discovered for myself was that when I tried to fit / transform UMAP on a mixed dataset, it failed due to this line:
X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")

The issue for me was that check_array tries to convert the input to a numpy array, but since the dtype of the array is object for certain columns, check_array attempts converting strings to float, raising a failure.

So, even if the Gower metric were to be suitably defined, it seems that using fit / transform on mixed datasets just isn't supported currently?

Could you please confirm if this is correct?

@sleighsoft
Copy link
Collaborator

sleighsoft commented Feb 4, 2020

With proper checks in place you could also change the line in a PR.

See here for an example of adding a distance metric: lmcinnes/pynndescent#86

@simeng-yang
Copy link
Author

Thanks for the response, @sleighsoft

So you're suggesting on creating a PR to enable passing in mixed datasets into the UMAP transformer (with the proper checks for validating such a mixed dataset is still valid for mixed data metrics)?

Also, your example links back directly to this issue - did you intend to link to somewhere else instead?

For now, I've actually opted to use FAMD for mixed data analysis instead, due to the above gotchas, although I continue to be interested in UMAP too.

@sleighsoft
Copy link
Collaborator

My mistake, I copied the wrong link. Updated it now.

Every contribution is definitely welcome :)

@simeng-yang
Copy link
Author

Thanks for the link!

That seems like a great example of implementing a custom distance metric and I will likely refer to it if I end up adapting the Gower metric for UMAP.

Interestingly enough though, I saw several references to the Gower metric for UMAP, but did not come across any njit implementation of the Gower metric for use with UMAP.

@AdamSpannbauer
Copy link

Hi @simeng-yang, were you able to successfully implement Gower for UMAP? I'm interested in exploring the same thing, and I'd be very interested to see your implementation before starting from scratch.

@simeng-yang
Copy link
Author

Hey @AdamSpannbauer, due to the above issues with UMAP not being directly suitable towards mixed datasets and having non-negligible runtime overhead compared to some simpler methods, I did not choose to make any further progress on this path.

Notably, I ended up investigating FAMD - Factor Analysis of Mixed Data - instead, which is a union of linear techniques that can handle both numerical and categorical data. Perhaps you might be interested in taking a look there.

However, if you do want to further explore the option of creating a custom implementation of the Gower metric for UMAP, you may wish to refer to these existing standalone Gower metric implementations and try to "refit" those implementations to work with UMAP.

You would also have to develop the proper checks to handle mixed datasets with object columns. You can see here for an example of adding a distance metric: lmcinnes/pynndescent#86 (credit to @sleighsoft).

I think this would still be a worthwhile endeavor. Mixed datasets are very prevalent in a wide variety of data analysis situations.

@lorenelovestudy
Copy link

@simeng-yang@lmcinnes
hello! I encountered the same problem when i perform umap with Gower metric and I finally found that the Gower UMAP (gUMAP) is an implementation of UMAP using the Gower metric of integrating both categorical and numeric data (https://github.com/gibsramen/gUMAP/blob/master/index.md), which simply solve my problem with the following codes:
reducer_gower = umap.UMAP(metric="precomputed", min_dist=0.4, n_neighbors=8, random_state=42)
embedding_gower = reducer_gower.fit_transform(pokemon_dm_gower)
I wonder if this feasible and can deal with my problem without the numba jit which you menetioned above. I will be appreciated if you apply to me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants