New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed-type datasets #58
Comments
This is actually a problem I am working towards solving in general, but I do not yet have all the bits and pieces required in place in the code yet, so unfortunately there is no easy pluggable solution at this time. As a teaser of what is to come, if you have just one categorical data column you can use the 0.3dev branch and do a fit with
More interesting things can be done with more mixed data, but it's really built off variations on this sort of approach. |
Thanks Leland! This makes sense. The example you provided is great. I will give it a try and report back. |
Hi, I'm using umap 0.3.2 and trying the approach outlined above but running into problems in that the resulting embedding produces a single, centered globular distribution, whereas the separate embeddings of my two distinct feature types (interval vs categorical) exhibit interesting structure. I've done the following, for instance:
It wasn't clear what I should have used for the data parameter in simplicial_set_embedding(). I tried both fit1._raw_data and fit2._raw_data, but neither alone seems appropriate here -- both produce similar results (a single diffuse blob). Any advice would be greatly appreciated! |
The data you pass in to simplicial set embedding shouldn't matter too much unless you end up with lots of separate connected components. I admit that I can't say immediately what might be causing this -- it looks like you are doing something fairly reasonable. There is some slightly newer code that you could try, but I'm not sure it will help in your case. I'll have to look up what the right code is, because it would be a series of internal function calls, and I don't have time right now. I'll try to get back to you soon. |
Okay, so this is a little less than ideal because these are decidedly not public APIs, so it gets messy, but you can try: fit1 = umap.UMAP(metric='braycurtis').fit(df_b.values)
fit2 = umap.UMAP(metric='jaccard').fit(df_dummies.values)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, fit2.graph_, mix_weight=0.5)
intersection = umap.umap_.reset_local_connectivty(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components,
fit1.initial_alpha, fit1._a, fit1._b,
fit1.repulsion_strength, fit1.negative_sample_rate,
200, 'random', np.random, fit1.metric,
fit1._metric_kwds, False) which may work a little better. Note that we are passing The other thing to note is the |
Yes, this works much better, thank you! |
Glad that its working better. A proper interface for general dataframe
handling (based on the new sklearn ColumnTransformer) based on this newer
code is among my plans for 0.4. It is good to know it works (at least
somewhat) in practice.
…On Mon, Sep 10, 2018 at 6:49 PM Jay Reynolds ***@***.***> wrote:
Yes, this works much better, thank you!
I'll wedge the range for the mix value and report my findings.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#58 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALaKBSpjGCr-1MCqi61W49Mrh8Wz2rVaks5uZtSkgaJpZM4SuCFW>
.
|
To extend this idea to multiple blocks of data, if we have say three graphs,
or does |
It is an interesting question -- theoretically you don't need to reset the local connectivity 'til the end, but implementation-wise I believe it would be more beneficial to do so at each step. I have been getting started on this and am still playing with the right implementation approximations to what theory says should be done. |
How would new data then be processed? test_ numerical_transform = fit1.transform(X_numerical_test)
test_categorical_transform = fit2.transform(X_categorical_test) |
Transform for mixed-type data is something that is certainly not available at this time. It isn't theoretically infeasible, but implementation-wise it would require a non-negligible amount of code refactoring to make it happen. Sorry @MeTooThanks . |
@MeTooThanks Maybe you could try to train a Neural Network (or an Extreme Learning Machine) to learn the mapping between the numeric and categorical transformed data and the "intersected" transformed data. @lmcinnes Opinions? |
@ekerazha You can certainly try, but I suspect getting good parameters/architecture for the network and successfully training it without overfitting to the original input data will be a potentially large challenge. |
@lmcinnes Hello, thanks for your awesome work in UMAP. |
@ivenzor Yes, that would be the right approach right now. One alternative would be to check out the 0.4dev branch which has a (very!) experimental class called DataframeUMAP that would take a pandas dataframe and a tuple of (column name, metric) lists (similar to the ColumnTransformer in sklearn) and does all the required manipulations. |
Ok, I will also check the experimental DataFrameUMAP. Many thanks for your reply. |
Hi, all. Recently I worked on a small project looking at this very issue. I used the cluster package in R to calculate the Gower distance matrix on mixed-data and passed that to UMAP with To my knowledge Gower distance isn't implemented in any Python package (though I am working on remedying that right now...) |
@lmcinnes While trying the following piece of code with ether 'jaccard' or 'dice' metric for fit2.
|
Fortunately that is just a warning: you won't be able to use the |
ok thank you |
Thank you for sharing your experience with Gower distance. |
@lmcinnes is this no longer relevant following the ability to perform intersections as outlined at ? I tried the two alongside one another and got some pretty different results |
The intersection functionality is more mature, so I would definitely suggest that as the right way to go. If you are getting very different results and you prefer the older method it may be worth having a conversation and digging in to why the differences occur. |
Is there a way to weight the categorical vs binary mix with the intersection functionality? it looked like it was set to 50/50 by default and i didn't see a way to do it using * as the intersection mapper |
No, that is a definite downside. If you need mixing weights you can take a
look at how the ``__mul__`` operator is implemented and potentially code up
an equivalent adding in mixing weights yourself. As it is the operator
overloading in Python doesn't really allow for adding mix weights, so I
left it out. Perhaps having a different explicit method that support mix
weights might not be a bad idea.
Leland.
…On Tue, Mar 29, 2022 at 9:24 PM Jesse Cahill ***@***.***> wrote:
Is there a way to weight the categorical vs binary mix with the
intersection functionality? it looked like it was set to 50/50 by default
and i didn't see a way to do it using * as the intersection mapper
—
Reply to this email directly, view it on GitHub
<#58 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUBL6CWULVZYQQAUB7NDVCOUOFANCNFSM4EVYEFLA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Makes sense - I'd be willing to pick that up and give it a shot if so |
Is there a way to perform the same integrative analysis in the R implementation of UMAP. I am interested in analysing the data coming from different sources by combining them together via UMAP and later perform the clustering on the final embeddings. I am able to do such analysis in python but not in R. |
Not sure what you are looking for, but if you use |
Thank you for the reply!
I wish to replicate the following operation in R. Is there a way? |
@ratheraarif unfortunately the ability to carry out intersections of the simplicial set with a user-defined weight is not exposed in the uwot API. There is an internal function you can call, but you can't do anything useful with the output because the ability to call the equivalent of the Python |
Could anyone point me towards the DataframeUMAP class? |
Thanks for sharing the great algorithm and library!
Wondering what would be the recommended way of feeding mixed-type data with some categorical features to UMAP? Binary encoding (possibly with appropriate distance metrics)?
The text was updated successfully, but these errors were encountered: