Mixed-type datasets #58

amir-ghasemi · 2018-03-16T15:59:27Z

Thanks for sharing the great algorithm and library!

Wondering what would be the recommended way of feeding mixed-type data with some categorical features to UMAP? Binary encoding (possibly with appropriate distance metrics)?

lmcinnes · 2018-03-16T20:34:20Z

This is actually a problem I am working towards solving in general, but I do not yet have all the bits and pieces required in place in the code yet, so unfortunately there is no easy pluggable solution at this time. As a teaser of what is to come, if you have just one categorical data column you can use the 0.3dev branch and do a fit with X as the numerical data and y as the single categorical column (cast to 0-up integers, one for each category). This is the upcoming supervised (and semi-supervised) dimension reduction. Going a step further is not really well supported yet, but working off the current master on github you can theoretically do something like the following ... first split the data into numeric and categorical, then binarize the categorical data (pd.get_dummies or similar). The something along the lines of:

fit1 = umap.UMAP().fit(numeric_data)
fit2 = umap.UMAP(metric='dice').fit(categorical_data)
prod_graph = fit1.graph.multiply(fit2.graph)
new_graph = 0.99 * prod_graph + 0.01 * (fit1.graph + fit2.graph - prod_graph)
embedding = umap.umap_.simplicial_set_embedding(new_graph, fit1.n_components, fit1.initial_alpha, fit1.a, fit1.b, fit1.gamma, fit1.negative_sample_rate, 200, fit1.init, np.random, False)

More interesting things can be done with more mixed data, but it's really built off variations on this sort of approach.

amir-ghasemi · 2018-03-16T20:38:18Z

Thanks Leland! This makes sense. The example you provided is great. I will give it a try and report back.

jay-reynolds · 2018-09-08T18:25:59Z

Hi, I'm using umap 0.3.2 and trying the approach outlined above but running into problems in that the resulting embedding produces a single, centered globular distribution, whereas the separate embeddings of my two distinct feature types (interval vs categorical) exhibit interesting structure.

I've done the following, for instance:

fit1 = umap.UMAP(metric='braycurtis').fit(df_b.values)
fit2 = umap.UMAP(metric='jaccard').fit(df_dummies.values)
prod_graph = fit1.graph_.multiply(fit2.graph_)
new_graph = 0.99 * prod_graph + 0.01 * (fit1.graph_ + fit2.graph_ - prod_graph)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, new_graph, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, fit1.init, np.random, fit1.metric, 
                                                fit1._metric_kwds, False)

It wasn't clear what I should have used for the data parameter in simplicial_set_embedding(). I tried both fit1._raw_data and fit2._raw_data, but neither alone seems appropriate here -- both produce similar results (a single diffuse blob).

Any advice would be greatly appreciated!

lmcinnes · 2018-09-08T19:30:27Z

The data you pass in to simplicial set embedding shouldn't matter too much unless you end up with lots of separate connected components. I admit that I can't say immediately what might be causing this -- it looks like you are doing something fairly reasonable. There is some slightly newer code that you could try, but I'm not sure it will help in your case. I'll have to look up what the right code is, because it would be a series of internal function calls, and I don't have time right now. I'll try to get back to you soon.

lmcinnes · 2018-09-09T00:43:54Z

Okay, so this is a little less than ideal because these are decidedly not public APIs, so it gets messy, but you can try:

fit1 = umap.UMAP(metric='braycurtis').fit(df_b.values)
fit2 = umap.UMAP(metric='jaccard').fit(df_dummies.values)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, fit2.graph_, mix_weight=0.5)
intersection = umap.umap_.reset_local_connectivty(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False)

which may work a little better. Note that we are passing 'random' instead of fit1.init as this will ensure the fit1._raw_data and fit1.metric won't come into it at all. This is not exactly ideal, but it might suffice to see if we can get a better result for you.

The other thing to note is the mix_ratio in the call to general_simplicial_set_intersection. This is the balance between fit1 and fit2. A value of 0.0 means essentially just use fit1 and a value of 1.0 means essentially just use fit2. You can try playing with values in between to see if that can help move you away from a pure blobby structure.

jay-reynolds · 2018-09-10T21:07:44Z

Yes, this works much better, thank you!
I'll wedge the range for the mix value and report my findings.

lmcinnes · 2018-09-11T00:22:04Z

Glad that its working better. A proper interface for general dataframe handling (based on the new sklearn ColumnTransformer) based on this newer code is among my plans for 0.4. It is good to know it works (at least somewhat) in practice.

…

On Mon, Sep 10, 2018 at 6:49 PM Jay Reynolds ***@***.***> wrote: Yes, this works much better, thank you! I'll wedge the range for the mix value and report my findings. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#58 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALaKBSpjGCr-1MCqi61W49Mrh8Wz2rVaks5uZtSkgaJpZM4SuCFW> .

jlmelville · 2018-11-11T00:40:37Z

To extend this idea to multiple blocks of data, if we have say three graphs, graphA, graphB, graphC, is it sufficient to do (in pseudo-code):

intersectAB = general_simplicial_set_intersection(graphA, graphB)
intersectABC = general_simplicial_set_intersection(intersectAB, graphC)
intersectABC = reset_local_connectivity(intersectABC)

or does reset_local_connectivity need to be called after each pair of graphs are intersected?

lmcinnes · 2018-11-11T01:10:41Z

It is an interesting question -- theoretically you don't need to reset the local connectivity 'til the end, but implementation-wise I believe it would be more beneficial to do so at each step. I have been getting started on this and am still playing with the right implementation approximations to what theory says should be done.

RitterHannah · 2019-03-06T20:45:54Z

How would new data then be processed?
I am clueless what to do after calling

test_ numerical_transform = fit1.transform(X_numerical_test)
test_categorical_transform = fit2.transform(X_categorical_test)

lmcinnes · 2019-03-30T15:13:19Z

Transform for mixed-type data is something that is certainly not available at this time. It isn't theoretically infeasible, but implementation-wise it would require a non-negligible amount of code refactoring to make it happen. Sorry @MeTooThanks .

ekerazha · 2019-04-14T08:59:16Z

@MeTooThanks Maybe you could try to train a Neural Network (or an Extreme Learning Machine) to learn the mapping between the numeric and categorical transformed data and the "intersected" transformed data. @lmcinnes Opinions?

lmcinnes · 2019-04-14T14:04:09Z

@ekerazha You can certainly try, but I suspect getting good parameters/architecture for the network and successfully training it without overfitting to the original input data will be a potentially large challenge.

ivenzor · 2019-06-17T17:31:15Z

@lmcinnes Hello, thanks for your awesome work in UMAP.
Quick question: I have read issues #58, #104 and #241, and I just wanted to confirm that in order to use categorical, ordinal and/or mixed datasets, at the moment best way to handle this data in UMAP is to split the numeric and categorical variables, then one-hot encode the categoricals (pd.get_dummies) with a dice metric, then merge the two splits and continue. Am I correct?

lmcinnes · 2019-06-18T02:54:31Z

@ivenzor Yes, that would be the right approach right now. One alternative would be to check out the 0.4dev branch which has a (very!) experimental class called DataframeUMAP that would take a pandas dataframe and a tuple of (column name, metric) lists (similar to the ColumnTransformer in sklearn) and does all the required manipulations.

ivenzor · 2019-06-18T14:10:20Z

Ok, I will also check the experimental DataFrameUMAP. Many thanks for your reply.

gibsramen · 2019-12-13T01:38:12Z

Hi, all.

Recently I worked on a small project looking at this very issue. I used the cluster package in R to calculate the Gower distance matrix on mixed-data and passed that to UMAP with metric="precomputed". Results turned out pretty well (seems better than one-hot encoding), so it's one option for anyone who would like to do some mixed-type analysis while this functionality is still in development.

To my knowledge Gower distance isn't implemented in any Python package (though I am working on remedying that right now...)

arnaud-nt2i · 2021-04-13T12:36:05Z

@lmcinnes While trying the following piece of code with ether 'jaccard' or 'dice' metric for fit2.
I get gradient function is not yet implemented for {dice or jaccard} distance metric; inverse_transform will be unavailable
Is the resulting embedding wrong?
What should I do to get good results ?
(Umap learn 0.5.1 with conda install on w10)

fit1 = umap.UMAP(metric='braycurtis').fit(df_b.values)
fit2 = umap.UMAP(metric='jaccard').fit(df_dummies.values)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, fit2.graph_, mix_weight=0.5)
intersection = umap.umap_.reset_local_connectivty(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, fit1.n_components, 
                                                fit1.initial_alpha, fit1._a, fit1._b, 
                                                fit1.repulsion_strength, fit1.negative_sample_rate, 
                                                200, 'random', np.random, fit1.metric, 
                                                fit1._metric_kwds, False)

lmcinnes · 2021-04-14T20:04:10Z

Fortunately that is just a warning: you won't be able to use the inverse_transform method -- but then you can't use that in the model here anyway, so there is no loss. Everything is working.

arnaud-nt2i · 2021-04-15T07:57:47Z

ok thank you

mohammad-saber · 2021-06-23T06:57:15Z

Hi, all.

Recently I worked on a small project looking at this very issue. I used the cluster package in R to calculate the Gower distance matrix on mixed-data and passed that to UMAP with metric="precomputed". Results turned out pretty well (seems better than one-hot encoding), so it's one option for anyone who would like to do some mixed-type analysis while this functionality is still in development.

To my knowledge Gower distance isn't implemented in any Python package (though I am working on remedying that right now...)

Thank you for sharing your experience with Gower distance.
If we have train and test datasets, how can we use this idea to fit_transform on train dataset, and transform on test dataset?

j-cahill · 2022-03-28T15:16:25Z

@lmcinnes is this no longer relevant following the ability to perform intersections as outlined at ? I tried the two alongside one another and got some pretty different results

lmcinnes · 2022-03-28T19:11:43Z

The intersection functionality is more mature, so I would definitely suggest that as the right way to go. If you are getting very different results and you prefer the older method it may be worth having a conversation and digging in to why the differences occur.

j-cahill · 2022-03-30T01:24:38Z

Is there a way to weight the categorical vs binary mix with the intersection functionality? it looked like it was set to 50/50 by default and i didn't see a way to do it using * as the intersection mapper

lmcinnes · 2022-03-30T20:12:12Z

No, that is a definite downside. If you need mixing weights you can take a look at how the ``__mul__`` operator is implemented and potentially code up an equivalent adding in mixing weights yourself. As it is the operator overloading in Python doesn't really allow for adding mix weights, so I left it out. Perhaps having a different explicit method that support mix weights might not be a bad idea. Leland.

…

On Tue, Mar 29, 2022 at 9:24 PM Jesse Cahill ***@***.***> wrote: Is there a way to weight the categorical vs binary mix with the intersection functionality? it looked like it was set to 50/50 by default and i didn't see a way to do it using * as the intersection mapper — Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBL6CWULVZYQQAUB7NDVCOUOFANCNFSM4EVYEFLA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

j-cahill · 2022-03-31T13:37:40Z

Makes sense - I'd be willing to pick that up and give it a shot if so

ratheraarif · 2022-09-21T07:35:24Z

Is there a way to perform the same integrative analysis in the R implementation of UMAP. I am interested in analysing the data coming from different sources by combining them together via UMAP and later perform the clustering on the final embeddings. I am able to do such analysis in python but not in R.

jlmelville · 2022-09-21T14:18:08Z

Not sure what you are looking for, but if you use uwot in R there is some support for mixed data types.

ratheraarif · 2022-09-21T14:29:20Z

Thank you for the reply!
I am doing the following analysis in python and want to do the same in R

import umap

X_reduced= PCA(n_components = 20).fit_transform(X)
Y_reduced = PCA(n_components = 10).fit_transform(Y)

fit1 = umap.UMAP(n_components = 2, min_dist = 1, n_neighbors = 93, n_epochs = 1000, 
                 init = X_reduced[:, 0:2], 
                 verbose = 2).fit(X_reduced)
fit2 = umap.UMAP(n_components = 2, min_dist = 0.8, n_neighbors = 93, n_epochs = 1000, 
                 init = Y_reduced[:, 0:2], 
                 verbose = 2).fit(Y_reduced)
intersection = umap.umap_. general_simplicial_set_intersection(fit1.graph_, 
                                                               fit2.graph_, 
                                                               weight = 0.45)
intersection = umap.umap_.reset_local_connectivity(intersection)
embedding = umap.umap_.simplicial_set_embedding(fit1._raw_data, intersection, 
                                                fit1.n_components, 
                                                fit1.learning_rate, 
                                                fit1._a, fit1._b, 
                                                fit1.repulsion_strength, 
                                                fit1.negative_sample_rate, 
                                                1000, 'random', np.random, 
                                                fit1.metric, 
                                                fit1._metric_kwds, False)

I wish to replicate the following operation in R. Is there a way?

jlmelville · 2022-09-21T14:52:01Z

@ratheraarif unfortunately the ability to carry out intersections of the simplicial set with a user-defined weight is not exposed in the uwot API. There is an internal function you can call, but you can't do anything useful with the output because the ability to call the equivalent of the Python simplicial_set_embedding with arbitrary data is also not yet supported (but will be: jlmelville/uwot#98 ). Sorry for now. So you can only have a weight of 0.5.

iterakhtaras · 2023-03-11T22:19:21Z

Could anyone point me towards the DataframeUMAP class?

lmcinnes mentioned this issue Aug 5, 2018

How to project numerical and categorical data? #104

Open

lmcinnes mentioned this issue Sep 25, 2018

Multiple real valued labels #145

Open

lmcinnes mentioned this issue Dec 17, 2018

Multiple target labels #184

Open

ghost mentioned this issue Aug 14, 2019

transform() for mixed dataset #276

Open

GenevieveBuckley mentioned this issue Oct 2, 2019

UMAP dataframe distance metrics for mixed type categorical data #303

Open

candalfigomoro mentioned this issue Jan 15, 2020

Handling of mixed type datasets with categorical features beringresearch/ivis#58

Closed

lmcinnes mentioned this issue Jan 18, 2020

Beginners manual #343

Closed

candalfigomoro mentioned this issue Feb 24, 2021

Weights when combining multiple UMAP models #601

Open

lukaschoebel mentioned this issue Apr 28, 2021

Best Practices for AlignedUMAP on large datasets #658

Closed

jlmelville mentioned this issue Sep 22, 2022

Add general_simplicial_set_intersection to the uwot API jlmelville/uwot#101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed-type datasets #58

Mixed-type datasets #58

amir-ghasemi commented Mar 16, 2018

lmcinnes commented Mar 16, 2018

amir-ghasemi commented Mar 16, 2018

jay-reynolds commented Sep 8, 2018 •

edited

lmcinnes commented Sep 8, 2018

lmcinnes commented Sep 9, 2018

jay-reynolds commented Sep 10, 2018

lmcinnes commented Sep 11, 2018 via email

jlmelville commented Nov 11, 2018

lmcinnes commented Nov 11, 2018

RitterHannah commented Mar 6, 2019 •

edited

lmcinnes commented Mar 30, 2019

ekerazha commented Apr 14, 2019

lmcinnes commented Apr 14, 2019

ivenzor commented Jun 17, 2019

lmcinnes commented Jun 18, 2019

ivenzor commented Jun 18, 2019

gibsramen commented Dec 13, 2019

arnaud-nt2i commented Apr 13, 2021

lmcinnes commented Apr 14, 2021

arnaud-nt2i commented Apr 15, 2021

mohammad-saber commented Jun 23, 2021 •

edited

j-cahill commented Mar 28, 2022

lmcinnes commented Mar 28, 2022

j-cahill commented Mar 30, 2022

lmcinnes commented Mar 30, 2022 via email

j-cahill commented Mar 31, 2022

ratheraarif commented Sep 21, 2022 •

edited

jlmelville commented Sep 21, 2022

ratheraarif commented Sep 21, 2022

jlmelville commented Sep 21, 2022

iterakhtaras commented Mar 11, 2023

Mixed-type datasets #58

Mixed-type datasets #58

Comments

amir-ghasemi commented Mar 16, 2018

lmcinnes commented Mar 16, 2018

amir-ghasemi commented Mar 16, 2018

jay-reynolds commented Sep 8, 2018 • edited

lmcinnes commented Sep 8, 2018

lmcinnes commented Sep 9, 2018

jay-reynolds commented Sep 10, 2018

lmcinnes commented Sep 11, 2018 via email

jlmelville commented Nov 11, 2018

lmcinnes commented Nov 11, 2018

RitterHannah commented Mar 6, 2019 • edited

lmcinnes commented Mar 30, 2019

ekerazha commented Apr 14, 2019

lmcinnes commented Apr 14, 2019

ivenzor commented Jun 17, 2019

lmcinnes commented Jun 18, 2019

ivenzor commented Jun 18, 2019

gibsramen commented Dec 13, 2019

arnaud-nt2i commented Apr 13, 2021

lmcinnes commented Apr 14, 2021

arnaud-nt2i commented Apr 15, 2021

mohammad-saber commented Jun 23, 2021 • edited

j-cahill commented Mar 28, 2022

lmcinnes commented Mar 28, 2022

j-cahill commented Mar 30, 2022

lmcinnes commented Mar 30, 2022 via email

j-cahill commented Mar 31, 2022

ratheraarif commented Sep 21, 2022 • edited

jlmelville commented Sep 21, 2022

ratheraarif commented Sep 21, 2022

jlmelville commented Sep 21, 2022

iterakhtaras commented Mar 11, 2023

jay-reynolds commented Sep 8, 2018 •

edited

RitterHannah commented Mar 6, 2019 •

edited

mohammad-saber commented Jun 23, 2021 •

edited

ratheraarif commented Sep 21, 2022 •

edited