Feature importance of UMAP output #505

dangkunal · 2020-10-12T20:17:30Z

Hi,

I am learning about visualizing multi dimensional data, So i found UMAP and t-SNE but by any chance can we also get the feature importance of the output.

By feature importance i mean that which variables are contributing most to the UMAP output, I know my question might be incorrect but i was curious and still learning.

Thanks,
Kunal

dmarx · 2021-01-05T19:39:46Z

I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it):

Fit a UMAP embedding to your full data. We'll call this the "canonical" projection
Pick a column to calculate feature importance on
Randomly shuffle the values in this column. Call the dataset with column i shuffled D_i
Fit a new UMAP embedding to D_i. For stability, we probably want to use some variation of the AlignedUMAP feature coming 0.5.0
Calculate some summary statistic to quantify distance between these two embedding spaces. I'm thinking maybe earth movers distance?
Reset the column to its unshuffled state. Rinse and repeat for all columns.

The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here.

One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think.

jc-healy · 2021-01-12T17:51:14Z

I definitely agree with David's idea of sensitivity instead of importance. I see that you are suggesting using the AlignedUMAP in 0.5.0 to reduce the embedding noise due to the stochastic nature of the embedding (i.e. if you run the algorithm a few times with no feature permutation you'll get a different embedding). Reducing that stochasticity is definitely an important step. An alternate method for eliminating the effects of this stochasticity might be to measure the disruption of the UMAP complex itself. You could do the above game but instead of measuring the impact on the embedding, simply measure the difference between the representations of your data (found in my_model.graph_). Perhaps you could use cross entropy (as we do in UMAP) to measure the difference between the 'canonical' graph and the one induced with the permuted column. This has the added benefit of eliminating any effects that might be caused by selecting an embedding dimension that is too low to easily represent your data. If you wanted to go a step further you make the whole thing far more efficient by popping the hood on the code and grabbing out the fuzzy_simplicial_set() function to build the UMAP complex (graph) without going through the computationally heavy steps of actually ever embedding the data.

…

On Tue, Jan 5, 2021 at 2:40 PM David Marx ***@***.***> wrote: I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it): 1. Fit a UMAP embedding to your full data. We'll call this the "canonical" projection 2. Pick a column to calculate feature importance on 3. Randomly shuffle the values in this column. Call the dataset with column i shuffled D_i 4. Fit a new UMAP embedding to D_i. For stability, we probably want to use some variation of the AlignedUMAP feature coming 0.5.0 5. Calculate some summary statistic to quantify distance between these two embedding spaces. I'm thinking maybe earth movers distance? 6. Reset the column to its unshuffled state. Rinse and repeat for all columns. The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here. One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#505 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWU6X3FHYK66Q6VE7YLSYNTJNANCNFSM4SNJUZNQ> .

bschilder · 2021-11-09T21:00:44Z

Would love to see something like this implemented in UMAP! In the case of gene expression matrices in scRNA-seq data, could be extremely useful for identifying which genes are most strongly influencing the latent representation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature importance of UMAP output #505

Feature importance of UMAP output #505

dangkunal commented Oct 12, 2020

dmarx commented Jan 5, 2021

jc-healy commented Jan 12, 2021 via email

bschilder commented Nov 9, 2021

Feature importance of UMAP output #505

Feature importance of UMAP output #505

Comments

dangkunal commented Oct 12, 2020

dmarx commented Jan 5, 2021

jc-healy commented Jan 12, 2021 via email

bschilder commented Nov 9, 2021