Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature importance of UMAP output #505

Open
dangkunal opened this issue Oct 12, 2020 · 3 comments
Open

Feature importance of UMAP output #505

dangkunal opened this issue Oct 12, 2020 · 3 comments

Comments

@dangkunal
Copy link

Hi,

I am learning about visualizing multi dimensional data, So i found UMAP and t-SNE but by any chance can we also get the feature importance of the output.

By feature importance i mean that which variables are contributing most to the UMAP output, I know my question might be incorrect but i was curious and still learning.

Thanks,
Kunal

@dmarx
Copy link

dmarx commented Jan 5, 2021

I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it):

  1. Fit a UMAP embedding to your full data. We'll call this the "canonical" projection
  2. Pick a column to calculate feature importance on
  3. Randomly shuffle the values in this column. Call the dataset with column i shuffled D_i
  4. Fit a new UMAP embedding to D_i. For stability, we probably want to use some variation of the AlignedUMAP feature coming 0.5.0
  5. Calculate some summary statistic to quantify distance between these two embedding spaces. I'm thinking maybe earth movers distance?
  6. Reset the column to its unshuffled state. Rinse and repeat for all columns.

The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here.

One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think.

@jc-healy
Copy link
Contributor

jc-healy commented Jan 12, 2021 via email

@bschilder
Copy link

Would love to see something like this implemented in UMAP! In the case of gene expression matrices in scRNA-seq data, could be extremely useful for identifying which genes are most strongly influencing the latent representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants