-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature importance of UMAP output #505
Comments
I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it):
The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here. One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think. |
I definitely agree with David's idea of sensitivity instead of importance.
I see that you are suggesting using the AlignedUMAP in 0.5.0 to reduce the
embedding noise due to the stochastic nature of the embedding (i.e. if you
run the algorithm a few times with no feature permutation you'll get a
different embedding). Reducing that stochasticity is definitely an
important step. An alternate method for eliminating the effects of this
stochasticity might be to measure the disruption of the UMAP complex
itself. You could do the above game but instead of measuring the impact on
the embedding, simply measure the difference between the representations of
your data (found in my_model.graph_). Perhaps you could use cross entropy
(as we do in UMAP) to measure the difference between the 'canonical' graph
and the one induced with the permuted column. This has the added benefit
of eliminating any effects that might be caused by selecting an embedding
dimension that is too low to easily represent your data.
If you wanted to go a step further you make the whole thing far more
efficient by popping the hood on the code and grabbing out the
fuzzy_simplicial_set() function to build the UMAP complex (graph) without
going through the computationally heavy steps of actually ever embedding
the data.
…On Tue, Jan 5, 2021 at 2:40 PM David Marx ***@***.***> wrote:
I think a way that we could rethink this would be as "sensitivity" rather
than "importance". In other words, if we were to define "feature
importance" as an answer to a question, that question might be: "How
sensitive is the UMAP projection to fluctuations in the respective
dimensions of the data space?" Here's one way we might answer this (which
would be pretty heavy to compute, but could be interesting if you need it):
1. Fit a UMAP embedding to your full data. We'll call this the
"canonical" projection
2. Pick a column to calculate feature importance on
3. Randomly shuffle the values in this column. Call the dataset with
column i shuffled D_i
4. Fit a new UMAP embedding to D_i. For stability, we probably want to
use some variation of the AlignedUMAP feature coming 0.5.0
5. Calculate some summary statistic to quantify distance between these
two embedding spaces. I'm thinking maybe earth movers distance?
6. Reset the column to its unshuffled state. Rinse and repeat for all
columns.
The distance calculated in step 5 then gives us an approximate measure for
how sensitive the topology of the canonical embedding space is to changes
in that particular dimension, which I posit is roughly what you're looking
for in a "feature importance" measure here.
One potential problem I'm foreseeing here is the application of the
AlignedUMAP. On the one hand, we sort of have to use it to make sure we can
compare the projections (I think?). On the other hand, the parameters of
the alignment estimator will probably impact the distance score. The
relative distance scores should still be meaningful though, I'd think.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#505 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUWU6X3FHYK66Q6VE7YLSYNTJNANCNFSM4SNJUZNQ>
.
|
Would love to see something like this implemented in UMAP! In the case of gene expression matrices in scRNA-seq data, could be extremely useful for identifying which genes are most strongly influencing the latent representation. |
Hi,
I am learning about visualizing multi dimensional data, So i found UMAP and t-SNE but by any chance can we also get the feature importance of the output.
By feature importance i mean that which variables are contributing most to the UMAP output, I know my question might be incorrect but i was curious and still learning.
Thanks,
Kunal
The text was updated successfully, but these errors were encountered: