-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attention Map Generation #18
Comments
It is not very convenient, but it should be possible this way:
It could have been more elegant with forward hooks, but PyTorch does not currently allow modifying the keyword arguments of forward_pre_hooks. Another caveat is that the multihead attention implementation only returns the attention weights summed over all heads, so you might want to delete the |
Hi, @betterze and @Hugh0120, I implement the attention map based on suggestions. Here is my implementation which may be helpful to you guys. |
@haofanwang Thank you for shaing with us this nice implementation. It is very nice. |
Does anyone have any insight on which layers are best to use as the saliency layer for Grad CAM (for the visual transformer ViT models)? For the resnet people use the relu from the last visual layer, and I'm wondering what's the best for the visual transformer. EDIT: realizing that GradCAM is not meant for transformers and mainly CNN / ResNet based models. Feel free to ignore the above question. |
@g-luo I don't think gradCAM on the activation is a good idea. The patch setting makes it too coarse to be useful. But in general, when dealing with ViT, people usually use the output of the last norm layer of the last transformer block. So far I have mixed results though with CLIP. |
Does anyone have an implementation on the visual side using CLIP with a ViT? |
Since this is the top result from google, it may be easier for me to find my fork here than by using the search box on github 😆 |
Hello @ricardodeazambuja , thank you for your recommendation. , However, I run this file and this error occurs. Do you know how to solve it? |
@zhanjiahui, are you using the |
|
@zhanjiahui, Here everything works just fine (I had a problem related to a change in behaviour on how pillow and numpy exchange stuff). I would bet when you do |
Hi @ricardodeazambuja , I uninstalled normal clip, and moved this ipynb file into the root dir of your project. But this error still happens. I double-checked that the encode_image function has exactly one return value in your repository. This is so confusing to me. |
@zhanjiahui, keep following the flow of the code and you will find it here: Try to pull the latest version of the repo, checkout the |
Thanks for the release of pretrained model!
I was wondering if it is possible to show attention maps of input images using released ViT-32 model?
The text was updated successfully, but these errors were encountered: