Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why using random sampling during inference and not pick instead the X patches with maximum attention? #14

Open
Irra59 opened this issue Oct 22, 2020 · 1 comment

Comments

@Irra59
Copy link

Irra59 commented Oct 22, 2020

Hi all,

I was reading the paper to understand the implementation, but there is something strange to me.

If I understand correctly, the goal of using sampling in the training phase is to give to each patch an opportunity to have its attention score updated, when it is sampled from the distribution. They also prove it results in the minimum variance estimator.

But for inference, why don't we just pick the N patches with the best attention instead of repeating the same sampling process? How sampling can be more accurate than taking the best attention patches for inference, since the model has been trained?

What makes me confuse even more, is the fact the authors compare ATS-10 and ATS-50 for inference, but never talk about what sampling size they use during training.

TL;DR: Why sampling during inference and not taking the maximum attention values?

I also wonder about the manual selection of the patch size? Does it mean this algorithm will be inefficient for classification tasks where objects can represent a different proportion of the image? Can't this work be adapted for object detection task, similar to yolo?

@Irra59 Irra59 changed the title Why using random sampling during inference and not pick the X patches with maximum attention? Why using random sampling during inference and not pick instead the X patches with maximum attention? Oct 22, 2020
@angeloskath
Copy link
Collaborator

Hi,

It's been a while but better late than never.

It never occurred to me to actually pick the top-k locations during inference. It might indeed be beneficial in comparison to sampling. Something to keep in mind would be that this would no longer be an unbiased approximation so it might introduce some bias especially if the attention is very peaky.

Regarding, ATS-10 vs ATS-50 we use the same number of patches both during training and inference.

Having the model predict the patch size would be an interesting extension to this paper. For instance, one could predict two Gaussian random variables for patch width and height and then use the same math to approximate the average feature integrating over the patch size values.

Cheers,
Angelos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants