-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, I am a little bit confused about cyclic shift,Can you help me understand? #52
Comments
Taking the right-bottom window as an example. This window is composed by 4 sub-windows, and there should be no connections between these 4 sub-windows. The mask of all connections between these 4-sub-windows are set -100.0, which will make these connections no contribution to the attention computation. |
But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
then the mask will be partitioned with window_partition and the results are
Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this result shows the right bottom window for brevity |
Hi ,Can u tell me what does right bottom window refer to ? |
It's the window_partition's resulting
as mentioned above. [[n n 0 0], [[n n n n], [[n n n n], |
@ancientmooner Hi, Can you help explain the above question? |
Thank you, you made me understand part of it |
when you want to introduce cross-window connections between consecutive self-attention layers, why do you use masking mechanism to limit self-attention computation to within each sub-window, if you limit the range of attention ,cross-window connections will be weaker? @eddie94 can you tell me your understanding? thank you very much. |
@meiguoofa |
Thank you😊, Thanks for your explanation, the advantage of transformer is to build long range dependency, maybe this is a bit redundant,Haha |
@meiguoofa Transformer has many other advantages: https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/articles/five-reasons-to-embrace-transformer-in-computer-vision/ |
The masking is not redundant . Here is how I understand of masked attention: Due to the circle shift operation of input feature, areas of diffenent semantic are combined together into one window. During the calculation of (N,N) dot product matrix , the mask confines weights of area 4 only calculated with area 4 positions, others spatial positions(5, 7, 8 ) are set to small vaule near 0 through the trick of softmax(-100.0). In sum, for windows with differenent and multiple sematic areas, the mask confines window attention within each area independently. ^-^ |
Can you explain how the cyclic shift changes the feature map, and what position of the tokens is masked during the calculation of the attention? As in your paper's figure , it's too abstract for me. In your code, you use torch.roll() to implemented cyclic shift, and then From Line 209 To Line 227
you calculate the mask, How the mask help to compute the attention?
The text was updated successfully, but these errors were encountered: