Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi, I am a little bit confused about cyclic shift,Can you help me understand? #52

Closed
meiguoofa opened this issue Apr 27, 2021 · 11 comments

Comments

@meiguoofa
Copy link

meiguoofa commented Apr 27, 2021

Can you explain how the cyclic shift changes the feature map, and what position of the tokens is masked during the calculation of the attention? As in your paper's figure , it's too abstract for me. In your code, you use torch.roll() to implemented cyclic shift, and then From Line 209 To Line 227
you calculate the mask, How the mask help to compute the attention?

@ancientmooner
Copy link
Contributor

Taking the right-bottom window as an example. This window is composed by 4 sub-windows, and there should be no connections between these 4 sub-windows. The mask of all connections between these 4-sub-windows are set -100.0, which will make these connections no contribution to the attention computation.

@eddie94
Copy link

eddie94 commented May 13, 2021

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as

tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])

then the mask will be partitioned with window_partition and the results are

tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])

Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get

        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])

this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

@meiguoofa
Copy link
Author

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as

tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])

then the mask will be partitioned with window_partition and the results are

tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])

Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get

        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])

this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

Hi ,Can u tell me what does right bottom window refer to ?

@eddie94
Copy link

eddie94 commented May 14, 2021

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as

tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])

then the mask will be partitioned with window_partition and the results are

tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])

Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get

        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])

this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

Hi ,Can u tell me what does right bottom window refer to ?

It's the window_partition's
[[4., 4., 5., 5.],
[4., 4., 5., 5.],
[7., 7., 8., 8.],
[7., 7., 8., 8.]]
which corresponds to the right bottom of mask_window
as we can see the mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) operation will execute
[[4., 4., 5., 5.], -[4., 4., 5., 5.]
[4., 4., 5., 5.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.]] twice since we have 2 rows of [4., 4., 5., 5.]
then with the 3rd 4th row it will execute
[[4., 4., 5., 5.] - [7., 7., 8., 8.],
[4., 4., 5., 5.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.]] twice.

resulting
[[[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 3., 3., 3., 3.],
[ 3., 3., 3., 3.]],

 [[ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.],
  [ 3.,  3.,  3.,  3.],
  [ 3.,  3.,  3.,  3.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]]]])

as mentioned above.
however the results of my understanding should end with something like
[[0 0 n n],
[0 0 n n],
[n n n n],
[n n n n]]

[[n n 0 0],
[n n 0 0],
[n n n n],
[n n n n]]

[[n n n n],
[n n n n],
[0 0 n n],
[0 0 n n]]

[[n n n n],
[n n n n],
[n n 0 0],
[n n 0 0]]
where n stands for any number.

@meiguoofa
Copy link
Author

@ancientmooner Hi, Can you help explain the above question?

@meiguoofa
Copy link
Author

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as

tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])

then the mask will be partitioned with window_partition and the results are

tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])

Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get

        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])

this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

Hi ,Can u tell me what does right bottom window refer to ?

It's the window_partition's
[[4., 4., 5., 5.],
[4., 4., 5., 5.],
[7., 7., 8., 8.],
[7., 7., 8., 8.]]
which corresponds to the right bottom of mask_window
as we can see the mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) operation will execute
[[4., 4., 5., 5.], -[4., 4., 5., 5.]
[4., 4., 5., 5.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.]] twice since we have 2 rows of [4., 4., 5., 5.]
then with the 3rd 4th row it will execute
[[4., 4., 5., 5.] - [7., 7., 8., 8.],
[4., 4., 5., 5.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.]] twice.

resulting
[[[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 3., 3., 3., 3.],
[ 3., 3., 3., 3.]],

 [[ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.],
  [ 3.,  3.,  3.,  3.],
  [ 3.,  3.,  3.,  3.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]]]])

as mentioned above.
however the results of my understanding should end with something like
[[0 0 n n],
[0 0 n n],
[n n n n],
[n n n n]]

[[n n 0 0],
[n n 0 0],
[n n n n],
[n n n n]]

[[n n n n],
[n n n n],
[0 0 n n],
[0 0 n n]]

[[n n n n],
[n n n n],
[n n 0 0],
[n n 0 0]]
where n stands for any number.

Thank you, you made me understand part of it

@meiguoofa
Copy link
Author

when you want to introduce cross-window connections between consecutive self-attention layers, why do you use masking mechanism to limit self-attention computation to within each sub-window, if you limit the range of attention ,cross-window connections will be weaker? @eddie94 can you tell me your understanding? thank you very much.

@eddie94
Copy link

eddie94 commented May 16, 2021

@meiguoofa
In fact I think the masking is redundant but the attention is computed before switching and after switching the window partition.
I regard this masking attention must be restricting the information change between far apart patches.
For example without masking, the attention of the four corner patches of the input image will be calculated which might cause some performance loss because they are less likely to be related.
I guess they wanted to remove those kind of less likely relations.

@meiguoofa
Copy link
Author

@meiguoofa
In fact I think the masking is redundant but the attention is computed before switching and after switching the window partition.
I regard this masking attention must be restricting the information change between far apart patches.
For example without masking, the attention of the four corner patches of the input image will be calculated which might cause some performance loss because they are less likely to be related.
I guess they wanted to remove those kind of less likely relations.

Thank you😊, Thanks for your explanation, the advantage of transformer is to build long range dependency, maybe this is a bit redundant,Haha

@ancientmooner
Copy link
Contributor

@fedral
Copy link

fedral commented Mar 10, 2022

The masking is not redundant . Here is how I understand of masked attention:

image

Due to the circle shift operation of input feature, areas of diffenent semantic are combined together into one window.
As shown in window 3, area 4, 5, 7, 8 in the same window;

During the calculation of (N,N) dot product matrix , the mask confines weights of area 4 only calculated with area 4 positions, others spatial positions(5, 7, 8 ) are set to small vaule near 0 through the trick of softmax(-100.0).

In sum, for windows with differenent and multiple sematic areas, the mask confines window attention within each area independently. ^-^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants