Hi, I am a little bit confused about cyclic shift，Can you help me understand? #52

meiguoofa · 2021-04-27T03:03:55Z

Can you explain how the cyclic shift changes the feature map, and what position of the tokens is masked during the calculation of the attention? As in your paper's figure , it's too abstract for me. In your code, you use torch.roll() to implemented cyclic shift, and then From Line 209 To Line 227
you calculate the mask, How the mask help to compute the attention?

ancientmooner · 2021-05-05T03:52:11Z

Taking the right-bottom window as an example. This window is composed by 4 sub-windows, and there should be no connections between these 4 sub-windows. The mask of all connections between these 4-sub-windows are set -100.0, which will make these connections no contribution to the attention computation.

eddie94 · 2021-05-13T07:02:31Z

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as

tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])

then the mask will be partitioned with window_partition and the results are

tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])

Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get

        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])

this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

meiguoofa · 2021-05-13T14:03:35Z

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as
tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])
then the mask will be partitioned with window_partition and the results are
tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])
Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get
        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])
this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?

Hi ,Can u tell me what does right bottom window refer to ?

eddie94 · 2021-05-14T03:43:40Z

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as
tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])
then the mask will be partitioned with window_partition and the results are
tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])
Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get
        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])
this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?
Hi ,Can u tell me what does right bottom window refer to ?

It's the window_partition's
[[4., 4., 5., 5.],
[4., 4., 5., 5.],
[7., 7., 8., 8.],
[7., 7., 8., 8.]]
which corresponds to the right bottom of mask_window
as we can see the mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) operation will execute
[[4., 4., 5., 5.], -[4., 4., 5., 5.]
[4., 4., 5., 5.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.]] twice since we have 2 rows of [4., 4., 5., 5.]
then with the 3rd 4th row it will execute
[[4., 4., 5., 5.] - [7., 7., 8., 8.],
[4., 4., 5., 5.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.]] twice.

resulting
[[[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 3., 3., 3., 3.],
[ 3., 3., 3., 3.]],

 [[ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.],
  [ 3.,  3.,  3.,  3.],
  [ 3.,  3.,  3.,  3.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]]]])

as mentioned above.
however the results of my understanding should end with something like
[[0 0 n n],
[0 0 n n],
[n n n n],
[n n n n]]

[[n n 0 0],
[n n 0 0],
[n n n n],
[n n n n]]

[[n n n n],
[n n n n],
[0 0 n n],
[0 0 n n]]

[[n n n n],
[n n n n],
[n n 0 0],
[n n 0 0]]
where n stands for any number.

meiguoofa · 2021-05-14T23:34:41Z

@ancientmooner Hi, Can you help explain the above question？

meiguoofa · 2021-05-14T23:35:55Z

But the operation mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
this can only mask the row dimension not the column dimension.
I'm doing some experiments with your code to understand the shifting process,
if I create an exact same window and shift size with the paper I can get the mask_windows as
tensor([[0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [0., 0., 0., 0., 1., 1., 2., 2.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [3., 3., 3., 3., 4., 4., 5., 5.],
        [6., 6., 6., 6., 7., 7., 8., 8.],
        [6., 6., 6., 6., 7., 7., 8., 8.]])
then the mask will be partitioned with window_partition and the results are
tensor([[[0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.],
     [0., 0., 0., 0.]],

    [[1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.],
     [1., 1., 2., 2.]],

    [[3., 3., 3., 3.],
     [3., 3., 3., 3.],
     [6., 6., 6., 6.],
     [6., 6., 6., 6.]],

    [[4., 4., 5., 5.],
     [4., 4., 5., 5.],
     [7., 7., 8., 8.],
     [7., 7., 8., 8.]]])
Finally if we compute the attn_mask with mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
we get
        [[[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.],
      [ 3.,  3.,  3.,  3.],
      [ 3.,  3.,  3.,  3.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]],

     [[-3., -3., -3., -3.],
      [-3., -3., -3., -3.],
      [ 0.,  0.,  0.,  0.],
      [ 0.,  0.,  0.,  0.]]]])
this result shows the right bottom window for brevity
as we can see the mask is possible to mask the connections with the row dimension,
(4 4 5 5) and (7 7 8 8) but it seems it cannot mask between 4 and 5
My understanding of this code was masking all different cnt when the img_mask was created, but it works differently.
Are the above results are correct?
Hi ,Can u tell me what does right bottom window refer to ?
It's the window_partition's
[[4., 4., 5., 5.],
[4., 4., 5., 5.],
[7., 7., 8., 8.],
[7., 7., 8., 8.]]
which corresponds to the right bottom of mask_window
as we can see the mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) operation will execute
[[4., 4., 5., 5.], -[4., 4., 5., 5.]
[4., 4., 5., 5.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.],
[7., 7., 8., 8.] - [4., 4., 5., 5.]] twice since we have 2 rows of [4., 4., 5., 5.]
then with the 3rd 4th row it will execute
[[4., 4., 5., 5.] - [7., 7., 8., 8.],
[4., 4., 5., 5.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.],
[7., 7., 8., 8.] - [7., 7., 8., 8.]] twice.

resulting
[[[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 3., 3., 3., 3.],
[ 3., 3., 3., 3.]],
 [[ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.],
  [ 3.,  3.,  3.,  3.],
  [ 3.,  3.,  3.,  3.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]],

 [[-3., -3., -3., -3.],
  [-3., -3., -3., -3.],
  [ 0.,  0.,  0.,  0.],
  [ 0.,  0.,  0.,  0.]]]])
as mentioned above.
however the results of my understanding should end with something like
[[0 0 n n],
[0 0 n n],
[n n n n],
[n n n n]]

[[n n 0 0],
[n n 0 0],
[n n n n],
[n n n n]]

[[n n n n],
[n n n n],
[0 0 n n],
[0 0 n n]]

[[n n n n],
[n n n n],
[n n 0 0],
[n n 0 0]]
where n stands for any number.

Thank you, you made me understand part of it

meiguoofa · 2021-05-16T05:52:34Z

when you want to introduce cross-window connections between consecutive self-attention layers, why do you use masking mechanism to limit self-attention computation to within each sub-window, if you limit the range of attention ,cross-window connections will be weaker? @eddie94 can you tell me your understanding? thank you very much.

eddie94 · 2021-05-16T06:45:47Z

@meiguoofa
In fact I think the masking is redundant but the attention is computed before switching and after switching the window partition.
I regard this masking attention must be restricting the information change between far apart patches.
For example without masking, the attention of the four corner patches of the input image will be calculated which might cause some performance loss because they are less likely to be related.
I guess they wanted to remove those kind of less likely relations.

meiguoofa · 2021-05-16T08:21:03Z

@meiguoofa
In fact I think the masking is redundant but the attention is computed before switching and after switching the window partition.
I regard this masking attention must be restricting the information change between far apart patches.
For example without masking, the attention of the four corner patches of the input image will be calculated which might cause some performance loss because they are less likely to be related.
I guess they wanted to remove those kind of less likely relations.

Thank you😊, Thanks for your explanation, the advantage of transformer is to build long range dependency, maybe this is a bit redundant，Haha

ancientmooner · 2021-12-20T09:28:59Z

@meiguoofa Transformer has many other advantages: https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/articles/five-reasons-to-embrace-transformer-in-computer-vision/

fedral · 2022-03-10T01:16:13Z

The masking is not redundant . Here is how I understand of masked attention:

Due to the circle shift operation of input feature, areas of diffenent semantic are combined together into one window.
As shown in window 3, area 4, 5, 7, 8 in the same window;

During the calculation of (N,N) dot product matrix , the mask confines weights of area 4 only calculated with area 4 positions, others spatial positions(5, 7, 8 ) are set to small vaule near 0 through the trick of softmax(-100.0).

In sum, for windows with differenent and multiple sematic areas, the mask confines window attention within each area independently. ^-^

ancientmooner closed this as completed Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi, I am a little bit confused about cyclic shift，Can you help me understand? #52

Hi, I am a little bit confused about cyclic shift，Can you help me understand? #52

meiguoofa commented Apr 27, 2021 •

edited

Loading

ancientmooner commented May 5, 2021

eddie94 commented May 13, 2021 •

edited

Loading

meiguoofa commented May 13, 2021

eddie94 commented May 14, 2021

meiguoofa commented May 14, 2021

meiguoofa commented May 14, 2021

meiguoofa commented May 16, 2021

eddie94 commented May 16, 2021

meiguoofa commented May 16, 2021

ancientmooner commented Dec 20, 2021

fedral commented Mar 10, 2022 •

edited

Loading

Hi, I am a little bit confused about cyclic shift，Can you help me understand? #52

Hi, I am a little bit confused about cyclic shift，Can you help me understand? #52

Comments

meiguoofa commented Apr 27, 2021 • edited Loading

ancientmooner commented May 5, 2021

eddie94 commented May 13, 2021 • edited Loading

meiguoofa commented May 13, 2021

eddie94 commented May 14, 2021

meiguoofa commented May 14, 2021

meiguoofa commented May 14, 2021

meiguoofa commented May 16, 2021

eddie94 commented May 16, 2021

meiguoofa commented May 16, 2021

ancientmooner commented Dec 20, 2021

fedral commented Mar 10, 2022 • edited Loading

meiguoofa commented Apr 27, 2021 •

edited

Loading

eddie94 commented May 13, 2021 •

edited

Loading

fedral commented Mar 10, 2022 •

edited

Loading