Problem with reproducing "strided" attention scheme from the paper #7

krishnadubba · 2019-05-21T15:32:01Z

HI,
I am trying to visualize the attention schemes using this code. Basically trying to reproduce Fig:3 from the paper. I could reproduce the "fixed" attention scheme as shown below:

The problem is I could not reproduce the "strided" scheme (Fig 3.b from paper). All I get is the following no matter what parameters I try:

If I change some code then I can get the correct "strided" version as shown in the paper. The following is after some code changes:

Did anyone face the same issue?

pengfeiZhao1993 · 2019-11-19T07:23:25Z

After reading this code -- "attention.py", I find this base code only contains separate implementations of strided attention, called "first / second step of strided attention" within it. Therefore, you perhaps need to implement a integral version of strided attention by yourself with each head corresponding to one of aforementioned two steps for a two head sparse self-attention.

benathi · 2022-12-06T14:30:23Z

@krishnadubba Have you successfully implement the strided version btw? Could you share the code change?

jaindhairyahere · 2024-01-13T20:36:08Z

@krishnadubba Have you successfully implement the strided version btw? Could you share the code change?

I was able to reproduce the patterns using this function.

`

def sparse_attention_mask(n_tokens, stride_length=3, c=2):

  x = tf.reshape(tf.range(n_tokens), [n_tokens, 1])

  y = tf.transpose(x)

  z = tf.zeros((n_tokens,n_tokens))

  Q = z + x

  K = z + y

  causal_attention_mask = (Q>=K)

  fixed_mask_1 = tf.equal(Q//stride_length, K//stride_length)
  fixed_mask_2 = tf.logical_and(tf.math.floormod(K, stride_length) >= stride_length-c, tf.math.floormod(K, stride_length)<=stride_length)
  combined_mask_fixed = tf.logical_and(causal_attention_mask, tf.logical_or(fixed_mask_1, fixed_mask_2))

  stride_mask_1 = tf.less_equal(Q-K, stride_length)
  stride_mask_2 = tf.equal(tf.math.floormod(Q-K, stride_length), 0)
  combined_mask_stride = tf.logical_and(causal_attention_mask, tf.logical_or(stride_mask_1, stride_mask_2))

  return tf.reshape(combined_mask_fixed, [1, 1, n_tokens, n_tokens]), tf.reshape(combined_mask_stride, [1, 1, n_tokens, n_tokens])`

shaform mentioned this issue Oct 13, 2019

Has anyone been able to reproduce the results for image generation? #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with reproducing "strided" attention scheme from the paper #7

Problem with reproducing "strided" attention scheme from the paper #7

krishnadubba commented May 21, 2019

pengfeiZhao1993 commented Nov 19, 2019

benathi commented Dec 6, 2022

jaindhairyahere commented Jan 13, 2024 •

edited

Loading

Problem with reproducing "strided" attention scheme from the paper #7

Problem with reproducing "strided" attention scheme from the paper #7

Comments

krishnadubba commented May 21, 2019

pengfeiZhao1993 commented Nov 19, 2019

benathi commented Dec 6, 2022

jaindhairyahere commented Jan 13, 2024 • edited Loading

jaindhairyahere commented Jan 13, 2024 •

edited

Loading