Fix TF LED/Longformer attentions computation #10007

jplu · 2021-02-04T16:49:36Z

What does this PR do?

This PR fixes the test test_saved_model_with_attentions_output for TF Longformer and LED that was failing due to an issue in computing some shapes in the attentions.

All the slow tests are now passing 🎉

patrickvonplaten · 2021-02-08T11:57:55Z

src/transformers/models/led/modeling_tf_led.py

        attn_probs = tf.where(
-            tf.broadcast_to(is_index_masked[:, :, None, None], shape_list(attn_probs)),


I don't really understand this change here. The correct shape is given by attn_probs -> I don't understand why we cannot just use shape_list(attn_probs)? IMO, something like:

attn_probs = tf.where( masked_index, tf.zeros(shape_list(masked_index), dtype=attn_probs.dtype )

should work, no?

Also, it's better to make the dtype dependent on the type of attn_probs I think

No need here, the default dtype of tf.zeros is always float (float16 if AMP is activated, or float32 if not).

patrickvonplaten

Thanks for making the test pass! Could you give some more background on why
a) tf.tile should be used instead of tf.broadcast_to &
b) why we cannot simply use the shape of attn_probs since we apply the mask on attn_probs itself? So we know that shape_list(masked_index) == shape_list(attn_probs)

jplu · 2021-02-08T12:38:51Z

a) tf.tile should be used instead of tf.broadcast_to &

There are two reasons for this, the first one is because broadcast_to does reshape + tile, here we don't need to reshape, just tile is enough. The second reason is that broadcast_to is not compliant with ONNXRuntime.

b) why we cannot simply use the shape of attn_probs since we apply the mask on attn_probs itself? So we know that shape_list(masked_index) == shape_list(attn_probs)

This part is a bit tricky to explain. The issue here was that attn_probs was not always the same shape, if is_global_attn is True, then the shape of attn_probs is [batch_size, seq_len, self.num_heads, self.one_sided_attn_window_size * 2 + max_num_global_attn_indices + 1], while if it equals False its shape is [batch_size, seq_len, self.num_heads, self.one_sided_attn_window_size * 2 + 1]. Now, because the shape is never potentially the same during the execution when run in graph mode, the pre-computed shape for attn_probs by the TF tracing was [batch_size, seq_len, self.num_heads, variable], where variable cannot be computed. The consequence of this was that attn_probs had never the proper shape at the end and creates a conflict in the tf.where. To solve this we had to also create a mask of a fixed shape that depends on is_global_attn.

I don't know if it is clear enough or not. Don't hesitate to tell me if there is something you don't get.

patrickvonplaten · 2021-02-09T19:47:50Z

a) tf.tile should be used instead of tf.broadcast_to &

There are two reasons for this, the first one is because broadcast_to does reshape + tile, here we don't need to reshape, just tile is enough. The second reason is that broadcast_to is not compliant with ONNXRuntime.

b) why we cannot simply use the shape of attn_probs since we apply the mask on attn_probs itself? So we know that shape_list(masked_index) == shape_list(attn_probs)

This part is a bit tricky to explain. The issue here was that attn_probs was not always the same shape, if is_global_attn is True, then the shape of attn_probs is [batch_size, seq_len, self.num_heads, self.one_sided_attn_window_size * 2 + max_num_global_attn_indices + 1], while if it equals False its shape is [batch_size, seq_len, self.num_heads, self.one_sided_attn_window_size * 2 + 1]. Now, because the shape is never potentially the same during the execution when run in graph mode, the pre-computed shape for attn_probs by the TF tracing was [batch_size, seq_len, self.num_heads, variable], where variable cannot be computed. The consequence of this was that attn_probs had never the proper shape at the end and creates a conflict in the tf.where. To solve this we had to also create a mask of a fixed shape that depends on is_global_attn.

I don't know if it is clear enough or not. Don't hesitate to tell me if there is something you don't get.

Thanks for the explanation - just tried it out and cool to see that your change fixes the test!

LysandreJik

If the slow tests pass, LGTM

jplu · 2021-02-10T08:58:53Z

The entire list of slow tests are ok!

jplu · 2021-02-10T15:18:24Z

@sgugger Feel free to merge if it looks ok for you!

sgugger

Thanks for fixing!

jplu added 5 commits February 4, 2021 17:59

Fix test

2a8724d

Remove commented test

e4e8bd9

Fix name

5529779

Apply style

efbf378

Fix check copies

79d1da5

jplu force-pushed the fix-tf-led-long branch from 502ae7e to 79d1da5 Compare February 4, 2021 17:00

jplu added 2 commits February 4, 2021 18:01

Remove prints

ecf4426

Restore boolean

d35dfd1

LysandreJik requested a review from patrickvonplaten February 4, 2021 17:47

Fix reshape

f1536d3

jplu requested review from LysandreJik and sgugger February 4, 2021 21:39

patrickvonplaten reviewed Feb 8, 2021

View reviewed changes

Merge branch 'master' into fix-tf-led-long

7b30dbe

patrickvonplaten approved these changes Feb 9, 2021

View reviewed changes

LysandreJik approved these changes Feb 10, 2021

View reviewed changes

sgugger approved these changes Feb 10, 2021

View reviewed changes

sgugger merged commit 22a32cf into huggingface:master Feb 10, 2021

jplu deleted the fix-tf-led-long branch February 10, 2021 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TF LED/Longformer attentions computation #10007

Fix TF LED/Longformer attentions computation #10007

jplu commented Feb 4, 2021 •

edited

patrickvonplaten Feb 8, 2021

patrickvonplaten Feb 8, 2021 •

edited

jplu Feb 8, 2021 •

edited

patrickvonplaten left a comment

jplu commented Feb 8, 2021

patrickvonplaten commented Feb 9, 2021

LysandreJik left a comment

jplu commented Feb 10, 2021

jplu commented Feb 10, 2021

sgugger left a comment

		attn_probs = tf.where(
		tf.broadcast_to(is_index_masked[:, :, None, None], shape_list(attn_probs)),

Fix TF LED/Longformer attentions computation #10007

Fix TF LED/Longformer attentions computation #10007

Conversation

jplu commented Feb 4, 2021 • edited

What does this PR do?

patrickvonplaten Feb 8, 2021

Choose a reason for hiding this comment

patrickvonplaten Feb 8, 2021 • edited

Choose a reason for hiding this comment

jplu Feb 8, 2021 • edited

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

jplu commented Feb 8, 2021

patrickvonplaten commented Feb 9, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

jplu commented Feb 10, 2021

jplu commented Feb 10, 2021

sgugger left a comment

Choose a reason for hiding this comment

jplu commented Feb 4, 2021 •

edited

patrickvonplaten Feb 8, 2021 •

edited

jplu Feb 8, 2021 •

edited