[XLA:GPU] disable mask in cuDNN attention #11444

Cjkkkk · 2024-04-11T20:33:07Z

cuDNN attention mask is not doing masking with -inf but multiply which is not correct. Hence disable patterns with mask.
Follow up PR to clean up the remaining mask related logic.

Imported from GitHub PR openxla/xla#11444 1. cuDNN attention mask is not doing masking with -inf but multiply which is not correct. Hence disable patterns with mask. 2. Follow up PR to clean up the remaining mask related logic. Copybara import of the project: -- acf95b6cc7e1084026eaf87c0119ba3801ba8f8c by cjkkkk <ske@nvidia.com>: disable mask Merging this change closes #11444 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#11444 from Cjkkkk:remove_mask acf95b6cc7e1084026eaf87c0119ba3801ba8f8c PiperOrigin-RevId: 624057479

…ensor into ifrt array. FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#11444 from Cjkkkk:remove_mask acf95b6cc7e1084026eaf87c0119ba3801ba8f8c PiperOrigin-RevId: 623707308

Imported from GitHub PR openxla/xla#11444 1. cuDNN attention mask is not doing masking with -inf but multiply which is not correct. Hence disable patterns with mask. 2. Follow up PR to clean up the remaining mask related logic. Copybara import of the project: -- acf95b6cc7e1084026eaf87c0119ba3801ba8f8c by cjkkkk <ske@nvidia.com>: disable mask Merging this change closes #11444 PiperOrigin-RevId: 624068883

… in cuDNN Imported from GitHub PR openxla/xla#11717 * Default to flash attention as it is high performant and maintained actively by cuDNN, remove old fused attn. * Remove lowering to fused attn in rewriter * Remove cudnn graph generation * Remove mask input as it is only doing multiply instead of masking with -inf, give incorrect results. Also cuDNN does not support this anymore, mask should be combined with bias. This is follow up on openxla/xla#11444. * Remove mask logic in rewriter * Remove mask buffer/descriptor in thunk * Remove bmm1-bmm2 pattern as it is not support by flash attention. Modified related rewriter test to use bmm1-softmax - bmm2. Current pattern: bmm1 - (scale) - (bias) - softmax - (dropout) - bmm2. Copybara import of the project: -- 552b4a3387c6d5b2b5adcf31b6f44cc858387b23 by cjkkkk <ske@nvidia.com>: remove fused attn -- 13b683bf923e6fe344f879f913ae6ce41334eeb2 by cjkkkk <ske@nvidia.com>: remove mask and bmm1-bmm2 pattern -- 9e843dd66c8f7d51d239b433e1b9bc329afee90d by cjkkkk <ske@nvidia.com>: rm unused vari -- 1104df540e9196b34d9e61e679d88260151728d5 by cjkkkk <ske@nvidia.com>: remove fused attn cudnnv version check and update flash attn cudnn version check -- b020cb8e91d8f7b834645ff815c38c4798174857 by cjkkkk <ske@nvidia.com>: remove mask related cudnnfmhakind&descriptor&buffer -- ff1952faa460eecfca62660a5c34ea6fa3c2dfd4 by cjkkkk <ske@nvidia.com>: rename hlo_string to shorter name Merging this change closes #11717 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#11717 from Cjkkkk:remove_fused_attn_and_mask ff1952faa460eecfca62660a5c34ea6fa3c2dfd4 PiperOrigin-RevId: 627269256

… in cuDNN Imported from GitHub PR #11717 * Default to flash attention as it is high performant and maintained actively by cuDNN, remove old fused attn. * Remove lowering to fused attn in rewriter * Remove cudnn graph generation * Remove mask input as it is only doing multiply instead of masking with -inf, give incorrect results. Also cuDNN does not support this anymore, mask should be combined with bias. This is follow up on #11444. * Remove mask logic in rewriter * Remove mask buffer/descriptor in thunk * Remove bmm1-bmm2 pattern as it is not support by flash attention. Modified related rewriter test to use bmm1-softmax - bmm2. Current pattern: bmm1 - (scale) - (bias) - softmax - (dropout) - bmm2. Copybara import of the project: -- 552b4a3 by cjkkkk <ske@nvidia.com>: remove fused attn -- 13b683b by cjkkkk <ske@nvidia.com>: remove mask and bmm1-bmm2 pattern -- 9e843dd by cjkkkk <ske@nvidia.com>: rm unused vari -- 1104df5 by cjkkkk <ske@nvidia.com>: remove fused attn cudnnv version check and update flash attn cudnn version check -- b020cb8 by cjkkkk <ske@nvidia.com>: remove mask related cudnnfmhakind&descriptor&buffer -- ff1952f by cjkkkk <ske@nvidia.com>: rename hlo_string to shorter name Merging this change closes #11717 COPYBARA_INTEGRATE_REVIEW=#11717 from Cjkkkk:remove_fused_attn_and_mask ff1952f PiperOrigin-RevId: 628146618

… in cuDNN Imported from GitHub PR openxla/xla#11717 * Default to flash attention as it is high performant and maintained actively by cuDNN, remove old fused attn. * Remove lowering to fused attn in rewriter * Remove cudnn graph generation * Remove mask input as it is only doing multiply instead of masking with -inf, give incorrect results. Also cuDNN does not support this anymore, mask should be combined with bias. This is follow up on openxla/xla#11444. * Remove mask logic in rewriter * Remove mask buffer/descriptor in thunk * Remove bmm1-bmm2 pattern as it is not support by flash attention. Modified related rewriter test to use bmm1-softmax - bmm2. Current pattern: bmm1 - (scale) - (bias) - softmax - (dropout) - bmm2. Copybara import of the project: -- 552b4a3387c6d5b2b5adcf31b6f44cc858387b23 by cjkkkk <ske@nvidia.com>: remove fused attn -- 13b683bf923e6fe344f879f913ae6ce41334eeb2 by cjkkkk <ske@nvidia.com>: remove mask and bmm1-bmm2 pattern -- 9e843dd66c8f7d51d239b433e1b9bc329afee90d by cjkkkk <ske@nvidia.com>: rm unused vari -- 1104df540e9196b34d9e61e679d88260151728d5 by cjkkkk <ske@nvidia.com>: remove fused attn cudnnv version check and update flash attn cudnn version check -- b020cb8e91d8f7b834645ff815c38c4798174857 by cjkkkk <ske@nvidia.com>: remove mask related cudnnfmhakind&descriptor&buffer -- ff1952faa460eecfca62660a5c34ea6fa3c2dfd4 by cjkkkk <ske@nvidia.com>: rename hlo_string to shorter name Merging this change closes #11717 PiperOrigin-RevId: 628146618

disable mask

acf95b6

Cjkkkk requested a review from beckerhe April 11, 2024 20:33

github-actions bot added the kokoro:force-run Forces CI to rerun label Apr 11, 2024

github-actions bot assigned kamaljeeti and xla-rotation Apr 11, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Apr 11, 2024

Cjkkkk requested a review from akuegel April 11, 2024 21:09

akuegel approved these changes Apr 12, 2024

View reviewed changes

copybara-service bot mentioned this pull request Apr 12, 2024

PR #11444: [XLA:GPU] disable mask in cuDNN attention tensorflow/tensorflow#65512

Merged

copybara-service bot closed this in e88cc83 Apr 12, 2024

copybara-service bot mentioned this pull request Apr 12, 2024

Refactor out an util function that load a registered named variable tensor into ifrt array. tensorflow/tensorflow#65443

Merged

Cjkkkk mentioned this pull request Apr 19, 2024

[XLA:GPU] Default to flash attention and remove mask input in cuDNN #11717

Closed

copybara-service bot mentioned this pull request Apr 25, 2024

PR #11717: [XLA:GPU] Default to flash attention and remove mask input in cuDNN tensorflow/tensorflow#66427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA:GPU] disable mask in cuDNN attention #11444

[XLA:GPU] disable mask in cuDNN attention #11444

Cjkkkk commented Apr 11, 2024

[XLA:GPU] disable mask in cuDNN attention #11444

[XLA:GPU] disable mask in cuDNN attention #11444

Conversation

Cjkkkk commented Apr 11, 2024