Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

tianleiwu · 2023-11-08T17:43:10Z

Description

Cherry-pick LLaMA GQA attention mask and script changes to 1.16.2 release branch.

Motivation and Context

### Description GQA now only works with Flash Attention with Attention Mask input, allowing for batched input. Note: This PR Disables Memory Efficient Attention, only allowing Flash Attention kernel to be used. ### Motivation and Context Allows GQA to work with batched input. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

This PR updates replacing MHA with GQA and updates the LLaMA scripts for the modified GQA op. It is related to the changes in [this PR](#18283). ### Motivation and Context This PR allows us to run LLaMA with the GQA op end-to-end using ragged batching (i.e. batched inputs of different lengths).

aciddelgado and others added 2 commits November 8, 2023 17:40

kunal-vaishnavi approved these changes Nov 8, 2023

View reviewed changes

tianleiwu requested review from yufenglee and aciddelgado November 8, 2023 21:02

yufenglee approved these changes Nov 8, 2023

View reviewed changes

tianleiwu merged commit 0c5b95f into rel-1.16.2 Nov 8, 2023
96 of 99 checks passed

tianleiwu deleted the tlwu/rel-1.16.2-llama-cherry-pick-round4 branch November 8, 2023 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

tianleiwu commented Nov 8, 2023

Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

Conversation

tianleiwu commented Nov 8, 2023

Description

Motivation and Context