Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick LLaMA GQA mask to rel-1.16.2 (round 4) #18350

Merged
merged 2 commits into from
Nov 8, 2023

Conversation

tianleiwu
Copy link
Contributor

Description

Cherry-pick LLaMA GQA attention mask and script changes to 1.16.2 release branch.

Motivation and Context

aciddelgado and others added 2 commits November 8, 2023 17:40
### Description
GQA now only works with Flash Attention with Attention Mask input,
allowing for batched input. Note: This PR Disables Memory Efficient
Attention, only allowing Flash Attention kernel to be used.



### Motivation and Context
Allows GQA to work with batched input.

---------

Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>
This PR updates replacing MHA with GQA and updates the LLaMA scripts for
the modified GQA op. It is related to the changes in [this
PR](#18283).

### Motivation and Context
This PR allows us to run LLaMA with the GQA op end-to-end using ragged
batching (i.e. batched inputs of different lengths).
@tianleiwu tianleiwu merged commit 0c5b95f into rel-1.16.2 Nov 8, 2023
96 of 99 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.16.2-llama-cherry-pick-round4 branch November 8, 2023 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants