adds triton flash attention2 kernel #4337

stephen-youn · 2023-09-14T18:45:49Z

This PR adds flash attention 2 kernel implemented in triton2.1 for inference.
Benchmarking on bert-base shows 4~13% further latency reduction (the longer the sequence length, the larger the performance gain) when compared to the case using regular attention triton kernel.

clean-up and formatting

lekurile

Left a comment about the TestModelTask unit test, other than that LGTM!

tests/unit/inference/test_inference.py

lekurile

LGTM!

* origin/master: Allow multiple inference engines in single script (microsoft#4384) adds triton flash attention2 kernel (microsoft#4337) Fix llama meta tensor loading in AutoTP and kernel injected inference (microsoft#3608) Fix min torch version (microsoft#4375) Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND (microsoft#4373) add the missing method (microsoft#4363) Openfold fix (microsoft#4368) deepspeed4science japanese blog (microsoft#4369) deepspeed4science chinese blog (microsoft#4366) Enable workflow dispatch on Torch 1.10 CI tests (microsoft#4361) Update conda env to have max pydantic version (microsoft#4362) add deepspeed4science blog link (microsoft#4364) added check to avoid undefined behavior when the input_id length is greater than max_tokens (microsoft#4349) Add the policy to run llama model from the official repo (microsoft#4313) fix deepspeed4science links (microsoft#4358) DeepSpeed4Science (microsoft#4357) Support InternLM (microsoft#4137) Pass base_dir to model files can be loaded for auto-tp/meta-tensor. (microsoft#4348)

styoun added 6 commits September 11, 2023 22:58

initial commit

4318d77

temp commit: needs debugging

95456d0

packed flash attn with mask works

2466fd9

clean-up

7e2e858

add bert/roberta tests to test_inference

b0d4240

is_triton_supported added to Accelerator class

b6c47f6

clean-up and formatting

stephen-youn marked this pull request as ready for review September 14, 2023 19:28

stephen-youn requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2, arashb and tjruwase as code owners September 14, 2023 19:28

styoun added 4 commits September 14, 2023 21:45

triton supports the flash attention when compute cap > 8.0

1ec4255

formatting

26853e9

fix comments

f30f64c

cleanup

401b3d5

lekurile requested changes Sep 14, 2023

View reviewed changes

tests/unit/inference/test_inference.py Outdated Show resolved Hide resolved

styoun added 3 commits September 15, 2023 16:50

cleanup flash kernel

fae2ab9

Merge branch 'master' into styoun/triton-flash2

4bae607

fix according to the PR comment

4e6c164

delock mentioned this pull request Sep 20, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Draft

25 tasks

lekurile approved these changes Sep 20, 2023

View reviewed changes

Merge branch 'master' into styoun/triton-flash2

544f1dd

stephen-youn enabled auto-merge September 20, 2023 22:21

stephen-youn added this pull request to the merge queue Sep 20, 2023

Merged via the queue into master with commit 0e0748c Sep 21, 2023
16 checks passed

jinyouzhi mentioned this pull request Sep 24, 2023

fix build error with DeepSpeed triton support intel/intel-extension-for-deepspeed#48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds triton flash attention2 kernel #4337

adds triton flash attention2 kernel #4337

stephen-youn commented Sep 14, 2023 •

edited

Loading

lekurile left a comment

lekurile left a comment

adds triton flash attention2 kernel #4337

adds triton flash attention2 kernel #4337

Conversation

stephen-youn commented Sep 14, 2023 • edited Loading

lekurile left a comment

Choose a reason for hiding this comment

lekurile left a comment

Choose a reason for hiding this comment

stephen-youn commented Sep 14, 2023 •

edited

Loading