Skip to content

Commit

Permalink
ROCm: make CK FA2 default instead of Triton (#1924)
Browse files Browse the repository at this point in the history
As per title.

Triton autotune overhead is prohibitive, as it needs to be done for each
different prompt length.
  • Loading branch information
fxmarty committed May 20, 2024
1 parent f871f11 commit 293b812
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 3 deletions.
5 changes: 3 additions & 2 deletions Dockerfile_amd
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,9 @@ RUN cd pytorch && python tools/amd_build/build_amd.py && python setup.py install
# Set as recommended: https://github.com/ROCm/triton/wiki/A-script-to-set-program-execution-environment-in-ROCm
ENV HIP_FORCE_DEV_KERNARG=1

# On MI300, performances for flash with Triton FA is very competitive (actually better than CK)
ENV ROCM_USE_FLASH_ATTN_V2_TRITON=1
# On MI250 and MI300, performances for flash with Triton FA are slightly better than CK.
# However, Triton requires a tunning for each prompt length, which is prohibitive.
ENV ROCM_USE_FLASH_ATTN_V2_TRITON=0

FROM base AS kernel-builder

Expand Down
2 changes: 1 addition & 1 deletion docs/source/installation_amd.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ TunableOp is enabled by default, the warmup may take 1-2 minutes. In case you wo

Two implementations of Flash Attention are available for ROCm, the first is [ROCm/flash-attention](https://github.com/ROCm/flash-attention) based on a [Composable Kernel](https://github.com/ROCm/composable_kernel) (CK) implementation, and the second is a [Triton implementation](https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/flash_attn_triton.py).

By default, as its performances have experimentally been better, Triton implementation is used. It can be disabled (using CK implementation instead) by passing `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container.
By default, the Composable Kernel implementation is used. However, the Triton implementation has slightly lower latency on MI250 and MI300, but requires a warmup which can be prohibitive as it needs to be done again for each new prompt length. If needed, FA Triton impelmentation can be enabled with `--env ROCM_USE_FLASH_ATTN_V2_TRITON="0"` when launching TGI's docker container.

## Unsupported features

Expand Down

0 comments on commit 293b812

Please sign in to comment.