[BACKEND][AMDGPU] Use full-vectorized load instructions for load vectorization #3609

htyu · 2024-04-08T22:53:12Z

Current implementation for load vectorization uses segmented short-vectorized loads instead of a full 128-bit load. Using multiple copies of shorter load creates a dependency on the LLVM backend (esp. the load and store vectorizer) for full vectorization. This could be fragile as I saw in some cases the vector combine pass and the jump threading pass screwed it up and resulted in non-ideal vectorization

This is a backport of ROCm#445

…lang#445) * Stablize load vectorization * fix test failures * Shared one mask check when decomposing a load * Revert "fix test failures" This reverts commit 75a461a. * Emit vectorized loads * Fix test failures due to using vectorized load

htyu requested a review from ptillet as a code owner April 8, 2024 22:53

htyu requested a review from zhanglx13 April 8, 2024 22:53

htyu force-pushed the hoy/vec branch from 84ad02c to 9199e92 Compare April 8, 2024 23:01

zhanglx13 approved these changes Apr 9, 2024

View reviewed changes

zhanglx13 requested a review from zahimoud April 9, 2024 02:00

zahimoud approved these changes Apr 9, 2024

View reviewed changes

zahimoud merged commit 29b2fbe into triton-lang:main Apr 9, 2024
5 checks passed

htyu mentioned this pull request Apr 9, 2024

Use full-vectorized load instructions for load vectorization ROCm/triton#445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND][AMDGPU] Use full-vectorized load instructions for load vectorization #3609

[BACKEND][AMDGPU] Use full-vectorized load instructions for load vectorization #3609

htyu commented Apr 8, 2024

[BACKEND][AMDGPU] Use full-vectorized load instructions for load vectorization #3609

[BACKEND][AMDGPU] Use full-vectorized load instructions for load vectorization #3609

Conversation

htyu commented Apr 8, 2024