Minimal viable support for tf32 on the block pointer path #1172

jopperm · 2024-05-22T13:06:22Z

Adds support for TF32 dot products. The main difference compared to 16-bit datatypes is that the A operand load needs to be encoded as an i32-based block load, and it must not use the VNNI format.

whitneywhtsang · 2024-05-28T19:24:55Z

The failure in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9267818058/job/25513998366?pr=1172 is in the fallback path, it is doing some invalid bitcast, e.g., %248 = llvm.bitcast %247 : vector<4xi16> to vector<4xf32>, where the source and destination type sizes are different. It should be fixed by #1204.

test/TritonIntelGPU/match-target-size.mlir

Dewei-Wang-sh

lgtm

Dewei-Wang-sh · 2024-05-30T09:39:13Z

python/tutorials/09-experimental-block-pointer.py

@@ -243,8 +244,9 @@ def matmul(a, b, res_dtype):

    # Note: the torch.matmul and Triton implementations uses different
    # algorithms so we need to adjust tolerance.
+    atol = 4e-2 if dtype == torch.float32 else 1e-4


for this, we can double confirm with other teams(kernel library, igc, etc) that used tf32 gemm previously.
to make sure this is as expected.

Yes that would be good to know. I'd need to raise the bounds again in #1211.

What's the rational to increase atol vs rtol?

Nothing specific, I just played around with the parameters. The maximum relative error is 146 (triton: 0.0146, torch: 0.0001). I think the underlying problem might be that the reference computation is not done with TF32 precision; still investigating how to enable that.

Increasing rtol is better because the torch.allclose comparison is:

∣input−other∣≤atol+rtol×∣other∣

So increasing atol affects comparisons regardless of the value.
If we can force torch to use TF32 precision that would be ideal.

Found out how to set the TF32 mode and hence was able to drop the changes to the tolerances.

mfrancepillois

LGTM

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp

Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

python/tutorials/09-experimental-block-pointer.py

whitneywhtsang

LGTM

jopperm self-assigned this May 22, 2024

jopperm linked an issue May 22, 2024 that may be closed by this pull request

[GEMM] Fix functional issues with non-float16 dtype #1097

Closed

8 tasks

jopperm force-pushed the jopperm/blockptr_tf32 branch from fb22772 to 9896ed2 Compare May 27, 2024 19:05

jopperm changed the title ~~Experimental TF32 support~~ Minimal viable support for tf32 on the block pointer path May 29, 2024

jopperm mentioned this pull request May 29, 2024

[GEMM] Fix functional issues with non-float16 dtype #1097

Closed

8 tasks

jopperm force-pushed the jopperm/blockptr_tf32 branch from f8cfe71 to 37de28a Compare May 30, 2024 09:15

jopperm marked this pull request as ready for review May 30, 2024 09:23

jopperm requested review from whitneywhtsang, etiotto, Dewei-Wang-sh and a team May 30, 2024 09:23

Dewei-Wang-sh reviewed May 30, 2024

View reviewed changes

test/TritonIntelGPU/match-target-size.mlir Show resolved Hide resolved

Dewei-Wang-sh approved these changes May 30, 2024

View reviewed changes

Dewei-Wang-sh reviewed May 30, 2024

View reviewed changes

mfrancepillois approved these changes May 30, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp Outdated Show resolved Hide resolved

FMarno reviewed May 30, 2024

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp Outdated Show resolved Hide resolved

third_party/intel/lib/TritonIntelGPUToLLVM/TritonOpsToLLVM.cpp Show resolved Hide resolved

jopperm force-pushed the jopperm/blockptr_tf32 branch from 5030564 to e588e8c Compare May 30, 2024 15:21

Enable TF32 test.

3b09e6d

Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

jopperm force-pushed the jopperm/blockptr_tf32 branch from e588e8c to 3b09e6d Compare May 30, 2024 15:54

FMarno approved these changes May 30, 2024

View reviewed changes

Enable TF32 for reference matmul.

7e0640b

jopperm force-pushed the jopperm/blockptr_tf32 branch from f131540 to 7e0640b Compare May 31, 2024 09:21

etiotto approved these changes May 31, 2024

View reviewed changes

python/tutorials/09-experimental-block-pointer.py Show resolved Hide resolved

jopperm merged commit 7843ec0 into llvm-target May 31, 2024
2 checks passed

jopperm deleted the jopperm/blockptr_tf32 branch May 31, 2024 15:29

whitneywhtsang reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal viable support for tf32 on the block pointer path #1172

Minimal viable support for tf32 on the block pointer path #1172

jopperm commented May 22, 2024 •

edited

whitneywhtsang commented May 28, 2024 •

edited

Dewei-Wang-sh left a comment

Dewei-Wang-sh May 30, 2024

jopperm May 30, 2024

whitneywhtsang May 30, 2024

jopperm May 30, 2024

etiotto May 30, 2024

jopperm May 31, 2024

mfrancepillois left a comment

whitneywhtsang left a comment

Minimal viable support for tf32 on the block pointer path #1172

Minimal viable support for tf32 on the block pointer path #1172

Conversation

jopperm commented May 22, 2024 • edited

whitneywhtsang commented May 28, 2024 • edited

Dewei-Wang-sh left a comment

Choose a reason for hiding this comment

Dewei-Wang-sh May 30, 2024

Choose a reason for hiding this comment

jopperm May 30, 2024

Choose a reason for hiding this comment

whitneywhtsang May 30, 2024

Choose a reason for hiding this comment

jopperm May 30, 2024

Choose a reason for hiding this comment

etiotto May 30, 2024

Choose a reason for hiding this comment

jopperm May 31, 2024

Choose a reason for hiding this comment

mfrancepillois left a comment

Choose a reason for hiding this comment

whitneywhtsang left a comment

Choose a reason for hiding this comment

jopperm commented May 22, 2024 •

edited

whitneywhtsang commented May 28, 2024 •

edited