Minimal viable support for f16 dot product with f16 accumulator on the block pointer path #1211

jopperm · 2024-05-30T09:31:48Z

Adds support for accumulating the dot result with 16-bit precision.

python/tutorials/09-experimental-block-pointer.py

Dewei-Wang-sh · 2024-05-31T10:32:51Z

https://github.com/triton-lang/triton/blob/v2.1.0/python/tutorials/08-experimental-block-pointer.py#L161,#L176
this is the original test. that's why I'm asking if you have time to make res-type the same with a/b type in #1155

jopperm · 2024-05-31T12:09:23Z

python/tutorials/09-experimental-block-pointer.py

+        if res_dtype in [torch.float16]:
+            # We observed high relative errors on small numbers when only using 16 bit for accumulation;
+            # hence, use a more restricted input set here.
+            a = torch.randint(low=-8, high=8, size=(512, 512), device='xpu', dtype=dtype) / 16
+            b = torch.randint(low=-8, high=8, size=(512, 512), device='xpu', dtype=dtype) / 16


As an alternative to raising the tolerances, we could use restricted inputs (here: [-0.5, 0.5] in 1/16th increments). WDYT?

Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

whitneywhtsang · 2024-06-01T01:39:06Z

python/tutorials/09-experimental-block-pointer.py

+        if res_dtype in [torch.float16]:
+            # We observed high relative errors on small numbers when only using 16 bit for accumulation;
+            # hence, use a more restricted input set here.
+            a = torch.randint(low=-8, high=8, size=(512, 512), device='xpu', dtype=dtype) / 16


Why are we generating random integer values for float16?

While the results overall look reasonable to me, there are outliers which would require the relative tolerance to be >> 10, so I looked for alternatives to restrict the inputs. I'll bring this up later in the call.

python/tutorials/09-experimental-block-pointer.py

jopperm · 2024-06-05T19:15:26Z

Superseded by #1258.

This PR adds support for using the `bf16 += bf16 x bf16` variant of the DPAS instruction in the `10-experimental-block-pointer.py` tutorial. I extended the `triton_gen.dpas` op's verifier to support non-f32 accumulator types, and added corresponding testcases for bf16 (as well as for f16 accumulation which I missed in #1211). I had to raise the absolute tolerance again (compared to f16 accumulation). `1e-2` for `bf16` and `1e-3` for `f16` matches what `torch.finfo` returns as "resolution" for the data types, though that might be just a coincidence. --------- Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

jopperm self-assigned this May 30, 2024

jopperm linked an issue May 30, 2024 that may be closed by this pull request

[GEMM] Fix functional issues with non-float16 dtype #1097

Closed

8 tasks

jopperm mentioned this pull request May 30, 2024

Minimal viable support for tf32 on the block pointer path #1172

Merged

jopperm force-pushed the jopperm/blockptr_f16accu branch 2 times, most recently from 37835e5 to df9b126 Compare May 31, 2024 09:52

Dewei-Wang-sh reviewed May 31, 2024

View reviewed changes

python/tutorials/09-experimental-block-pointer.py Show resolved Hide resolved

jopperm commented May 31, 2024

View reviewed changes

python/tutorials/09-experimental-block-pointer.py Outdated Show resolved Hide resolved

jopperm commented May 31, 2024

View reviewed changes

jopperm added 2 commits May 31, 2024 16:32

Add f16 = f16 + f16xf16 test case.

bb59033

Signed-off-by: Julian Oppermann <julian.oppermann@codeplay.com>

Restrict inputs.

01da0fc

jopperm force-pushed the jopperm/blockptr_f16accu branch from 55a34ee to 01da0fc Compare May 31, 2024 15:42

jopperm marked this pull request as ready for review May 31, 2024 15:45

jopperm requested review from whitneywhtsang, etiotto and a team May 31, 2024 15:45

whitneywhtsang reviewed Jun 1, 2024

View reviewed changes

jopperm mentioned this pull request Jun 3, 2024

[GEMM] Fix functional issues with non-float16 dtype #1097

Closed

8 tasks

jopperm closed this Jun 5, 2024

jopperm deleted the jopperm/blockptr_f16accu branch June 5, 2024 19:15

jopperm mentioned this pull request Jun 18, 2024

Add support for DPAS with BF16 accumulation #1387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal viable support for f16 dot product with f16 accumulator on the block pointer path #1211

Minimal viable support for f16 dot product with f16 accumulator on the block pointer path #1211

jopperm commented May 30, 2024 •

edited

Loading

Dewei-Wang-sh commented May 31, 2024

jopperm May 31, 2024

whitneywhtsang Jun 1, 2024

jopperm Jun 3, 2024

jopperm commented Jun 5, 2024

Minimal viable support for f16 dot product with f16 accumulator on the block pointer path #1211

Minimal viable support for f16 dot product with f16 accumulator on the block pointer path #1211

Conversation

jopperm commented May 30, 2024 • edited Loading

Dewei-Wang-sh commented May 31, 2024

jopperm May 31, 2024

Choose a reason for hiding this comment

whitneywhtsang Jun 1, 2024

Choose a reason for hiding this comment

jopperm Jun 3, 2024

Choose a reason for hiding this comment

jopperm commented Jun 5, 2024

jopperm commented May 30, 2024 •

edited

Loading