[SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. #5009

JackAKirk · 2021-11-22T16:44:24Z

Implementation of Nvidia MMA's using bf16, mixed precision int ((u)int8/int32), and mixed precision float (half/float).

Signed-off-by: jack.kirk jack.kirk@codeplay.com

Initial implementation of bf16 data format using a new interface that make use of the matrix_type enum. matrix_type enum is used as the first template parameter instead of typename in e.g. joint-matrix-load to allow the user to provide fp32 matrices that the implementation is responsible for converting to special format types bf16 or tf32 prior to a matrix multiplication operation that returns an fp32 matrix. The conversion to the more efficient types will be done if the user specifies matrix_type::bf16. This means that the user does not have to worry about conversion from fp32 to bf16 when there is no advantage for to the user to do this themselves. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2021-11-22T16:49:48Z

Hi @dkhaldi

I wanted to ask for your thoughts on this approach before going further. I have seen that there is a recent Intel bf16 implementation where the user can directly create a bf16 format matrix and perform operations on it. Since, at least in the Tensor Core case, the MAD operation that multiplies two bf16 type matrices return a fp32 type matrix, I thought that it might be useful to the user to have an additional complementary option whereby they can provide two fp32 matrices (perhaps themselves being the output of previous MAD operations) to the joint-matrix-load/mad operation interfaces that can convert the fp32 matrices to bf16 type, load these bf16 types to registers, and perform the matrix multiplication operation with the more efficient bf16 type before returning the output fp32 matrices.

In short the purpose is to not expose the user to bf16 type if it is of no advantage to them. You can see from the example test, tensorcore-bf16.cpp, that this makes the user experience identical to e.g. the existing double case.
I see that there is an advantage to also providing an option to the user to directly create the bf16 matrix so that they do not have to create a fp32 matrix before converting to a bf16 matrix; but in the case that the user is always working with fp32 matrices anyway I thought it made sense to explore an option where the user is not exposed to the bf16 type at all.
What do you think to this approach?

Thanks

bader · 2021-11-22T16:50:33Z

@JackAKirk, I convert this pull request to a draft to avoid unintentional merge. Please, push "Ready for review" button when it's ready.

JackAKirk · 2021-11-22T16:53:41Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

 #ifdef __NVPTX__
 #ifdef __SYCL_DEVICE_ONLY__
-    __dmma_m8n8k4_ld_b(res.data, src.get(), stride, get_layout_id<Layout>());
+      uint32_t val[128];


A SPIRV function will be utilized here to allow a more efficient conversion from the fp32 to bf16 formats. Currently the threads repeatedly copy/convert the same data. This current version is meant as a simple proof of concept example only.

I am adding @AlexeySotkin here because I know he has been working on conversion functions for these types.

Thanks, I've just looked at the bf16 conversion extension proposal.
It is worth mentioning that for the joint_matrix_mad operations tensor cores required that two bf16 floats are packed into a single uint32_t. I believe there only currently exists an operation to pack/convert two fp32's into a single uint32_t - It doesn't appear to be possible to directly pack two bf16's stored in uint16_t's into a single uint32_t.

Can you send the spec that explains this packing requirement?
This seems similar to the VNNI packing requirement for AMX/DPAS (packing of low precision to 32 bits)

Sure. The spec is spread out a little:

Firstly under the header "Floating point fragments for .bf16 data format" in this section it states that for the A,B matrices the data is stored via:

"A vector expression of four .b32 registers, with each register containing two elements from the matrix."

The conversion/packing information can be found here.

JackAKirk · 2021-11-22T16:54:20Z

@JackAKirk, I convert this pull request to a draft to avoid unintentional merge. Please, push "Ready for review" button when it's ready.

OK Thank you.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

dkhaldi · 2021-11-30T17:14:40Z

We synced up at a meeting today. We agreed on the following next steps (– corrections welcome):

Add this "Implicit conversion feature" to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc#future-looking-api to document it as a potential additional feature while investigating the following:
- Investigate the "implicit conversion" performance implication caused by (1) doing the conversion on the submatrix rather than on the whole matrix, and (2) not being able to reuse the converted data in subsequent operations means redundant conversions.
- (this is something I did not mention in the meeting): Have a list of the exact conversions that we will be able to do and efficiently (besides bf16 to fp32 and vice versa)
Focus on adding the missing types, element wise operations and fill matrix that are necessary for this API usage
Investigate the potential problems around the current element wise operations proposal for PTX backend (how to provide the mapping to the user)
Add matrix_use from the Intel implementation to reach convergence and portability

All wmma instructions now have runtime impls for all matrix shapes supported for the mixed precision float and integer, and bf16 types. Removed cvt (convert) Intrinsics/builtins which will be added in a separate PR. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2021-12-21T18:09:45Z

Implementation of Nvidia MMA's using bf16, mixed precision int ((u)int8/int32), and mixed precision float (half/float).

Signed-off-by: jack.kirk jack.kirk@codeplay.com

Corresponding integration tests are here: intel/llvm-test-suite#650

JackAKirk · 2021-12-21T18:58:24Z

UPDATE:

This PR is ready for review now.

Hi @dkhaldi

In answer to your question about conversions we want to initially support the tf32->fp32, fp32->tf32 routes, in addition to bf16->fp32 and fp32->bf16 routes. We plan to extend the SYCL_INTEL_bf16_conversion proposal to also cover tf32. Do you also plan to implement the tf32 matrix type? Later we will also want to support conversions between f16->f32, s8->s32, u8->s32 and vice versa. However these cases are not a priority. For the tf32/bf16 conversions we will make use of ptx asm instructions that perform these conversions directly.

The tf32 case was not added to this pull request because it requires a change to the joint_matrix_store/joint_matrix_load interface so that loading/storing the accumulator matrix for the tf32 case can be distinguished from the other floating point cases. The tf32 case uses a different instruction to perform these loads/stores. This instruction currently cannot be distinguished from the other fp32 accumulator load/store cases since they all use the same matrix dimensions and fp32 type for the matrix elements. The most straightforward solution from our point of view is to add an additional template parameter to joint_matrix_store/joint_matrix_load corresponding to the "K" dimension that is used to specify the shapes of A/B matrices that may be used in a MMA with the Accumulator matrix that is being loaded/stored. In this case joint_matrix_load/joint_matrix_store would take three template parameters instead of two, corresponding to the full set of values that specify the shapes of the A/B/Accumulator matrices: M, N, K. How would this change be from your point of view?

The above tf32 implementation issue is our biggest priority at the moment. Following this other priorities are to implement the tf32/bf16 conversion extension for the tensorcore backend which is a WIP, and also plan an implementation of get_slice/get_element to support element wise operations. We may make use of this proposal https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/LocalMemory.asciidoc in order to copy memory from joint_matrix to shared memory within the nvidia wmma implementation of e.g. get_slice, but this is under investigation.

The only other thing worth mentioning is that we plan to add an implementation for two single bit MMA's where the dot product operator is replaced with the bitwise "and" and "xor" operators. In order to have implementations for both operators it would be necessary to overload joint_matrix_mad to take an operator argument. It is however worth noting that the "xor" operator is our initial priority. Do you also plan to support such bitwise MMA operations?

Many thanks and best wishes!

Used fully qualified names. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

dkhaldi · 2022-01-06T16:19:27Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+// m16n16k16
+__SYCL_JOINT_MATRIX_OVERLOAD(half, a, 16, 16, int32_t, 8)
+__SYCL_JOINT_MATRIX_OVERLOAD(half, b, 16, 16, int32_t, 8)
+__SYCL_JOINT_MATRIX_OVERLOAD(uint16_t, a, 16, 16, int32_t, 4)


Is bfloat treated here as uint16_t? If yes, add a comment

Yeah it is, I thought it was safer than unsigned short. I can add a comment. Thanks

dkhaldi · 2022-01-06T16:29:31Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+        __imma_m32n8k16_ld_c(res.data, src.get(), stride,
+                             get_layout_id<Layout>());
+      }
+    } else if constexpr (std::is_same<T, float>::value) {


what happens if I use matrix A or B with type float? you should also check "use" argument here as well, right?

There isn't a builtin for the case A or B are type float.

dkhaldi · 2022-01-06T16:30:59Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+      } else if constexpr (NumRows == 16 && NumCols == 8) {
+        __mma_bf16_m32n8k16_ld_b(res.data, tileptr, stride,
+                                 get_layout_id<Layout>());
+      }


You should add an else statement here. If unsupported sizes are used --> error.
This comment applies to all the types that comes after this one.

In joint_matrix_load_impl, joint_matrix_store_impl, and joint_matrix_mad_impl the user must provide joint_matrix(s) that is constructed with a sycl::sub_group template parameter. So long as I implement cases for all eligible joint_matrix and I ensure that there is a 1:1 mapping between all eligible joint_matrix and the constexpr statements/builtin calls within these functions, there is no need to have any error cases.

Errors are only possible when constructing joint_matrix. I looked at the option of providing a default constructor for joint_matrix using the sycl::sub_group template parameter and then using an assert statement as appropriate for unsupported instances of joint_matrix (similar to the intel matrix implementation's use of assert). However the error message returned using this method was completely ambiguous. Therefore I reverted to the existing case where if a user tries to construct an unsupported joint_matrix then they get the error that the constructor doesn't exist.
It could be better to use an exception system that can provide the user with a simpler and more concise error message such as "invalid joint_matrix"; however this is not achieved using assert, and I think that we should decide on an error message system that is consistent across backend implementations first.

What do you mean by ambiguous?
What will something like static_assert((NumRows != X &&
NumCols != Y),
"Tensorcores matrix implementation does not support these sizes; refer to ... for supported sizes");
}

What do you mean by ambiguous? What will something like static_assert((NumRows != X && NumCols != Y), "Tensorcores matrix implementation does not support these sizes; refer to ... for supported sizes"); }

If I use a runtime assert like in matrix-jit.hpp then I get a general cuda assert exception with lots of lines of thread assert failures that give zero information about the specific joint_matrix failure. An example line is:

:43: block: [0,1,0], thread: [31,0,0] Assertion `` failed.

I tried using the compile time static_assert that you suggested (thanks for the suggestion, I had not used static_assert before) but a problem with this is that if I use the correct assert boolean the error message will be very very long: I'd need to check each of the possible cases as used by the macro "__SYCL_JOINT_MATRIX_OVERLOAD". I could instead have a default constructor for joint_matrix to cover cases not enumerated by the __SYCL_JOINT_MATRIX_OVERLOAD macro, but I would have to use one of the template parameters in the assert to ensure that the correct constructor is called by the compile time assert, and the assert would not relate to the error which could be confusing for the user, e.g. if I choose to use the Group template parameter I get an error like:

static_assert((std::is_same<Group, sycl::sub_group>::value && false), "Tensor Cores joint_matrix implementation does not support these sizes for the requested data type");

I think that this is worse than the error messages that I get from the existing implementation where the constructor is only implemented for supported cases, which is correct and clear:

joint_matrix_tensorcore.cpp:156:83: error: implicit instantiation of undefined template 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::detail::half_impl::half, sycl::ext::oneapi::experimental::matrix::matrix_use::a, 14, 16, sycl::ext::oneapi::experimental::matrix::matrix_layout::row_major>' joint_matrix<T1, matrix_use::a, 14, K, matrix_layout::row_major, sycl::sub_group> sub_ze; ^ joint_matrix_tensorcore.cpp:193:3: note: in instantiation of function template specialization 'test<sycl::detail::half_impl::half, float, 2UL, 4UL, 3UL, 16UL, 16UL, 16UL>' requested here test<half, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 16, 16, 16>(); ^ /home/jackkirk/jkLLVM/llvm/build/bin/../include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp:19:8: note: template is declared here struct joint_matrix; ^ 1 error generated.

I'm open to suggestions if you think there is a better option? Thanks

With the static_cast, you can improve the message to tell the user how to fix it. But with the current approach, it will be hard to debug as the user does not know whether they made a typo or the sizes are not supported but this is a problem of C++ metaprogramming, it is hard to debug :)
On the other hand, we state that the query interface is the tool to use to know what is supported and what is not, so errors are not mandatory here. It is just that since this is AOT, we can take advantage of some of compile-time errors that we cannot currently do with matrix-jit since it uses JIT compilation.

Yes I was thinking that the query interface was more the place to find out suitable types/shapes. I agree that it would be best to have a short message informing the user that the chosen type/shape is not supported, but with the current implementation I do not know how to do this using static_assert (without making the assert unrelated to the actual error), since it would output a very large boolean assert that effectively spits out in one go all of the information that the query interface would provide. I have no strong feelings on this however and would be happy to switch to the best possible static_assert for the current implementation if you want me to.

I think that once we are at the point of merging the dpas/Nvidia backend implementations we should reconsider the error messaging and ensure that both messaging systems (and messages) are consistent and effective.
I personally think that at the moment it is the correct tactic to keep the backends in separate headers, but to each keep the other backend implementation in mind, and conform to the same proposal. So far I think that both implementations could immediately be merged together as they are (e.g. #ifdef NVPTX sends joint_matrix_load to joint_matrix_load_impl, and DPAS is sent to the intel implementation functions). So as long as both are constrained to conform closely to the extension proposal, keeping them separate allows some small degree of freedom and experimentation in both backends, in case of unforeseen issues that we encounter from time to time, to ensure that all features of both backends can be supported to the full extent desired before a final universal design is decided upon.

dkhaldi · 2022-01-06T16:31:41Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+        __dmma_m8n8k4_ld_c(res.data, src.get(), stride,
+                           get_layout_id<Layout>());
+      }
+    }


an else statement is missing here for when an unsupported type is used --> error

See above comment.

dkhaldi · 2022-01-06T16:33:37Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp


 #ifdef __NVPTX__
 #ifdef __SYCL_DEVICE_ONLY__
-    __dmma_m8n8k4_st_c_f64(dst.get(), src.data, stride,
-                           get_layout_id<Layout>());
+    if (NumRows == 16 && NumCols == 16) {


same comments as in the load:

check use for float type (when float s used with A or B --> do you error?)

else statement for when unsupported sizes are used

else statement for when unsupported types are used

See above comment.

dkhaldi · 2022-01-06T16:35:57Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

          sycl::ext::oneapi::experimental::matrix::matrix_layout LayoutB,
          sycl::ext::oneapi::experimental::matrix::matrix_layout LayoutC>
 struct joint_matrix_mad_impl<
-    double, double, 8, 4, 8, LayoutA, LayoutB, LayoutC,
+    T1, T2, M, K, N, LayoutA, LayoutB, LayoutC,


you are assuming here A and B have the same type. Is that always true?
Are there cases where A is unsigned, B is signed
or they are mixed integer types

Yeah they are always the same type for all possible nvidia ptx instructions.

dkhaldi · 2022-01-06T16:37:07Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

-    __dmma_m8n8k4_mma_f64(D.data, A.data, B.data, C.data,
-                          get_layout_pair_id<LayoutA, LayoutB>(), 0);
+    if constexpr (M == 16 && N == 16 && K == 16) {
+      if constexpr (std::is_same<T1, int8_t>::value) {


you should check that C is int32 if it matters?

Yeah, I'll go through and make sure all possible inputs either call the correct builtin or return an appropriate error. Thanks

dkhaldi · 2022-01-06T16:39:03Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+                                get_layout_pair_id<LayoutA, LayoutB>(), 0);
+      } else if constexpr (std::is_same<T1, half>::value) {
+        __hmma_m16n16k16_mma_f32f32(D.data, A.data, B.data, C.data,
+                                    get_layout_pair_id<LayoutA, LayoutB>(), 0);


I would have expected this to be called __hmma_m16n16k16_mma_f16f16
what is the mma function if A and B are f32?

A and B are always f16: There is no ptx instruction for such a case.

The __hmma_m16n16k16_mma_f16f16 instruction does exist. This corresponds to the case where the "matrix_use::accumulator" matrices C (input) and D (output) are also fp16. This case has not yet been added because it would require a change to the proposal where joint_matrix_mma takes an additional parameter that specifies the precision of the output matrix that would only be necessary for this mixed precision float case.
There are also the cases __hmma_m16n16k16_mma_f16f32 (C type f16 D type f32) and __hmma_m16n16k16_mma_f32f16.

A more pressing issue regarding the proposal for us is the one that would allow us to make use of the tf32 type:

The tf32 case was not added to this pull request because it requires a change to the joint_matrix_store/joint_matrix_load interface so that loading/storing the accumulator matrix for the tf32 case can be distinguished from the other floating point cases. The tf32 case uses a different instruction to perform these loads/stores. This instruction currently cannot be distinguished from the other fp32 accumulator load/store cases since they all use the same matrix dimensions and fp32 type for the matrix elements. The most straightforward solution from our point of view is to add an additional template parameter to joint_matrix_store/joint_matrix_load corresponding to the "K" dimension that is used to specify the shapes of A/B matrices that may be used in a MMA with the Accumulator matrix that is being loaded/stored. In this case joint_matrix_load/joint_matrix_store would take three template parameters instead of two, corresponding to the full set of values that specify the shapes of the A/B/Accumulator matrices: M, N, K. How would this change be from your point of view?

I think __hmma_m16n16k16_mma_f16f16 can be added with the current interface by checking if C is fp16. If C is fp32, __hmma_m16n16k16_mma_f32f32 is generated.
BTW, you should add a check that T2 is fp32 to line 427.

tf32 is a separate issue. I think the current direction is to introduce a special type like we did for bf16 or something along that. So I am not sure whether introducing extra complexity to the matrix interface is the right thing to do at this point.

One immediate workaround can be to treat fp32 as tf32 for storage, the same way we are treating unsigned short as bf16.
We add a comment about this stating that this is a workaround while tf32 type is being baked.

What do you think?

So the accumulator matrix for tf32 is always fp32 already - this isn't the issue. If you look here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
You can see that there is a different instruction for loading the accumulator matrix for the tf32 type compared to loading the accumulator matrix for the bf16 MMA or the mixed precision float MMA: The tf32 MMA uses the following instruction:

wmma.load.c.sync.aligned.layout..m16n16k8{.ss}.f32

whereas the other two cases use:

wmma.load.c.sync.aligned.layout..m16n16k8{.ss}.f32

There is no way to distinguish between these two cases using the current interface of joint_matrix_load, since the values for NumRows=16, NumCols=16 and T=f32 are identical in both cases. An obvious resolution is to add a parameter for k=8/16, which is what the Cuda runtime does. Another reason for introducing this extra parameter is that it adds semantics that joint_matrix_load/joint_matrix_store/joint_matrix_mma should always be used together with matching values for M,N,K, but an argument against is that it is more complicated since there is an additional parameter that isn't used most of the time. I wanted to check with you whether the Intel backend might also have a similar case to be implemented in the future that might require the additional parameter corresponding to K, and more generally what you think about adding the K parameter. Otherwise I think we would have to add an additional runtime function for the tf32 type for the cuda backend: e.g. "joint_matrix_load_tf32_cuda" or similar.

I think __hmma_m16n16k16_mma_f16f16 can be added with the current interface by checking if C is fp16. If C is fp32, __hmma_m16n16k16_mma_f32f32 is generated. BTW, you should add a check that T2 is fp32 to line 427.

Yeah I should be able to add __hmma_m16n16k16_mma_f16f16 with the current interface. I wanted to also check first whether you also have any cases where the output matrix "D" in joint_matrix_mma differs from the "C" matrix?
SImilar to e.g.:
__hmma_m16n16k16_mma_f16f32 or
__hmma_m16n16k16_mma_f32f16

If you do then it might make sense to update the interface accordingly at this point. If not, since it is not really a priority for us to support these cases I can just add the __hmma_m16n16k16_mma_f16f16 case using the current joint_matrix_mma interface which assumes the types of "C" and "D" are both "T2". As I mentioned earlier it is always the case for cuda that Matrices "A" and "B" are "T1", so that part is fine for us.

dkhaldi · 2022-01-06T17:14:34Z

sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp

+    auto accD = bufD.get_access<access::mode::read_write>(cgh);
+
+    cgh.parallel_for<class row_row_m16n16k16>(
+        nd_range<2>({1, 32}, {1, 32}), [=


what is the size of the original buffer?
what does 32 in the second dimension of the global range represents?

This is just a device code test that doesn't execute so I don't worry about the size of the buffer, I only care that it has the correct type. It is for this reason that I tried to make it as simple as possible: the purpose is to check that the runtime is calling the correct builtin for each case. Also check out the integration tests here: intel/llvm-test-suite#650

32 is used in the second dimension of the global range because I only use one subgroup: the meaning of 32 in the second dimension of the global range carries the same meaning as the second dimension of the local range: the number of nd_items per subgroup that is required for the matrix operations for the cuda backend.

Constrained joint_matrix_XX_impl functions to take a joint_matrix constructed from sycl::sub_group template parameter. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2022-01-10T16:38:28Z

I have added the missing (A/B/Accumulator = half) cases.

Tests have been updated here: intel/llvm-test-suite#650

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

bader · 2022-01-12T09:47:10Z

/verify with intel/llvm-test-suite#650

dkhaldi

LGTM

JackAKirk · 2022-01-18T14:09:06Z

ping @bader

The checks failed due to an unrelated sporadic test failure that has been temporarily disabled: intel/llvm-test-suite#725. We will be investigating this sporadic test failure.

Is it possible to rerun the tests and merge this PR? The tests should now pass. It would be useful to merge this because I have a follow on PR ready.

Thanks

bader · 2022-01-18T14:15:39Z

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

bader · 2022-01-18T14:17:41Z

Is it possible to merge this PR?

We need an approval from @intel/llvm-reviewers-runtime to merge.

JackAKirk · 2022-01-18T14:30:51Z

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

JackAKirk · 2022-01-18T14:31:03Z

Is it possible to merge this PR?

We need an approval from @intel/llvm-reviewers-runtime to merge.

OK

bader · 2022-01-18T14:35:16Z

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

Thanks for letting me know. I've started the tests. I'll check with @alexbatashev what rights are needed to access this button.

sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2022-01-19T10:42:33Z

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

Thanks for letting me know. I've started the tests. I'll check with @alexbatashev what rights are needed to access this button.

OK thanks.

steffenlarsen

LGTM!

JackAKirk · 2022-01-20T11:58:22Z

Hi @maximdimakov

We have had a couple of occasions where https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Regression/event_destruction.cpp is timing out (see the failed test in the HIP AMD Test Suite job above) on the CI for PR's (I think for the hip backend in both cases incidentally). This test does take quite a long time to execute, although I am surprised that it times out. Perhaps a solution is simply to reduce ITERS if this would be possible whilst maintaining the test integrity?

Thanks

bader · 2022-01-20T12:02:39Z

+1. llvm-test-suite probably is not the best place to put stress tests like https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Regression/event_destruction.cpp.

bader · 2022-01-20T16:26:41Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

@@ -1,8 +1,5 @@
 #pragma once


Please, add a comment with license reference. Similar to this one:

//==---------------- sycl.hpp - SYCL standard header file ------------------==// // // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // See https://llvm.org/LICENSE.txt for license information. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // //===----------------------------------------------------------------------===//

Ah yes, thank you for pointing this out. I noticed that the corresponding reference in matrix-jit.hpp is labelled matrix.hpp and matrix-aot-amx.hpp is labelled matrix-amx.hpp. Should I correct these also while I am at it?

Yes, please.

Thank you for working on this!

bader · 2022-01-20T16:28:58Z

@JackAKirk, Maxim is disabling the test in intel/llvm-test-suite#755, so let's ignore it for now.
Please, add a comment with license reference and this should be ready to merge.
Thanks!

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk requested a review from a team as a code owner November 22, 2021 16:44

JackAKirk requested a review from v-klochkov November 22, 2021 16:44

bader marked this pull request as draft November 22, 2021 16:49

bader requested a review from dkhaldi November 22, 2021 16:50

JackAKirk commented Nov 22, 2021

View reviewed changes

dkhaldi requested a review from AlexeySotkin November 22, 2021 16:56

Added the prototype conversion builtin with three arguments.

fd80c38

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk changed the title ~~[WIP Do not merge!][SYCL][MATRIX][CUDA]~~ [SYCL][MATRIX][CUDA] impl and tests for bf16, (u)int8, and half. Dec 21, 2021

JackAKirk changed the title ~~[SYCL][MATRIX][CUDA] impl and tests for bf16, (u)int8, and half.~~ [SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. Dec 21, 2021

JackAKirk added 3 commits December 21, 2021 17:08

clang-format'd

58061ca

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

clang format'd

7dc3337

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Format'd

a4bc9c1

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Dec 21, 2021

[SYCL][CUDA][Matrix]Replaced double type test with templated test. intel/llvm-test-suite#650

Merged

JackAKirk added 5 commits January 5, 2022 21:54

Merge branch 'sycl' into tensorcore-bf16-tf32

14865e4

Removed "using namespace"

bf07740

Used fully qualified names. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Corrected last commit.

1da14e0

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

formatted.

d4f4918

formatted.

7eb02c2

JackAKirk marked this pull request as ready for review January 6, 2022 11:24

dkhaldi reviewed Jan 6, 2022

View reviewed changes

JackAKirk added 2 commits January 10, 2022 15:51

Added T1=T2=half cases.

4d756a4

Constrained joint_matrix_XX_impl functions to take a joint_matrix constructed from sycl::sub_group template parameter. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Added a note that bf16 uses uint16_t.

ad18dd1

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

formatted.

0164048

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Removed default joint_matrix definition.

29ed54d

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

dkhaldi previously approved these changes Jan 12, 2022

View reviewed changes

steffenlarsen reviewed Jan 18, 2022

View reviewed changes

sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp Outdated Show resolved Hide resolved

sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp Outdated Show resolved Hide resolved

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp Outdated Show resolved Hide resolved

Checking if single line is acceptable.

5ab98ea

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk dismissed dkhaldi’s stale review via 5ab98ea January 18, 2022 18:14

JackAKirk added 5 commits January 18, 2022 23:07

testing formatting.

c4f5752

Switched to single line CHECK:.

90fcd0e

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Add error if cuda device not used.

2c0407f

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

format

f21863c

format

ec933e2

steffenlarsen previously approved these changes Jan 19, 2022

View reviewed changes

bader reviewed Jan 20, 2022

View reviewed changes

Added licence reference.

60b95ad

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk dismissed steffenlarsen’s stale review via 60b95ad January 20, 2022 16:37

bader merged commit 5373362 into intel:sycl Jan 20, 2022

[SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. #5009

[SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. #5009

Conversation

JackAKirk commented Nov 22, 2021 • edited

JackAKirk commented Nov 22, 2021 • edited

bader commented Nov 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Nov 22, 2021

dkhaldi commented Nov 30, 2021 • edited

JackAKirk commented Dec 21, 2021

JackAKirk commented Dec 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jan 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jan 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jan 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jan 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Jan 10, 2022 • edited

bader commented Jan 12, 2022

dkhaldi left a comment

Choose a reason for hiding this comment

JackAKirk commented Jan 18, 2022

bader commented Jan 18, 2022

bader commented Jan 18, 2022

JackAKirk commented Jan 18, 2022

JackAKirk commented Jan 18, 2022

bader commented Jan 18, 2022

JackAKirk commented Jan 19, 2022

steffenlarsen left a comment

Choose a reason for hiding this comment

JackAKirk commented Jan 20, 2022 • edited

bader commented Jan 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Jan 20, 2022

JackAKirk commented Nov 22, 2021 •

edited

JackAKirk commented Nov 22, 2021 •

edited

dkhaldi commented Nov 30, 2021 •

edited

JackAKirk Jan 6, 2022 •

edited

JackAKirk Jan 12, 2022 •

edited

JackAKirk Jan 6, 2022 •

edited

JackAKirk Jan 6, 2022 •

edited

JackAKirk commented Jan 10, 2022 •

edited

JackAKirk commented Jan 20, 2022 •

edited