Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. #5009

Merged
merged 22 commits into from Jan 20, 2022

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Nov 22, 2021

Implementation of Nvidia MMA's using bf16, mixed precision int ((u)int8/int32), and mixed precision float (half/float).

Signed-off-by: jack.kirk jack.kirk@codeplay.com

Initial implementation of bf16 data format using a new interface that make use of the matrix_type enum.
matrix_type enum is used as the first template parameter instead of typename in e.g. joint-matrix-load to allow the user to provide fp32 matrices that the implementation is responsible for converting to special format types bf16 or tf32 prior to a matrix multiplication operation that returns an fp32 matrix.
The conversion to the more efficient types will be done if the user specifies matrix_type::bf16.
This means that the user does not have to worry about conversion from fp32 to bf16 when there is no advantage for to the user to do this themselves.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk requested a review from a team as a code owner November 22, 2021 16:44
@bader bader marked this pull request as draft November 22, 2021 16:49
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 22, 2021

Hi @dkhaldi

I wanted to ask for your thoughts on this approach before going further. I have seen that there is a recent Intel bf16 implementation where the user can directly create a bf16 format matrix and perform operations on it. Since, at least in the Tensor Core case, the MAD operation that multiplies two bf16 type matrices return a fp32 type matrix, I thought that it might be useful to the user to have an additional complementary option whereby they can provide two fp32 matrices (perhaps themselves being the output of previous MAD operations) to the joint-matrix-load/mad operation interfaces that can convert the fp32 matrices to bf16 type, load these bf16 types to registers, and perform the matrix multiplication operation with the more efficient bf16 type before returning the output fp32 matrices.

In short the purpose is to not expose the user to bf16 type if it is of no advantage to them. You can see from the example test, tensorcore-bf16.cpp, that this makes the user experience identical to e.g. the existing double case.
I see that there is an advantage to also providing an option to the user to directly create the bf16 matrix so that they do not have to create a fp32 matrix before converting to a bf16 matrix; but in the case that the user is always working with fp32 matrices anyway I thought it made sense to explore an option where the user is not exposed to the bf16 type at all.
What do you think to this approach?

Thanks

@bader
Copy link
Contributor

bader commented Nov 22, 2021

@JackAKirk, I convert this pull request to a draft to avoid unintentional merge. Please, push "Ready for review" button when it's ready.

@bader bader requested a review from dkhaldi November 22, 2021 16:50
#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_ld_b(res.data, src.get(), stride, get_layout_id<Layout>());
uint32_t val[128];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A SPIRV function will be utilized here to allow a more efficient conversion from the fp32 to bf16 formats. Currently the threads repeatedly copy/convert the same data. This current version is meant as a simple proof of concept example only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding @AlexeySotkin here because I know he has been working on conversion functions for these types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've just looked at the bf16 conversion extension proposal.
It is worth mentioning that for the joint_matrix_mad operations tensor cores required that two bf16 floats are packed into a single uint32_t. I believe there only currently exists an operation to pack/convert two fp32's into a single uint32_t - It doesn't appear to be possible to directly pack two bf16's stored in uint16_t's into a single uint32_t.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you send the spec that explains this packing requirement?
This seems similar to the VNNI packing requirement for AMX/DPAS (packing of low precision to 32 bits)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. The spec is spread out a little:

Firstly under the header "Floating point fragments for .bf16 data format" in this section it states that for the A,B matrices the data is stored via:

"A vector expression of four .b32 registers, with each register containing two elements from the matrix."

The conversion/packing information can be found here.

@JackAKirk
Copy link
Contributor Author

@JackAKirk, I convert this pull request to a draft to avoid unintentional merge. Please, push "Ready for review" button when it's ready.

OK Thank you.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@dkhaldi
Copy link
Contributor

dkhaldi commented Nov 30, 2021

We synced up at a meeting today. We agreed on the following next steps (– corrections welcome):

  • Add this "Implicit conversion feature" to https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc#future-looking-api to document it as a potential additional feature while investigating the following:
    - Investigate the "implicit conversion" performance implication caused by (1) doing the conversion on the submatrix rather than on the whole matrix, and (2) not being able to reuse the converted data in subsequent operations means redundant conversions.
    - (this is something I did not mention in the meeting): Have a list of the exact conversions that we will be able to do and efficiently (besides bf16 to fp32 and vice versa)
  • Focus on adding the missing types, element wise operations and fill matrix that are necessary for this API usage
  • Investigate the potential problems around the current element wise operations proposal for PTX backend (how to provide the mapping to the user)
  • Add matrix_use from the Intel implementation to reach convergence and portability

All wmma instructions now have runtime impls for all matrix shapes supported for the mixed precision float and integer, and bf16 types.

Removed cvt (convert) Intrinsics/builtins which will be added in a separate PR.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk changed the title [WIP Do not merge!][SYCL][MATRIX][CUDA] [SYCL][MATRIX][CUDA] impl and tests for bf16, (u)int8, and half. Dec 21, 2021
@JackAKirk JackAKirk changed the title [SYCL][MATRIX][CUDA] impl and tests for bf16, (u)int8, and half. [SYCL][MATRIX][CUDA] impl/tests for bf16, (u)int8, and half. Dec 21, 2021
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

Implementation of Nvidia MMA's using bf16, mixed precision int ((u)int8/int32), and mixed precision float (half/float).

Signed-off-by: jack.kirk jack.kirk@codeplay.com

Corresponding integration tests are here: intel/llvm-test-suite#650

@JackAKirk
Copy link
Contributor Author

UPDATE:

This PR is ready for review now.

Hi @dkhaldi

In answer to your question about conversions we want to initially support the tf32->fp32, fp32->tf32 routes, in addition to bf16->fp32 and fp32->bf16 routes. We plan to extend the SYCL_INTEL_bf16_conversion proposal to also cover tf32. Do you also plan to implement the tf32 matrix type? Later we will also want to support conversions between f16->f32, s8->s32, u8->s32 and vice versa. However these cases are not a priority. For the tf32/bf16 conversions we will make use of ptx asm instructions that perform these conversions directly.

The tf32 case was not added to this pull request because it requires a change to the joint_matrix_store/joint_matrix_load interface so that loading/storing the accumulator matrix for the tf32 case can be distinguished from the other floating point cases. The tf32 case uses a different instruction to perform these loads/stores. This instruction currently cannot be distinguished from the other fp32 accumulator load/store cases since they all use the same matrix dimensions and fp32 type for the matrix elements. The most straightforward solution from our point of view is to add an additional template parameter to joint_matrix_store/joint_matrix_load corresponding to the "K" dimension that is used to specify the shapes of A/B matrices that may be used in a MMA with the Accumulator matrix that is being loaded/stored. In this case joint_matrix_load/joint_matrix_store would take three template parameters instead of two, corresponding to the full set of values that specify the shapes of the A/B/Accumulator matrices: M, N, K. How would this change be from your point of view?

The above tf32 implementation issue is our biggest priority at the moment. Following this other priorities are to implement the tf32/bf16 conversion extension for the tensorcore backend which is a WIP, and also plan an implementation of get_slice/get_element to support element wise operations. We may make use of this proposal https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/LocalMemory.asciidoc in order to copy memory from joint_matrix to shared memory within the nvidia wmma implementation of e.g. get_slice, but this is under investigation.

The only other thing worth mentioning is that we plan to add an implementation for two single bit MMA's where the dot product operator is replaced with the bitwise "and" and "xor" operators. In order to have implementations for both operators it would be necessary to overload joint_matrix_mad to take an operator argument. It is however worth noting that the "xor" operator is our initial priority. Do you also plan to support such bitwise MMA operations?

Many thanks and best wishes!

Used fully qualified names.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk marked this pull request as ready for review January 6, 2022 11:24
// m16n16k16
__SYCL_JOINT_MATRIX_OVERLOAD(half, a, 16, 16, int32_t, 8)
__SYCL_JOINT_MATRIX_OVERLOAD(half, b, 16, 16, int32_t, 8)
__SYCL_JOINT_MATRIX_OVERLOAD(uint16_t, a, 16, 16, int32_t, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bfloat treated here as uint16_t? If yes, add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is, I thought it was safer than unsigned short. I can add a comment. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

__imma_m32n8k16_ld_c(res.data, src.get(), stride,
get_layout_id<Layout>());
}
} else if constexpr (std::is_same<T, float>::value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if I use matrix A or B with type float? you should also check "use" argument here as well, right?

Copy link
Contributor Author

@JackAKirk JackAKirk Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a builtin for the case A or B are type float.

} else if constexpr (NumRows == 16 && NumCols == 8) {
__mma_bf16_m32n8k16_ld_b(res.data, tileptr, stride,
get_layout_id<Layout>());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add an else statement here. If unsupported sizes are used --> error.
This comment applies to all the types that comes after this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In joint_matrix_load_impl, joint_matrix_store_impl, and joint_matrix_mad_impl the user must provide joint_matrix(s) that is constructed with a sycl::sub_group template parameter. So long as I implement cases for all eligible joint_matrix and I ensure that there is a 1:1 mapping between all eligible joint_matrix and the constexpr statements/builtin calls within these functions, there is no need to have any error cases.

Errors are only possible when constructing joint_matrix. I looked at the option of providing a default constructor for joint_matrix using the sycl::sub_group template parameter and then using an assert statement as appropriate for unsupported instances of joint_matrix (similar to the intel matrix implementation's use of assert). However the error message returned using this method was completely ambiguous. Therefore I reverted to the existing case where if a user tries to construct an unsupported joint_matrix then they get the error that the constructor doesn't exist.
It could be better to use an exception system that can provide the user with a simpler and more concise error message such as "invalid joint_matrix"; however this is not achieved using assert, and I think that we should decide on an error message system that is consistent across backend implementations first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by ambiguous?
What will something like static_assert((NumRows != X &&
NumCols != Y),
"Tensorcores matrix implementation does not support these sizes; refer to ... for supported sizes");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by ambiguous? What will something like static_assert((NumRows != X && NumCols != Y), "Tensorcores matrix implementation does not support these sizes; refer to ... for supported sizes"); }

If I use a runtime assert like in matrix-jit.hpp then I get a general cuda assert exception with lots of lines of thread assert failures that give zero information about the specific joint_matrix failure. An example line is:

:43: block: [0,1,0], thread: [31,0,0] Assertion `` failed.

I tried using the compile time static_assert that you suggested (thanks for the suggestion, I had not used static_assert before) but a problem with this is that if I use the correct assert boolean the error message will be very very long: I'd need to check each of the possible cases as used by the macro "__SYCL_JOINT_MATRIX_OVERLOAD". I could instead have a default constructor for joint_matrix to cover cases not enumerated by the __SYCL_JOINT_MATRIX_OVERLOAD macro, but I would have to use one of the template parameters in the assert to ensure that the correct constructor is called by the compile time assert, and the assert would not relate to the error which could be confusing for the user, e.g. if I choose to use the Group template parameter I get an error like:

static_assert((std::is_same<Group, sycl::sub_group>::value && false), "Tensor Cores joint_matrix implementation does not support these sizes for the requested data type");

I think that this is worse than the error messages that I get from the existing implementation where the constructor is only implemented for supported cases, which is correct and clear:

joint_matrix_tensorcore.cpp:156:83: error: implicit instantiation of undefined template 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::detail::half_impl::half, sycl::ext::oneapi::experimental::matrix::matrix_use::a, 14, 16, sycl::ext::oneapi::experimental::matrix::matrix_layout::row_major>'
joint_matrix<T1, matrix_use::a, 14, K, matrix_layout::row_major, sycl::sub_group> sub_ze;
                                                                                  ^
joint_matrix_tensorcore.cpp:193:3: note: in instantiation of function template specialization 'test<sycl::detail::half_impl::half, float, 2UL, 4UL, 3UL, 16UL, 16UL, 16UL>' requested here
  test<half, float, SUB_TILES_M, SUB_TILES_K, SUB_TILES_N, 16, 16, 16>();
  ^
/home/jackkirk/jkLLVM/llvm/build/bin/../include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp:19:8: note: template is declared here
struct joint_matrix;
       ^
1 error generated.

I'm open to suggestions if you think there is a better option? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the static_cast, you can improve the message to tell the user how to fix it. But with the current approach, it will be hard to debug as the user does not know whether they made a typo or the sizes are not supported but this is a problem of C++ metaprogramming, it is hard to debug :)
On the other hand, we state that the query interface is the tool to use to know what is supported and what is not, so errors are not mandatory here. It is just that since this is AOT, we can take advantage of some of compile-time errors that we cannot currently do with matrix-jit since it uses JIT compilation.

Copy link
Contributor Author

@JackAKirk JackAKirk Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was thinking that the query interface was more the place to find out suitable types/shapes. I agree that it would be best to have a short message informing the user that the chosen type/shape is not supported, but with the current implementation I do not know how to do this using static_assert (without making the assert unrelated to the actual error), since it would output a very large boolean assert that effectively spits out in one go all of the information that the query interface would provide. I have no strong feelings on this however and would be happy to switch to the best possible static_assert for the current implementation if you want me to.

I think that once we are at the point of merging the dpas/Nvidia backend implementations we should reconsider the error messaging and ensure that both messaging systems (and messages) are consistent and effective.
I personally think that at the moment it is the correct tactic to keep the backends in separate headers, but to each keep the other backend implementation in mind, and conform to the same proposal. So far I think that both implementations could immediately be merged together as they are (e.g. #ifdef NVPTX sends joint_matrix_load to joint_matrix_load_impl, and DPAS is sent to the intel implementation functions). So as long as both are constrained to conform closely to the extension proposal, keeping them separate allows some small degree of freedom and experimentation in both backends, in case of unforeseen issues that we encounter from time to time, to ensure that all features of both backends can be supported to the full extent desired before a final universal design is decided upon.

__dmma_m8n8k4_ld_c(res.data, src.get(), stride,
get_layout_id<Layout>());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an else statement is missing here for when an unsupported type is used --> error

Copy link
Contributor Author

@JackAKirk JackAKirk Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.


#ifdef __NVPTX__
#ifdef __SYCL_DEVICE_ONLY__
__dmma_m8n8k4_st_c_f64(dst.get(), src.data, stride,
get_layout_id<Layout>());
if (NumRows == 16 && NumCols == 16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in the load:

  • check use for float type (when float s used with A or B --> do you error?)
  • else statement for when unsupported sizes are used
  • else statement for when unsupported types are used

Copy link
Contributor Author

@JackAKirk JackAKirk Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment.

sycl::ext::oneapi::experimental::matrix::matrix_layout LayoutB,
sycl::ext::oneapi::experimental::matrix::matrix_layout LayoutC>
struct joint_matrix_mad_impl<
double, double, 8, 4, 8, LayoutA, LayoutB, LayoutC,
T1, T2, M, K, N, LayoutA, LayoutB, LayoutC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are assuming here A and B have the same type. Is that always true?
Are there cases where A is unsigned, B is signed
or they are mixed integer types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they are always the same type for all possible nvidia ptx instructions.

__dmma_m8n8k4_mma_f64(D.data, A.data, B.data, C.data,
get_layout_pair_id<LayoutA, LayoutB>(), 0);
if constexpr (M == 16 && N == 16 && K == 16) {
if constexpr (std::is_same<T1, int8_t>::value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check that C is int32 if it matters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll go through and make sure all possible inputs either call the correct builtin or return an appropriate error. Thanks

get_layout_pair_id<LayoutA, LayoutB>(), 0);
} else if constexpr (std::is_same<T1, half>::value) {
__hmma_m16n16k16_mma_f32f32(D.data, A.data, B.data, C.data,
get_layout_pair_id<LayoutA, LayoutB>(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to be called __hmma_m16n16k16_mma_f16f16
what is the mma function if A and B are f32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A and B are always f16: There is no ptx instruction for such a case.

The __hmma_m16n16k16_mma_f16f16 instruction does exist. This corresponds to the case where the "matrix_use::accumulator" matrices C (input) and D (output) are also fp16. This case has not yet been added because it would require a change to the proposal where joint_matrix_mma takes an additional parameter that specifies the precision of the output matrix that would only be necessary for this mixed precision float case.
There are also the cases __hmma_m16n16k16_mma_f16f32 (C type f16 D type f32) and __hmma_m16n16k16_mma_f32f16.

A more pressing issue regarding the proposal for us is the one that would allow us to make use of the tf32 type:

The tf32 case was not added to this pull request because it requires a change to the joint_matrix_store/joint_matrix_load interface so that loading/storing the accumulator matrix for the tf32 case can be distinguished from the other floating point cases. The tf32 case uses a different instruction to perform these loads/stores. This instruction currently cannot be distinguished from the other fp32 accumulator load/store cases since they all use the same matrix dimensions and fp32 type for the matrix elements. The most straightforward solution from our point of view is to add an additional template parameter to joint_matrix_store/joint_matrix_load corresponding to the "K" dimension that is used to specify the shapes of A/B matrices that may be used in a MMA with the Accumulator matrix that is being loaded/stored. In this case joint_matrix_load/joint_matrix_store would take three template parameters instead of two, corresponding to the full set of values that specify the shapes of the A/B/Accumulator matrices: M, N, K. How would this change be from your point of view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think __hmma_m16n16k16_mma_f16f16 can be added with the current interface by checking if C is fp16. If C is fp32, __hmma_m16n16k16_mma_f32f32 is generated.
BTW, you should add a check that T2 is fp32 to line 427.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tf32 is a separate issue. I think the current direction is to introduce a special type like we did for bf16 or something along that. So I am not sure whether introducing extra complexity to the matrix interface is the right thing to do at this point.

One immediate workaround can be to treat fp32 as tf32 for storage, the same way we are treating unsigned short as bf16.
We add a comment about this stating that this is a workaround while tf32 type is being baked.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the accumulator matrix for tf32 is always fp32 already - this isn't the issue. If you look here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
You can see that there is a different instruction for loading the accumulator matrix for the tf32 type compared to loading the accumulator matrix for the bf16 MMA or the mixed precision float MMA: The tf32 MMA uses the following instruction:

wmma.load.c.sync.aligned.layout..m16n16k8{.ss}.f32

whereas the other two cases use:

wmma.load.c.sync.aligned.layout..m16n16k8{.ss}.f32

There is no way to distinguish between these two cases using the current interface of joint_matrix_load, since the values for NumRows=16, NumCols=16 and T=f32 are identical in both cases. An obvious resolution is to add a parameter for k=8/16, which is what the Cuda runtime does. Another reason for introducing this extra parameter is that it adds semantics that joint_matrix_load/joint_matrix_store/joint_matrix_mma should always be used together with matching values for M,N,K, but an argument against is that it is more complicated since there is an additional parameter that isn't used most of the time. I wanted to check with you whether the Intel backend might also have a similar case to be implemented in the future that might require the additional parameter corresponding to K, and more generally what you think about adding the K parameter. Otherwise I think we would have to add an additional runtime function for the tf32 type for the cuda backend: e.g. "joint_matrix_load_tf32_cuda" or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think __hmma_m16n16k16_mma_f16f16 can be added with the current interface by checking if C is fp16. If C is fp32, __hmma_m16n16k16_mma_f32f32 is generated. BTW, you should add a check that T2 is fp32 to line 427.

Yeah I should be able to add __hmma_m16n16k16_mma_f16f16 with the current interface. I wanted to also check first whether you also have any cases where the output matrix "D" in joint_matrix_mma differs from the "C" matrix?
SImilar to e.g.:
__hmma_m16n16k16_mma_f16f32 or
__hmma_m16n16k16_mma_f32f16

If you do then it might make sense to update the interface accordingly at this point. If not, since it is not really a priority for us to support these cases I can just add the __hmma_m16n16k16_mma_f16f16 case using the current joint_matrix_mma interface which assumes the types of "C" and "D" are both "T2". As I mentioned earlier it is always the case for cuda that Matrices "A" and "B" are "T1", so that part is fine for us.

auto accD = bufD.get_access<access::mode::read_write>(cgh);

cgh.parallel_for<class row_row_m16n16k16>(
nd_range<2>({1, 32}, {1, 32}), [=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the size of the original buffer?
what does 32 in the second dimension of the global range represents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a device code test that doesn't execute so I don't worry about the size of the buffer, I only care that it has the correct type. It is for this reason that I tried to make it as simple as possible: the purpose is to check that the runtime is calling the correct builtin for each case. Also check out the integration tests here: intel/llvm-test-suite#650

32 is used in the second dimension of the global range because I only use one subgroup: the meaning of 32 in the second dimension of the global range carries the same meaning as the second dimension of the local range: the number of nd_items per subgroup that is required for the matrix operations for the cuda backend.

Constrained joint_matrix_XX_impl functions to take a joint_matrix constructed from sycl::sub_group template parameter.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jan 10, 2022

I have added the missing (A/B/Accumulator = half) cases.

Tests have been updated here: intel/llvm-test-suite#650

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@bader
Copy link
Contributor

bader commented Jan 12, 2022

/verify with intel/llvm-test-suite#650

dkhaldi
dkhaldi previously approved these changes Jan 12, 2022
Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JackAKirk
Copy link
Contributor Author

ping @bader

The checks failed due to an unrelated sporadic test failure that has been temporarily disabled: intel/llvm-test-suite#725. We will be investigating this sporadic test failure.

Is it possible to rerun the tests and merge this PR? The tests should now pass. It would be useful to merge this because I have a follow on PR ready.

Thanks

@bader
Copy link
Contributor

bader commented Jan 18, 2022

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

@bader
Copy link
Contributor

bader commented Jan 18, 2022

Is it possible to merge this PR?

We need an approval from @intel/llvm-reviewers-runtime to merge.

@JackAKirk
Copy link
Contributor Author

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

@JackAKirk
Copy link
Contributor Author

Is it possible to merge this PR?

We need an approval from @intel/llvm-reviewers-runtime to merge.

OK

@bader
Copy link
Contributor

bader commented Jan 18, 2022

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

Thanks for letting me know. I've started the tests. I'll check with @alexbatashev what rights are needed to access this button.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

Is it possible to rerun the tests

If you click on "Details" link next to failed GitHub Action - https://github.com/intel/llvm/runs/4781102824?check_suite_focus=true, there should be a button "Rerun all jobs" in the top right corner of the page. Do you have it?

I do not have this button on the intel/llvm repo. I imagine that I do not have the rights for it.

Thanks for letting me know. I've started the tests. I'll check with @alexbatashev what rights are needed to access this button.

OK thanks.

steffenlarsen
steffenlarsen previously approved these changes Jan 19, 2022
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jan 20, 2022

Hi @maximdimakov

We have had a couple of occasions where https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Regression/event_destruction.cpp is timing out (see the failed test in the HIP AMD Test Suite job above) on the CI for PR's (I think for the hip backend in both cases incidentally). This test does take quite a long time to execute, although I am surprised that it times out. Perhaps a solution is simply to reduce ITERS if this would be possible whilst maintaining the test integrity?

Thanks

@bader
Copy link
Contributor

bader commented Jan 20, 2022

+1. llvm-test-suite probably is not the best place to put stress tests like https://github.com/intel/llvm-test-suite/blob/intel/SYCL/Regression/event_destruction.cpp.

@@ -1,8 +1,5 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add a comment with license reference. Similar to this one:

//==---------------- sycl.hpp - SYCL standard header file ------------------==//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thank you for pointing this out. I noticed that the corresponding reference in matrix-jit.hpp is labelled matrix.hpp and matrix-aot-amx.hpp is labelled matrix-amx.hpp. Should I correct these also while I am at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@bader
Copy link
Contributor

bader commented Jan 20, 2022

@JackAKirk, Maxim is disabling the test in intel/llvm-test-suite#755, so let's ignore it for now.
Please, add a comment with license reference and this should be ready to merge.
Thanks!

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants