[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

JackAKirk · 2021-10-04T15:16:07Z

This proposal is based on the existing AMX proposal, making small adaptation to accommodate the Nvidia Tensorcore hardware.
The intention is that this proposal should be compatible with both architectures (AMX and Tensorcore).

The corresponding implementation is #4696.

Integration tests are here: intel/llvm-test-suite#760

Signed-off-by: JackAKirk jack.kirk@codeplay.com

This proposal is based on the existing AMX proposal, making small adaptation to accomodate the Nvidia tensorcore hardware. The intention is that the proposal should be compatible with both architectures (AMX and tensorcore). Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

dkhaldi

We should find a way to blend this with the existing matrix doc.
Note that the existing one is out of date.
I will update it with the changes we recently made to enable DPAS of Intel GPUs. This is not really an AMX proposal but a unified matrix proposal. The implementation now supports both AMX and DPAS plus the one you are adding here for Nvidia tensor cores.

We should have one file that has the interface. Then we can add sections (or new files) that talk about implementation restrictions for each hardware (AMX, DPAS, TensorCores).

JackAKirk · 2021-10-05T15:59:31Z

We should find a way to blend this with the existing matrix doc. Note that the existing one is out of date. I will update it with the changes we recently made to enable DPAS of Intel GPUs. This is not really an AMX proposal but a unified matrix proposal. The implementation now supports both AMX and DPAS plus the one you are adding here for Nvidia tensor cores.

We should have one file that has the interface. Then we can add sections (or new files) that talk about implementation restrictions for each hardware (AMX, DPAS, TensorCores).

This sounds fine. It could be useful to keep header files separate for the time being to allow for experimental changes that only affect one backend. However I don't think that any functional changes to the joint_matrix, joint_matrix_load, joint_matrix_store, joint_matrix_mad interfaces will be necessary on the Tensorcore side as more cases are implemented, so a single header should be fine from the Tensorcore side. We can wait for #4707 to be merged before adding any missing changes that are included in the doc from this PR.

steffenlarsen

Great work! Minor suggestion for when it is moved.

steffenlarsen · 2021-10-18T15:37:34Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix-tensorcore.asciidoc

+
+A(double, 8x4, row_major/col_major), B(double, 4x8, row_major/col_major), C(double, 8x8, row_major/col_major)
+
+In order to deal with different cases we use partial specialization of the various template functions introduced by the extension.  LLVM builtins are available for all possible matrix shapes, and runtime implementations covering these cases will be progressively added.


Since this the tensorcore side of the matrix extension is likely to stay somewhat restrictive in matrix type/shape/layout combinations I think it would make sense to make the overview of currently supported type/shape/layout combinations more extendable. Maybe a table for joint_matrix_store, joint_matrix_load, and joint_matrix_mad?

Yeah this is a good idea thanks! I'll add it when I merge the changes from this proposal doc into the existing proposal.

…4696) Initial Implementation based on the new matrix extension supporting Nvidia Tensorcore, #4695, that is adapted from the AMX matrix extension. Only double data type matrix elements are initially supported. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Added nvidia compilation instructions. Other clarifications added on distinction between Nvidia and AMX use cases. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2021-11-09T11:19:42Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

 }
 ```
-This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
-
+This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS or to the matrix "fragments" for Nvidia Tensorcore.  Note that `Layout` is not included as an argument since it may be determined from the joint_matrix argument.


Is it also acceptable for the other backends to determine matrix layout from joint_matrix instead of having an extra argument to joint_matrix_load/joint_matrix_store for matrix_layout Layout?

+This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS/Nvidia tensorcores

fragment is not the hardware name right? can you just refer to them as registers?
If not, you can still merge things as something like this:
+This function loads data from memory to the 2d tiles/registers/fragments of Intel AMX/DPAS/Nvidia tensorcores, respectively.

The comment about "
Note that Layout is not included as an argument since it may be determined from the joint_matrix argument."
should be added in the tensorcores subsection that adds use argument.

JackAKirk · 2021-11-09T11:23:04Z

@dkhaldi I've added matrix_use to the proposal, as well as some minimal Nvidia specific information where necessary. I've removed the separate Nvidia based matrix proposal document.

dkhaldi

These are a lot of comments :)
If anything is not clear, we can have a meeting to go over the comments

dkhaldi · 2021-11-11T19:00:24Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

@@ -33,11 +33,11 @@ SYCL specification refer to that revision.

 **_NOTE:_** _This document describes the current design and API for the matrix
 extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS, and Nvidia Tensorcore. We are going to work with the community on incrementally improving


dkhaldi · 2021-11-11T19:04:10Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

+          size_t Rows = sycl::dynamic_extent,
+          size_t Cols = sycl::dynamic_extent,
+          matrix_layout Layout = matrix_layout::row_major,
+          typename Group, typename Cond = void>


While this "Cond" argument was handy for the implementation, it should not be part of the interface to the user

dkhaldi · 2021-11-11T19:07:23Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc


 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
-          matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
+template <typename T, matrix_use MT,


I suggest you revert back to the original definition.

Add a new subsection here called something like: "Additional "use" argument for the tensorcore case". Put there this new definition along with the reason behind the addition of this argument. You can also specify that the long term plan is to incorporate the use argument for other TPUs (AMX and DPAS) as well.

dkhaldi · 2021-11-11T19:08:01Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

 };
 }
 ```

+The final optional conditional argument can be used to remove cases for template parameter values which are incompatible with a particular backend.
+For example, the Nvidia Tensorcore backend does not allow usage of `matrix_layout::packed_a` or `matrix_layout::packed_b`.
+


Remove this as this should be part of the implementation code not the spec

dkhaldi · 2021-11-11T19:10:28Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc


-IMPORTANT: In the current implementation, only the subgroup scope is supported
+IMPORTANT: In the current implementation, only the subgroup scope is supported.  For Nvidia Tensorcore only the subgroup scope is supported.


Since this is the case for the three current use cases, change that to something like:
In the current implementations of Intel AMX, Intel DPAS, and Nvidia tensorcores, only the subgroup scope is supported.

dkhaldi · 2021-11-12T16:16:56Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

@@ -235,6 +265,16 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 }).wait();
 ```

+## Nvidia Compilation instructions


Please add this section as +important note under the example above

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

dkhaldi · 2021-11-12T16:24:29Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

 - Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
 - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
 - Add a more realistic and complete example that shows the value of the general query
+- Clarify USM compatibility


Tensorcores:

Clarify USM compatibility

Add support for other combinations, the query interface, and consider how the future looking API can be added here.

dkhaldi · 2021-11-12T16:25:33Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

@@ -583,13 +621,15 @@ We did not utilize this extension for this matrix API version because sub-group
 - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"

 - In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
-
+- multi_ptr can be constructed from T* since https://github.com/intel/llvm/pull/1183.  However currently this cannot be used with USM for all cases.
+  It is expected that eventually the `joint_matrix_load` and `joint_matrix_store` interfaces will be fully compatible with USM.
 ## TODO List


add the matrix use to do:
AMX and DPAS:

Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform

dkhaldi · 2021-11-12T16:26:39Z

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

@@ -599,4 +639,5 @@ We did not utilize this extension for this matrix API version because sub-group
 |Rev |Date       |Author     |Changes
 |1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
 |2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
+|3   |2021-11-08 |Jack Kirk |Added matrix_use


Initial AOT use case on Nvidia tensorcores

JackAKirk · 2021-11-12T16:43:01Z

These are a lot of comments :) If anything is not clear, we can have a meeting to go over the comments

Thanks. Regarding USM I am surprised that it is fully working for the other backends since I thought there was a general issue casting multi_ptr to T*. Here are my notes on testing USM with the Nvidia matrix backend:

multi_ptr can be constructed from T* since #1183. However
currently this cannot be used with USM for all cases. It is expected that eventually the
joint_matrix_load and joint_matrix_store interfaces will be fully compatible with USM. Currently
USM has only been validated to work with this Nvidia backend matrix extension for a single case: using shared USM
pointers by casting them to the global address space in the following way:

joint_matrix_load(sg, sub_c, global_ptr<double>(d_C) + (m * M) * BIG_N + n * N,
STRIDE_C);

Where d_C is a shared USM pointer, e.g.:

double*
d_C = malloc_shared<double>(size, queue);

However even this case is not reliable and requires more testing.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2021-11-15T20:42:15Z

These are a lot of comments :) If anything is not clear, we can have a meeting to go over the comments

Thanks for all the comments. Hopefully I have addressed them now.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Unnecessary template parameters are removed. Although only one "shape" is currently supported: m8n8k128, the shape template parameters are still used so that when future shapes are supported by the hardware they can be implemented. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

keryell · 2022-04-20T23:42:51Z

I have recently watched SIMD in C++20 - EVE of a new Era - Joël Falcou - CPPP 2021 on https://github.com/jfalcou/eve by @jfalcou and there are a lot of interesting design ideas which might be useful in some SYCL SIMD 1D & 2D extensions.
Just adding this here since I do not know where to mention it. :-)

JackAKirk requested a review from a team as a code owner October 4, 2021 15:16

JackAKirk mentioned this pull request Oct 4, 2021

[SYCL][CUDA][Matrix] Initial Tensorcore matrix ext impl #4696

Merged

pvchupin requested a review from dkhaldi October 4, 2021 18:52

dkhaldi requested changes Oct 4, 2021

View reviewed changes

steffenlarsen reviewed Oct 18, 2021

View reviewed changes

Merge branch 'sycl' into matrix-tensorcore-proposal

8737eb8

JackAKirk added 5 commits November 8, 2021 14:33

Added matrix_use to proposal.

5baff2d

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Added impl status table for nvidia.

677ba9d

Added nvidia compilation instructions. Other clarifications added on distinction between Nvidia and AMX use cases. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Improved presentation.

33978b9

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Some further small style changes.

71c1b7e

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Removed tensorcore specific proposal.

b5b03f3

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk commented Nov 9, 2021

View reviewed changes

dkhaldi requested changes Nov 12, 2021

View reviewed changes

JackAKirk added 4 commits November 15, 2021 20:17

Separated Nvidia implementation descriptions etc.

9e70512

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Removed nested table.

23f7f1f

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Typos

d818faf

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Formatting

b3064e5

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk added 4 commits November 16, 2021 10:09

Small formatting improvements.

cc10a11

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Added description of single-bit Nvidia bmad.

9814f97

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Added a summary of BNNs, as motivation for BMADs

9615ed2

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Jan 21, 2022

[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

Closed

JackAKirk and others added 2 commits January 24, 2022 11:57

Improved readability.

a42d748

Updated BMAD description.

ecb8e6b

github-actions bot added the Stale label Oct 18, 2022

github-actions bot closed this Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

JackAKirk commented Oct 4, 2021 •

edited

dkhaldi left a comment

JackAKirk commented Oct 5, 2021 •

edited

steffenlarsen left a comment

steffenlarsen Oct 18, 2021

JackAKirk Oct 19, 2021 •

edited

JackAKirk Nov 9, 2021

dkhaldi Nov 12, 2021

JackAKirk commented Nov 9, 2021 •

edited

dkhaldi left a comment

dkhaldi Nov 11, 2021

dkhaldi Nov 11, 2021

dkhaldi Nov 11, 2021

dkhaldi Nov 11, 2021

dkhaldi Nov 11, 2021

dkhaldi Nov 12, 2021

dkhaldi Nov 12, 2021

dkhaldi Nov 12, 2021

dkhaldi Nov 12, 2021

JackAKirk commented Nov 12, 2021 •

edited

JackAKirk commented Nov 15, 2021

keryell commented Apr 20, 2022


		A(double, 8x4, row_major/col_major), B(double, 4x8, row_major/col_major), C(double, 8x8, row_major/col_major)

		In order to deal with different cases we use partial specialization of the various template functions introduced by the extension. LLVM builtins are available for all possible matrix shapes, and runtime implementations covering these cases will be progressively added.


		IMPORTANT: In the current implementation, only the subgroup scope is supported
		IMPORTANT: In the current implementation, only the subgroup scope is supported. For Nvidia Tensorcore only the subgroup scope is supported.

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

Conversation

JackAKirk commented Oct 4, 2021 • edited

dkhaldi left a comment

Choose a reason for hiding this comment

JackAKirk commented Oct 5, 2021 • edited

steffenlarsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Oct 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Nov 9, 2021 • edited

dkhaldi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Nov 12, 2021 • edited

JackAKirk commented Nov 15, 2021

keryell commented Apr 20, 2022

JackAKirk commented Oct 4, 2021 •

edited

JackAKirk commented Oct 5, 2021 •

edited

JackAKirk Oct 19, 2021 •

edited

JackAKirk commented Nov 9, 2021 •

edited

JackAKirk commented Nov 12, 2021 •

edited