Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

Closed
wants to merge 17 commits into from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Oct 4, 2021

This proposal is based on the existing AMX proposal, making small adaptation to accommodate the Nvidia Tensorcore hardware.
The intention is that this proposal should be compatible with both architectures (AMX and Tensorcore).

The corresponding implementation is #4696.

Integration tests are here: intel/llvm-test-suite#760

Signed-off-by: JackAKirk jack.kirk@codeplay.com

This proposal is based on the existing AMX proposal, making small adaptation to accomodate the Nvidia tensorcore hardware.
The intention is that the proposal should be compatible with both architectures (AMX and tensorcore).

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to blend this with the existing matrix doc.
Note that the existing one is out of date.
I will update it with the changes we recently made to enable DPAS of Intel GPUs. This is not really an AMX proposal but a unified matrix proposal. The implementation now supports both AMX and DPAS plus the one you are adding here for Nvidia tensor cores.

We should have one file that has the interface. Then we can add sections (or new files) that talk about implementation restrictions for each hardware (AMX, DPAS, TensorCores).

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 5, 2021

We should find a way to blend this with the existing matrix doc. Note that the existing one is out of date. I will update it with the changes we recently made to enable DPAS of Intel GPUs. This is not really an AMX proposal but a unified matrix proposal. The implementation now supports both AMX and DPAS plus the one you are adding here for Nvidia tensor cores.

We should have one file that has the interface. Then we can add sections (or new files) that talk about implementation restrictions for each hardware (AMX, DPAS, TensorCores).

This sounds fine. It could be useful to keep header files separate for the time being to allow for experimental changes that only affect one backend. However I don't think that any functional changes to the joint_matrix, joint_matrix_load, joint_matrix_store, joint_matrix_mad interfaces will be necessary on the Tensorcore side as more cases are implemented, so a single header should be fine from the Tensorcore side. We can wait for #4707 to be merged before adding any missing changes that are included in the doc from this PR.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Minor suggestion for when it is moved.


A(double, 8x4, row_major/col_major), B(double, 4x8, row_major/col_major), C(double, 8x8, row_major/col_major)

In order to deal with different cases we use partial specialization of the various template functions introduced by the extension. LLVM builtins are available for all possible matrix shapes, and runtime implementations covering these cases will be progressively added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this the tensorcore side of the matrix extension is likely to stay somewhat restrictive in matrix type/shape/layout combinations I think it would make sense to make the overview of currently supported type/shape/layout combinations more extendable. Maybe a table for joint_matrix_store, joint_matrix_load, and joint_matrix_mad?

Copy link
Contributor Author

@JackAKirk JackAKirk Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good idea thanks! I'll add it when I merge the changes from this proposal doc into the existing proposal.

romanovvlad pushed a commit that referenced this pull request Nov 8, 2021
…4696)

Initial Implementation based on the new matrix extension
supporting Nvidia Tensorcore, #4695, that is adapted from
the AMX matrix extension.
Only double data type matrix elements are initially supported.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Added nvidia compilation instructions.
Other clarifications added on distinction between Nvidia and AMX use cases.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
}
```
This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.

This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS or to the matrix "fragments" for Nvidia Tensorcore. Note that `Layout` is not included as an argument since it may be determined from the joint_matrix argument.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also acceptable for the other backends to determine matrix layout from joint_matrix instead of having an extra argument to joint_matrix_load/joint_matrix_store for matrix_layout Layout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS/Nvidia tensorcores

fragment is not the hardware name right? can you just refer to them as registers?
If not, you can still merge things as something like this:
+This function loads data from memory to the 2d tiles/registers/fragments of Intel AMX/DPAS/Nvidia tensorcores, respectively.

The comment about "
Note that Layout is not included as an argument since it may be determined from the joint_matrix argument."
should be added in the tensorcores subsection that adds use argument.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 9, 2021

@dkhaldi I've added matrix_use to the proposal, as well as some minimal Nvidia specific information where necessary. I've removed the separate Nvidia based matrix proposal document.

Copy link
Contributor

@dkhaldi dkhaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a lot of comments :)
If anything is not clear, we can have a meeting to go over the comments

@@ -33,11 +33,11 @@ SYCL specification refer to that revision.

**_NOTE:_** _This document describes the current design and API for the matrix
extension to {dpcpp}. This is an initial experimental version to try out functionality
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS, and Nvidia Tensorcore. We are going to work with the community on incrementally improving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA®

size_t Rows = sycl::dynamic_extent,
size_t Cols = sycl::dynamic_extent,
matrix_layout Layout = matrix_layout::row_major,
typename Group, typename Cond = void>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this "Cond" argument was handy for the implementation, it should not be part of the interface to the user


```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent,
matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
template <typename T, matrix_use MT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you revert back to the original definition.

Add a new subsection here called something like: "Additional "use" argument for the tensorcore case". Put there this new definition along with the reason behind the addition of this argument. You can also specify that the long term plan is to incorporate the use argument for other TPUs (AMX and DPAS) as well.

};
}
```

The final optional conditional argument can be used to remove cases for template parameter values which are incompatible with a particular backend.
For example, the Nvidia Tensorcore backend does not allow usage of `matrix_layout::packed_a` or `matrix_layout::packed_b`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this as this should be part of the implementation code not the spec


IMPORTANT: In the current implementation, only the subgroup scope is supported
IMPORTANT: In the current implementation, only the subgroup scope is supported. For Nvidia Tensorcore only the subgroup scope is supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the case for the three current use cases, change that to something like:
In the current implementations of Intel AMX, Intel DPAS, and Nvidia tensorcores, only the subgroup scope is supported.

@@ -235,6 +265,16 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
}).wait();
```

## Nvidia Compilation instructions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this section as +important note under the example above

- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
- Add a more realistic and complete example that shows the value of the general query
- Clarify USM compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensorcores:

  • Clarify USM compatibility
  • Add support for other combinations, the query interface, and consider how the future looking API can be added here.

@@ -583,13 +621,15 @@ We did not utilize this extension for this matrix API version because sub-group
- Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"

- In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.

- multi_ptr can be constructed from T* since https://github.com/intel/llvm/pull/1183. However currently this cannot be used with USM for all cases.
It is expected that eventually the `joint_matrix_load` and `joint_matrix_store` interfaces will be fully compatible with USM.
## TODO List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the matrix use to do:
AMX and DPAS:

Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform

@@ -599,4 +639,5 @@ We did not utilize this extension for this matrix API version because sub-group
|Rev |Date |Author |Changes
|1 |2021-04-13 |Dounia Khaldi |Initial public working draft.
|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
|3 |2021-11-08 |Jack Kirk |Added matrix_use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial AOT use case on Nvidia tensorcores

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Nov 12, 2021

These are a lot of comments :) If anything is not clear, we can have a meeting to go over the comments

Thanks. Regarding USM I am surprised that it is fully working for the other backends since I thought there was a general issue casting multi_ptr to T*. Here are my notes on testing USM with the Nvidia matrix backend:

multi_ptr can be constructed from T* since #1183. However
currently this cannot be used with USM for all cases. It is expected that eventually the
joint_matrix_load and joint_matrix_store interfaces will be fully compatible with USM. Currently
USM has only been validated to work with this Nvidia backend matrix extension for a single case: using shared USM
pointers by casting them to the global address space in the following way:

joint_matrix_load(sg, sub_c, global_ptr<double>(d_C) + (m * M) * BIG_N + n * N,
STRIDE_C);

Where d_C is a shared USM pointer, e.g.:

double*
d_C = malloc_shared<double>(size, queue);

However even this case is not reliable and requires more testing.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

These are a lot of comments :) If anything is not clear, we can have a meeting to go over the comments

Thanks for all the comments. Hopefully I have addressed them now.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Unnecessary template parameters are removed.

Although only one "shape" is currently supported: m8n8k128, the shape template parameters are still used so that when future shapes are supported by the hardware they can be implemented.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@keryell
Copy link
Contributor

keryell commented Apr 20, 2022

I have recently watched SIMD in C++20 - EVE of a new Era - Joël Falcou - CPPP 2021 on https://github.com/jfalcou/eve by @jfalcou and there are a lot of interesting design ideas which might be useful in some SYCL SIMD 1D & 2D extensions.
Just adding this here since I do not know where to mention it. :-)

@github-actions github-actions bot added the Stale label Oct 18, 2022
@github-actions github-actions bot closed this Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants