[hat] Initial support for Tensor APIs and tensor operations on GPUs#998
[hat] Initial support for Tensor APIs and tensor operations on GPUs#998jjfumero wants to merge 63 commits intoopenjdk:code-reflectionfrom
Conversation
|
👋 Welcome back jfumero! A progress list of the required criteria for merging this PR into |
|
@jjfumero This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
|
@jjfumero this pull request can not be integrated into git checkout hat/tensors/portable
git fetch https://git.openjdk.org/babylon.git code-reflection
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge code-reflection"
git push |
|
|
| super(that, cc); | ||
| } | ||
|
|
||
| public static final class TensorVarOp extends HATTensorOp implements VarLikeOp, StatementLikeOp { |
There was a problem hiding this comment.
Why do you need this operation and the corresponding load and store operations?
There was a problem hiding this comment.
The TensorVarOp is actually a view. Maybe a better name would be TensorView. My reasoning is that the views do not follow the same semantics as normal Java varOps.
- Tensors view declaration can be moved outside of the scope, meaning that the tensor view declaration can be moved around outside of the current scope, and this depends on the backend used (e.g., OpenCL). For example, we could store a tensor, but the tensor, depending on the low-level programming model used, would need to be declared in an outer scope. Providing a domain-specific op for this facilitates analysis, code shuffle, lowering and code generation.
- Tensor views could represent a view from global memory, but also from shared memory (shared across multiple threads). Thus, the semantics is different compared to the Java VarOp.
Other programming framework such as MLIR for GPUs also introduce a dialect that include tensor initialisation/views: https://mlir.llvm.org/docs/Dialects/GPU/#gpucreate_dn_tensor-gpucreatedntensorop
Regarding Load/Stores. The way we process this per backed is very different. For the CUDA backend is straightforward, generating code via a set of intrinsics and templates. However, for the OpenCL backend, we have mapped to explicit tiling, reconstructing loops and new control flow. This would be better handled by a lowering phase for OpenCL devices. In fact, within the OpenCL ecosystem, we might have different lowered code. Having an Op to represent these operations also facilitate code analysis and provide and code generation.
Similar works:
There was a problem hiding this comment.
I think whether we call this View (? not) or an Op is irrelevant.
Can't normal varOps can also be moved around in the manner you describe? Why is a TensorVarOp different in this regard? I don't get this point.
If we consider a Tensor as just another type (possibky exotic in nature - which can reside in global, local, shared, private memory) I think we will realize that values of this 'type' can be assigned to an existing varOp.
We might need to do special things in codegen (by extracting the type), but we should not need a varOp specifically for this.
We don't have specific varOps for primitives, or Objects, or Arrays, or records. If we did we would have an explosion of varOp types.
We really don't need a special Op for this, same with Vectors, Tiles, Matrices etc.
The same argument applies BTW to arithmetic operations performed on types. We also should not specialize arithmetic Ops based on the types of operands. We may produce the Add op as a result of a transformation on an invoke/method call in the original code add(vec1,vec2) instead of as a result of a java operator someInt + someOtherInt but the arithmetic add operation conveys all the information we require.
There was a problem hiding this comment.
At this point we don't really know what a HAT dialect might be as the contract between the HAT framework and an accelerator. What you have devised is something partial, unspecified/undocumented, that duplicates existing logic, and therefore appears to complicate the code base. I don't see it providing a clear current advantage to proving out the HAT programming model and generating reasonable C code that is optimizable by the GPU C compiler.
Further, it is important that we also consider executing in the Java backend for debugging. The best way to do that is to implement the tensors and their operations in Java, naturally, and albeit slowly, and then there is less to do in the Java backend.
I think what you have devised is not really a dialect as I understand it. AFAICT it's more an internal IR whose goal is to make it easier to generate the C code. But, it is not clear to me that it really does that, and instead for the most part creates more work and pushes around the logic of where stuff is done.
IMO you need to try, in another PR, to do the same without these operations, and the same goes for where you have added similar operations for other types (such as HATVectorVarOp and any VarLikeOp thing). Let's see if you can achieve the same goals reusing ops and code types with shared logic in the C backend, and if not why not.
Every specialized variable declaration operation now has to be reasoned about independently and they have to come with their own load and store operations too. And yet, they are fundamentally all the same thing, modeling a variable holding a value of a specific type, which may have a name, and which may be initialized, and may be accessed. Because they are separate you cannot apply a general SSA transformation, which if it could be applied should not change the program meaning. If tensors are values whose contents are immutable it becomes easier to track tensor values (even views of) and what is the source. And given tensor operations can be embedded in expressions you need to handle the variable and non-variable case, which collapses to the latter in pure SSA. That applies in all the other cases too.
That does not mean we should not consider a complete HAT dialect, or one that focuses say on tensor programming (tensors with constant shapes). Devising one is a significant undertaking. (I devised a dialect for Triton to do just this for operating tensors, it's not fully complete nor specified either, but was easier because I was copying the MLIR dialect.)
It also does not mean that there may need to be some internal operations to aid in the generation C code. But, we need to evaluate each and every operation to determine if it carries its weight or not for that use case, and we have not fully done that evaluation.
| } | ||
| } | ||
|
|
||
| private void transformTensorFillOp(Block.Builder blockBuilder, Op op) { |
There was a problem hiding this comment.
If you move your switch you can reuse a lot of this code. Something like.
v ar values = blockBuilder.context().getValues(invokeOp.operands()));
var resultType = op.resultType();
var newOp = switch(op){
case oreOp.VarAccessOp.VarLoadOp loadOp -> new TensorFillOp(resultType, values)
JavaOp.InvokeOp invokeOp -> new TensorFillOp(resultType, values)
...
}
newOp.setLocation(invokeOp.location());
blockBuilder.context().mapValue(newOp.result(), blockBuilder.op(newOp));
| // 1. Analyse IR calls for Tensor.fill | ||
| Set<Op> opsToProcess = new HashSet<>(); | ||
| OpHelper.Invoke.stream(lookup, funcOp) | ||
| .filter(OpHelper.Invoke::returnsVoid) |
There was a problem hiding this comment.
Looks like the code patterns for seaching for ops are very similar (maybe just method names?) You should be able to use some helper methods to minimize this code.
| this.usesBarrier = OpHelper.Invoke.stream(lookup(), inlinedEntryPoint) | ||
| .anyMatch(invoke -> invoke.refIs(KernelContext.class) && invoke.named("barrier")); | ||
| this.useTensors = OpHelper.Invoke.stream(lookup(), inlinedEntryPoint) | ||
| .anyMatch(invoke -> invoke.refIs(Tensor.class) && invoke.named("load")); |
There was a problem hiding this comment.
I don't think you need the invoke here. Maybe any use of Tensor.class in the list of accessed types
grfrost
left a comment
There was a problem hiding this comment.
A few more code queries
|
|
||
| @Override | ||
| protected CudaHATKernelBuilder recurseValueOrThrough(Value value) { | ||
| if (value instanceof Op.Result r) { |
There was a problem hiding this comment.
Great minds think alike. ;)
I think this is already implemented in BabylonOpDispatcher)
default T recurseResultOrThrow(Value v) {
if (v instanceof Op.Result r){
return recurse(r.op());
}else{
throw new RuntimeException("can't recurse on value v, it is not a result");
}
}
There was a problem hiding this comment.
Note there is a short cut to this, Value.declaringElement that return the declaring operation if the value is an operation result or the declaring block if the value is a block parameter. Very useful for pattern ops with results using a switch expression, where the default bombs out of an unsupported op or an unsupported block parameter.
| if (reference instanceof Op.Result r) { | ||
| recurse(r.op()); | ||
| } | ||
| recurseValueOrThrough(reference); |
There was a problem hiding this comment.
So (from above note) this becomes
recurseResultOrThrow(reference);
| return MATH_FUNCTIONS.getOrDefault(hatMathIntrinsicName, hatMathIntrinsicName); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
Not only did you replicate this from BabylonOpDispatcher, for some reason you copied it into both OpenCL and Cuda kernel builders, where a more logical place would have been to put it in the common base class.
| return blockComment("Not supported yet"); | ||
| } | ||
|
|
||
| @Override |
| @@ -961,4 +961,6 @@ protected boolean isColumnMajor(Value tensorLayout) { | |||
| return false; | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
OK so you did put this in the base class. So why not use the same implementation for all?
I am confused.
There was a problem hiding this comment.
Because each implementation launches the correct exception (e.g., CUDACodeGenException for the CUDA backend). Each subclass handles this. But I can make it more generic.
There was a problem hiding this comment.
I am not sure that justifies the extra code.
Also there are other Op and Helper calls that throw unchecked exceptions.... so for consistancy we would have to address those.
If these exceptions all extend a common CodeGenException (in optkl) then we could ....
default T recurseResultOrThrow(Value v) {
if (v instanceof Op.Result r){
return recurse(r.op());
}else{
throw new CodeGenException("can't recurse on value v, it is not a result");
}
}
There was a problem hiding this comment.
To be clear CodeGenException does not exist. We would have to create it in optkl.
grfrost
left a comment
There was a problem hiding this comment.
I think we replicated recurseResultOrThrow from BabylonOpDispatch.
| if (value instanceof Op.Result r) { | ||
| return recurse(r.op()); | ||
| } else { | ||
| throw launchBackendException("OpResult expected, but found: " + value.getClass()); |
There was a problem hiding this comment.
I would not recomment this pattern.
Better I think to extend common CodeGenException.
grfrost
left a comment
There was a problem hiding this comment.
I think I would prefer a common optkl CodeGenException.
|
/integrate |
|
Going to push as commit b27d800. |
This PR extends the HAT programming model with tensor support.
Tensors are defined as a ND-array tile from the input data set that support
mmaandfilloperations.For the CUDA backend, tensors are mapped to WMMA operations. For the OpenCL backend, since HAT requires OpenCL 1.2, tensor operations are generated as tiles (loads, stores and mma) individually. Thus, HAT can guarantee code portability across different backends and vendors.
Besides, the ND-Range API has been extended to accommodate tiles and warps.
Let's use this PR as discussion. We might change/adapt APIs and evolve in new directions based on this work.
How to test?
For the CUDA backend:
HAT=SHOW_CODE,INFO java -cp hat/job.jar hat.java test ffi-cuda hat.test.TestTensorsFor the OpenCL backend:
HAT=SHOW_CODE,INFO java -cp hat/job.jar hat.java test ffi-opencl hat.test.TestTensorsProgress
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/babylon.git pull/998/head:pull/998$ git checkout pull/998Update a local copy of the PR:
$ git checkout pull/998$ git pull https://git.openjdk.org/babylon.git pull/998/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 998View PR using the GUI difftool:
$ git pr show -t 998Using diff file
Download this PR as a diff file:
https://git.openjdk.org/babylon/pull/998.diff
Using Webrev
Link to Webrev Comment