[hat] Initial support for Tensor APIs and tensor operations on GPUs by jjfumero · Pull Request #998 · openjdk/babylon

jjfumero · 2026-04-08T13:14:40Z

This PR extends the HAT programming model with tensor support.

Tensors are defined as a ND-array tile from the input data set that support mma and fill operations.

For the CUDA backend, tensors are mapped to WMMA operations. For the OpenCL backend, since HAT requires OpenCL 1.2, tensor operations are generated as tiles (loads, stores and mma) individually. Thus, HAT can guarantee code portability across different backends and vendors.

Besides, the ND-Range API has been extended to accommodate tiles and warps.

Let's use this PR as discussion. We might change/adapt APIs and evolve in new directions based on this work.

How to test?

For the CUDA backend:

HAT=SHOW_CODE,INFO java -cp hat/job.jar hat.java test ffi-cuda hat.test.TestTensors

For the OpenCL backend:

HAT=SHOW_CODE,INFO java -cp hat/job.jar hat.java test ffi-opencl hat.test.TestTensors

I confirm that I make this contribution in accordance with the OpenJDK Interim AI Policy.

Progress

Change must not contain extraneous whitespace

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/babylon.git pull/998/head:pull/998
$ git checkout pull/998

Update a local copy of the PR:
$ git checkout pull/998
$ git pull https://git.openjdk.org/babylon.git pull/998/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 998

View PR using the GUI difftool:
$ git pr show -t 998

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/babylon/pull/998.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2026-04-08T13:16:57Z

👋 Welcome back jfumero! A progress list of the required criteria for merging this PR into code-reflection will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2026-04-08T13:18:59Z

@jjfumero This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

[hat] Initial support for Tensor APIs and tensor operations on GPUs

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the code-reflection branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the code-reflection branch, type /integrate in a new comment.

openjdk · 2026-04-08T13:19:34Z

@jjfumero this pull request can not be integrated into code-reflection due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout hat/tensors/portable
git fetch https://git.openjdk.org/babylon.git code-reflection
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge code-reflection"
git push

openjdk · 2026-04-08T13:19:35Z

⚠️ @jjfumero This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

PaulSandoz · 2026-04-20T15:38:16Z

+        super(that, cc);
+    }
+
+    public static final class TensorVarOp extends HATTensorOp implements VarLikeOp, StatementLikeOp {


Why do you need this operation and the corresponding load and store operations?

The TensorVarOp is actually a view. Maybe a better name would be TensorView. My reasoning is that the views do not follow the same semantics as normal Java varOps.

Tensors view declaration can be moved outside of the scope, meaning that the tensor view declaration can be moved around outside of the current scope, and this depends on the backend used (e.g., OpenCL). For example, we could store a tensor, but the tensor, depending on the low-level programming model used, would need to be declared in an outer scope. Providing a domain-specific op for this facilitates analysis, code shuffle, lowering and code generation.

Tensor views could represent a view from global memory, but also from shared memory (shared across multiple threads). Thus, the semantics is different compared to the Java VarOp.

Other programming framework such as MLIR for GPUs also introduce a dialect that include tensor initialisation/views: https://mlir.llvm.org/docs/Dialects/GPU/#gpucreate_dn_tensor-gpucreatedntensorop

Regarding Load/Stores. The way we process this per backed is very different. For the CUDA backend is straightforward, generating code via a set of intrinsics and templates. However, for the OpenCL backend, we have mapped to explicit tiling, reconstructing loops and new control flow. This would be better handled by a lowering phase for OpenCL devices. In fact, within the OpenCL ecosystem, we might have different lowered code. Having an Op to represent these operations also facilitate code analysis and provide and code generation.

Similar works:

https://mlir.llvm.org/docs/Dialects/SPIR-V/

https://mlir.llvm.org/docs/Dialects/AMDGPU/

I think whether we call this View (? not) or an Op is irrelevant.

Can't normal varOps can also be moved around in the manner you describe? Why is a TensorVarOp different in this regard? I don't get this point.

If we consider a Tensor as just another type (possibky exotic in nature - which can reside in global, local, shared, private memory) I think we will realize that values of this 'type' can be assigned to an existing varOp.

We might need to do special things in codegen (by extracting the type), but we should not need a varOp specifically for this.

We don't have specific varOps for primitives, or Objects, or Arrays, or records. If we did we would have an explosion of varOp types.

We really don't need a special Op for this, same with Vectors, Tiles, Matrices etc.

The same argument applies BTW to arithmetic operations performed on types. We also should not specialize arithmetic Ops based on the types of operands. We may produce the Add op as a result of a transformation on an invoke/method call in the original code add(vec1,vec2) instead of as a result of a java operator someInt + someOtherInt but the arithmetic add operation conveys all the information we require.

At this point we don't really know what a HAT dialect might be as the contract between the HAT framework and an accelerator. What you have devised is something partial, unspecified/undocumented, that duplicates existing logic, and therefore appears to complicate the code base. I don't see it providing a clear current advantage to proving out the HAT programming model and generating reasonable C code that is optimizable by the GPU C compiler.

Further, it is important that we also consider executing in the Java backend for debugging. The best way to do that is to implement the tensors and their operations in Java, naturally, and albeit slowly, and then there is less to do in the Java backend.

I think what you have devised is not really a dialect as I understand it. AFAICT it's more an internal IR whose goal is to make it easier to generate the C code. But, it is not clear to me that it really does that, and instead for the most part creates more work and pushes around the logic of where stuff is done.

IMO you need to try, in another PR, to do the same without these operations, and the same goes for where you have added similar operations for other types (such as HATVectorVarOp and any VarLikeOp thing). Let's see if you can achieve the same goals reusing ops and code types with shared logic in the C backend, and if not why not.

Every specialized variable declaration operation now has to be reasoned about independently and they have to come with their own load and store operations too. And yet, they are fundamentally all the same thing, modeling a variable holding a value of a specific type, which may have a name, and which may be initialized, and may be accessed. Because they are separate you cannot apply a general SSA transformation, which if it could be applied should not change the program meaning. If tensors are values whose contents are immutable it becomes easier to track tensor values (even views of) and what is the source. And given tensor operations can be embedded in expressions you need to handle the variable and non-variable case, which collapses to the latter in pure SSA. That applies in all the other cases too.

That does not mean we should not consider a complete HAT dialect, or one that focuses say on tensor programming (tensors with constant shapes). Devising one is a significant undertaking. (I devised a dialect for Triton to do just this for operating tensors, it's not fully complete nor specified either, but was easier because I was copying the MLIR dialect.)

It also does not mean that there may need to be some internal operations to aid in the generation C code. But, we need to evaluate each and every operation to determine if it carries its weight or not for that use case, and we have not fully done that evaluation.

grfrost · 2026-04-20T15:58:40Z

+        }
+    }
+
+    private void transformTensorFillOp(Block.Builder blockBuilder, Op op) {


If you move your switch you can reuse a lot of this code. Something like.

v ar values = blockBuilder.context().getValues(invokeOp.operands())); var resultType = op.resultType(); var newOp = switch(op){ case oreOp.VarAccessOp.VarLoadOp loadOp -> new TensorFillOp(resultType, values) JavaOp.InvokeOp invokeOp -> new TensorFillOp(resultType, values) ... } newOp.setLocation(invokeOp.location()); blockBuilder.context().mapValue(newOp.result(), blockBuilder.op(newOp));

grfrost · 2026-04-20T16:00:33Z

+        // 1. Analyse IR calls for Tensor.fill
+        Set<Op> opsToProcess = new HashSet<>();
+        OpHelper.Invoke.stream(lookup, funcOp)
+                .filter(OpHelper.Invoke::returnsVoid)


Looks like the code patterns for seaching for ops are very similar (maybe just method names?) You should be able to use some helper methods to minimize this code.

grfrost · 2026-04-20T16:32:43Z

        this.usesBarrier = OpHelper.Invoke.stream(lookup(), inlinedEntryPoint)
                .anyMatch(invoke -> invoke.refIs(KernelContext.class) && invoke.named("barrier"));
+        this.useTensors = OpHelper.Invoke.stream(lookup(), inlinedEntryPoint)
+                .anyMatch(invoke -> invoke.refIs(Tensor.class) && invoke.named("load"));


I don't think you need the invoke here. Maybe any use of Tensor.class in the list of accessed types

grfrost

A few more code queries

grfrost · 2026-04-21T08:09:14Z


+    @Override
+    protected CudaHATKernelBuilder recurseValueOrThrough(Value value) {
+        if (value instanceof Op.Result r) {


Great minds think alike. ;)
I think this is already implemented in BabylonOpDispatcher)

default T recurseResultOrThrow(Value v) { if (v instanceof Op.Result r){ return recurse(r.op()); }else{ throw new RuntimeException("can't recurse on value v, it is not a result"); } }

Note there is a short cut to this, Value.declaringElement that return the declaring operation if the value is an operation result or the declaring block if the value is a block parameter. Very useful for pattern ops with results using a switch expression, where the default bombs out of an unsupported op or an unsupported block parameter.

grfrost · 2026-04-21T08:10:39Z

-                    if (reference instanceof Op.Result r) {
-                        recurse(r.op());
-                    }
+                    recurseValueOrThrough(reference);


So (from above note) this becomes

recurseResultOrThrow(reference);

grfrost · 2026-04-21T08:13:51Z

        return MATH_FUNCTIONS.getOrDefault(hatMathIntrinsicName, hatMathIntrinsicName);
    }

+    @Override


Not only did you replicate this from BabylonOpDispatcher, for some reason you copied it into both OpenCL and Cuda kernel builders, where a more logical place would have been to put it in the common base class.

grfrost · 2026-04-21T08:15:08Z

        return blockComment("Not supported yet");
    }

+    @Override


grfrost · 2026-04-21T08:16:22Z

@@ -961,4 +961,6 @@ protected boolean isColumnMajor(Value tensorLayout) {
        return false;
    }



OK so you did put this in the base class. So why not use the same implementation for all?

I am confused.

Because each implementation launches the correct exception (e.g., CUDACodeGenException for the CUDA backend). Each subclass handles this. But I can make it more generic.

I am not sure that justifies the extra code.

Also there are other Op and Helper calls that throw unchecked exceptions.... so for consistancy we would have to address those.

If these exceptions all extend a common CodeGenException (in optkl) then we could ....

default T recurseResultOrThrow(Value v) { if (v instanceof Op.Result r){ return recurse(r.op()); }else{ throw new CodeGenException("can't recurse on value v, it is not a result"); } }

To be clear CodeGenException does not exist. We would have to create it in optkl.

grfrost

I think we replicated recurseResultOrThrow from BabylonOpDispatch.

grfrost · 2026-04-21T09:13:18Z

+        if (value instanceof Op.Result r) {
+            return recurse(r.op());
+        } else {
+            throw launchBackendException("OpResult expected, but found: " + value.getClass());


I would not recomment this pattern.

Better I think to extend common CodeGenException.

grfrost

I think I would prefer a common optkl CodeGenException.

jjfumero · 2026-04-21T10:40:51Z

/integrate

openjdk · 2026-04-21T10:41:05Z

Going to push as commit b27d800.

openjdk · 2026-04-21T10:41:12Z

@jjfumero Pushed as commit b27d800.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

jjfumero added 26 commits March 9, 2026 11:14

[hat] PoC implementation for leveraging Tensor instructions

c9742d9

Add Tensor API class

2b3d83a

Merge branch 'code-reflection' into hat/tensors/v1

918befe

Merge branch 'code-reflection' into hat/tensors/v1

d26721e

Merge branch 'code-reflection' into hat/tensors/v1

35dff94

[hat] Fix buffer-tagger for tensors

7d16e6d

[hat] Simplify tensor compilation phase

f5ea3e7

[hat] PoC with prebuilt kernel for portable tensor API

558c5d8

Merge branch 'code-reflection' into hat/tensors/v1

6c52537

[hat][tensor] Update API and codegen for easier indexing

f97e32b

[hat] adding warpSize op

3223e39

missing class added

185d83f

[hat] CUDA tensor codegen moved to the CUDA code builder

e25ab61

Merge branch 'code-reflection' into hat/tensors/portable

c048a9c

Merge branch 'hat/tensors/v1' into hat/tensors/portable

0a1629e

[hat] OpenCL C backend for Tensor API

f3799a7

[hat] Refactoring OpenCL C99 backend

ffcc910

[hat] OpenCL C backend refactored to use the internal DSL

0a0c905

[hat] Remove old code

92065a4

[hat] Improving CUDA codegen for tensors

fdaacc0

Merge branch 'code-reflection' into hat/tensors/portable

b9087cd

[hat] Automatically resize thread-block for the OpenCL backend

536a911

[hat] maxWorkSize fit analysis improved

0f49c7d

minor cleanup

c497ae3

[hat] Tensor's kernel scheduling expanded to support warp and tiles

28cac64

[hat] Buffer Tagger with RO instead of NA

c279fc7

PaulSandoz reviewed Apr 20, 2026

View reviewed changes

grfrost reviewed Apr 20, 2026

View reviewed changes

[hat] Tensor codegen refactored

fcf1027

grfrost reviewed Apr 21, 2026

View reviewed changes

[hat] Tensors tranformer simplified with common pattern

743d1b2

openjdk Bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 21, 2026

[hat] Move recurseValueOrThrough impl to common class

e488c11

openjdk Bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 21, 2026

grfrost reviewed Apr 21, 2026

View reviewed changes

jjfumero added 2 commits April 21, 2026 12:12

[hat] tensor cleanup

259ef1c

Merge branch 'code-reflection' into hat/tensors/portable

05a4d22

openjdk Bot added the integrated Pull request has been integrated label Apr 21, 2026

openjdk Bot closed this Apr 21, 2026

openjdk Bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 21, 2026

jjfumero mentioned this pull request Apr 27, 2026

[hat] Roll back tensors from the main branch #1012

Closed

2 tasks

		@@ -961,4 +961,6 @@ protected boolean isColumnMajor(Value tensorLayout) {
		return false;
		}

Conversation

jjfumero commented Apr 8, 2026 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test?

Progress

Reviewing

Uh oh!

bridgekeeper Bot commented Apr 8, 2026

Uh oh!

openjdk Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk Bot commented Apr 8, 2026

Uh oh!

openjdk Bot commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulSandoz Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grfrost left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grfrost left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grfrost left a comment

Choose a reason for hiding this comment

Uh oh!

jjfumero commented Apr 21, 2026

Uh oh!

openjdk Bot commented Apr 21, 2026

Uh oh!

openjdk Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

jjfumero commented Apr 8, 2026 •

edited by openjdk Bot

Loading

openjdk Bot commented Apr 8, 2026 •

edited

Loading

PaulSandoz Apr 21, 2026 •

edited

Loading