8349146: [REDO] Implement a better allocator for downcalls #24829

minborg · 2025-04-23T14:05:58Z

This PR is based on the work of @mernst-github and aims to implement an internal thread-local 'stack' allocator, which works like a dynamically sized arena, but with reset functionality to reset the allocated size back to a certain level. The underlying memory could stay around between calls, which could improve performance.

Re-allocated segments are not zeroed between allocations.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8349146: [REDO] Implement a better allocator for downcalls (Enhancement - P3)

Reviewers

Chen Liang (@liach - Reviewer) Review applies to 59e1f261
Maurizio Cimadamore (@mcimadamore - Reviewer)
Jorn Vernee (@JornVernee - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24829/head:pull/24829
$ git checkout pull/24829

Update a local copy of the PR:
$ git checkout pull/24829
$ git pull https://git.openjdk.org/jdk.git pull/24829/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24829

View PR using the GUI difftool:
$ git pr show -t 24829

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24829.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-04-23T14:06:57Z

👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

minborg · 2025-04-23T14:07:11Z

/contributor add @mernst-github

openjdk · 2025-04-23T14:07:58Z

@minborg This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8349146: [REDO] Implement a better allocator for downcalls

Reviewed-by: mcimadamore, jvernee, liach

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 51 new commits pushed to the master branch:

afb9134: 8355627: Don't use ThreadCritical for EventLog list
811f117: 8355980: RISC-V: remove vmclr_m before vmsXX and vmfXX
d29700c: 8344706: Implement JEP 512: Compact Source Files and Instance Main Methods
... and 48 more: https://git.openjdk.org/jdk/compare/c2485d5f7dd00eaed34a5d309276114eb4c78cb0...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

minborg · 2025-04-23T14:08:31Z

Here are the current benchmark results:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.653 ± 0.180  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.573 ± 0.121  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.712 ± 0.252  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.714 ± 0.146  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  24.269 ± 1.079  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.519 ± 0.219  ns/op
BufferStackBench.buffer                        8  avgt   15   4.866 ± 0.141  ns/op
BufferStackBench.buffer                       16  avgt   15   4.829 ± 0.106  ns/op
BufferStackBench.buffer                       32  avgt   15   4.809 ± 0.071  ns/op
BufferStackBench.confined                      8  avgt   15  22.768 ± 0.448  ns/op
BufferStackBench.confined                     16  avgt   15  23.380 ± 0.192  ns/op
BufferStackBench.confined                     32  avgt   15  25.874 ± 1.129  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.140 ± 0.054  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.698 ± 0.347  ns/op
CallOverheadByValue.byPtr              avgt   15   4.120 ± 0.028  ns/op
CallOverheadByValue.byValue            avgt   15  11.828 ± 0.082  ns/op

openjdk · 2025-04-23T14:08:51Z

@minborg @mernst-github was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

openjdk · 2025-04-23T14:09:07Z

@minborg The following label will be automatically applied to this pull request:

core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

liach · 2025-04-23T14:40:50Z

Can we just cleanly revert commit 7764742?
This new version defines redundant sealed interface + record - feels unnecessary to me.
Meanwhile, the BufferStackBench is valueable, but its addition should be kept in a separate commit to make reviewing easier.

minborg · 2025-04-23T15:22:03Z

Can we just cleanly revert commit 7764742?
This new version defines redundant sealed interface + record - feels unnecessary to me.
Meanwhile, the BufferStackBench is valueable, but its addition should be kept in a separate commit to make reviewing easier.

The idea was to use records for trusted components and less boilerplate. The idea with the interface was to hide the record accessors a bit.

liach · 2025-04-23T15:40:40Z

trust_final_non_static_fields in ciField already trusts jdk.internal.foreign, so I don't think the record performance argument stands here. But making Frame and BufferStack classes final is a good move.

liach

Looks to be in parity with the original patch.

mlbridge · 2025-04-23T17:07:14Z

Webrevs

JornVernee · 2025-04-29T15:33:50Z

We can not just re-use the original test due to: https://bugs.openjdk.org/browse/JDK-8350455

See my comment here: #23142 (comment)

minborg · 2025-04-29T19:54:07Z

We can not just re-use the original test due to: https://bugs.openjdk.org/browse/JDK-8350455

See my comment here: #23142 (comment)

I ran the test 250 times on my Mac machine (macOS 15.4.1)with no problems but, maybe we should problem list or exclude macOS testing?

JornVernee · 2025-04-29T20:32:05Z

maybe we should problem list or exclude macOS testing?

Yes, I think we should. The issue is quite intermittent, and last time it only started showing up in CI as well. I think we should move the stress() test to a separate file./jtreg test, and then problem list that test on Mac, with a link to JDK-8350455

src/java.base/share/classes/jdk/internal/foreign/BufferStack.java

test/jdk/java/foreign/TestBufferStack.java

test/jdk/java/foreign/TestBufferStackStress2.java

test/jdk/java/foreign/TestBufferStackStress.java

mcimadamore · 2025-05-01T10:25:05Z

test/micro/org/openjdk/bench/java/lang/foreign/CallOverheadByValue.java

+    @Benchmark
+    public void byValue() throws Throwable {
+        // point = unit();
+        MemorySegment unused = (MemorySegment) MH_UNIT_BY_VALUE.invokeExact(


This benchmark is a bit misleading, because the allocator object will add some noise in the mix. I suggest to have some allocator object ready in a field, and just pass that (avoiding the lambda). That would also be more similar to idiomatic FFM code (which assumes some allocator is already available at the callsite).

src/java.base/share/classes/jdk/internal/foreign/BufferStack.java

minborg · 2025-05-02T10:41:38Z

Updated benchmarks:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.695 ± 0.205  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.622 ± 0.078  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.523 ± 0.022  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.902 ± 0.407  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  23.858 ± 0.652  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.544 ± 0.458  ns/op
BufferStackBench.buffer                        8  avgt   15   4.923 ± 0.029  ns/op
BufferStackBench.buffer                       16  avgt   15   4.971 ± 0.095  ns/op
BufferStackBench.buffer                       32  avgt   15   4.980 ± 0.105  ns/op
BufferStackBench.confined                      8  avgt   15  22.713 ± 0.289  ns/op
BufferStackBench.confined                     16  avgt   15  23.576 ± 0.348  ns/op
BufferStackBench.confined                     32  avgt   15  25.272 ± 0.530  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.159 ± 0.097  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.535 ± 0.433  ns/op
CallOverheadByValue.byPtr              avgt   15   4.119 ± 0.031  ns/op
CallOverheadByValue.byValue            avgt   15  11.404 ± 0.240  ns/op

wenshao · 2025-05-02T11:21:55Z

src/java.base/share/classes/jdk/internal/foreign/BufferStack.java

+    private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {
+
+        @ForceInline
+        public Arena pushFrame(long size, long byteAlignment) {
+            boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
+            if (needsLock && !lock.tryLock()) {
+                // Rare: another virtual thread on the same carrier competed for acquisition.
+                return Arena.ofConfined();
+            }
+            if (!stack.canAllocate(size, byteAlignment)) {
+                if (needsLock) lock.unlock();
+                return Arena.ofConfined();
+            }
+            return new Frame(needsLock, size, byteAlignment);
+        }
+
+        static PerThread of(long byteSize, long byteAlignment) {
+            final Arena arena = Arena.ofAuto();
+            return new PerThread(new ReentrantLock(),
+                    arena,
+                    new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));
+        }
+
+        private final class Frame implements Arena {
+            private final boolean locked;
+            private final long parentOffset;
+            private final long topOfStack;
+            private final Arena confinedArena = Arena.ofConfined();
+            private final SegmentAllocator frame;
+
+            @SuppressWarnings("restricted")
+            @ForceInline
+            public Frame(boolean locked, long byteSize, long byteAlignment) {
+                this.locked = locked;
+                parentOffset = stack.currentOffset();
+                MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
+                topOfStack = stack.currentOffset();
+                // The cleanup action will keep the original automatic `arena` (from which
+                // the reusable segment is first allocated) alive even if this Frame
+                // becomes unreachable but there are reachable segments still alive.
+                frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new CleanupAction(arena)));
+            }
+
+            record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
+                @Override
+                public void accept(MemorySegment memorySegment) {
+                    Reference.reachabilityFence(arena);
+                }
+            }
+
+            @ForceInline
+            private void assertOrder() {
+                if (topOfStack != stack.currentOffset())
+                    throw new IllegalStateException("Out of order access: frame not top-of-stack");
+            }
+
+            @ForceInline
+            @Override
+            @SuppressWarnings("restricted")
+            public MemorySegment allocate(long byteSize, long byteAlignment) {
+                // Make sure we are on the right thread and not closed
+                MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
+                return frame.allocate(byteSize, byteAlignment);
+            }
+
+            @ForceInline
+            @Override
+            public MemorySegment.Scope scope() {
+                return confinedArena.scope();
+            }
+
+            @ForceInline
+            @Override
+            public void close() {
+                assertOrder();
+                // the Arena::close method is called "early" as it checks thread
+                // confinement and crucially before any mutation of the internal
+                // state takes place.
+                confinedArena.close();
+                stack.resetTo(parentOffset);
+                if (locked) {
+                    lock.unlock();
+                }
+            }
+        }
+    }


Suggested change

private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {

@ForceInline

public Arena pushFrame(long size, long byteAlignment) {

boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();

if (needsLock && !lock.tryLock()) {

// Rare: another virtual thread on the same carrier competed for acquisition.

return Arena.ofConfined();

}

if (!stack.canAllocate(size, byteAlignment)) {

if (needsLock) lock.unlock();

return Arena.ofConfined();

}

return new Frame(needsLock, size, byteAlignment);

}

static PerThread of(long byteSize, long byteAlignment) {

final Arena arena = Arena.ofAuto();

return new PerThread(new ReentrantLock(),

arena,

new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));

}

private final class Frame implements Arena {

private final boolean locked;

private final long parentOffset;

private final long topOfStack;

private final Arena confinedArena = Arena.ofConfined();

private final SegmentAllocator frame;

@SuppressWarnings("restricted")

@ForceInline

public Frame(boolean locked, long byteSize, long byteAlignment) {

this.locked = locked;

parentOffset = stack.currentOffset();

MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);

topOfStack = stack.currentOffset();

// The cleanup action will keep the original automatic `arena` (from which

// the reusable segment is first allocated) alive even if this Frame

// becomes unreachable but there are reachable segments still alive.

frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new CleanupAction(arena)));

}

record CleanupAction(Arena arena) implements Consumer<MemorySegment> {

@Override

public void accept(MemorySegment memorySegment) {

Reference.reachabilityFence(arena);

}

}

@ForceInline

private void assertOrder() {

if (topOfStack != stack.currentOffset())

throw new IllegalStateException("Out of order access: frame not top-of-stack");

}

@ForceInline

@Override

@SuppressWarnings("restricted")

public MemorySegment allocate(long byteSize, long byteAlignment) {

// Make sure we are on the right thread and not closed

MemorySessionImpl.toMemorySession(confinedArena).checkValidState();

return frame.allocate(byteSize, byteAlignment);

}

@ForceInline

@Override

public MemorySegment.Scope scope() {

return confinedArena.scope();

}

@ForceInline

@Override

public void close() {

assertOrder();

// the Arena::close method is called "early" as it checks thread

// confinement and crucially before any mutation of the internal

// state takes place.

confinedArena.close();

stack.resetTo(parentOffset);

if (locked) {

lock.unlock();

}

}

}

}

private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {

@ForceInline

@SuppressWarnings("restricted")

public Arena pushFrame(long size, long byteAlignment) {

boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();

if (needsLock && !lock.tryLock()) {

// Rare: another virtual thread on the same carrier competed for acquisition.

return Arena.ofConfined();

}

if (!stack.canAllocate(size, byteAlignment)) {

if (needsLock) lock.unlock();

return Arena.ofConfined();

}

Arena confinedArena = Arena.ofConfined();

long parentOffset = stack.currentOffset();

MemorySegment frameSegment = stack.allocate(size, byteAlignment);

long topOfStack = stack.currentOffset();

// The cleanup action will keep the original automatic `arena` (from which

// the reusable segment is first allocated) alive even if this Frame

// becomes unreachable but there are reachable segments still alive.

SegmentAllocator frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new Frame.CleanupAction(arena)));

return new Frame(this, needsLock, parentOffset, topOfStack, confinedArena, frame);

}

static PerThread of(long byteSize, long byteAlignment) {

final Arena arena = Arena.ofAuto();

return new PerThread(new ReentrantLock(),

arena,

new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));

}

}

private record Frame(PerThread thread, boolean locked, long parentOffset, long topOfStack, Arena confinedArena, SegmentAllocator frame) implements Arena {

record CleanupAction(Arena arena) implements Consumer<MemorySegment> {

@Override

public void accept(MemorySegment memorySegment) {

Reference.reachabilityFence(arena);

}

}

@ForceInline

private void assertOrder() {

if (topOfStack != thread.stack.currentOffset())

throw new IllegalStateException("Out of order access: frame not top-of-stack");

}

@ForceInline

@Override

@SuppressWarnings("restricted")

public MemorySegment allocate(long byteSize, long byteAlignment) {

// Make sure we are on the right thread and not closed

MemorySessionImpl.toMemorySession(confinedArena).checkValidState();

return frame.allocate(byteSize, byteAlignment);

}

@ForceInline

@Override

public MemorySegment.Scope scope() {

return confinedArena.scope();

}

@ForceInline

@Override

public void close() {

assertOrder();

// the Arena::close method is called "early" as it checks thread

// confinement and crucially before any mutation of the internal

// state takes place.

confinedArena.close();

thread.stack.resetTo(parentOffset);

if (locked) {

thread.lock.unlock();

}

}

}

I find it a bit strange to nest a class inside a record. What do you think if I change it to two records?

wenshao · 2025-05-02T11:31:13Z

src/java.base/share/classes/jdk/internal/foreign/BufferStack.java

+    private record PerThread(ReentrantLock lock,
+                             Arena arena,
+                             SlicingAllocator stack,
+                             CleanupAction cleanupAction) {
+
+        @ForceInline
+        public Arena pushFrame(long size, long byteAlignment) {
+            boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
+            if (needsLock && !lock.tryLock()) {
+                // Rare: another virtual thread on the same carrier competed for acquisition.
+                return Arena.ofConfined();
+            }
+            if (!stack.canAllocate(size, byteAlignment)) {
+                if (needsLock) lock.unlock();
+                return Arena.ofConfined();
+            }
+            return new Frame(needsLock, size, byteAlignment);
+        }
+
+        static PerThread of(long byteSize, long byteAlignment) {
+            final Arena arena = Arena.ofAuto();
+            return new PerThread(new ReentrantLock(),
+                    arena,
+                    new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
+                    new CleanupAction(arena));
+        }
+
+        private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
+            @Override
+            public void accept(MemorySegment memorySegment) {
+                Reference.reachabilityFence(arena);
+            }
+        }
+
+        private final class Frame implements Arena {
+
+            private final boolean locked;
+            private final long parentOffset;
+            private final long topOfStack;
+            private final Arena confinedArena;
+            private final SegmentAllocator frame;
+
+            @SuppressWarnings("restricted")
+            @ForceInline
+            public Frame(boolean locked, long byteSize, long byteAlignment) {
+                this.locked = locked;
+                this.parentOffset = stack.currentOffset();
+                final MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
+                this.topOfStack = stack.currentOffset();
+                this.confinedArena = Arena.ofConfined();
+                // The cleanup action will keep the original automatic `arena` (from which
+                // the reusable segment is first allocated) alive even if this Frame
+                // becomes unreachable but there are reachable segments still alive.
+                this.frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction));
+            }
+
+            @ForceInline
+            private void assertOrder() {
+                if (topOfStack != stack.currentOffset())
+                    throw new IllegalStateException("Out of order access: frame not top-of-stack");
+            }
+
+            @ForceInline
+            @Override
+            @SuppressWarnings("restricted")
+            public MemorySegment allocate(long byteSize, long byteAlignment) {
+                // Make sure we are on the right thread and not closed
+                MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
+                return frame.allocate(byteSize, byteAlignment);
+            }
+
+            @ForceInline
+            @Override
+            public MemorySegment.Scope scope() {
+                return confinedArena.scope();
+            }
+
+            @ForceInline
+            @Override
+            public void close() {
+                assertOrder();
+                // the Arena::close method is called "early" as it checks thread
+                // confinement and crucially before any mutation of the internal
+                // state takes place.
+                confinedArena.close();
+                stack.resetTo(parentOffset);
+                if (locked) {
+                    lock.unlock();
+                }
+            }
+        }
+    }


Suggested change

private record PerThread(ReentrantLock lock,

Arena arena,

SlicingAllocator stack,

CleanupAction cleanupAction) {

@ForceInline

public Arena pushFrame(long size, long byteAlignment) {

boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();

if (needsLock && !lock.tryLock()) {

// Rare: another virtual thread on the same carrier competed for acquisition.

return Arena.ofConfined();

}

if (!stack.canAllocate(size, byteAlignment)) {

if (needsLock) lock.unlock();

return Arena.ofConfined();

}

return new Frame(needsLock, size, byteAlignment);

}

static PerThread of(long byteSize, long byteAlignment) {

final Arena arena = Arena.ofAuto();

return new PerThread(new ReentrantLock(),

arena,

new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),

new CleanupAction(arena));

}

private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {

@Override

public void accept(MemorySegment memorySegment) {

Reference.reachabilityFence(arena);

}

}

private final class Frame implements Arena {

private final boolean locked;

private final long parentOffset;

private final long topOfStack;

private final Arena confinedArena;

private final SegmentAllocator frame;

@SuppressWarnings("restricted")

@ForceInline

public Frame(boolean locked, long byteSize, long byteAlignment) {

this.locked = locked;

this.parentOffset = stack.currentOffset();

final MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);

this.topOfStack = stack.currentOffset();

this.confinedArena = Arena.ofConfined();

// The cleanup action will keep the original automatic `arena` (from which

// the reusable segment is first allocated) alive even if this Frame

// becomes unreachable but there are reachable segments still alive.

this.frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction));

}

@ForceInline

private void assertOrder() {

if (topOfStack != stack.currentOffset())

throw new IllegalStateException("Out of order access: frame not top-of-stack");

}

@ForceInline

@Override

@SuppressWarnings("restricted")

public MemorySegment allocate(long byteSize, long byteAlignment) {

// Make sure we are on the right thread and not closed

MemorySessionImpl.toMemorySession(confinedArena).checkValidState();

return frame.allocate(byteSize, byteAlignment);

}

@ForceInline

@Override

public MemorySegment.Scope scope() {

return confinedArena.scope();

}

@ForceInline

@Override

public void close() {

assertOrder();

// the Arena::close method is called "early" as it checks thread

// confinement and crucially before any mutation of the internal

// state takes place.

confinedArena.close();

stack.resetTo(parentOffset);

if (locked) {

lock.unlock();

}

}

}

}

private record PerThread(ReentrantLock lock,

Arena arena,

SlicingAllocator stack,

CleanupAction cleanupAction) {

@SuppressWarnings("restricted")

@ForceInline

public Arena pushFrame(long size, long byteAlignment) {

boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();

if (needsLock && !lock.tryLock()) {

// Rare: another virtual thread on the same carrier competed for acquisition.

return Arena.ofConfined();

}

if (!stack.canAllocate(size, byteAlignment)) {

if (needsLock) lock.unlock();

return Arena.ofConfined();

}

long parentOffset = stack.currentOffset();

final MemorySegment frameSegment = stack.allocate(size, byteAlignment);

long topOfStack = stack.currentOffset();

Arena confinedArena = Arena.ofConfined();

// return new Frame(needsLock, size, byte

// The cleanup action will keep the original automatic `arena` (from which

// the reusable segment is first allocated) alive even if this Frame

// becomes unreachable but there are reachable segments still alive.

return new Frame(this, needsLock, parentOffset, topOfStack, confinedArena, new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction)));

}

static PerThread of(long byteSize, long byteAlignment) {

final Arena arena = Arena.ofAuto();

return new PerThread(new ReentrantLock(),

arena,

new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),

new CleanupAction(arena));

}

private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {

@Override

public void accept(MemorySegment memorySegment) {

Reference.reachabilityFence(arena);

}

}

}

private record Frame(PerThread thead, boolean locked, long parentOffset, long topOfStack, Arena confinedArena, SegmentAllocator frame) implements Arena {

@ForceInline

private void assertOrder() {

if (topOfStack != thead.stack.currentOffset())

throw new IllegalStateException("Out of order access: frame not top-of-stack");

}

@ForceInline

@Override

@SuppressWarnings("restricted")

public MemorySegment allocate(long byteSize, long byteAlignment) {

// Make sure we are on the right thread and not closed

MemorySessionImpl.toMemorySession(confinedArena).checkValidState();

return frame.allocate(byteSize, byteAlignment);

}

@ForceInline

@Override

public MemorySegment.Scope scope() {

return confinedArena.scope();

}

@ForceInline

@Override

public void close() {

assertOrder();

// the Arena::close method is called "early" as it checks thread

// confinement and crucially before any mutation of the internal

// state takes place.

confinedArena.close();

thead.stack.resetTo(parentOffset);

if (locked) {

thead.lock.unlock();

}

}

}

Same as the suggestion above, you changed the code. I updated and remade the suggestion and changed it to two records. What do you think?

mcimadamore · 2025-05-02T11:36:56Z

src/java.base/share/classes/jdk/internal/foreign/BufferStack.java

+            return new PerThread(new ReentrantLock(),
+                    arena,
+                    new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
+                    new CleanupAction(arena));


any reason why you didn't use a lambda here?

Also, not a big fan of records here -- it seems that many implementation details such as cleanup action, lock and slicing allocator are "leaked out" to the caller, that is now responsible to set things up correctly. I think a PerThread class with a constructor taking arena, size, alignment would make the code less coupled and more readable.

And, if you do that, you then don't need the of factory -- clients can just use new

any reason why you didn't use a lambda here?

I also think that CleanupAction should be changed to lambda

Using an anonymous class for the cleanup action had very adverse effects on performance. I didn't want to use a lambda for startup performance reasons.

If using lambda affects performance, how about using anonymous classes?

return new PerThread(new ReentrantLock(), arena, new SlicingAllocator(arena.allocate(byteSize, byteAlignment)), new Consumer<MemorySegment>() { @Override public void accept(MemorySegment memorySegment) { Reference.reachabilityFence(arena); }});

Anonymous classes also captures outer variables and break cleaner/GC. The only safe measures are local enums or records, which never capture outer variables (anonymous classes cannot be enum or records)

mcimadamore · 2025-05-02T11:41:38Z

Updated benchmarks:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.695 ± 0.205  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.622 ± 0.078  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.523 ± 0.022  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.902 ± 0.407  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  23.858 ± 0.652  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.544 ± 0.458  ns/op
BufferStackBench.buffer                        8  avgt   15   4.923 ± 0.029  ns/op
BufferStackBench.buffer                       16  avgt   15   4.971 ± 0.095  ns/op
BufferStackBench.buffer                       32  avgt   15   4.980 ± 0.105  ns/op
BufferStackBench.confined                      8  avgt   15  22.713 ± 0.289  ns/op
BufferStackBench.confined                     16  avgt   15  23.576 ± 0.348  ns/op
BufferStackBench.confined                     32  avgt   15  25.272 ± 0.530  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.159 ± 0.097  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.535 ± 0.433  ns/op
CallOverheadByValue.byPtr              avgt   15   4.119 ± 0.031  ns/op
CallOverheadByValue.byValue            avgt   15  11.404 ± 0.240  ns/op

What was the result before this PR?

mcimadamore

Looks good - I left some stylistic comments on the use of records

JornVernee

Latest version looks good to me as well.

wenshao · 2025-05-02T13:42:23Z

minborg#7 @minborg I submitted a PR to you, please see if it is useful to you

minborg · 2025-05-02T14:12:42Z

The performance before this PR can be seen in the "confined" benchmarks above. In those benchmarks, a regular Arena.ofConfined is created upon every invocation.

minborg · 2025-05-02T14:14:28Z

/integrate

I will integrate this PR now @wenshao . Can you summarize your proposed changes below as It was a bit unclear to me what you meant. I can create a separate PR with those changes later.

openjdk · 2025-05-02T14:15:01Z

Going to push as commit 9f9e73d.
Since your change was applied there have been 54 commits pushed to the master branch:

995d541: 8355681: G1HeapRegionManager::find_contiguous_allow_expand ignores free regions when checking regions available for allocation
cf2f9ce: 8356032: createAutoconfBundle.sh downloads to local directory
cdad6d7: 8216437: PPC64: Add intrinsic for GHASH algorithm
... and 51 more: https://git.openjdk.org/jdk/compare/c2485d5f7dd00eaed34a5d309276114eb4c78cb0...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-05-02T14:15:08Z

@minborg Pushed as commit 9f9e73d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

wenshao · 2025-05-02T14:44:12Z

Can you summarize your proposed changes below as It was a bit unclear to me what you meant. I can create a separate PR with those changes later.

My suggestion is to use two records (both PerThread and Frame) to replace the original inner class Frame. This would also eliminate one use of @ForceInline. This might just be my personal coding style preference – feel free to adopt it if you like. Just a suggestion!

mcimadamore · 2025-05-02T17:33:22Z

The performance before this PR can be seen in the "confined" benchmarks above. In those benchmarks, a regular Arena.ofConfined is created upon every invocation.

Yes - but I was referring to the performance of the native call -- not that of the allocation per se (e.g. CallOverheadByValue). If I'm not mistaken the numbers of that benchmark all refer to new code (well, except when we pass by pointer, in which case there's no allocation -- both before and after this PR).

E.g. it would be useful to also see how the by-value call has improved before/after this PR, in addition to looking at the difference between byValue/byRef in this PR.

minborg added 4 commits April 7, 2025 13:49

Change set from previous PR

4c45111

Merge branch 'master' into internal-allocator

44a2eba

Refactor and add benchmark

761cd55

Convert class to record and add @ForceInline

ef18394

openjdk bot changed the title ~~8349146 [REDO] Implement a better allocator for downcalls~~ 8349146: [REDO] Implement a better allocator for downcalls Apr 23, 2025

openjdk bot added the core-libs core-libs-dev@openjdk.org label Apr 23, 2025

Add rudimentary documentation

1c75c8f

Add benchmark class with virtual executor

47f3054

Remove interface level

59e1f26

liach approved these changes Apr 23, 2025

View reviewed changes

minborg marked this pull request as ready for review April 23, 2025 17:03

openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 23, 2025

Merge branch 'master' into internal-allocator

9d25173

Break out stress thest and ProblemList it on macosx

1e9d046

openjdk bot removed the ready Pull request is ready to be integrated label Apr 30, 2025

Convert to junit tests

101882d