Skip to content

Conversation

@minborg
Copy link
Contributor

@minborg minborg commented Apr 23, 2025

This PR is based on the work of @mernst-github and aims to implement an internal thread-local 'stack' allocator, which works like a dynamically sized arena, but with reset functionality to reset the allocated size back to a certain level. The underlying memory could stay around between calls, which could improve performance.

Re-allocated segments are not zeroed between allocations.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8349146: [REDO] Implement a better allocator for downcalls (Enhancement - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24829/head:pull/24829
$ git checkout pull/24829

Update a local copy of the PR:
$ git checkout pull/24829
$ git pull https://git.openjdk.org/jdk.git pull/24829/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24829

View PR using the GUI difftool:
$ git pr show -t 24829

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24829.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 23, 2025

👋 Welcome back pminborg! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@minborg
Copy link
Contributor Author

minborg commented Apr 23, 2025

/contributor add @mernst-github

@openjdk
Copy link

openjdk bot commented Apr 23, 2025

@minborg This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8349146: [REDO] Implement a better allocator for downcalls

Reviewed-by: mcimadamore, jvernee, liach

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 51 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot changed the title 8349146 [REDO] Implement a better allocator for downcalls 8349146: [REDO] Implement a better allocator for downcalls Apr 23, 2025
@minborg
Copy link
Contributor Author

minborg commented Apr 23, 2025

Here are the current benchmark results:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.653 ± 0.180  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.573 ± 0.121  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.712 ± 0.252  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.714 ± 0.146  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  24.269 ± 1.079  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.519 ± 0.219  ns/op
BufferStackBench.buffer                        8  avgt   15   4.866 ± 0.141  ns/op
BufferStackBench.buffer                       16  avgt   15   4.829 ± 0.106  ns/op
BufferStackBench.buffer                       32  avgt   15   4.809 ± 0.071  ns/op
BufferStackBench.confined                      8  avgt   15  22.768 ± 0.448  ns/op
BufferStackBench.confined                     16  avgt   15  23.380 ± 0.192  ns/op
BufferStackBench.confined                     32  avgt   15  25.874 ± 1.129  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.140 ± 0.054  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.698 ± 0.347  ns/op
CallOverheadByValue.byPtr              avgt   15   4.120 ± 0.028  ns/op
CallOverheadByValue.byValue            avgt   15  11.828 ± 0.082  ns/op

@openjdk
Copy link

openjdk bot commented Apr 23, 2025

@minborg @mernst-github was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Apr 23, 2025

@minborg The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Apr 23, 2025
@liach
Copy link
Member

liach commented Apr 23, 2025

Can we just cleanly revert commit 7764742?
This new version defines redundant sealed interface + record - feels unnecessary to me.
Meanwhile, the BufferStackBench is valueable, but its addition should be kept in a separate commit to make reviewing easier.

@minborg
Copy link
Contributor Author

minborg commented Apr 23, 2025

Can we just cleanly revert commit 7764742?
This new version defines redundant sealed interface + record - feels unnecessary to me.
Meanwhile, the BufferStackBench is valueable, but its addition should be kept in a separate commit to make reviewing easier.

The idea was to use records for trusted components and less boilerplate. The idea with the interface was to hide the record accessors a bit.

@liach
Copy link
Member

liach commented Apr 23, 2025

trust_final_non_static_fields in ciField already trusts jdk.internal.foreign, so I don't think the record performance argument stands here. But making Frame and BufferStack classes final is a good move.

Copy link
Member

@liach liach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be in parity with the original patch.

@minborg minborg marked this pull request as ready for review April 23, 2025 17:03
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 23, 2025
@mlbridge
Copy link

mlbridge bot commented Apr 23, 2025

@JornVernee
Copy link
Member

We can not just re-use the original test due to: https://bugs.openjdk.org/browse/JDK-8350455

See my comment here: #23142 (comment)

@minborg
Copy link
Contributor Author

minborg commented Apr 29, 2025

We can not just re-use the original test due to: https://bugs.openjdk.org/browse/JDK-8350455

See my comment here: #23142 (comment)

I ran the test 250 times on my Mac machine (macOS 15.4.1)with no problems but, maybe we should problem list or exclude macOS testing?

@JornVernee
Copy link
Member

maybe we should problem list or exclude macOS testing?

Yes, I think we should. The issue is quite intermittent, and last time it only started showing up in CI as well. I think we should move the stress() test to a separate file./jtreg test, and then problem list that test on Mac, with a link to JDK-8350455

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Apr 30, 2025
@Benchmark
public void byValue() throws Throwable {
// point = unit();
MemorySegment unused = (MemorySegment) MH_UNIT_BY_VALUE.invokeExact(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark is a bit misleading, because the allocator object will add some noise in the mix. I suggest to have some allocator object ready in a field, and just pass that (avoiding the lambda). That would also be more similar to idiomatic FFM code (which assumes some allocator is already available at the callsite).

@minborg
Copy link
Contributor Author

minborg commented May 2, 2025

Updated benchmarks:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.695 ± 0.205  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.622 ± 0.078  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.523 ± 0.022  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.902 ± 0.407  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  23.858 ± 0.652  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.544 ± 0.458  ns/op
BufferStackBench.buffer                        8  avgt   15   4.923 ± 0.029  ns/op
BufferStackBench.buffer                       16  avgt   15   4.971 ± 0.095  ns/op
BufferStackBench.buffer                       32  avgt   15   4.980 ± 0.105  ns/op
BufferStackBench.confined                      8  avgt   15  22.713 ± 0.289  ns/op
BufferStackBench.confined                     16  avgt   15  23.576 ± 0.348  ns/op
BufferStackBench.confined                     32  avgt   15  25.272 ± 0.530  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.159 ± 0.097  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.535 ± 0.433  ns/op
CallOverheadByValue.byPtr              avgt   15   4.119 ± 0.031  ns/op
CallOverheadByValue.byValue            avgt   15  11.404 ± 0.240  ns/op

Comment on lines 101 to 186
private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {

@ForceInline
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
return new Frame(needsLock, size, byteAlignment);
}

static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));
}

private final class Frame implements Arena {
private final boolean locked;
private final long parentOffset;
private final long topOfStack;
private final Arena confinedArena = Arena.ofConfined();
private final SegmentAllocator frame;

@SuppressWarnings("restricted")
@ForceInline
public Frame(boolean locked, long byteSize, long byteAlignment) {
this.locked = locked;
parentOffset = stack.currentOffset();
MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
topOfStack = stack.currentOffset();
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new CleanupAction(arena)));
}

record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}

@ForceInline
private void assertOrder() {
if (topOfStack != stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}

@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}

@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}

@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
stack.resetTo(parentOffset);
if (locked) {
lock.unlock();
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {
@ForceInline
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
return new Frame(needsLock, size, byteAlignment);
}
static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));
}
private final class Frame implements Arena {
private final boolean locked;
private final long parentOffset;
private final long topOfStack;
private final Arena confinedArena = Arena.ofConfined();
private final SegmentAllocator frame;
@SuppressWarnings("restricted")
@ForceInline
public Frame(boolean locked, long byteSize, long byteAlignment) {
this.locked = locked;
parentOffset = stack.currentOffset();
MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
topOfStack = stack.currentOffset();
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new CleanupAction(arena)));
}
record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}
@ForceInline
private void assertOrder() {
if (topOfStack != stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}
@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}
@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}
@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
stack.resetTo(parentOffset);
if (locked) {
lock.unlock();
}
}
}
}
private record PerThread(ReentrantLock lock, Arena arena, SlicingAllocator stack) {
@ForceInline
@SuppressWarnings("restricted")
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
Arena confinedArena = Arena.ofConfined();
long parentOffset = stack.currentOffset();
MemorySegment frameSegment = stack.allocate(size, byteAlignment);
long topOfStack = stack.currentOffset();
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
SegmentAllocator frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, new Frame.CleanupAction(arena)));
return new Frame(this, needsLock, parentOffset, topOfStack, confinedArena, frame);
}
static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)));
}
}
private record Frame(PerThread thread, boolean locked, long parentOffset, long topOfStack, Arena confinedArena, SegmentAllocator frame) implements Arena {
record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}
@ForceInline
private void assertOrder() {
if (topOfStack != thread.stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}
@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}
@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}
@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
thread.stack.resetTo(parentOffset);
if (locked) {
thread.lock.unlock();
}
}
}

I find it a bit strange to nest a class inside a record. What do you think if I change it to two records?

Comment on lines +103 to +194
private record PerThread(ReentrantLock lock,
Arena arena,
SlicingAllocator stack,
CleanupAction cleanupAction) {

@ForceInline
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
return new Frame(needsLock, size, byteAlignment);
}

static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
new CleanupAction(arena));
}

private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}

private final class Frame implements Arena {

private final boolean locked;
private final long parentOffset;
private final long topOfStack;
private final Arena confinedArena;
private final SegmentAllocator frame;

@SuppressWarnings("restricted")
@ForceInline
public Frame(boolean locked, long byteSize, long byteAlignment) {
this.locked = locked;
this.parentOffset = stack.currentOffset();
final MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
this.topOfStack = stack.currentOffset();
this.confinedArena = Arena.ofConfined();
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
this.frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction));
}

@ForceInline
private void assertOrder() {
if (topOfStack != stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}

@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}

@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}

@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
stack.resetTo(parentOffset);
if (locked) {
lock.unlock();
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private record PerThread(ReentrantLock lock,
Arena arena,
SlicingAllocator stack,
CleanupAction cleanupAction) {
@ForceInline
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
return new Frame(needsLock, size, byteAlignment);
}
static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
new CleanupAction(arena));
}
private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}
private final class Frame implements Arena {
private final boolean locked;
private final long parentOffset;
private final long topOfStack;
private final Arena confinedArena;
private final SegmentAllocator frame;
@SuppressWarnings("restricted")
@ForceInline
public Frame(boolean locked, long byteSize, long byteAlignment) {
this.locked = locked;
this.parentOffset = stack.currentOffset();
final MemorySegment frameSegment = stack.allocate(byteSize, byteAlignment);
this.topOfStack = stack.currentOffset();
this.confinedArena = Arena.ofConfined();
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
this.frame = new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction));
}
@ForceInline
private void assertOrder() {
if (topOfStack != stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}
@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}
@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}
@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
stack.resetTo(parentOffset);
if (locked) {
lock.unlock();
}
}
}
}
private record PerThread(ReentrantLock lock,
Arena arena,
SlicingAllocator stack,
CleanupAction cleanupAction) {
@SuppressWarnings("restricted")
@ForceInline
public Arena pushFrame(long size, long byteAlignment) {
boolean needsLock = Thread.currentThread().isVirtual() && !lock.isHeldByCurrentThread();
if (needsLock && !lock.tryLock()) {
// Rare: another virtual thread on the same carrier competed for acquisition.
return Arena.ofConfined();
}
if (!stack.canAllocate(size, byteAlignment)) {
if (needsLock) lock.unlock();
return Arena.ofConfined();
}
long parentOffset = stack.currentOffset();
final MemorySegment frameSegment = stack.allocate(size, byteAlignment);
long topOfStack = stack.currentOffset();
Arena confinedArena = Arena.ofConfined();
// return new Frame(needsLock, size, byte
// The cleanup action will keep the original automatic `arena` (from which
// the reusable segment is first allocated) alive even if this Frame
// becomes unreachable but there are reachable segments still alive.
return new Frame(this, needsLock, parentOffset, topOfStack, confinedArena, new SlicingAllocator(frameSegment.reinterpret(confinedArena, cleanupAction)));
}
static PerThread of(long byteSize, long byteAlignment) {
final Arena arena = Arena.ofAuto();
return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
new CleanupAction(arena));
}
private record CleanupAction(Arena arena) implements Consumer<MemorySegment> {
@Override
public void accept(MemorySegment memorySegment) {
Reference.reachabilityFence(arena);
}
}
}
private record Frame(PerThread thead, boolean locked, long parentOffset, long topOfStack, Arena confinedArena, SegmentAllocator frame) implements Arena {
@ForceInline
private void assertOrder() {
if (topOfStack != thead.stack.currentOffset())
throw new IllegalStateException("Out of order access: frame not top-of-stack");
}
@ForceInline
@Override
@SuppressWarnings("restricted")
public MemorySegment allocate(long byteSize, long byteAlignment) {
// Make sure we are on the right thread and not closed
MemorySessionImpl.toMemorySession(confinedArena).checkValidState();
return frame.allocate(byteSize, byteAlignment);
}
@ForceInline
@Override
public MemorySegment.Scope scope() {
return confinedArena.scope();
}
@ForceInline
@Override
public void close() {
assertOrder();
// the Arena::close method is called "early" as it checks thread
// confinement and crucially before any mutation of the internal
// state takes place.
confinedArena.close();
thead.stack.resetTo(parentOffset);
if (locked) {
thead.lock.unlock();
}
}
}

Same as the suggestion above, you changed the code. I updated and remade the suggestion and changed it to two records. What do you think?

return new PerThread(new ReentrantLock(),
arena,
new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
new CleanupAction(arena));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why you didn't use a lambda here?

Copy link
Contributor

@mcimadamore mcimadamore May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not a big fan of records here -- it seems that many implementation details such as cleanup action, lock and slicing allocator are "leaked out" to the caller, that is now responsible to set things up correctly. I think a PerThread class with a constructor taking arena, size, alignment would make the code less coupled and more readable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, if you do that, you then don't need the of factory -- clients can just use new

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why you didn't use a lambda here?

I also think that CleanupAction should be changed to lambda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an anonymous class for the cleanup action had very adverse effects on performance. I didn't want to use a lambda for startup performance reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using lambda affects performance, how about using anonymous classes?

            return new PerThread(new ReentrantLock(),
                    arena,
                    new SlicingAllocator(arena.allocate(byteSize, byteAlignment)),
                    new Consumer<MemorySegment>() {
                        @Override
                        public void accept(MemorySegment memorySegment) {
                            Reference.reachabilityFence(arena);
                        }});

Copy link
Member

@liach liach May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anonymous classes also captures outer variables and break cleaner/GC. The only safe measures are local enums or records, which never capture outer variables (anonymous classes cannot be enum or records)

@mcimadamore
Copy link
Contributor

mcimadamore commented May 2, 2025

Updated benchmarks:

Benchmark                            (ELEM_SIZE)  Mode  Cnt   Score   Error  Units
BufferStackBench.OfVirtual.buffer              8  avgt   15  12.695 ± 0.205  ns/op
BufferStackBench.OfVirtual.buffer             16  avgt   15  12.622 ± 0.078  ns/op
BufferStackBench.OfVirtual.buffer             32  avgt   15  12.523 ± 0.022  ns/op
BufferStackBench.OfVirtual.confined            8  avgt   15  22.902 ± 0.407  ns/op
BufferStackBench.OfVirtual.confined           16  avgt   15  23.858 ± 0.652  ns/op
BufferStackBench.OfVirtual.confined           32  avgt   15  25.544 ± 0.458  ns/op
BufferStackBench.buffer                        8  avgt   15   4.923 ± 0.029  ns/op
BufferStackBench.buffer                       16  avgt   15   4.971 ± 0.095  ns/op
BufferStackBench.buffer                       32  avgt   15   4.980 ± 0.105  ns/op
BufferStackBench.confined                      8  avgt   15  22.713 ± 0.289  ns/op
BufferStackBench.confined                     16  avgt   15  23.576 ± 0.348  ns/op
BufferStackBench.confined                     32  avgt   15  25.272 ± 0.530  ns/op

and

Benchmark                              Mode  Cnt   Score   Error  Units
CallOverheadByValue.OfVirtual.byPtr    avgt   15   4.159 ± 0.097  ns/op
CallOverheadByValue.OfVirtual.byValue  avgt   15  16.535 ± 0.433  ns/op
CallOverheadByValue.byPtr              avgt   15   4.119 ± 0.031  ns/op
CallOverheadByValue.byValue            avgt   15  11.404 ± 0.240  ns/op

What was the result before this PR?

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I left some stylistic comments on the use of records

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 2, 2025
Copy link
Member

@JornVernee JornVernee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version looks good to me as well.

@wenshao
Copy link
Contributor

wenshao commented May 2, 2025

minborg#7 @minborg I submitted a PR to you, please see if it is useful to you

@minborg
Copy link
Contributor Author

minborg commented May 2, 2025

The performance before this PR can be seen in the "confined" benchmarks above. In those benchmarks, a regular Arena.ofConfined is created upon every invocation.

@minborg
Copy link
Contributor Author

minborg commented May 2, 2025

/integrate

I will integrate this PR now @wenshao . Can you summarize your proposed changes below as It was a bit unclear to me what you meant. I can create a separate PR with those changes later.

@openjdk
Copy link

openjdk bot commented May 2, 2025

Going to push as commit 9f9e73d.
Since your change was applied there have been 54 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 2, 2025
@openjdk openjdk bot closed this May 2, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 2, 2025
@openjdk
Copy link

openjdk bot commented May 2, 2025

@minborg Pushed as commit 9f9e73d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@wenshao
Copy link
Contributor

wenshao commented May 2, 2025

Can you summarize your proposed changes below as It was a bit unclear to me what you meant. I can create a separate PR with those changes later.

My suggestion is to use two records (both PerThread and Frame) to replace the original inner class Frame. This would also eliminate one use of @ForceInline. This might just be my personal coding style preference – feel free to adopt it if you like. Just a suggestion!

@mcimadamore
Copy link
Contributor

mcimadamore commented May 2, 2025

The performance before this PR can be seen in the "confined" benchmarks above. In those benchmarks, a regular Arena.ofConfined is created upon every invocation.

Yes - but I was referring to the performance of the native call -- not that of the allocation per se (e.g. CallOverheadByValue). If I'm not mistaken the numbers of that benchmark all refer to new code (well, except when we pass by pointer, in which case there's no allocation -- both before and after this PR).

E.g. it would be useful to also see how the by-value call has improved before/after this PR, in addition to looking at the difference between byValue/byRef in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants