Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Unsafe and merykitty's branchless parser #194

Merged
merged 3 commits into from Jan 7, 2024

Conversation

royvanrijn
Copy link
Contributor

Check List:

  • Tests pass (./test.sh <username>)
  • All formatting changes by the build are committed
  • Output matches that of calculate_average_baseline.sh
  • Execution time: 0m1.897s
  • Execution time of reference implementation: 3m47.094s

@thomaswue
Copy link
Contributor

Cool, I think this is getting close to optimal! Now soon what remains is reading assembly and compiler graphs ;-).

Did you check out if building an "-O3 -march" native image still provides only the same performance? I might check out your branch and test with different native image options tomorrow.

TreeMap<String, Measurement> results = IntStream.range(0, chunks.length - 1)
.mapToObj(chunkIndex -> process(chunks[chunkIndex], chunks[chunkIndex + 1])).parallel()
.flatMap(MeasurementRepository::get)
.collect(Collectors.toMap(e -> e.city, MeasurementRepository.Entry::measurement, Measurement::updateWith, TreeMap::new));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambdas are not super good for startup, but maybe with CDS isn't like that! Anonymous classes tends to have more predictable costs (lower usually)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And that loading is single-threaded. This is why it should be a bigger benefit in terms of percentage of execution time on a stronger machine to create a native image. Overall, the loaded and compiled code is still relatively minimal compared to what is loaded in your typical Spring / Quarkus / Micronaut application ;-).

index = (index + 1) & tableMask;
}

if (tableEntry != null) {
return tableEntry.measurement;
tableEntry.measurement.updateWith(temperature);
return;
}

// --- This is a brand new entry, insert into the hashtable and do the extra calculations (once!) do slower calculations here.
Measurement measurement = new Measurement();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move slow path code out of the hot path, to be sure the hot will be inlined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update the measurement with the new temperature, always. I'm not sure what you mean?

Copy link

@franz1981 franz1981 Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part related the brand new entry, I didn't checked if goes beyond the inlining bytecode threshold, so If you move it in a separate method you are sure the hot path to be inlined

@royvanrijn
Copy link
Contributor Author

@thomaswue I've checked, on my machine your code is also slightly faster running in JVM mode; could be different on the target machine of course.

Copy link
Owner

@gunnarmorling gunnarmorling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once comment on CDS inline. Holding back with updating the result until this has been sorted out, but this looks very good!

@@ -19,9 +19,9 @@ source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk use java 21.0.1-graal 1>&2
# Added for fun, doesn't seem to be making a difference...
if [ -f "target/calculate_average_royvanrijn.jsa" ]; then
JAVA_OPTS="-XX:SharedArchiveFile=target/calculate_average_royvanrijn.jsa -Xshare:on"
JAVA_OPTS="--enable-preview -XX:SharedArchiveFile=target/calculate_average_royvanrijn.jsa -Xshare:on"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the CDS stuff, I'm just seeing this now, and I think it should be moved to a separate script, akin to @thomaswue's script for building the native image. The reasoning being that there should be no artifacts passed between runs (where would we draw the line, is for instance recording statistics fine, just caching the entire result, etc.). I don't think it makes a real difference as only a very small amount of custom classes is added to what's in the distro's static CDS archive, but it should be separated as a matter of principle.


public String toString() {
return round(min) + "/" + round((1.0 * sum) / count) + "/" + round(max);
private static Unsafe initUnsafe() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can just export jdk.internal.misc.Unsafe to your application in the execution script and do getUnsafe().

@gunnarmorling
Copy link
Owner

Discussed with @royvanrijn, I'll remove the dynamic App CDS archive creation and evaluate this submission after that.

@gunnarmorling gunnarmorling merged commit e665d71 into gunnarmorling:main Jan 7, 2024
1 check passed
@gunnarmorling
Copy link
Owner

@royvanrijn, you're on #1 with 00:06.159! Seeing this kind of improvement at this stage is amazing. There are some discussions still in this PR, so might worth addressing them and see what you can gain. @merykitty, you also might consider running on Graal, it kinda consistently made a nice improvement "for free" for all the top contenders.

@thomaswue
Copy link
Contributor

@royvanrijn Great job! I think looking at compiler graphs is next for you ;-).

@merykitty Graal might indeed help your solution, but only if you are using the plain old unsafe instead of the Foreign Function & Memory API. This API is still in incubator phase and it seems like we are missing some intrinsification optimizations for it. We are investigating this already.

@merykitty
Copy link
Contributor

@thomaswue Indeed, using Graal brings huge regression that I suspect might be related to the intrinsification of Vector API or FFM.

@gunnarmorling Thanks a lot for your suggestions, currently Roy's and Thomas's solutions outpace mine with OpenJDK so I will try to fix that first. The glaring problem is that my hash function is really bad which is definitely worth improving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants