Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version for thomaswue with Oracle GraalVM Native Image #70

Merged
merged 15 commits into from Jan 6, 2024

Conversation

thomaswue
Copy link
Contributor

@thomaswue thomaswue commented Jan 4, 2024

The script "additional_build_step_thomaswue.sh" will generate the image as "image_calculateaverage_thomaswue". The script "calculate_average_thomaswue.sh" will either execute the native image if the file exists or otherwise run the program in JVM mode with the Graal JIT compiler.

The program finishes on my system in 1.67s (total CPU user time is 42.6s). It is using an Intel 13th Gen Core i9-13900K processor.

Update: Thanks to tuning from @mukel using sun.misc.Unsafe to directly access the mapped memory, it is now down to 1.28s (total CPU user time 32.2s) on my machine. Also, instead of PGO, this is now just using a single native image build run with tuning flags "-O3" and "-march=native".

@lobaorn
Copy link

lobaorn commented Jan 4, 2024

I was thinking in that direction in the next couple of days to leverage that from the 5 runs, since the fastest and lowest are discarded, and use native-image. But my main interest was to see how everything compares with the "best" of the others, since you have already done it @thomaswue I will think again on how to do it, haha. And thank you for bringing this approach :)

Guess I will wait a little more and if by the end of the deadline no one tried with Lilliput or Valhalla builds, I will try to mix and converge some approaches that would best work with those ;)

@thomaswue
Copy link
Contributor Author

You are certainly very welcome to copy the GraalVM native image generation scripts and see if it helps also with your solution! Key is to use profile-guided optimizations and to have an already optimized solution that does not run for too long. I don't think Lilliput or Valhalla (or Loom) can be beneficial for this benchmark.

@lobaorn
Copy link

lobaorn commented Jan 4, 2024

I agree that by the constraints specifically, they would not help on being top of the leaderboard. It is more in the sense of comparing the performance or profiling using those builds, against a "default" JDK build, it falls more on the experimentation side of things, that is why I will probably wait to see the approaches emerging. It has been lots of fun. We can just think "How about if..." and in a couple of hours there is a good chance someone will submit that idea, even if no one said or wrote anything aloud. So right now my idea is to think some more "How about if..." and pick the top implementations and mix-and-match some JDK builds and configs, and then mix-and-match the implementations themselves and see what happens. But more and more I think someone will even do that idea I am talking about now, of this mix-and-match. Probably @gunnarmorling is the one having more fun than all by seeing so many ideas arise.

@mariusstaicu
Copy link

Was thinking of trying this too combined with one of the top implementations.
Also, would be a lot of fun comparing same algo with different jdks with or without native image.

@thomaswue
Copy link
Contributor Author

Agreed. And there possibly also different input sizes to more isolated show the startup vs long term peak performance characteristics.

if [ -f ./profile_thomaswue.iprof ]; then
echo 'Picking up profiling information from thomaswue.iprof.'
else
echo 'Could not find profiling information, therefore it will be now regenerated.'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @thomaswue, thanks a lot for this, very interesting to see PGO!

In its current shape, I think though that it goes somewhat against the spirit of this challenge, specifically "Implementations must not rely on specifics of a given data set". Which I think we kinda do when leveraging profile data from the first run during following ones. Admittedly, the wording could be more precise, but the general spirit is that runs shouldn't benefit from outcomes from previous runs (as otherwise, when taken to the extreme, one could calculate results once store them in a file and then simply load the content of that file in the subsequent runs).

What would be acceptable to me though is creating the profile data at build time, not relying on the specific data set used during evaluation (similar to how creating a static CDS archive is in bounds). Would that be an option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, absolutely. The profiling is only used to automate compiler tuning knowledge one would otherwise need to manually specify (e.g. "inline a lot in this method"). I can for example train on the existing test data files or check in a slightly larger test file with maybe 1k lines?

Another thing maybe in the spirit of the competition to generate interesting insights could be to run that same algorithm in 3 different ways - i.e. JIT, AOT, AOT+PGO.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiling is only used to automate compiler tuning knowledge one would otherwise need to manually specify (e.g. "inline a lot in this method")

Yepp, that's what I kinda thought.

check in a slightly larger test file with maybe 1k lines?

Yeah, if you could add a 1K file and the expected output (as obtained via calculate_average_baseline.sh) to src/test/resources that would be perfect. It would extend the tests and you can use it for "training". Can the PGO data be generated via Maven? If so, that'd be best, but it's not a strict requirement. A one-off script to run before the evaluation would be ok, too.

Another thing maybe in the spirit of the competition to generate interesting insights could be to run that same algorithm in 3 different ways - i.e. JIT, AOT, AOT+PGO.

Agreed. Thing is, I'm really overwhelmed by the number of submissions and I won't have any capacity for further increasing scope of this right now. But you're more than welcome to do this experiment and publish any interesting insights on your own end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. As a quick fix, I have turned off PGO by default and instead added global tuning flags in a commit added to this PR. Maybe approaches that require running the application before should be kept separate in terms of evaluation anyway.

Agreed that running with different configs probably increases the scope too much. It could be interesting to do this after 31st of Jan for a few of the submissions.

Let me know if we can assist you with some of the work. Obviously a downside of the success of the project to have so much attention ;-).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, excellent, will evaluate this one tomorrow (need to install the native image tool first). In the mean time, could you make sure that running this one shows no differences to the expected output:

./test.sh thomaswue

We've added this one just earlier today so as to have at least some basic assurance of correctness for implementations. Thanks again!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see that angle, OTOH, it's super-hard to impossible to objectively rate what's "most idiomatic". Probably something better suited for a blog post or a talk where one can explore these nuances. That said, I think there's somewhere an interesting threshold of how far comparatively "standard" Java solutions go up on the leaderboard (quite far in fact) before the more extreme solutions kick in. I.e. "idiomatic" gets you surprisingly far.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script will generate on first invocation the image with some extra output though. The image file would have to be cleaned up when updating the code (e.g., after the evaluation). Let me know if this is appropriate or there should be a different way to integrate this into the build process.

@thomaswue, so IIUC, this still does the profiling with the actual dataset in the first run, right? Didn't you mean to provide a separate dataset file and use this one when creating the native image? I.e. the build (with profiling) should be clearly be separated from the evaluation run. Can we extract the part which builds the native image (including PGO) to a separate script and invoke this one to build the application (instead of the usual mvn verify)?

So the flow would become this:

Build:

ln -s src/test/data/yourprofilingdataset.txt measurements.txt
./your_native_build_script.sh # builds the native binary and PGO data

Evaluation (i.e. what I am doing):

rm -f measurements.txt
ln -s measurements_1B.txt measurements.txt

for i in {1..5}
do
    ./calculate_average_thomaswue.sh # just the actual launch command, no build steps
done
.

Copy link
Contributor Author

@thomaswue thomaswue Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of PGO is disabled (just had the option to enable via environment variable). To avoid confusion and make this simpler, I deleted the code for optionally using PGO for now. There is an "additional_build_step_thomaswue.sh" that builds the image. The "calculcate_average_thomaswue.sh" script runs in JVM mode if the file is not there and otherwise picks up the generated image. Let me know if this works better for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also updated the PR description to clarify the new behavior and that PGO is not in use for this version.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepp, this LGTM now, thanks!

thomaswue and others added 8 commits January 4, 2024 20:28
variable to be set. Add -O3 -march=native tuning flags for better
performance.
mmap the entire file, use Unsafe directly instead of ByteBuffer, avoid byte[] copies.
These tricks give a ~30% speedup, over an already fast implementation.
Contribution by mukel to tune thomaswue submission.
@thomaswue
Copy link
Contributor Author

thomaswue commented Jan 5, 2024

@gunnarmorling This PR is now slightly updated with tuning from @mukel to use sun.misc.Unsafe for accessing the NIO direct byte buffer (which I believe is within the boundaries of the rules). It gains ~30% and is now down to 1.28s (32s CPU user) on my machine from 1.67s (42s CPU user) before. Still passes the test script.

@gunnarmorling
Copy link
Owner

gunnarmorling commented Jan 6, 2024

@thomaswue, very cool. Coming in at 9.625sec on the (8 core) eval machine, i.e. 2nd place! Would love to see the PGO version as a follow-up, created with that additional build script you've provided. Thanks for participating!

Btw. I was very impressed by the low variance of the results:

0:9.626
0:9.612
0:9.626
0:9.627
0:9.622

@gunnarmorling gunnarmorling merged commit a53aa2e into gunnarmorling:main Jan 6, 2024
@gunnarmorling
Copy link
Owner

Squashed and merged. Thx!

@gunnarmorling
Copy link
Owner

Maybe approaches that require running the application before should be kept separate in terms of evaluation anyway.

I have added a "Note" column to the leaderboard, stating for this entry that this is a GraalVM native binary.

Agreed that running with different configs probably increases the scope too much. It could be interesting to do this after 31st of Jan for a few of the submissions.

Let me know if we can assist you with some of the work. Obviously a downside of the success of the project to have so much attention ;-).

Thanks :) I might take you up on this at some point. I think once the dust has settled a bit, we can explore how to make use of this set-up for running all different kinds of comparisons.

@thomaswue
Copy link
Contributor Author

Cool, thank you for merging and evaluating!

Yes, making comparisons for a few solutions with different run options (and maybe also input sizes or target hardware) is certainly something I would be interested in doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants