Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8341003: [lworld+fp16] Benchmarks for various Float16 operations #1254

Closed
wants to merge 3 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Sep 26, 2024

  • Adding micro-benchmarks for various Float16 operations.
  • Adding similarity search targeting micro-benchmarks.

Please find below the results of performance testing over Intel Xeon6 Granite Rapids:-

Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25605.990          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2  19222.468          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.738          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    660.018          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    659.799          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   1974.039          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    743.071          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    682.440          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2  14052.422          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3851.234          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1496.207          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2778.822          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2  19231.326          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2  19257.589          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2  19236.498          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25938.789          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2   1759.051          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2  19242.967          ops/ms

Best Regrads,
Jatin


Progress

  • Change must not contain extraneous whitespace

Issue

  • JDK-8341003: [lworld+fp16] Benchmarks for various Float16 operations (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/1254/head:pull/1254
$ git checkout pull/1254

Update a local copy of the PR:
$ git checkout pull/1254
$ git pull https://git.openjdk.org/valhalla.git pull/1254/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 1254

View PR using the GUI difftool:
$ git pr show -t 1254

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/1254.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 26, 2024

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into lworld+fp16 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 26, 2024

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8341003: [lworld+fp16] Benchmarks for various Float16 operations

Reviewed-by: bkilambi

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the lworld+fp16 branch:

  • fb5b1b1: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Please see this link for an up-to-date comparison between the source branch of this pull request and the lworld+fp16 branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld+fp16 branch, type /integrate in a new comment.

@mlbridge
Copy link

mlbridge bot commented Sep 26, 2024

Webrevs

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 26, 2024

Hi @Bhavana-Kilambi , I see vector IR in almost all the micros apart from three i.e. isNaN, isFinite and isInfinity with following command

numactl --cpunodebind=1 -l java -jar target/benchmarks.jar -jvmArgs "-XX:+TraceNewVectors" -p vectorDim=512 -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.java.lang.Float16OpsBenchmark.<BM_NAME>

Indicates Java implementation in those cases is not getting auto-vectorized, we didn't had benchmarks earlier, after tuning we can verify with this new benchmark.

Kindly let me know if the micro looks good, I can integrate it.

@Bhavana-Kilambi
Copy link
Contributor

Bhavana-Kilambi commented Sep 26, 2024

Hi @jatin-bhateja , thanks for doing the micros.
Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

@jatin-bhateja
Copy link
Member Author

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?

Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now.
Macro-benchmarks demonstrates use case for low precision semantic search primitives.

@Bhavana-Kilambi
Copy link
Contributor

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 27, 2024

Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?

Hi @Bhavana-Kilambi , This patch adds micro benchmarks for all Float16 APIs optimized uptill now. Macro-benchmarks demonstrates use case for low precision semantic search primitives.

Hey, for baseline we should not pass --enable-preview since it will prohibit following

  • Flat layout of Float16 arrays.
  • Creating valhalla specific IR needed for intrinsification.

Here are the first baseline numbers without --enable-primitive.


Benchmark                                               (vectorDim)   Mode  Cnt     Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2    99.424          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2    97.498          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   525.360          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    51.132          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    46.921          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2    97.186          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   583.051          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    56.133          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    81.386          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2  2257.619          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3086.476          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  1718.411          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2  1685.557          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2    92.078          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2    63.377          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2    98.202          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2    98.158          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    83.760          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2    98.200          ops/ms

Following are the number where we do allow flat array layout, but only disable intrinsics (-XX:DisableIntrinsic=<INTIN_ID>+).


Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25978.876          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2   6406.685          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.877          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2     76.680          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2     53.692          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   3227.037          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    740.490          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2     83.747          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    256.399          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2   2135.678          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3916.860          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1497.417          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2747.704          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2   3625.708          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2   3628.261          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2   6340.403          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25727.870          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    157.519          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2   6404.047          ops/ms

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 27, 2024

@jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.

@Bhavana-Kilambi
Copy link
Contributor

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 27, 2024

Thanks for sharing the numbers. So in the first case, without the --enable-preview flag, it would have generated only scalar FP32 operations and in the second case where it is allowed to have flat array layout, it generates vector instructions but for FP32 right?

Yes.

Let me know if you have other comments on micros, or kindly approve if its good to integrate.

@Bhavana-Kilambi
Copy link
Contributor

I am just running the tests on one of our machines. Can I just confirm in a while please? The tests otherwise look fine to me..

Copy link
Contributor

@Bhavana-Kilambi Bhavana-Kilambi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks.

@Bhavana-Kilambi
Copy link
Contributor

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

@jatin-bhateja
Copy link
Member Author

btw are you generating min instruction for max and max instruction for min in c2_MacroAssembler_x86.cpp ?

My bad, good catch, Thanks!

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Sep 27, 2024

Going to push as commit 0ce9f0f.
Since your change was applied there has been 1 commit pushed to the lworld+fp16 branch:

  • fb5b1b1: 8341005: [lworld+fp16] Disable intrinsification of Float16.abs and Float16.negate for x86 target

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated label Sep 27, 2024
@openjdk openjdk bot closed this Sep 27, 2024
@openjdk
Copy link

openjdk bot commented Sep 27, 2024

@jatin-bhateja Pushed as commit 0ce9f0f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants