Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8265029: Preserve SIZED characteristics on slice operations (skip, limit) #3427

Closed
wants to merge 13 commits into from

Conversation

@amaembo
Copy link
Contributor

@amaembo amaembo commented Apr 10, 2021

With the introduction of toList(), preserving the SIZED characteristics in more cases becomes more important. This patch preserves SIZED on skip() and limit() operations, so now every combination of map/mapToX/boxed/asXyzStream/skip/limit/sorted preserves size, and toList(), toArray() and count() may benefit from this. E. g., LongStream.range(0, 10_000_000_000L).skip(1).count() returns result instantly with this patch.

Some microbenchmarks added that confirm the reduced memory allocation in toList() and toArray() cases. Before patch:

ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10   40235,534 ±     0,984    B/op
ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  106431,101 ±     0,198    B/op
ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  106544,977 ±     1,983    B/op
value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10   40121,878 ±     0,247    B/op
value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  106317,693 ±     1,083    B/op
value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  106430,954 ±     0,136    B/op

After patch:

ref.SliceToList.seq_baseline:·gc.alloc.rate.norm                    10000  thrpt   10  40235,648 ±     1,354    B/op
ref.SliceToList.seq_limit:·gc.alloc.rate.norm                       10000  thrpt   10  40355,784 ±     1,288    B/op
ref.SliceToList.seq_skipLimit:·gc.alloc.rate.norm                   10000  thrpt   10  40476,032 ±     2,855    B/op
value.SliceToArray.seq_baseline:·gc.alloc.rate.norm                 10000  thrpt   10  40121,830 ±     0,308    B/op
value.SliceToArray.seq_limit:·gc.alloc.rate.norm                    10000  thrpt   10  40242,554 ±     0,443    B/op
value.SliceToArray.seq_skipLimit:·gc.alloc.rate.norm                10000  thrpt   10  40363,674 ±     1,576    B/op

Time improvements are less exciting. It's likely that inlining and vectorizing dominate in these tests over array allocations and unnecessary copying. Still, I notice a significant improvement in SliceToArray.seq_limit case (2x) and mild improvement (+12..16%) in other slice tests. No significant change in parallel execution time, though its performance is much less stable and I didn't run enough tests.

Before patch:

Benchmark                         (size)   Mode  Cnt      Score     Error  Units
ref.SliceToList.par_baseline       10000  thrpt   30  14876,723 ±  99,770  ops/s
ref.SliceToList.par_limit          10000  thrpt   30  14856,841 ± 215,089  ops/s
ref.SliceToList.par_skipLimit      10000  thrpt   30   9555,818 ± 991,335  ops/s
ref.SliceToList.seq_baseline       10000  thrpt   30  23732,290 ± 444,162  ops/s
ref.SliceToList.seq_limit          10000  thrpt   30  14894,040 ± 176,496  ops/s
ref.SliceToList.seq_skipLimit      10000  thrpt   30  10646,929 ±  36,469  ops/s
value.SliceToArray.par_baseline    10000  thrpt   30  25093,141 ± 376,402  ops/s
value.SliceToArray.par_limit       10000  thrpt   30  24798,889 ± 760,762  ops/s
value.SliceToArray.par_skipLimit   10000  thrpt   30  16456,310 ± 926,882  ops/s
value.SliceToArray.seq_baseline    10000  thrpt   30  69669,787 ± 494,562  ops/s
value.SliceToArray.seq_limit       10000  thrpt   30  21097,081 ± 117,338  ops/s
value.SliceToArray.seq_skipLimit   10000  thrpt   30  15522,871 ± 112,557  ops/s

After patch:

Benchmark                         (size)   Mode  Cnt      Score      Error  Units
ref.SliceToList.par_baseline       10000  thrpt   30  14793,373 ±   64,905  ops/s
ref.SliceToList.par_limit          10000  thrpt   30  13301,024 ± 1300,431  ops/s
ref.SliceToList.par_skipLimit      10000  thrpt   30  11131,698 ± 1769,932  ops/s
ref.SliceToList.seq_baseline       10000  thrpt   30  24101,048 ±  263,528  ops/s
ref.SliceToList.seq_limit          10000  thrpt   30  16872,168 ±   76,696  ops/s
ref.SliceToList.seq_skipLimit      10000  thrpt   30  11953,253 ±  105,231  ops/s
value.SliceToArray.par_baseline    10000  thrpt   30  25442,442 ±  455,554  ops/s
value.SliceToArray.par_limit       10000  thrpt   30  23111,730 ± 2246,086  ops/s
value.SliceToArray.par_skipLimit   10000  thrpt   30  17980,750 ± 2329,077  ops/s
value.SliceToArray.seq_baseline    10000  thrpt   30  66512,898 ± 1001,042  ops/s
value.SliceToArray.seq_limit       10000  thrpt   30  41792,549 ± 1085,547  ops/s
value.SliceToArray.seq_skipLimit   10000  thrpt   30  18007,613 ±  141,716  ops/s

I also modernized SliceOps a little bit, using switch expression (with no explicit default!) and diamonds on anonymous classes.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8265029: Preserve SIZED characteristics on slice operations (skip, limit)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3427/head:pull/3427
$ git checkout pull/3427

Update a local copy of the PR:
$ git checkout pull/3427
$ git pull https://git.openjdk.java.net/jdk pull/3427/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3427

View PR using the GUI difftool:
$ git pr show -t 3427

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3427.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 10, 2021

👋 Welcome back tvaleev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Apr 10, 2021

@amaembo The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs label Apr 10, 2021
@openjdk openjdk bot added the rfr label Apr 10, 2021
@amaembo amaembo changed the title JDK-8265029: Preserve SIZED characteristics on slice operations (skip, limit) 8265029: Preserve SIZED characteristics on slice operations (skip, limit) Apr 10, 2021
// Original spliterator is split to [0..499] and [500..999] parts
// due to skip+limit, we have [50..499] and [500..849]
var prefix = parSpliterator.trySplit();
assertNotNull(prefix);
assertTrue(parSpliterator.hasCharacteristics(Spliterator.SIZED));
assertTrue(parSpliterator.hasCharacteristics(Spliterator.SUBSIZED));
Comment on lines +365 to +370

This comment has been minimized.

@vlsi

vlsi Apr 10, 2021

It would be great to integrate the code comment to the assertion message. Then test failure would print the message making it easier to understand the failure.

It would be great to extract assertHasCharacteristics method so the failure printed the actual characteristics rather than "expected true got false"

This comment has been minimized.

@amaembo

amaembo Apr 11, 2021
Author Contributor

assertTrue/False(spltr.hasCharacteristics(...)) pattern appears in Stream API tests quite often. I can see dozens of occurrences. If extracting such kind of method, then it would be better to replace all of them. At the same time, this will make the patch bigger complicating its review and drifting from the original task. You can contribute such a change separately. In fact, the line number in the report points exactly to the problematic spliterator and characteristics, and dedicated method would not provide you with more information. Anyway, you'll need to take a look at the test source code to see how the spliterator was created, so you won't save much time with a better message.


@Benchmark
public List<String> seq_baseline() {
return IntStream.range(0, size)

This comment has been minimized.

@vlsi

vlsi Apr 10, 2021

Typically you want to move all the constants to state fields to avoid constant folding by the compiler.
The compiler might accidentally use the fact that range start is always 0 and produce a dedicated optimized code for it.

See https://shipilev.net/blog/2014/java-scala-divided-we-fail/

This comment has been minimized.

@amaembo

amaembo Apr 11, 2021
Author Contributor

I know this article. Here, the upper bound is a state field, so the whole range cannot be optimized. And even if the compiler optimizes at the loop start, it's pretty common to have ranges starting with the constant 0 in the production, so I would not say that having 0 as an iteration starting point makes the benchmark more artificial. The same approach is used in neighbor benchmarks, and in fact, this is not the biggest problem with these benchmarks. Clean type profile makes them much more artificial than starting with zero. So I'd prefer keeping zero as is.

1. Comments in adjustSize
2. repeating code extracted from testNoEvaluationForSizedStream
Copy link
Member

@PaulSandoz PaulSandoz left a comment

Even though there are not many changes this cuts deep into how streams work.

I suspect there is some, possibly minor, impact for sized streams without limit/skip because of the increased cost to compute the exact size. Further, that cost is also incurred for AbstractWrappingSpliterator, where we might need to cache the exact size result.

I made some suggestions, mostly around naming, but I admit to being a little uncomfortable with the proposed change and need to think a little more about it. I am wondering if we need another stream flag indicating the size is known and has to be computed?

amaembo and others added 4 commits Apr 17, 2021
Co-authored-by: Paul Sandoz <paul.d.sandoz@googlemail.com>
Co-authored-by: Paul Sandoz <paul.d.sandoz@googlemail.com>
Co-authored-by: Paul Sandoz <paul.d.sandoz@googlemail.com>
@amaembo
Copy link
Contributor Author

@amaembo amaembo commented Apr 17, 2021

I see your concern. I made some additional benchmarks and added them here. First, CountSized, which just gets the stream size without actual traversal. We can see how the performance changes depending on number of stream operations. I also added an optional type profile pollution that makes exactOutputSize virtual method polymorphic. Here's the results:

Baseline (The count10Skip test added just to ensure that patch works)

Benchmark  (pollute)  (size)  Mode  Cnt       Score      Error  Units
count0         false   10000  avgt  100      15,648 ±    0,182  ns/op
count2         false   10000  avgt  100      31,252 ±    0,113  ns/op
count4         false   10000  avgt  100      47,683 ±    0,165  ns/op
count6         false   10000  avgt  100      64,417 ±    0,203  ns/op
count8         false   10000  avgt  100      80,813 ±    0,265  ns/op
count10        false   10000  avgt  100     101,057 ±    0,295  ns/op
count10Skip    false   10000  avgt  100  497967,375 ± 5946,108  ns/op
count0          true   10000  avgt  100      18,843 ±    0,103  ns/op
count2          true   10000  avgt  100      33,716 ±    0,152  ns/op
count4          true   10000  avgt  100      49,062 ±    0,208  ns/op
count6          true   10000  avgt  100      66,773 ±    0,237  ns/op
count8          true   10000  avgt  100      82,727 ±    0,354  ns/op
count10         true   10000  avgt  100     104,499 ±    0,299  ns/op
count10Skip     true   10000  avgt  100  501723,220 ± 6361,932  ns/op

Type pollution adds some near-constant ~2ns overhead to the non-patched version as well.

Patched:

Benchmark  (pollute)  (size)  Mode  Cnt    Score   Error  Units
count0         false   10000  avgt  100   15,363 ± 0,086  ns/op
count2         false   10000  avgt  100   33,736 ± 0,138  ns/op
count4         false   10000  avgt  100   51,470 ± 0,205  ns/op
count6         false   10000  avgt  100   70,407 ± 0,262  ns/op
count8         false   10000  avgt  100   89,865 ± 0,262  ns/op
count10        false   10000  avgt  100  114,423 ± 0,363  ns/op
count10Skip    false   10000  avgt  100  139,963 ± 0,550  ns/op
count0          true   10000  avgt  100   26,538 ± 0,084  ns/op
count2          true   10000  avgt  100   46,089 ± 0,191  ns/op
count4          true   10000  avgt  100   66,560 ± 0,315  ns/op
count6          true   10000  avgt  100   87,852 ± 0,288  ns/op
count8          true   10000  avgt  100  109,037 ± 0,391  ns/op
count10         true   10000  avgt  100  139,759 ± 0,382  ns/op
count10Skip     true   10000  avgt  100  156,963 ± 1,862  ns/op

So indeed we have some performance drawback in patched version. Here's a chart:

image
I've calculated linear regression on (patched-baseline) times, depending on the number of ops. It's y = 1.288x - 0.7078 for clean type profile and y = 2.6174x + 6.9489 for polluted type profile. So, in the worst case, we have circa 2.6ns per operation plus 7ns constant overhead.

However, using Stream API without actually iterating the stream is very rare case. And if we iterate, the performance will be dominated by the number of iterations. I tried to model this case with SumSized benchmark (replacing count with sum, for 5 and 10 stream elements), but got very confusing results.

Baseline:

Benchmark  (pollute)  (size)  Mode  Cnt    Score    Error  Units
sum0            true       5  avgt  100  126,425 ±  0,793  ns/op
sum2            true       5  avgt  100  195,113 ±  1,359  ns/op
sum4            true       5  avgt  100  304,111 ±  8,302  ns/op
sum6            true       5  avgt  100  414,841 ±  3,215  ns/op
sum8            true       5  avgt  100  507,421 ±  4,781  ns/op
sum10           true       5  avgt  100  633,635 ±  7,105  ns/op
sum0           false       5  avgt  100   45,781 ±  0,258  ns/op
sum2           false       5  avgt  100   86,720 ±  0,573  ns/op
sum4           false       5  avgt  100  195,777 ±  1,145  ns/op
sum6           false       5  avgt  100  291,261 ±  2,091  ns/op
sum8           false       5  avgt  100  376,094 ±  3,283  ns/op
sum10          false       5  avgt  100  492,082 ±  7,914  ns/op
sum0            true      10  avgt  100  127,989 ±  0,758  ns/op
sum2            true      10  avgt  100  219,991 ±  3,081  ns/op
sum4            true      10  avgt  100  374,148 ±  7,426  ns/op
sum6            true      10  avgt  100  557,829 ±  3,959  ns/op
sum8            true      10  avgt  100  698,135 ±  4,915  ns/op
sum10           true      10  avgt  100  904,851 ± 14,458  ns/op
sum0           false      10  avgt  100   43,861 ±  0,107  ns/op
sum2           false      10  avgt  100  105,049 ±  0,276  ns/op
sum4           false      10  avgt  100  294,639 ±  1,499  ns/op
sum6           false      10  avgt  100  439,340 ±  4,223  ns/op
sum8           false      10  avgt  100  577,025 ±  5,760  ns/op
sum10          false      10  avgt  100  729,391 ±  6,045  ns/op

Patched:

Benchmark  (pollute)  (size)  Mode  Cnt    Score   Error  Units
sum0            true       5  avgt  100   68,466 ± 0,167  ns/op
sum2            true       5  avgt  100  107,240 ± 0,261  ns/op
sum4            true       5  avgt  100  209,469 ± 1,098  ns/op
sum6            true       5  avgt  100  300,873 ± 2,020  ns/op
sum8            true       5  avgt  100  378,654 ± 2,620  ns/op
sum10           true       5  avgt  100  473,769 ± 3,665  ns/op
sum0           false       5  avgt  100   49,435 ± 2,702  ns/op
sum2           false       5  avgt  100   96,237 ± 2,906  ns/op
sum4           false       5  avgt  100  195,196 ± 0,961  ns/op
sum6           false       5  avgt  100  286,542 ± 1,874  ns/op
sum8           false       5  avgt  100  371,664 ± 3,416  ns/op
sum10          false       5  avgt  100  457,178 ± 3,776  ns/op
sum0            true      10  avgt  100   69,223 ± 0,195  ns/op
sum2            true      10  avgt  100  120,507 ± 0,752  ns/op
sum4            true      10  avgt  100  291,328 ± 5,581  ns/op
sum6            true      10  avgt  100  439,136 ± 3,787  ns/op
sum8            true      10  avgt  100  569,188 ± 6,440  ns/op
sum10           true      10  avgt  100  709,916 ± 5,022  ns/op
sum0           false      10  avgt  100   46,347 ± 0,174  ns/op
sum2           false      10  avgt  100  109,131 ± 2,381  ns/op
sum4           false      10  avgt  100  296,566 ± 2,079  ns/op
sum6           false      10  avgt  100  430,852 ± 2,629  ns/op
sum8           false      10  avgt  100  562,795 ± 4,442  ns/op
sum10          false      10  avgt  100  716,229 ± 5,659  ns/op

Or, in graphical form:
image
image

For some reason, non-patched polluted version is the slowest, and I cannot see any stable overhead in patched version. For 4+ intermediate ops, patched version numbers are always better than the corresponding non-patched ones. I would be glad if somebody explains these numbers or point to a flaw in this benchmark.

What do you think, @PaulSandoz? Is it acceptable overhead or should I experiment with the new Stream flag?

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented May 15, 2021

@amaembo This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented May 17, 2021

@amaembo this dropped of my radar, but the Skara reminder made it visible again!

Thank you for the detailed analysis. I cannot explain the results of when triggering profile pollution over the kinds of stream.
I think we have good sense that the performance is not unduly perturbed for small streams (there is more work done elsewhere than determining the actual size of the sized stream).

Implementation-wise I still find it a little awkward that the operation is responsible for calling the prior op in the pipeline, and we see the consequences of that in the Slice op implementation that predicates on isParallel.

It might be cleaner if the op is presented with some exact size and adjusts it, something more pure e.g. for SliceOp:

long exactOutputSize(long sourceSize) {
  return calcSize(sourceSize, skip, normalizedLimit);
}

The default impl would be:

long exactOutputSize(long sourceSize) {
  return sourceSize;
}

Then the pipeline helper is responsible for traversing the pipeline (be it sequentially or in parallel, in the latter case the above method would not get called since the slice op becomes the source spliterator for the next stages).

To do that efficiently does I think require a new flag, set by an op and only meaningful when SIZED is set (and cleared when SIZED is cleared, although perhaps we only need to do that splitting stages for parallel execution, see AbstractPipeline.sourceSpliterator).

@amaembo
Copy link
Contributor Author

@amaembo amaembo commented May 23, 2021

@PaulSandoz I added a new flag and updated exactOutputSize per your suggestion, implementing the explicit iteration inside AbstractPipeline#exactOutputSizeIfKnown, along with isParallel() check. I'm not sure how to clear the new flag automatically but this is not so necessary, as it's not used if SIZED is not set. Also, I skipped the sourceStage (starting from sourceState.nextStage), to save one virtual call, as current implementation never has the source stage that adjusts the size.

// 13, 0x04000000
SIZE_ADJUSTING(13,
set(Type.OP));

// The following 2 flags are currently undefined and a free for any further

This comment has been minimized.

@amaembo

amaembo May 23, 2021
Author Contributor

For some reason, the comment says 'The following 2 flags', while we had three, so it was incorrect before but correct now.

@openjdk openjdk bot removed the rfr label May 23, 2021
@openjdk openjdk bot added the rfr label May 24, 2021
Copy link
Member

@PaulSandoz PaulSandoz left a comment

Very good. Thanks making the adjustments. Architecturally, i think we are in a better place. Just have some comments, mostly around code comments.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

I agree the likelihood of some stateless size adjusting op is small but i think its worthwhile capturing the thinking, since this area of streams is quite complex. Thanks for adding the comment.

@openjdk
Copy link

@openjdk openjdk bot commented May 27, 2021

@amaembo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8265029: Preserve SIZED characteristics on slice operations (skip, limit)

Reviewed-by: psandoz

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 754 new commits pushed to the master branch:

  • 95b1fa7: 8267529: StringJoiner can create a String that breaks String::equals
  • 7f52c50: 8182043: Access to Windows Large Icons
  • 8a31c07: 8267886: ProblemList javax/management/remote/mandatory/connection/RMIConnector_NPETest.java
  • ae258f1: 8265418: Clean-up redundant null-checks of Class.getPackageName()
  • 41185d3: 8229517: Support for optional asynchronous/buffered logging
  • 7c85f35: 8267123: Remove RMI Activation
  • 0754266: 8267709: Investigate differences between HtmlStyle and stylesheet.css
  • 23189a1: 8191786: Thread-SMR hash table size should be dynamic
  • ef368b3: 8265836: OperatingSystemImpl.getCpuLoad() returns incorrect CPU load inside a container
  • 10a6f5d: 8230623: Extract command-line help for -Xlint sub-options to new --help-lint
  • ... and 744 more: https://git.openjdk.java.net/jdk/compare/c15680e7428335651232a28c9cb5a9b9b42a7d56...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label May 27, 2021
…java

Co-authored-by: Paul Sandoz <paul.d.sandoz@googlemail.com>
@amaembo
Copy link
Contributor Author

@amaembo amaembo commented May 28, 2021

/integrate

@openjdk openjdk bot closed this May 28, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels May 28, 2021
@openjdk
Copy link

@openjdk openjdk bot commented May 28, 2021

@amaembo Since your change was applied there have been 754 commits pushed to the master branch:

  • 95b1fa7: 8267529: StringJoiner can create a String that breaks String::equals
  • 7f52c50: 8182043: Access to Windows Large Icons
  • 8a31c07: 8267886: ProblemList javax/management/remote/mandatory/connection/RMIConnector_NPETest.java
  • ae258f1: 8265418: Clean-up redundant null-checks of Class.getPackageName()
  • 41185d3: 8229517: Support for optional asynchronous/buffered logging
  • 7c85f35: 8267123: Remove RMI Activation
  • 0754266: 8267709: Investigate differences between HtmlStyle and stylesheet.css
  • 23189a1: 8191786: Thread-SMR hash table size should be dynamic
  • ef368b3: 8265836: OperatingSystemImpl.getCpuLoad() returns incorrect CPU load inside a container
  • 10a6f5d: 8230623: Extract command-line help for -Xlint sub-options to new --help-lint
  • ... and 744 more: https://git.openjdk.java.net/jdk/compare/c15680e7428335651232a28c9cb5a9b9b42a7d56...master

Your commit was automatically rebased without conflicts.

Pushed as commit 0c9daa7.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants