{{ message }}

Alternative scalable MemoryScope #142

Closed
wants to merge 3 commits into from
Closed

Alternative scalable MemoryScope#142

wants to merge 3 commits into from

Conversation

plevart commented May 3, 2020 • edited by openjdk bot

This is an alternative MemoryScope which is more scalable when used in a scenario where child scope is frequently acquired and closed concurrently from multiple threads (for example in parallel Stream.findAny())

Progress

• Change must not contain extraneous whitespace
• Change must be properly reviewed

Reviewers

$git fetch https://git.openjdk.java.net/panama-foreign pull/142/head:pull/142 $ git checkout pull/142

 Alternative scalable MemoryScope 
 95ee767 
bot added the label May 3, 2020

bridgekeeper bot commented May 3, 2020

 Hi @plevart, welcome to this OpenJDK project and thanks for contributing! We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user plevart" as summary for the issue. If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

openjdk bot commented May 3, 2020

 ⚠️ @plevart a branch with the same name as the source branch for this pull request (foreign-memaccess) is present in the target repository. If you eventually integrate this pull request then the branch foreign-memaccess in your personal fork will diverge once you sync your personal fork with the upstream repository. To avoid this situation, create a new branch for your changes and reset the foreign-memaccess branch. You can do this by running the following commands in a local repository for your personal fork. Note: you do not have to name the new branch NEW-BRANCH-NAME. $git checkout -b NEW-BRANCH-NAME$ git branch -f foreign-memaccess 4397c2d060856538f52cc8165dc94fd0cc9fa0fc \$ git push -f origin foreign-memaccess  Then proceed to create a new pull request with NEW-BRANCH-NAME as the source branch and close this one.
bot removed the label May 4, 2020
bot added the label May 4, 2020

Webrevs

reviewed
 acquires.increment(); // check state 2nd int state = (int) STATE.getVolatile(this); while (state > STATE_OPEN) {

PaulSandoz May 4, 2020 Member

You could do the following to limit to one VH call:

int state;  // perhaps rename to not shadow the field
while ((state = (int) STATE.getVolatile(this)) > STATE_OPEN) {
...
}


plevart May 5, 2020 Author Contributor

Sure, that's possible. Thanks.

 return new MemoryScope(ref, this::release); } private static final class Root extends MemoryScope { private final LongAdder acquires;

PaulSandoz May 4, 2020 Member

Did you consider collapsing into just one LongAdder using increment (for acquiring), decrement (for releasing), and sum (checking a zero sum for closing)? Perhaps it was discussed already, i lost track of all the discussion.

plevart May 5, 2020 • edited Author Contributor

No, it was not discussed yet. I'm happy you brought it up. That would not work. LongAdder.sum() is not an atomic operation. Consider the following scenario:

• thread C calls close() on root segment
• sets state = CLOSING
• thread A2 tries to acquire child segment and so it 1st does adder.increment()
• thread A2 reads state == CLOSING and therefore undoes previous increment with adder.decrement() and fails, but...
• thread C that just missed A2's adder.increment() does see A2's adder.decrement() because it is performed on a different LongAdder cell which C reads after it is incremented, so what C gets as a result of adder.sum() is 0, it completes closing the root segment by setting state = CLOSED and executing cleanup, but thread A1 still has access to the child segment and BOOM!!!

Having two LongAdders (acquires, releases) and doing the sum() per-partes sequentially: 1st releases.sum(), then acquires.sum() we guarantee that we never miss an acquiring thread's acquires.increment() but see the thread's releases.increment(), so above scenario is not possible.

I agree with Peter that using a single long would misses update and you could see 'zero' when in reality something has happened but sum didn't get the memo. The entire approach, as far as I understand, relies on subtle happens-before relationship which arise precisely because of the use of the two counters (and the fact that one counter is always updated before the other).

plevart May 5, 2020 Author Contributor

...right and the fact that the sum() of both counters is performed in the opposite order as they are updated.

PaulSandoz commented May 4, 2020

 I think this should work under the well-defined Spliterator use case but might be problematic if made public sometime later on. For example: MemorySegment s = ... var a = s.acquire(); a.close(); var d = s.dup(); a.acquire(); // transiently affecting state of d since the adders are shared  There also might be ways to poke a Spliterator to induce such behaviour e.g. the Consumer passed to tryAdvance creating a new Spliterator from the given segment, and operating on that after the tryAdvance completes. Clearly that's not what someone should be doing but it's starting complex enough that it's hard to reason about the possible effects and whether they are harmless or not, at least for me :-)

plevart commented May 5, 2020

 So, in your scenario: MemorySegment s = ... var a = s.acquire(); a.close(); var d = s.dup(); a.acquire(); // transiently affecting state of d since the adders are shared  ...the a.acquire() will fail, because s.state == CLOSED, but the shared acquires.sum() will briefly be greater than releases.sum() and therefore could affect d.close() in a way that it would cause spurious failure of d.close(). This is of course undesirable (like it was undesirable to have spurious failures of acquire() in previous versions of this MemoryScope until I added a two-phase close), so it would be best to not share the acquires/releases LongAdders with a duped MemoryScope. I'll fix this by creating new LongAdder instances for duped scope. Thanks.
 Don't re-use acquires/releases LongAdder(s) fot duped scope 
 04cb4f8 
reviewed
 //flush read/writes to memory before returning the new segment VarHandle.fullFence(); } return dup(0L, length, mask, newOwner, scope.dup()); }

plevart May 5, 2020 Author Contributor

Maybe you're asking why did I remove this fullFence() call? I don't think it is needed. I checked all implementations of MemorySegment and they all contain just final fields. It is therefore safe to publish such instances via data races and no explicit fences are needed.

mcimadamore May 5, 2020 • edited Collaborator

I think we had a discussion on this; we knew publishing segments was safe (all fields final), but there were some perplexities that the fences added at the end of the constructor were enough to warrant flushing of writes to the off-heap memory backing the segment (on all platforms). This effect is not covered (afaik) by the JMM.

plevart May 5, 2020 • edited Author Contributor

Ah, I see. Will revert that back then. But why is AbstractMemorySegmentImpl.withOwnerThread() doing this differently than AbstractMemorySegmentImpl.acquire() ? acquire() too is handing access to a different thread and in acquire() there is no explicit fence.

But why is AbstractMemorySegmentImpl.withOwnerThread() doing this differently than AbstractMemorySegmentImpl.acquire() ? acquire() too is handing access to a different thread and in acquire() there is no explicit fence.

This is a valid point. I guess from the past, acquire() meant "racy" - whereas for handoff patterns (e.g. withOwnerAccess) we wanted to be extra careful that only one thread at a time has access.

Note that, currently, while all acquired segments slices can be operated concurrently in a race-free fashion, unfortunately there is still the main segment that can be written/read concurrently by the original thread. So adding fences in this situation just kicks the can down the road.

If we had a way so that the original segment became inaccessible while splitted with a spliterator, then I'd agree with you that a fence would be ideal (since this is another case of handoff).

But it is very hard to invalidate the original segment; few options I've considered:

• make the 'valid state' check more complex - e.g. in your implementation we could check whether the two counters yield same value before allowing access to the original thread - but I think this would be slow, and I think it could still allow races (this is a case of check & act - where we have no guarantees that, by the time memory is read/written, the condition we checked still holds true)

• as we do in other places of the API - just kill the original segment after you split (e.g. make it not alive). This will make it inaccessible, which is consistent with what we want. But now there's a problem: how do you get back the original segment after you're done with the spliterator? Joining slices is messy. And add synchronization to allow for spliterators to be 'closed' atomically, will just add more contention.

So, it seems like the cost for fixing this particular issue is higher than the actual problem; there's an access mode called ACQUIRE - if a segment owner is not OK with having races with different threads - that mode can be disabled.

plevart May 6, 2020 Author Contributor

Yeah, I was just thinking of the situation where a segment is passed to parallel stream and the threads of parallel stream operate on slices (acquired children) in read-only way, since this is the source of information and should not be mutated. Now if the segment was created and written to in one thread before handed to parallel stream then workers from the FJPool may not see the content of their slices (acquired children) fully initialized since normal synchronization like dispatch of control to other thread presumably doesn't have the same effect on the off-heap memory (on all platforms)?

Yeah, I was just thinking of the situation where a segment is passed to parallel stream and the threads of parallel stream operate on slices (acquired children) in read-only way, since this is the source of information and should not be mutated. Now if the segment was created and written to in one thread before handed to parallel stream then workers from the FJPool may not see the content of their slices (acquired children) fully initialized since normal synchronization like dispatch of control to other thread presumably doesn't have the same effect on the off-heap memory (on all platforms)?

I'm fine with adding the fence in principle - if it adds too much overhead (which I didn't try), I think we should leave it out though, since the benefits are limited.

reviewed

mcimadamore left a comment • edited

 My comment here is mostly non-code-related; I've already seen the code and attempted to prove its correctness in this email: https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-May/066190.html My concerns are: some of the HB relationships in the code are quite non-obvious, subtle and hard to reason about (well, at least they were for me) - this means that, compared to current approach, there will be an high maintenance cost associated with this In some of the benchmarks provided by Peter, the stream version of findAny is going several orders of magnitude slower than a serial for/each loop anyway. So, are we sure we're solving the right problem here? I.e. is there a use case where, by reducing contention, we get back performances that would be considered acceptable for a performance-savy user? Or is this something that just make something 100x worse as opposed to 1000x worse? Data is reported below: w/o patch ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535 ms/op ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110 ms/op ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044 ms/op w/ patch ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183 ms/op ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op  Few questions come up naturally here - do these number reflect some intrinsic problem with the memory segment spliterator per se, or do they reflect some more general problem with using shortcircuiting operations (such as find any) which consume almost zero CPU time in a stream (and, worse, a parallel stream) ? With this I'm not of course saying that the patch (and associated improvement) is not important, but I'm wondering whether the added benefit would be worth the added maintenance cost given that, from the numbers above, it still doesn't look like findAny is the way to go for processing a segment efficiently anyway?

plevart commented May 5, 2020

 I agree that the main problem here is the constant acquire/close-ing of memory segment in Spliterator.tryAdvance(). It is interesting to observe a slight performance improvement even in serial stream (according to above results) while the concurrent contention in current MemoryScope is obvious in parallel stream. The above results are of course "distorted" by the fact that there are lots of small elements in the testing stream (int == 4 bytes) so performance figures are dominated by the overhead. But there may be valid compositions of .findAny() or such that use bigger elements so that the overhead plays a minor role. Still, current MemoryScope could present a problem even in such compositions where the alternative MemoryScope would be perfectly acceptable. Anyway I was also thinking of "improving" the Spliterator API with two empty default methods that would present real improvement in such scenarios: public interface Spliterator { /** * Before a sequence of {@link #tryAdvance} calls from a thread, this method may * be called in the same thread to establish a context that may help in improving the * efficiency of subsequent {@code tryAdvance} calls. If this method is called and it returns * normally, then it is guaranteed that the sequence of {@code tryAdvance} calls will be * followed by a call to {@link #endTryAdvanceSequence} method to end such sequence. * The end of sequence of {@code tryAdvance} calls may be announced before all elements of * the Spliterator have been exhausted. * * @implNote the default implementation of this method does nothing * @since 15 */ default void beginTryAdvanceSequence() { } /** * If {@link #beginTryAdvanceSequence} method is called and the call returns normally before * a sequence of {@link #tryAdvance} calls, then this method is guaranteed to be called to end such * sequence and give this Spliterator a chance to clean-up what was established in * {@code beginTryAdvanceSequence}. * * @implNote the default implementation of this method does nothing * @since 15 */ default void endTryAdvanceSequence() { }  Would this be an acceptable "addtition" to Spliterator API?

 Would this be an acceptable "addtition" to Spliterator API? I'll leave this to Paul. IMHO doing this for an incubating API seems a bit too much (although I can buy that there can be other benefits in doing this). Another option would be in having this be a JDK-internal spliterator API, so that our own spliterators could implement that and get better stream performances (after all, I don't expect a lot of users to come up with their own spliterators). If stream sees that one of those 'special' stream is being used, it can use the more efficient implementation.

plevart commented May 5, 2020

 This would be universally useful. I once already missed such methods in a Spliterator I was trying to design, backed by database.
 Re-add fullFence() in MemorySegment.withOwnerThread() due to effect o… 
 3c68c95 
…n the segment backing memory which is not guaranteed by JMM otherwise
approved these changes

 The code looks good - leaning towards approving and see how it goes. One thing I noted is that there are many assumptions that some methods can only be called from owner thread - but while that's true for the 'safe' API, that's not the case for the unsafe API; in that case I have a vague feeling that the old atomic would behave a bit more predictably.
 //reference to keep hold onto final Object ref; /** * Creates a root MemoryScope with given ref and cleanupAction.

While this is true, note that it is possible to unsafely create unconfined segments, in which case the ownership restriction would not apply. Perhaps the doc should be tweaked a bit to reflect this.

 final static MemoryScope GLOBAL = new MemoryScope(null, null); /** * Closes this scope, executing any cleanup action if this is the root scope. * This method may only be called in "owner" thread.

Again - for unsafe segments this is not true. Not that we need to provide guarantees in that case - but I think it's better to make things clear.

openjdk bot commented May 6, 2020 • edited

 @plevart This change now passes all automated pre-integration checks, type /integrate in a new comment to proceed. After integration, the commit message will be: Alternative scalable MemoryScope Reviewed-by: mcimadamore, psandoz  If you would like to add a summary, use the /summary command. To credit additional contributors, use the /contributor command. To add additional solved issues, use the /solves command. Since the source branch of this PR was last updated there have been 115 commits pushed to the foreign-memaccess branch: 46b7d05: Automatic merge of master into foreign-memaccess b09f806: Automatic merge of jdk:master into master 3f50575: 8237834: com/sun/jndi/ldap/LdapDnsProviderTest.java failing with LDAP response read timeout 3beee2c: 8242919: Paste locks up jshell 13d6b49: 8244625: Zero VM is broken after JDK-8244550 (java_lang_Class::as_Klass(oopDesc*) undefined) 601891a: 8244618: WinUpgradeUUIDTest.java fails after JDK-8236518 88722a8: 8244243: Shenandoah: Cleanup Shenandoah phase timing tracking and JFR event supporting 61864c2: 8062947: Fix exception message to correctly represent LDAP connection failure e05227a: 8244248: boot-jdk.m4 captures the version line using regex ed4bc1b: 8244245: localizedBy() should override localized values with default values ... and 105 more: https://git.openjdk.java.net/panama-foreign/compare/4397c2d060856538f52cc8165dc94fd0cc9fa0fc...foreign-memaccess As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge foreign-memaccess into your branch, and then specify the current head hash when integrating, like this: /integrate 46b7d054ec1ff9cb59f50a2bd358a998ee1a398a. As you do not have Committer status in this project, an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@mcimadamore, @PaulSandoz) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
bot added the label May 6, 2020

PaulSandoz commented May 6, 2020

 Thanks Peter for the explanation. I now see why two monotonically increasing counters work (with careful ordering the non-atomic sums). My sense is we should let this soak in the panama repo. With regards to the proposed changes to Spliterator, i fear this will complicate the already complicated set of interactions that are possible, and require clients to take into account what is in effect a new characteristic. Instead i am wondering if we can add a new traversal method that can short-circuit, for example:  // traverses until no elements or the action returns false, which ever is first // if action returns false and if there are elements remaining then traversing may continue // with further calls to this method, or tryAdvance, or forEachRemaining default void forEachSomeRemaining(Predicate action) { boolean[] _continue = new boolean[1]; // ugly side return boolean continue; do { boolean continue = tryAdvance(e -> { _continue[0] = action.test(e); }); } while (continue && _continue[0]); } 
approved these changes

openjdk bot commented May 8, 2020

 @mcimadamore The change author (@plevart) must issue an integrate command before the integration can be sponsored.

plevart commented May 8, 2020

 /integrate

openjdk bot commented May 8, 2020

bot added the label May 8, 2020

bot closed this May 8, 2020
bot added the label May 8, 2020
bot removed labels May 8, 2020