New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistically update ref counts #7248

Closed
wants to merge 1 commit into
base: 4.1
from

Conversation

Projects
None yet
4 participants
@carl-mastrangelo
Member

carl-mastrangelo commented Sep 25, 2017

Motivation:
Highly retained and released objects have contention on their ref
count. Currently, the ref count is updated using compareAndSet
with care to make sure the count doesn't overflow, double free, or
revive the object.

Profiling has shown that a non trivial (~1%) of CPU time on gRPC
latency benchmarks is from the ref count updating.

Modification:
Rather than pessimistically assuming the ref count will be invalid,
optimistically update it assuming it will be. If the update was
wrong, then use the slow path to revert the change and throw an
execption. Most of the time, the ref counts are correct.

This changes from using compareAndSet to getAndAdd, which emits a
different CPU instruction on x86 (CMPXCHG to XADD). Because the
CPU knows it will modifiy the memory, it can avoid contention.

On a highly contended machine, this can be about 2x faster.

There is a downside to the new approach. The ref counters can
temporarily enter invalid states if over retained or over released.
The code does handle these overflow and underflow scenarios, but it
is possible that another concurrent access may push the failure to
a different location. For example:

Time 1 Thread 1: obj.retain(INT_MAX - 1)
Time 2 Thread 1: obj.retain(2)
Time 2 Thread 2: obj.retain(1)

Previously Thread 2 would always succeed and Thread 1 would always
fail on the second access. Now, thread 2 could fail while thread 1
is rolling back its change.

====

There are a few reasons why I think this is okay:

  1. Buggy code is going to have bugs. An exception is going to be
    thrown. This just causes the other threads to notice the state
    is messed up and stop early.
  2. If high retention counts are a use case, then ref count should
    be a long rather than an int.
  3. The critical section is greatly reduced compared to the previous
    version, so the likelihood of this happening is lower
  4. On error, the code always rollsback the change atomically, so
    there is no possibility of corruption.

Result:
Faster refcounting

@carl-mastrangelo carl-mastrangelo requested a review from normanmaurer Sep 25, 2017

int oldRef = refCntUpdater.getAndAdd(this, increment);
if (oldRef <= 0 || oldRef + increment < oldRef) {
// Ensure we don't resurrect (which means the refCnt was 0) and also that we encountered an overflow.
refCntUpdater.getAndAdd(this, -increment);

This comment has been minimized.

@fenik17

fenik17 Sep 26, 2017

Contributor

Why not refCntUpdater.getAndSet(this, oldRef);?

@fenik17

fenik17 Sep 26, 2017

Contributor

Why not refCntUpdater.getAndSet(this, oldRef);?

This comment has been minimized.

@carl-mastrangelo

carl-mastrangelo Sep 26, 2017

Member

Because it could have changed due to another thread updating it. Specifically, it could change between line 63 and 66

@carl-mastrangelo

carl-mastrangelo Sep 26, 2017

Member

Because it could have changed due to another thread updating it. Specifically, it could change between line 63 and 66

This comment has been minimized.

@fenik17

fenik17 Sep 26, 2017

Contributor

@carl-mastrangelo Oh, of course. Agree. 👍

@fenik17

fenik17 Sep 26, 2017

Contributor

@carl-mastrangelo Oh, of course. Agree. 👍

@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Sep 26, 2017

Member

@carl-mastrangelo mind adding a JMH benchmark ?

Member

normanmaurer commented Sep 26, 2017

@carl-mastrangelo mind adding a JMH benchmark ?

@carl-mastrangelo

This comment has been minimized.

Show comment
Hide comment
@carl-mastrangelo

carl-mastrangelo Sep 26, 2017

Member

Added a benchmark. The results show that under higher contention, the new version is much faster. As the amount of work done between retain and release goes up, contention goes down and they even out.

Numbers:

Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  2901361       804.579 ±  1.835  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3038729       785.376 ± 16.471  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  2899401       817.392 ±  6.668  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3650566      2077.700 ±  0.600  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3005467     19949.334 ±  4.243  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   456091        48.610 ±  1.162  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   732051        62.599 ±  0.815  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   778925       228.629 ±  1.205  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   633682      2002.987 ±  2.856  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506442     19735.345 ± 12.312  ns/op

AFTER:
Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  3761980       383.436 ±  1.315  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3667304       474.429 ±  1.101  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  3039374       479.267 ±  0.435  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3709210      2044.603 ±  0.989  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3011591     19904.227 ± 18.025  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   494975        52.269 ±  8.345  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   771094        62.290 ±  0.795  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   763230       235.044 ±  1.552  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   634037      2006.578 ±  3.574  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506284     19742.605 ± 13.729  ns/op
Member

carl-mastrangelo commented Sep 26, 2017

Added a benchmark. The results show that under higher contention, the new version is much faster. As the amount of work done between retain and release goes up, contention goes down and they even out.

Numbers:

Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  2901361       804.579 ±  1.835  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3038729       785.376 ± 16.471  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  2899401       817.392 ±  6.668  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3650566      2077.700 ±  0.600  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3005467     19949.334 ±  4.243  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   456091        48.610 ±  1.162  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   732051        62.599 ±  0.815  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   778925       228.629 ±  1.205  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   633682      2002.987 ±  2.856  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506442     19735.345 ± 12.312  ns/op

AFTER:
Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  3761980       383.436 ±  1.315  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3667304       474.429 ±  1.101  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  3039374       479.267 ±  0.435  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3709210      2044.603 ±  0.989  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3011591     19904.227 ± 18.025  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   494975        52.269 ±  8.345  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   771094        62.290 ±  0.795  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   763230       235.044 ±  1.552  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   634037      2006.578 ±  3.574  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506284     19742.605 ± 13.729  ns/op
@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Sep 26, 2017

Member

@carl-mastrangelo please fix the check style :)

Member

normanmaurer commented Sep 26, 2017

@carl-mastrangelo please fix the check style :)

@Setup
public void setUp() {
buf = (AbstractReferenceCountedByteBuf) Unpooled.buffer(1);

This comment has been minimized.

@normanmaurer

normanmaurer Sep 26, 2017

Member

why you even need t do any casing at all ?

@normanmaurer

normanmaurer Sep 26, 2017

Member

why you even need t do any casing at all ?

This comment has been minimized.

@carl-mastrangelo

carl-mastrangelo Sep 26, 2017

Member

Unpooled.buffer returns (ByteBuf). The API doesn't assert that it will always be a AbstractReferenceCountedByteBuf , so I cast here so that the benchmark fails if this assumption changes.

@carl-mastrangelo

carl-mastrangelo Sep 26, 2017

Member

Unpooled.buffer returns (ByteBuf). The API doesn't assert that it will always be a AbstractReferenceCountedByteBuf , so I cast here so that the benchmark fails if this assumption changes.

@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Sep 26, 2017

Member

@carl-mastrangelo in general I am cool with the change and I think its fine (just as you said). Can you please also make the change to AbstractReferenceCounted ?

@Scottmitch any concerns ?

Member

normanmaurer commented Sep 26, 2017

@carl-mastrangelo in general I am cool with the change and I think its fine (just as you said). Can you please also make the change to AbstractReferenceCounted ?

@Scottmitch any concerns ?

@normanmaurer normanmaurer self-assigned this Sep 26, 2017

@carl-mastrangelo

This comment has been minimized.

Show comment
Hide comment
@carl-mastrangelo
Member

carl-mastrangelo commented Sep 26, 2017

Optimistically update ref counts
Motivation:
Highly retained and released objects have contention on their ref
count.  Currently, the ref count is updated using compareAndSet
with care to make sure the count doesn't overflow, double free, or
revive the object.

Profiling has shown that a non trivial (~1%) of CPU time on gRPC
latency benchmarks is from the ref count updating.

Modification:
Rather than pessimistically assuming the ref count will be invalid,
optimistically update it assuming it will be.  If the update was
wrong, then use the slow path to revert the change and throw an
execption.  Most of the time, the ref counts are correct.

This changes from using compareAndSet to getAndAdd, which emits a
different CPU instruction on x86 (CMPXCHG to XADD).  Because the
CPU knows it will modifiy the memory, it can avoid contention.

On a highly contended machine, this can be about 2x faster.

There is a downside to the new approach.  The ref counters can
temporarily enter invalid states if over retained or over released.
The code does handle these overflow and underflow scenarios, but it
is possible that another concurrent access may push the failure to
a different location.  For example:

Time 1 Thread 1: obj.retain(INT_MAX - 1)
Time 2 Thread 1: obj.retain(2)
Time 2 Thread 2: obj.retain(1)

Previously Thread 2 would always succeed and Thread 1 would always
fail on the second access.  Now, thread 2 could fail while thread 1
is rolling back its change.

====

There are a few reasons why I think this is okay:

1. Buggy code is going to have bugs.  An exception _is_ going to be
   thrown.  This just causes the other threads to notice the state
   is messed up and stop early.
2. If high retention counts are a use case, then ref count should
   be a long rather than an int.
3. The critical section is greatly reduced compared to the previous
   version, so the likelihood of this happening is lower
4. On error, the code always rollsback the change atomically, so
   there is no possibility of corruption.

Result:
Faster refcounting

```
BEFORE:

Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  2901361       804.579 ±  1.835  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3038729       785.376 ± 16.471  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  2899401       817.392 ±  6.668  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3650566      2077.700 ±  0.600  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3005467     19949.334 ±  4.243  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   456091        48.610 ±  1.162  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   732051        62.599 ±  0.815  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   778925       228.629 ±  1.205  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   633682      2002.987 ±  2.856  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506442     19735.345 ± 12.312  ns/op

AFTER:
Benchmark                                                                                             (delay)    Mode      Cnt         Score    Error  Units
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                            1  sample  3761980       383.436 ±  1.315  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                           10  sample  3667304       474.429 ±  1.101  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                          100  sample  3039374       479.267 ±  0.435  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                         1000  sample  3709210      2044.603 ±  0.989  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_contended                                        10000  sample  3011591     19904.227 ± 18.025  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                          1  sample   494975        52.269 ±  8.345  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                         10  sample   771094        62.290 ±  0.795  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                        100  sample   763230       235.044 ±  1.552  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                       1000  sample   634037      2006.578 ±  3.574  ns/op
AbstractReferenceCountedByteBufBenchmark.retainRelease_uncontended                                      10000  sample   506284     19742.605 ± 13.729  ns/op

```
@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer
Member

normanmaurer commented Sep 29, 2017

@carl-mastrangelo

This comment has been minimized.

Show comment
Hide comment
@carl-mastrangelo
Member

carl-mastrangelo commented Oct 4, 2017

@Scottmitch @normanmaurer friendly ping.

@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Oct 4, 2017

Member

@carl-mastrangelo thanks of the ping... I think its fine because of the reasons you stated. Let me pull it in and if @Scottmitch has cycles to review and see issues with it we can do a followup pr.

Member

normanmaurer commented Oct 4, 2017

@carl-mastrangelo thanks of the ping... I think its fine because of the reasons you stated. Let me pull it in and if @Scottmitch has cycles to review and see issues with it we can do a followup pr.

@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Oct 4, 2017

Member

Cherry-picked into 4.1 (83a19d5) and 4.0 (45a6bd3)

Member

normanmaurer commented Oct 4, 2017

Cherry-picked into 4.1 (83a19d5) and 4.0 (45a6bd3)

@Scottmitch

This comment has been minimized.

Show comment
Hide comment
@Scottmitch

Scottmitch Nov 11, 2017

Member

In general this is playing "fast an loose" with the atomics. I'm guessing it is unlikely to be an issue and most folks will just use the retain() calls, and for "unreleasable" cases folks can use "unreleasable buffers". If we do get close to overflow and there is concurrency behavior becomes less well defined (or undefined), but again it seems unlikely to occur. So we do sacrifice correctness, but the risk seems low for this to cause an issue based upon typical usage.

Member

Scottmitch commented Nov 11, 2017

In general this is playing "fast an loose" with the atomics. I'm guessing it is unlikely to be an issue and most folks will just use the retain() calls, and for "unreleasable" cases folks can use "unreleasable buffers". If we do get close to overflow and there is concurrency behavior becomes less well defined (or undefined), but again it seems unlikely to occur. So we do sacrifice correctness, but the risk seems low for this to cause an issue based upon typical usage.

@normanmaurer

This comment has been minimized.

Show comment
Hide comment
@normanmaurer

normanmaurer Nov 11, 2017

Member

@Scottmitch I agree that we lose a bit correctness but I think its still good enough and the win justify it.

Member

normanmaurer commented Nov 11, 2017

@Scottmitch I agree that we lose a bit correctness but I think its still good enough and the win justify it.

@carl-mastrangelo carl-mastrangelo deleted the carl-mastrangelo:refcnt3 branch Nov 13, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment