xds: implement a global map for holding circuit breaker request counters #7588

voidzcy · 2020-11-03T21:32:03Z

Use WeakReference to hold atomics. Circuit breaking will always be enabled, with the limit of per-cluster concurrent requests default to 1024. Each EDS LB policy will get or create the counter for aggregating outstanding requests sent through it at the time it is instantiated. This holds a strong reference to the counter (as well as the picker instance spawned). When the EDS LB policy and the picker instance are destroyed, strong references will be gone. Atomics will be GCed when strong references are gone and the global map will be cleaned up upon each call of getOrCreate().

…ng per-cluster requests.

dapengzhang0 · 2020-11-09T19:37:44Z

Why not just define a RequestCounter class below instead of WeakReferrence and use it in ChildLbState? This will be easier to test.

class RequestCounter {
  String cluster;
  String edsServiceName;
  long localCount;
  AtomicLong counter;
  void increment() { 
    if (localCount == 0) {
      counter = register.getOrCreate(cluster, edsServiceName);
    }
    counter.incrementAndGet();
    localCount++;
  }
  void decrement() { 
    localCount--; 
    counter.decrementAndGet();
    if (localCount == 0) registry.return(cluster, edsServiceName, counter);
  }
}

voidzcy · 2020-11-09T19:56:30Z

Why not just define a RequestCounter class below instead of WeakReferrence and use it in ChildLbState? This will be easier to test.

That's a lot of reference counting for managing lifecycle of counters. Especially, the operation of returning a counter is hard to handle. When a ChildLbState is destroyed, the counter may still be referenced/used by the picker. So there really isn't a clean counter return. WeakReferrence is the ideal for this.

ejona86 · 2020-11-10T00:05:14Z

Avoiding weak reference is a huge benefit. If releasing the reference is racy, then the worst case should be that some counts may be lost. That seems quite acceptable, especially since the xdsClient wouldn't bother sending those counts anyway.

voidzcy · 2020-11-10T23:53:55Z

Avoiding weak reference is a huge benefit. If releasing the reference is racy, then the worst case should be that some counts may be lost. That seems quite acceptable, especially since the xdsClient wouldn't bother sending those counts anyway.

It's not a problem of reporting stats, but for limiting the number of requests. The lifecycle of atomic for each cluster needs to be accurate. If it is thrown away too early, a new one will be created and will cause the max_requests threshold to be exceeded incorrectly. That should be unacceptable. The atomic should be thrown away only after: the last EDS LB policy's helper that obtains it, or a Picker created by that EDS LB policy, or a stream tracer created by that Picker, whichever of them being unreferenced last. There is no guarantee which of them will be unreferenced last (at least for the later two: the Channel Picker can be updated while streams created by the old Picker are still running).

At worst case, never cleaning up atomics in the global map is acceptable in terms of correctness, as the total number of clusters (plus its edsServiceName aliasing) should be relatively small. But we would still prefer to not let memory usage grow unbounded by keeping unused atomics around. I don't see problems for using WeakReference here, it should work perfectly for such a case, isn't it?

ejona86 · 2020-11-11T00:45:54Z

Since this is pure memory (not an actual resource), WeakReference could work fine. I think the main problem with weak references for this is that weak references by their nature are hard to test and debug.

If it is thrown away too early, a new one will be created and will cause the max_requests threshold to be exceeded incorrectly. That should be unacceptable.

Please describe when that would happen and how severe the results would be. You have to put that in context to show that it is unacceptable. I don't believe that any exceeding of the limits is unacceptable. We've already agreed elsewhere that we can race a little bit for new RPCs and that is okay.

It seems to me that eds policies are created rarely, so there's limited risk of this happening frequently. As long as we are doing graceful switches, it seems strongly unlikely to be triggered; if we release the stats "too early" then that is because we aren't using it any more. But if we do hard switches, then that does expose a window in normal operation and would be a problem. I'd prefer not to depend on the graceful switch behavior if we can avoid it.

For ongoing RPCs, it sort of already has a reference count: the number of ongoing RPCs. When an RPC finishes, it could check whether the RPC count is zero and if so check the reference count. If both rpc count and reference count are zero, the stats is safe to throw away (note that things are racy, so the code should be a noop if the stats were already thrown away). That would use a reference count for the LB policy and the counters themselves for the stream tracer. It does still leave open a problem for new RPCs starting.

There are some ways to detect things when new RPCs are starting (like setting the RPC count to MIN_INT), but I think the problem is actually not much of a problem. I think we should just let it happen. The "problem" is only a problem if there are no ongoing RPCs. That means the RPC rate is low so accidentally accounting incorrectly means 1-2 RPCs are forgotten (or there's a bug and the wrong picker is used). That doesn't seem like a big deal.

So then the question is, if we implemented that, would it be "better" than weak references? I think it would at least be a contender and worth consideration.

voidzcy · 2020-11-11T01:53:36Z

This is indeed a pure memory issue. It would be easy if we do not care about unused atomics being kept in the global map and have the map grow unboundedly (although the total number of clusters should be relatively small). Things would work correctly with that. So if you are saying weak references made it hard to test and debug, it shouldn't be right.

The biggest advantage of using weak references here is you do not need to worry about the size of the global map for atomics, without any manual cleanup that involves elaborate reference counting, which could introduce bugs, complicate the logic or require locking on the data path. That's the main concern for approaches you talked above.

Please describe when that would happen and how severe the results would be. You have to put that in context to show that it is unacceptable. I don't believe that any exceeding of the limits is unacceptable. We've already agreed elsewhere that we can race a little bit for new RPCs and that is okay.

Those are cases that manage ref count of the atomic prematurely, such as increment the ref count when the EDS LB policy obtains it and decrement the ref count when the EDS LB policy is being shut down. That would results in the atomic is still being used by the channel picker and streams it creates while the atomic has been removed from the global map (aka, dangling atomic). So next time switching back to this cluster, the global map will create a new atomic for it.

ejona86 · 2020-11-11T22:01:11Z

Having slept on this, I think I'm fine with the WeakReference approach, with one caveat: we still reference count in the Helper. We would keep a reference count but not use it except ensuring it is zero when the element is removed due to the weak reference being cleared.

This WeakReference approach is simple and with the one caveat means we could easily change to that other approach I mentioned in the future if we needed. It does seem this is another instance of picker lacking a lifetime causing us trouble.

It would be easy if we do not care about unused atomics being kept in the global map and have the map grow unboundedly (although the total number of clusters should be relatively small)

But we do have to prevent unbounded growth.

without any manual cleanup that involves elaborate reference counting, which could introduce bugs, complicate the logic or require locking on the data path

Many people would consider ReferenceQueue to be elaborate. And rare shortly-held locking on the data path is not a problem in the least. Reference counting has the benefit that presence of bugs is more obvious and it does not require future allocations to clean up resources.

That would results in the atomic is still being used by the channel picker and streams it creates while the atomic has been removed from the global map (aka, dangling atomic). So next time switching back to this cluster, the global map will create a new atomic for it.

Okay, so the only problem is if the xds server stops having us use a cluster and has us start using that cluster again in rapid succession. That seems fairly unlikely on the surface. Priority flapping could maybe produce something similar, but there's other protections we put in place there. I think we could easily live with that deficiency without much problem. (Although I also provided an approach that wouldn't have that problem.)

dapengzhang0 · 2020-11-13T19:50:23Z

xds/src/main/java/io/grpc/xds/SharedCallCounterMap.java

+    AtomicLong counter;
+    if (ref == null || (counter = ref.get()) == null) {
+      counter = new AtomicLong();
+      ref = new CounterReference(counter, refQueue, cluster, edsServiceName);
+      clusterCounters.put(edsServiceName, ref);
+    }


What about avoiding assignment inside if-condition?

AtomicLong counter = null; if (ref != null) { counter = ref.get(); } if (counter == null) { counter = new AtomicLong(); ref = new CounterReference(counter, refQueue, cluster, edsServiceName); clusterCounters.put(edsServiceName, ref); }

Why choose something that is more verbose? 😄

Because assignment inside if-condition is less readable.

Well, that should not be considered as a readability issue, that's what the language feature is, just like all the one liners in Python.

…st counters (grpc#7588)" This reverts commit a43ae54.

…ers (grpc#7588) Circuit breakers should be applied to clusters in the global scope. However, the LB hierarchy might cause the LB policy (currently EDS, but cluster_impl in the future) that applies circuit breaking to be duplicated. Also, for multi-channel cases, the circuit breaking threshold should still be shared across channels in the process. This change creates a global map for accessing circuit breaking atomics that used to count the number of outstanding requests per global cluster basis. Atomics in the global map are held by WeakReferences so LB policies/Pickers/StreamTracers do not need to worry about counter's lifecycle and refcount.

Introduce an interface for providing per-clusuter request counters.

d3a4576

voidzcy closed this Nov 4, 2020

voidzcy reopened this Nov 4, 2020

Implement a global map for holding atomic counters used for aggregati…

bd222c5

…ng per-cluster requests.

voidzcy force-pushed the impl/shared_circuit_breaking_call_counters branch from b0aa8cd to bd222c5 Compare November 4, 2020 22:24

voidzcy requested a review from dapengzhang0 November 5, 2020 18:26

voidzcy added 2 commits November 9, 2020 12:10

Put supress warning close to usage.

4d62f73

Removed unused import.

4bf5aa1

voidzcy mentioned this pull request Nov 11, 2020

buildscripts: enable xDS circuit breaking test #7615

Merged

Fix typo.

db6c43d

dapengzhang0 approved these changes Nov 13, 2020

View reviewed changes

voidzcy merged commit a43ae54 into grpc:master Nov 13, 2020

voidzcy added a commit to voidzcy/grpc-java that referenced this pull request Nov 17, 2020

Revert "xds: implement a global map for holding circuit breaker reque…

af2ebc7

…st counters (grpc#7588)" This reverts commit a43ae54.

github-actions bot locked as resolved and limited conversation to collaborators Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xds: implement a global map for holding circuit breaker request counters #7588

xds: implement a global map for holding circuit breaker request counters #7588

voidzcy commented Nov 3, 2020 •

edited

Loading

dapengzhang0 commented Nov 9, 2020

voidzcy commented Nov 9, 2020

ejona86 commented Nov 10, 2020

voidzcy commented Nov 10, 2020

ejona86 commented Nov 11, 2020

voidzcy commented Nov 11, 2020 •

edited

Loading

ejona86 commented Nov 11, 2020

dapengzhang0 Nov 13, 2020

voidzcy Nov 13, 2020

dapengzhang0 Nov 13, 2020

voidzcy Nov 13, 2020

xds: implement a global map for holding circuit breaker request counters #7588

xds: implement a global map for holding circuit breaker request counters #7588

Conversation

voidzcy commented Nov 3, 2020 • edited Loading

dapengzhang0 commented Nov 9, 2020

voidzcy commented Nov 9, 2020

ejona86 commented Nov 10, 2020

voidzcy commented Nov 10, 2020

ejona86 commented Nov 11, 2020

voidzcy commented Nov 11, 2020 • edited Loading

ejona86 commented Nov 11, 2020

dapengzhang0 Nov 13, 2020

Choose a reason for hiding this comment

voidzcy Nov 13, 2020

Choose a reason for hiding this comment

dapengzhang0 Nov 13, 2020

Choose a reason for hiding this comment

voidzcy Nov 13, 2020

Choose a reason for hiding this comment

voidzcy commented Nov 3, 2020 •

edited

Loading

voidzcy commented Nov 11, 2020 •

edited

Loading