New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: manage load stats for all clusters in XdsClient #7299
xds: manage load stats for all clusters in XdsClient #7299
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some notes to help understand the change.
/** | ||
* Manages all stats for client side load. | ||
*/ | ||
final class LoadStatsManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Borrowed the idea from Envoy's implementation: https://github.com/envoyproxy/envoy/blob/50ef0945fa2c5da4bff7627c3abf41fdd3b7cffd/source/common/upstream/load_stats_reporter.cc#L17
@@ -386,27 +337,6 @@ private void cleanUp() { | |||
} | |||
} | |||
|
|||
private final class LoadStatsEntity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminated this wrapper. Move the interval recording into each LoadStatsStore
's implementation. Since each LoadStatsStore
represents a "living report", when taking a snapshot for the report, we could "stamp an interval" for the report. Since the report is queried every time for sending an LRS request, it totally makes sense to have the report already stamped with the interval.
…ecording and reporting flow.
…ter:eds_service loads
@@ -303,16 +302,17 @@ public boolean canHandleEmptyAddressListFromNameResolution() { | |||
|
|||
@Override | |||
public void shutdown() { | |||
localityStore.reset(); | |||
if (isReportingLoad) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is isReportingLoad
set true
? I know is hard to test load report in EdsLoadBalancer
because EdsLoadBalancerTest
uses a real XdsClientImpl
with fake channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot to put that back. Basically, nothing has changed in terms of the control flow. Only added XdsClient APIs for creating stats objects.
loadStatsStores.put(cluster, new HashMap<String, RefCounted>()); | ||
} | ||
Map<String, RefCounted> clusterLoadStatsStores = loadStatsStores.get(cluster); | ||
if (!clusterLoadStatsStores.containsKey(clusterService)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary. Stats objects are always created, and it always succeeds to create. In terms of functionality, there is no difference between creating an stats object by the LB policy or by the XdsClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for creating/removing stats object. But you have removed some logs, are they supposed to be moved somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted logs for "Start/Stop load reporting for cluster: ..., cluster_service: ..." in EDS policy. They are misleading and look awkward:
- There is a single load reporting client for all clusters, it is either turned on or off, not for a specific cluster/cluster_service.
- Load reporting is completely driven by the management server: which clusters to report load for are determined by the management server, not the gRPC client.
Each EDS policy's log for receiving it LB config includes the information for if this policy enables load reporting. The global load reporting is enabled as long as there is at least one EDS policy enabling it.
Instead, I put logs in XdsClient to indicate when the global load reporting is turned on or off.
04a0fb4
to
4d9d25e
Compare
Update: not only @dapengzhang0 PTAL. |
@@ -54,27 +54,10 @@ | |||
private final AtomicLong callsIssued = new AtomicLong(); | |||
private final MetricRecorder[] metricRecorders = new MetricRecorder[THREAD_BALANCING_FACTOR]; | |||
|
|||
// True if this counter continues to record stats after next snapshot. Otherwise, it will be | |||
// discarded. | |||
private boolean active; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reminder, active
is true if this counter is currently in-use by some LB policy to track loads sent to a specific locality.
We do not mark internally if a counter is still in use by some LB policy, instead ref-count the usage of this counter externally by wrapping a counter with ReferenceCounted
.
* Must only be used for testing. | ||
*/ | ||
@VisibleForTesting | ||
ClientLoadCounter(long callsSucceeded, long callsInProgress, long callsFailed, long callsIssued) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by individual setters as the way of testing has changed.
@Override | ||
public ClientLoadCounter getLocalityCounter(final Locality locality) { | ||
return localityLoadCounters.get(locality); | ||
ReferenceCounted<ClientLoadCounter> counter = localityLoadCounters.get(locality); | ||
return counter == null || counter.getReferenceCount() == 0 ? null : counter.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why changed this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the same specification of the API as before, if the locality is is not under track, return null
. We don't want to break in the middle as that makes this API fragile.
ClientLoadCounter counter = localityLoadCounters.get(locality); | ||
checkState(counter == null || !counter.isActive(), | ||
"An active counter for locality %s already exists", locality); | ||
ReferenceCounted<ClientLoadCounter> counter = localityLoadCounters.get(locality); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ClientLoadCounter addLocality(Locality locality)
and remove ClientLoadCounter getLocalityCounter(Locality locality)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that may make it better. Done.
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
…t) (#7299) (#7317) Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
Mostly an extensive refactoring work, might be awful to review (sorry about that 😢 ).
Currently we are having an issue for xDS routing:
For example, if we have two routes with each of them routing some RPCs to the same cluster, say clusterA. Our current architecture will create two identical CDS policy instances for clusterA (as well as the whole LB subtree). Then each of these two CDS policies' child EDS policy will create its own stats object and pass it to the XdsClient for load reporting. However, this is not what our current implementation expects and causes problem (fails the precondition check, it is completely not the usage of the load report client API).
This really is an issue caused by the current design&implementation contradicts what we previously had. After the RouteAction design is implemented, no more than one CDS policy will be created for the same cluster.
Move the creation of
LoadStatsStore
(aka, the stats object) into XdsClient (as described in the design for refactoring xDS LB policies). The XdsClient is responsible for managing the lifetime of stats objects.Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
Tests for the consumer side of the stats object is added (see LoadReportClientTest). Tests for the producer side is very poorly written given that
LocalityStoreTest
andEdsLoadBalancerTest
are totally a mess. Hopefully we can clean them up soon.