Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: manage load stats for all clusters in XdsClient #7299

Merged
merged 18 commits into from Aug 11, 2020

Conversation

voidzcy
Copy link
Contributor

@voidzcy voidzcy commented Aug 7, 2020

Mostly an extensive refactoring work, might be awful to review (sorry about that 😢 ).

Currently we are having an issue for xDS routing:

For example, if we have two routes with each of them routing some RPCs to the same cluster, say clusterA. Our current architecture will create two identical CDS policy instances for clusterA (as well as the whole LB subtree). Then each of these two CDS policies' child EDS policy will create its own stats object and pass it to the XdsClient for load reporting. However, this is not what our current implementation expects and causes problem (fails the precondition check, it is completely not the usage of the load report client API).

This really is an issue caused by the current design&implementation contradicts what we previously had. After the RouteAction design is implemented, no more than one CDS policy will be created for the same cluster.

  • Move the creation of LoadStatsStore (aka, the stats object) into XdsClient (as described in the design for refactoring xDS LB policies). The XdsClient is responsible for managing the lifetime of stats objects.

    • This helps mitigate the issue of current LB hierarchy for routing and traffic splitting enabled: there could be multiple CDS/EDS policy instances created for the same cluster. They should use the same stats object for recording loads.
    • Also this would be one step towards migrating to use XdsClient for stats object creation as mentioned in the xDS LB policy refactoring design.
  • Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.

Tests for the consumer side of the stats object is added (see LoadReportClientTest). Tests for the producer side is very poorly written given that LocalityStoreTest and EdsLoadBalancerTest are totally a mess. Hopefully we can clean them up soon.

Copy link
Contributor Author

@voidzcy voidzcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes to help understand the change.

xds/src/main/java/io/grpc/xds/LocalityStore.java Outdated Show resolved Hide resolved
xds/src/main/java/io/grpc/xds/LocalityStore.java Outdated Show resolved Hide resolved
/**
* Manages all stats for client side load.
*/
final class LoadStatsManager {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -386,27 +337,6 @@ private void cleanUp() {
}
}

private final class LoadStatsEntity {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliminated this wrapper. Move the interval recording into each LoadStatsStore's implementation. Since each LoadStatsStore represents a "living report", when taking a snapshot for the report, we could "stamp an interval" for the report. Since the report is queried every time for sending an LRS request, it totally makes sense to have the report already stamped with the interval.

xds/src/main/java/io/grpc/xds/EdsLoadBalancer.java Outdated Show resolved Hide resolved
@@ -303,16 +302,17 @@ public boolean canHandleEmptyAddressListFromNameResolution() {

@Override
public void shutdown() {
localityStore.reset();
if (isReportingLoad) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is isReportingLoad set true? I know is hard to test load report in EdsLoadBalancer because EdsLoadBalancerTest uses a real XdsClientImpl with fake channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to put that back. Basically, nothing has changed in terms of the control flow. Only added XdsClient APIs for creating stats objects.

loadStatsStores.put(cluster, new HashMap<String, RefCounted>());
}
Map<String, RefCounted> clusterLoadStatsStores = loadStatsStores.get(cluster);
if (!clusterLoadStatsStores.containsKey(clusterService)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary. Stats objects are always created, and it always succeeds to create. In terms of functionality, there is no difference between creating an stats object by the LB policy or by the XdsClient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for creating/removing stats object. But you have removed some logs, are they supposed to be moved somewhere else?

Copy link
Contributor Author

@voidzcy voidzcy Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted logs for "Start/Stop load reporting for cluster: ..., cluster_service: ..." in EDS policy. They are misleading and look awkward:

  • There is a single load reporting client for all clusters, it is either turned on or off, not for a specific cluster/cluster_service.
  • Load reporting is completely driven by the management server: which clusters to report load for are determined by the management server, not the gRPC client.

Each EDS policy's log for receiving it LB config includes the information for if this policy enables load reporting. The global load reporting is enabled as long as there is at least one EDS policy enabling it.

Instead, I put logs in XdsClient to indicate when the global load reporting is turned on or off.

@voidzcy voidzcy force-pushed the refactor/manage_global_client_stats branch from 04a0fb4 to 4d9d25e Compare August 10, 2020 19:27
@voidzcy
Copy link
Contributor Author

voidzcy commented Aug 11, 2020

Update: not only LoadStatsStore objects need to be reference counted in order for multiple CDS/EDS policies for the same cluster to create them. The counters used to actually record RPC states also need to be reference counted as multiple EDS policies may want to create the same counter for tracking loads sent to the same locality.

@dapengzhang0 PTAL.

@@ -54,27 +54,10 @@
private final AtomicLong callsIssued = new AtomicLong();
private final MetricRecorder[] metricRecorders = new MetricRecorder[THREAD_BALANCING_FACTOR];

// True if this counter continues to record stats after next snapshot. Otherwise, it will be
// discarded.
private boolean active;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reminder, active is true if this counter is currently in-use by some LB policy to track loads sent to a specific locality.

We do not mark internally if a counter is still in use by some LB policy, instead ref-count the usage of this counter externally by wrapping a counter with ReferenceCounted.

* Must only be used for testing.
*/
@VisibleForTesting
ClientLoadCounter(long callsSucceeded, long callsInProgress, long callsFailed, long callsIssued) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by individual setters as the way of testing has changed.

@Override
public ClientLoadCounter getLocalityCounter(final Locality locality) {
return localityLoadCounters.get(locality);
ReferenceCounted<ClientLoadCounter> counter = localityLoadCounters.get(locality);
return counter == null || counter.getReferenceCount() == 0 ? null : counter.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changed this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the same specification of the API as before, if the locality is is not under track, return null. We don't want to break in the middle as that makes this API fragile.

ClientLoadCounter counter = localityLoadCounters.get(locality);
checkState(counter == null || !counter.isActive(),
"An active counter for locality %s already exists", locality);
ReferenceCounted<ClientLoadCounter> counter = localityLoadCounters.get(locality);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ClientLoadCounter addLocality(Locality locality)
and remove ClientLoadCounter getLocalityCounter(Locality locality) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that may make it better. Done.

@voidzcy voidzcy merged commit f0a0e67 into grpc:master Aug 11, 2020
voidzcy added a commit to voidzcy/grpc-java that referenced this pull request Aug 11, 2020
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
voidzcy added a commit to voidzcy/grpc-java that referenced this pull request Aug 12, 2020
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
voidzcy added a commit that referenced this pull request Aug 12, 2020
…t) (#7299) (#7317)

Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
dfawley pushed a commit to dfawley/grpc-java that referenced this pull request Jan 15, 2021
Move the creation of LoadStatsStore (aka, the stats object) into XdsClient. The XdsClient is responsible for managing the lifetime of stats objects. Creations of LoadStatsStores are reference counted so that multiple EDS policies can retrieve the same stats object for load recording. Counters for recording loads per locality also need to be reference counted, as each EDS policy for the same cluster will receive endpoints for the same group of localities, they will use the same load counters for recording each locality's loads.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants