Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CDS to xds client #20638

Merged
merged 1 commit into from
Jan 16, 2020
Merged

Add CDS to xds client #20638

merged 1 commit into from
Jan 16, 2020

Conversation

AspirinSJL
Copy link
Member

@AspirinSJL AspirinSJL commented Oct 17, 2019

This change is Reviewable

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start! Please let me know if you have any questions about any of this.

Reviewed 5 of 5 files at r1.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @AspirinSJL)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 592 at r1 (raw file):

//

class XdsLb::ClusterWatcher : public XdsClient::ClusterWatcherInterface {

I don't think we want this in the current xds policy. I am adding this in the new cds policy in #20544 (not yet ready for review).


src/core/ext/filters/client_channel/xds/xds_api.h, line 138 at r1 (raw file):

};

struct CdsUpdate {

I think we're actually going to need two fields here, eds_service_name and lrs_load_reporting_server_name. See the fields I've added in #20544 (not yet ready for review).


src/core/ext/filters/client_channel/xds/xds_api.h, line 151 at r1 (raw file):

grpc_slice XdsEdsRequestCreateAndEncode(const char* server_name);

// Parses the EDS response and returns the args to update locality map. If there

s/EDS/ADS/


src/core/ext/filters/client_channel/xds/xds_api.h, line 154 at r1 (raw file):

// is any error, the output update is invalid.
grpc_error* XdsAdsResponseDecodeAndParse(const grpc_slice& encoded_response,
                                         CdsUpdate cds_update,

Looks like this should be a pointer.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 26 at r1 (raw file):

#include <grpc/support/alloc.h>
#include <grpc/support/string_util.h>
#include <src/core/ext/upb-generated/envoy/api/v2/core/config_source.upb.h>

This include should use quotes instead of angle brackets.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 107 at r1 (raw file):

}

grpc_slice XdsCdsRequestCreateAndEncode(const char* cluster_name) {

When you merge in the changes from #20606, this will need to be changed to populate the node message from the bootstrap data.

Note that we need to populate the node message only for the first message on the ADS stream.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 182 at r1 (raw file):

    // Decode the cluster.
    const upb_strview encoded_cluster = google_protobuf_Any_value(resources[i]);
    const envoy_api_v2_Cluster* candidate_cluster = envoy_api_v2_Cluster_parse(

We should probably check that this parsing actually succeeds, in case a bad server sends us a message of a type different than what the type URL indicates.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 213 at r1 (raw file):

  upb_strview service_name =
      envoy_api_v2_Cluster_EdsClusterConfig_service_name(eds_cluster_config);
  if (service_name.size != 0) {

If the EDS service name is not set, we should return the cluster name by default.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 222 at r1 (raw file):

        "LB policy is not ROUND_ROBIN.");
  }
  return GRPC_ERROR_NONE;

We also need to check the lrs_server field here.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 347 at r1 (raw file):

  upb_strview encoded_cluster_load_assignment =
      google_protobuf_Any_value(resources[0]);
  envoy_api_v2_ClusterLoadAssignment* cluster_load_assignment =

We should probably check that this parsing actually succeeds, in case a bad server sends us a message of a type different than what the type URL indicates. (This is a pre-existing problem, but we should probably fix it while we're in here.)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 676 at r1 (raw file):

    EdsUpdate update;
    grpc_error* parse_error =
        XdsAdsResponseDecodeAndParse(response_slice, CdsUpdate(), &update);

I think we need to pass a pointer to the CdsUpdate struct here.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1212 at r1 (raw file):

    w->OnClusterChanged(cluster_state_.cds_update);
  }
  chand_->MaybeStartAdsCall();

Currently, starting the ADS call automatically sends an EDS request, but what we want here is to send a CDS request.

I think what we need is another layer of abstraction here: ChannelState should provide methods for starting and stopping a watch for a particular CDS or EDS key rather than methods for starting and stopping the ADS call, and it should decide when to start and stop the ADS call automatically. Also, whenever a CDS or EDS watch is started or stopped, ChannelState should send a new CDS or EDS request updating the list of resource_names that it is interested in.

This approach has the nice benefit that we could easily switch from ADS to separate CDS/EDS/etc streams later simply by changing the implementation inside of ChannelState.

Note that we probably need to implement the ack/nack functionality before we can do the above, since without that, there is no way to safely change the set of resource_names that we are subscribing to.

@AspirinSJL
Copy link
Member Author

Sorry, this is actually still a draft PR.. I shouldn't have chosen a reviewer when I opened it. But still, thank you for the review!

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping the gun! I assumed that you wanted me to do an early look, since you added me as a reviewer. If you'd prefer, I can ignore this until you say it's ready.

I did think of one more quick comment here, so adding it now.

Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_api.cc, line 385 at r1 (raw file):

grpc_error* XdsAdsResponseDecodeAndParse(const grpc_slice& encoded_response,
                                         const char* expected_cluster_name,

I'm not sure whether the expected_cluster_name parameter makes sense here, because we could conceivably have multiple cluster watches happening at the same time during a transition where an RDS update switches us from one Cluster name to another. Presumably, we would not want to cancel the watch on the original cluster name until we've obtained the first response for the new cluster name, so the two watches will overlap for a short while.

This was referenced Oct 25, 2019
@AspirinSJL
Copy link
Member Author

I'm having build issue on my mac, so I'm using kokoro to debug.. Sorry for any inconvenience!

@AspirinSJL AspirinSJL added the release notes: no Indicates if PR should not be in release notes label Nov 8, 2019
Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Actually I still have some confusions, but I think asking them in the PR is more straightforward.

My general feeling while adding CDS:

  1. Our usage of the fields are not aligned with the proto (like the names, the proto comments) properly.
  2. Temporarily changing to ADS causes some implementation churn. The APIs should be cleaner when separate calls are used.
  3. It would be very helpful if we can somehow depict the relationship/constraints of the various names.

Reviewable status: 0 of 10 files reviewed, 14 unresolved discussions (waiting on @AspirinSJL and @markdroth)


build.yaml, line 6041 at r10 (raw file):

  src:
  - src/proto/grpc/testing/xds/ads_for_test.proto
  - src/proto/grpc/testing/xds/cds_for_test.proto

I have difficulty running the script to regenerate project on mac. Will fix later.


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 592 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we want this in the current xds policy. I am adding this in the new cds policy in #20544 (not yet ready for review).

Removed.


src/core/ext/filters/client_channel/xds/xds_api.h, line 138 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we're actually going to need two fields here, eds_service_name and lrs_load_reporting_server_name. See the fields I've added in #20544 (not yet ready for review).

Merged in.


src/core/ext/filters/client_channel/xds/xds_api.h, line 151 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/EDS/ADS/

Done.


src/core/ext/filters/client_channel/xds/xds_api.h, line 154 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Looks like this should be a pointer.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 26 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This include should use quotes instead of angle brackets.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 107 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

When you merge in the changes from #20606, this will need to be changed to populate the node message from the bootstrap data.

Note that we need to populate the node message only for the first message on the ADS stream.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 182 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should probably check that this parsing actually succeeds, in case a bad server sends us a message of a type different than what the type URL indicates.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 213 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the EDS service name is not set, we should return the cluster name by default.

Looks like you are handling this in https://github.com/grpc/grpc/pull/20544/files#diff-ca58750ead8b58438f44535919776ff7R148.

I prefer leave it as is, because CdsUpdate should contain the update from CDS response.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 222 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We also need to check the lrs_server field here.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 347 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should probably check that this parsing actually succeeds, in case a bad server sends us a message of a type different than what the type URL indicates. (This is a pre-existing problem, but we should probably fix it while we're in here.)

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 385 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I'm not sure whether the expected_cluster_name parameter makes sense here, because we could conceivably have multiple cluster watches happening at the same time during a transition where an RDS update switches us from one Cluster name to another. Presumably, we would not want to cancel the watch on the original cluster name until we've obtained the first response for the new cluster name, so the two watches will overlap for a short while.

This is no longer an issue if we expand the expected_cluster_name to all the keys in the cluster_map_?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 676 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we need to pass a pointer to the CdsUpdate struct here.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1212 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Currently, starting the ADS call automatically sends an EDS request, but what we want here is to send a CDS request.

I think what we need is another layer of abstraction here: ChannelState should provide methods for starting and stopping a watch for a particular CDS or EDS key rather than methods for starting and stopping the ADS call, and it should decide when to start and stop the ADS call automatically. Also, whenever a CDS or EDS watch is started or stopped, ChannelState should send a new CDS or EDS request updating the list of resource_names that it is interested in.

This approach has the nice benefit that we could easily switch from ADS to separate CDS/EDS/etc streams later simply by changing the implementation inside of ChannelState.

Note that we probably need to implement the ack/nack functionality before we can do the above, since without that, there is no way to safely change the set of resource_names that we are subscribing to.

Done.

@AspirinSJL AspirinSJL marked this pull request as ready for review November 8, 2019 14:43
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely moving in the right direction! Please let me know if you have any questions about any of this.

Responding to your concerns above:

Our usage of the fields are not aligned with the proto (like the names, the proto comments) properly.

Can you give an example of the type of misalignment you're referring to?

Temporarily changing to ADS causes some implementation churn. The APIs should be cleaner when separate calls are used.

I don't think the usage of ADS is temporary -- the xds ecosystem in general seems to be moving in the direction of ADS, so I suspect that we'll never wind up supporting separate streams for each API.

It would be very helpful if we can somehow depict the relationship/constraints of the various names.

Can you give an example of which relationships you'd like to see explained and how you'd like to see them documented?

Reviewed 1 of 6 files at r8, 4 of 9 files at r10, 6 of 6 files at r11.
Reviewable status: all files reviewed, 34 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_api.h, line 146 at r11 (raw file):

  CdsUpdate() = default;
  CdsUpdate(const CdsUpdate& other)

In general, to support copying, you should define both a copy ctor and a copy assignment operator, and to support moving, you should define both a move ctor and a move assignment operator. Right now, you have a copy ctor but no copy assignment operator, and you have a move assignment operator but no move ctor.


src/core/ext/filters/client_channel/xds/xds_api.h, line 201 at r11 (raw file):

    const std::set<StringView>& expected_eds_service_names,
    CdsUpdateMap* cds_update_map, EdsUpdateMap* eds_update_map,
    VersionState* new_version, bool* is_cds);

Instead of is_cds, we should probably make this a pointer to an enum type, since we will ultimately need to be able to handle LDS, RDS, CDS, and EDS resources.


src/core/ext/filters/client_channel/xds/xds_api.h, line 211 at r11 (raw file):

// are zero, returns empty slice.
grpc_slice XdsLrsRequestCreateAndEncode(
    std::map<StringView, std::set<XdsClientStats*>> client_stats_map);

I assume the key string is the cluster name? Please document.


src/core/ext/filters/client_channel/xds/xds_api.h, line 218 at r11 (raw file):

grpc_error* XdsLrsResponseDecodeAndParse(
    const grpc_slice& encoded_response,
    const std::set<StringView>& expected_eds_service_names,

Instead of passing in the list of expected names, we should have this return the list of cluster names that the LRS server asked for. That way, if it asks for one that we don't initially know about, but then we start using that cluster later, we can immediately know to start sending load reports for it.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 385 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

This is no longer an issue if we expand the expected_cluster_name to all the keys in the cluster_map_?

As per recent discussion, we actually need to validate and cache any resource sent to us by the server, even if it's for clusters we didn't expect, so I think the expected_cluster_names or expected_eds_service_names arguments should be removed here.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 215 at r11 (raw file):

      envoy_api_v2_DiscoveryRequest_new(arena.ptr());
  // Set version_info.
  envoy_api_v2_DiscoveryRequest_set_version_info(

On the very first request for a given resource type, there is no version, so we should not set this field.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 233 at r11 (raw file):

                                             upb_strview_makez(kCdsTypeUrl));
  // Set nonce.
  envoy_api_v2_DiscoveryRequest_set_response_nonce(

On the first request for a given resource type, this will not be set.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 251 at r11 (raw file):

      envoy_api_v2_DiscoveryRequest_new(arena.ptr());
  // Set version_info.
  envoy_api_v2_DiscoveryRequest_set_version_info(

Same as above: this will not be set on the first request.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 270 at r11 (raw file):

                                             upb_strview_makez(kEdsTypeUrl));
  // Set nonce.
  envoy_api_v2_DiscoveryRequest_set_response_nonce(

This will not be set on the first request.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 290 at r11 (raw file):

}  // namespace

grpc_error* CdsResponsedParse(

s/Responsed/Response/


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):

            "ConfigSource is not self.");
      }
      // FIXME: If we only use this field to enable/disable load reporting, we

Why does this violate the proto? The intent of this field is to do exactly this: if the lrs_server field is not present in the proto, then LRS should not be used.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 685 at r11 (raw file):

      // Prune unused locality stats.
      client_stats->PruneLocalityStats();
      if (!snapshot.IsAllZero()) all_zero = false;

I think that if an individual cluster's snapshot is all-zero, then we should not add that snapshot to the map.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 704 at r11 (raw file):

              request, arena.ptr());
      // Set the cluster name.
      // FIXME: The cluster_name is set to the eds_service_name according to the

I think this should be the cluster name. In the EDS-only mode (our old design, which is what the LRS design doc was written for), the cluster name and EDS service name would always be the same. But if they're different, then this needs to be the cluster name.

Hmmm... This may mean that we need to plumb the cluster name down into the EDS LB policy. Let me think about this for a bit.

For now, let's change this to say "cluster" instead of "eds_service_name" in all the variables and comments, and please add a TODO assigned to me in the xds LB policy indicating that we should be passing the cluster name instead of the EDS service name.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 762 at r11 (raw file):

  }
  // Check the cluster names in the response.
  // FIXME: Do we need to check and return error if unmatched? Or just ignore

As I mentioned elsewhere, I think this function should just return the list of cluster names returned by the LRS server. That way, the XdsClient can automatically know to start reporting load for a new cluster for which it gets data after it has gotten the initial LRS response.


src/core/ext/filters/client_channel/xds/xds_client.h, line 104 at r11 (raw file):

  // Adds and removes client stats for cluster.
  // FIXME: Looks like eds_service_name key is necessary so that the management

As I mentioned elsewhere, the parameter here should definitely be the cluster name, not the EDS service name. Please add a TODO assigned to me in the xds LB policy about this.


src/core/ext/filters/client_channel/xds/xds_client.h, line 115 at r11 (raw file):

  void ResetBackoff();

  std::set<StringView> ClusterNames() const;

Why are these three methods needed? The LB policies should only interact with the XdsClient via the watcher.


src/core/ext/filters/client_channel/xds/xds_client.h, line 156 at r11 (raw file):

    LrsCallState* lrs_calld() const;

    void MaybeStartAdsCall();

These two methods should probably be private now, since they should not need to be called by the XdsClient object.


src/core/ext/filters/client_channel/xds/xds_client.h, line 243 at r11 (raw file):

  std::map<StringView /*eds_service_name*/, EndpointState, StringLess>
      endpoint_map_;
  VersionState cds_version_state_;

These fields should probably go inside of the AdsCallState object, because versions are only relevant within a single call. If the call fails and is restarted, we start again with no version.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r11 (raw file):

  bool HasWatcher() const;

  void SendMessageLocked(bool is_cds, bool is_first_message = false);

We'll probably need to replace is_cds with an enum, so that we can support LDS, RDS, CDS, and EDS.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 426 at r11 (raw file):

void XdsClient::ChannelState::WatchClusterData(
    StringView cluster_name,
    std::unique_ptr<XdsClient::ClusterWatcherInterface> watcher) {

It seems strange to deal with the watcher inside of the ChannelState object, since that's not specific to the channel. I think the only thing we need to track inside of ChannelState is the set of resource names that we want to subscribe to for each of LDS, RDS, CDS, and EDS.

When the set of resource names for one of these APIs goes from empty to non-empty, we start an ADS call if there is not already one going and then send a request on the call for the relevant API.

When the set of resource names for one of these APIs goes from non-empty to empty, then if the set of resource names is also empty for all of the other APIs, we stop the ADS call.

Note that it's important to track the resource names that we're subscribing to independently of the cluster/endpoint maps in the XdsClient, because the server can send us additional resources beyond the ones that we're actually subscribing to. In that case, we need to cache the addition resources the server sent in the maps, but the set of resource names that we populate in our requests does not change.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 653 at r11 (raw file):

  }
  if (!xds_client()->endpoint_map_.empty())
    SendMessageLocked(false, initial_message);

Please put braces around the body of this if statement.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 754 at r11 (raw file):

  op.data.send_message.send_message = send_message_payload_;
  Ref(DEBUG_LOCATION, "ADS+OnRequestSentLocked").release();
  GRPC_CLOSURE_INIT(&on_request_sent_, OnRequestSentLocked, this,

This can probably be moved to the ctor.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 804 at r11 (raw file):

    }
    // Update the cluster state.
    cluster_state.update = std::move(cds_update);

We need to validate all of the resources sent by the server before we apply any of them. If any one resource is invalid, then we need to reject the entire response, even if it contained some valid resources.

Also, once we are done processing the response, we need to send the server either an ACK or a NACK, depending on whether all of the resources were valid.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 887 at r11 (raw file):

    }
    // Update the cluster state.
    endpoint_state.update = std::move(eds_update);

Same here.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 908 at r11 (raw file):

  self->send_message_payload_ = nullptr;
  // Continue to send another pending message if any.
  if (self->cds_request_buffered_) {

It looks like we're always prioritizing CDS over EDS if both of them have pending requests to send. If there is a lot of update churn, could this cause us to be constantly sending CDS requests and never get around to sending an EDS request? Maybe it would be better to model this as a queue. Although it would also be good to preserve the property (which this code currently has) of sending only the most recent list of resource names for each resource type. In other words, if the set of resource names for (e.g.) EDS changes twice while we're waiting for a CDS request to be sent, we should ideally be smart enough to send only the most recent list.

For now, let's just leave this as-is, but please add a TODO assigned to me indicating that we might need to fix this if we see EDS requests being starved when there are frequent updates.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 955 at r11 (raw file):

  EdsUpdateMap eds_update_map;
  VersionState new_version;
  bool is_cds;

As mentioned elsewhere, this should be an enum, since we have 4 different types of resources to handle.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 956 at r11 (raw file):

  VersionState new_version;
  bool is_cds;
  // FIXME: 1. If decoding fails, we don't know whether it's CDS or not, then

These are good questions. I think perhaps we need the decoding API to differentiate between invalid data and an unparseable response. For example, if we can't decode the proto at all, then there's probably nothing we can do but cancel the call and start a new one. But if we get a request that just has an unknown resource type, then we should just generate a request to NACK the data using the resource type and nonce from the response.

I don't think using separate streams would help with this, and I suspect that we will never wind up implementing separate streams, because the xDS ecosystem in general seems to be moving more toward the ADS model.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 969 at r11 (raw file):

    GRPC_ERROR_UNREF(parse_error);
    // NACK.
    ads_calld->SendMessageLocked(is_cds);

We need to populate the error_detail field in the DiscoveryRequest when sending a NACK.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1158 at r11 (raw file):

    // Ignore identical update.
    if (lrs_calld->load_reporting_interval_ == new_load_reporting_interval &&
        strcmp(lrs_calld->cluster_name_.get(), new_cluster_name.get()) == 0) {

We should store the list of clusters that we get from the LRS server, since that tells us which clusters we actually need to send updates for.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

class StringView;
inline int StringViewCmp(const StringView lhs, const StringView rhs);

Using inline doesn't help unless the method is defined in the .h file.

Actually, I don't see this function defined anywhere.


src/core/lib/gprpp/string_view.h, line 127 at r11 (raw file):

  }

  bool operator<(const StringView& other) const {

Is this part of the absl::string_view API? If not, we should not add it here, since that will make it harder to replace this with absl::string_view.


src/proto/grpc/testing/xds/cds_for_test.proto, line 51 at r11 (raw file):

message Cluster {
  ConfigSource lrs_server = 42;

I think we need more than just this field here. See the xDS API flow doc for what fields we check in the Cluster proto.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

      if (!stream->Read(&request)) return;
      // FIXME: Record CDS request number.
      if (request.type_url() == kCdsTypeUrl) {

Instead of hard-coding this here, maybe we should just represent the CDS responses in responses_and_delays_, just like we do with the EDS responses? Or maybe we need to change the abstraction here so that this is a true fake, where we populate the server with xds resources and let it generate the responses normally from that database?

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Most of my confusion about the misalignment and the names are already put into the FIXMEs inline.

Reviewable status: 5 of 10 files reviewed, 34 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/xds/xds_api.h, line 146 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In general, to support copying, you should define both a copy ctor and a copy assignment operator, and to support moving, you should define both a move ctor and a move assignment operator. Right now, you have a copy ctor but no copy assignment operator, and you have a move assignment operator but no move ctor.

Completed the {copy, move} * {ctor, move assignment operator}.

FWIW, the previous subset of those 4 methods seems sufficient to fix the build issues I saw.


src/core/ext/filters/client_channel/xds/xds_api.h, line 201 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of is_cds, we should probably make this a pointer to an enum type, since we will ultimately need to be able to handle LDS, RDS, CDS, and EDS resources.

Done.


src/core/ext/filters/client_channel/xds/xds_api.h, line 211 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I assume the key string is the cluster name? Please document.

Done.


src/core/ext/filters/client_channel/xds/xds_api.h, line 218 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of passing in the list of expected names, we should have this return the list of cluster names that the LRS server asked for. That way, if it asks for one that we don't initially know about, but then we start using that cluster later, we can immediately know to start sending load reports for it.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 385 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As per recent discussion, we actually need to validate and cache any resource sent to us by the server, even if it's for clusters we didn't expect, so I think the expected_cluster_names or expected_eds_service_names arguments should be removed here.

According to the current appendix to the design doc, we need to cache CDS resources but not the EDS ones due to the different characteristics of resource types. So I'm only removing expected_cluster_names.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 215 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

On the very first request for a given resource type, there is no version, so we should not set this field.

The proto (https://github.com/envoyproxy/envoy/blob/38adf1f02e95cf7a7078cdaa39032b62ca1e2ebf/api/envoy/api/v2/discovery.proto#L21) says it should be "empty on the first request". I assume setting it to an empty string is fine, so I made the default value of version_info a unique pointer to an empty string. Then we don't have to handle the special case for the first request.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 233 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

On the first request for a given resource type, this will not be set.

Same as above.

https://github.com/envoyproxy/envoy/blob/38adf1f02e95cf7a7078cdaa39032b62ca1e2ebf/api/envoy/api/v2/discovery.proto#L48


src/core/ext/filters/client_channel/xds/xds_api.cc, line 251 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above: this will not be set on the first request.

Same as above.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 270 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This will not be set on the first request.

Same as above.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 290 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/Responsed/Response/

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):
According to the proto (https://github.com/envoyproxy/envoy/blob/4787e0a9b6810ff4cc0dd3f8d78cb8a1d7c2d73a/api/envoy/api/v2/cds.proto#L789), we should send the report to the lrs_server if it's present in the CDS response. But we are using it as a bit.

If present, tells the client where to send load reports via LRS. If not present, the client will fall back to a client-side default


src/core/ext/filters/client_channel/xds/xds_api.cc, line 685 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think that if an individual cluster's snapshot is all-zero, then we should not add that snapshot to the map.

I think the current code is fine. If we don't want to see clusters with all-zero snapshot in this function, we need to preprocess the snapshots before this function is called. I think it might be better to put such trivial code in xds_api.cc.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 704 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this should be the cluster name. In the EDS-only mode (our old design, which is what the LRS design doc was written for), the cluster name and EDS service name would always be the same. But if they're different, then this needs to be the cluster name.

Hmmm... This may mean that we need to plumb the cluster name down into the EDS LB policy. Let me think about this for a bit.

For now, let's change this to say "cluster" instead of "eds_service_name" in all the variables and comments, and please add a TODO assigned to me in the xds LB policy indicating that we should be passing the cluster name instead of the EDS service name.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 762 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned elsewhere, I think this function should just return the list of cluster names returned by the LRS server. That way, the XdsClient can automatically know to start reporting load for a new cluster for which it gets data after it has gotten the initial LRS response.

Removed.

So if my understanding is correct, LRS is also a state-of-the-world API. The LRS server asks for some clusters to report load. The XdsClient will report load for clusters that are asked by the LRS server.


src/core/ext/filters/client_channel/xds/xds_client.h, line 104 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As I mentioned elsewhere, the parameter here should definitely be the cluster name, not the EDS service name. Please add a TODO assigned to me in the xds LB policy about this.

Do we also need to add eds_service_name? The proto comment says the client stats with the same eds_service_name will be aggregated, which is why I felt the client stats should be keyed by the eds_service_name.


src/core/ext/filters/client_channel/xds/xds_client.h, line 115 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why are these three methods needed? The LB policies should only interact with the XdsClient via the watcher.

Sorry, they should be private methods of XdsClient.

Moved.


src/core/ext/filters/client_channel/xds/xds_client.h, line 156 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

These two methods should probably be private now, since they should not need to be called by the XdsClient object.

Actually they are no longer used any more.

Deleted.


src/core/ext/filters/client_channel/xds/xds_client.h, line 243 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

These fields should probably go inside of the AdsCallState object, because versions are only relevant within a single call. If the call fails and is restarted, we start again with no version.

Done.

But shouldn't the cluster_map_ and the cds_version_state_ be stored together?

If we NACK the first response from a new ADS call, should we NACK with empty version (because we haven't used any response from this new call) or with the version of the currently used resource (so that the server knows which version we are using)?

The doc says we should send empty version, but I wanted to double check.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We'll probably need to replace is_cds with an enum, so that we can support LDS, RDS, CDS, and EDS.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 426 at r11 (raw file):
Done.

the only thing we need to track inside of ChannelState is the set of resource names that we want to subscribe to for each of LDS, RDS, CDS, and EDS.

Looks like we can tell if a cluster is pre-cached by whether it has watchers, so we don't have to store the subscribed names separately.

When the set of resource names for one of these APIs goes from empty to non-empty, we start an ADS call if there is not already one going and then send a request on the call for the relevant API.

We need to send a new request whenever a new resource name is watched, not just when the number of resource names for a specific type changes from 0 to 1, right?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 653 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please put braces around the body of this if statement.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 754 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can probably be moved to the ctor.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 804 at r11 (raw file):

We need to validate all of the resources sent by the server before we apply any of them. If any one resource is invalid, then we need to reject the entire response, even if it contained some valid resources.

I can understand this should be true for CDS.

But does this also apply to EDS? If the resource names are A and B, but we are only interested in B. Should we accept B and ignore A if A contains invalid data?

Also, once we are done processing the response, we need to send the server either an ACK or a NACK, depending on whether all of the resources were valid.

Yes, the validation is done during parsing. If the parsing (and validation) fails, an error is returned and we will NACK. Otherwise, we accept the update and ACK.

I renamed those methods AcceptXdsUpdate() to make it clearer.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 887 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

Same as above.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 908 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like we're always prioritizing CDS over EDS if both of them have pending requests to send. If there is a lot of update churn, could this cause us to be constantly sending CDS requests and never get around to sending an EDS request? Maybe it would be better to model this as a queue. Although it would also be good to preserve the property (which this code currently has) of sending only the most recent list of resource names for each resource type. In other words, if the set of resource names for (e.g.) EDS changes twice while we're waiting for a CDS request to be sent, we should ideally be smart enough to send only the most recent list.

For now, let's just leave this as-is, but please add a TODO assigned to me indicating that we might need to fix this if we see EDS requests being starved when there are frequent updates.

True.. Ideally it should include some kind of time info, like attaching a timestamp of the latest buffered message for each resource type, or using a queue and moving the latest buffered type to the end of the queue. But it feels like an overkill for now.

Added a TODO.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 955 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As mentioned elsewhere, this should be an enum, since we have 4 different types of resources to handle.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 956 at r11 (raw file):

For example, if we can't decode the proto at all, then there's probably nothing we can do but cancel the call and start a new one. But if we get a request that just has an unknown resource type, then we should just generate a request to NACK the data using the resource type and nonce from the response.

If the resource type is unknown, we would also have to restart the call because we don't know which resource type to NACK, right?

In general, is it safe to assume that we will always restart the call if we are in an undefined state for ACK/NACK (like, the fields necessary for ACK/NACK are not all found)?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 969 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We need to populate the error_detail field in the DiscoveryRequest when sending a NACK.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1158 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should store the list of clusters that we get from the LRS server, since that tells us which clusters we actually need to send updates for.

Done.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Using inline doesn't help unless the method is defined in the .h file.

Actually, I don't see this function defined anywhere.

It's pre-existing in this file at line 161. I forward declared it so that the < operator can use it.


src/core/lib/gprpp/string_view.h, line 127 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is this part of the absl::string_view API? If not, we should not add it here, since that will make it harder to replace this with absl::string_view.

I think so.

https://github.com/abseil/abseil-cpp/blob/3df7b52a6ada51a72a23796b844549a7b282f1b8/absl/strings/string_view.h#L560


src/proto/grpc/testing/xds/cds_for_test.proto, line 51 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we need more than just this field here. See the xDS API flow doc for what fields we check in the Cluster proto.

Done.

Sorry, I should have copied all the necessary fields.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer! Please let me know if you have any questions about any of this.

Reviewed 2 of 5 files at r12, 4 of 4 files at r13.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 747 at r13 (raw file):

                                        endpoint_watcher_);
  if (config_->lrs_load_reporting_server_name() != nullptr) {
    // TODO(roth): We should pass the cluster_name (instead if the

s/instead of/in addition to/


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 850 at r13 (raw file):

    }
    if (config_->lrs_load_reporting_server_name() != nullptr) {
      // TODO(roth): We should pass the cluster_name (instead if the

s/instead of/in addition to/


src/core/ext/filters/client_channel/xds/xds_api.h, line 44 at r13 (raw file):

  // The name to use in the EDS request.
  // If null, the cluster name will be used.
  std::unique_ptr<char> eds_service_name;

Please use std::string for these fields, since we're allowed to do that now.


src/core/ext/filters/client_channel/xds/xds_api.h, line 53 at r13 (raw file):

  CdsUpdate() = default;

  // Copyable.

Once you switch to std::string for the data fields, there's probably no need to define any ctors or assignment operators, since the defaults generated by the compiler should work fine.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 385 at r1 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

According to the current appendix to the design doc, we need to cache CDS resources but not the EDS ones due to the different characteristics of resource types. So I'm only removing expected_cluster_names.

Yup, that's right.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 215 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

The proto (https://github.com/envoyproxy/envoy/blob/38adf1f02e95cf7a7078cdaa39032b62ca1e2ebf/api/envoy/api/v2/discovery.proto#L21) says it should be "empty on the first request". I assume setting it to an empty string is fine, so I made the default value of version_info a unique pointer to an empty string. Then we don't have to handle the special case for the first request.

I think it should be unset, not set to an empty string.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 233 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Same as above.

https://github.com/envoyproxy/envoy/blob/38adf1f02e95cf7a7078cdaa39032b62ca1e2ebf/api/envoy/api/v2/discovery.proto#L48

Same as above: it should be unset, not the empty string.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

According to the proto (https://github.com/envoyproxy/envoy/blob/4787e0a9b6810ff4cc0dd3f8d78cb8a1d7c2d73a/api/envoy/api/v2/cds.proto#L789), we should send the report to the lrs_server if it's present in the CDS response. But we are using it as a bit.

If present, tells the client where to send load reports via LRS. If not present, the client will fall back to a client-side default

In the case of gRPC, the client-side default is to disable load reporting. There is no violation of the proto here.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 685 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

I think the current code is fine. If we don't want to see clusters with all-zero snapshot in this function, we need to preprocess the snapshots before this function is called. I think it might be better to put such trivial code in xds_api.cc.

What's the benefit of doing it in the caller? We're already checking whether it's all-zero here, so it seems simpler to do this at the same time.

In fact, if we do this, then this code gets a bit simpler, because we no longer need the all_zero variable. Instead, we can just check whether snapshot_map is empty, because we will only add entries to the map that are non-zero.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 762 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Removed.

So if my understanding is correct, LRS is also a state-of-the-world API. The LRS server asks for some clusters to report load. The XdsClient will report load for clusters that are asked by the LRS server.

It's not really state-of-the-world, because the client does not need to send load reports for clusters for which it has no data. The client only needs to report load for clusters for which the LRS server has requested and the client has data for. If the client is not sending any requests to a given cluster, then it has no load data and therefore does not need to send a load report.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 239 at r13 (raw file):

  // Set error_detail if it's a NACK.
  if (error != GRPC_ERROR_NONE) {
    const char* error_string = grpc_error_string(error);

I don't think we want to send the full error string, since that includes details like filenames and line numbers that the server won't care about. I think we just want to send the error description itself.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 286 at r13 (raw file):

  // Set error_detail if it's a NACK.
  if (error != GRPC_ERROR_NONE) {
    const char* error_string = grpc_error_string(error);

Same as above.


src/core/ext/filters/client_channel/xds/xds_client.h, line 104 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Do we also need to add eds_service_name? The proto comment says the client stats with the same eds_service_name will be aggregated, which is why I felt the client stats should be keyed by the eds_service_name.

Oh, good point -- I didn't realize that the proto required both the cluster name and the EDS service name. Yes, we probably need both parameters.

Don't worry about this for now. I'll take care of this along with the TODO in the xds LB policy.


src/core/ext/filters/client_channel/xds/xds_client.h, line 243 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Done.

But shouldn't the cluster_map_ and the cds_version_state_ be stored together?

If we NACK the first response from a new ADS call, should we NACK with empty version (because we haven't used any response from this new call) or with the version of the currently used resource (so that the server knows which version we are using)?

The doc says we should send empty version, but I wanted to double check.

The doc is correct. In non-incremental xDS, the version is only relevant within a single call. If the call fails and we start a new one, the version from the old call is irrelevant. If we NACK the first response on the new call, we send back an empty version. The protocol does not provide a way for the client to tell the server that it will continue using data from a previous call.


src/core/ext/filters/client_channel/xds/xds_client.h, line 180 at r13 (raw file):

    std::map<ClusterWatcherInterface*, std::unique_ptr<ClusterWatcherInterface>>
        watchers;
    std::set<XdsClientStats*> client_stats;

I think this should go back into the EndpointState struct. The data contained in an XdsClientStats object is inherently about a particular EDS service name more than it is about a cluster.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 426 at r11 (raw file):

We need to send a new request whenever a new resource name is watched, not just when the number of resource names for a specific type changes from 0 to 1, right?

For a given resource type, we need to send a new request whenever the set of resource names we want to subscribe to changes. That could mean an addition to the set or a removal from the set (although we need special handling for the case where we are unsubscribing from the last resource in LDS or CDS, since we can't send an empty resource_names field in those cases).


src/core/ext/filters/client_channel/xds/xds_client.cc, line 804 at r11 (raw file):

But does this also apply to EDS? If the resource names are A and B, but we are only interested in B. Should we accept B and ignore A if A contains invalid data?

For RDS and EDS, we can ignore any resources that we did not explicitly request, which means that we will not validate them nor will we cache them -- in other words, we treat the response as if it did not contain those resources to begin with.

However, if there is more than one resource in the response that we are not ignoring and one of those resources is invalid, then we need to NACK and the whole update, and then we should not accept any of the resources in the update. In other words, if we are using both resources A and B, and we get an update in which A is valid but B is invalid, we NACK the update, and we do not use the new data for A, even though it was valid, because it came in the same update with B, which was invalid.

Yes, the validation is done during parsing.

Ah, right -- I think I missed that when I was last looking at this code. How about adding a comment above the call to XdsAdsResponseDecodeAndParse() indicating that it also validates the data?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 956 at r11 (raw file):

If the resource type is unknown, we would also have to restart the call because we don't know which resource type to NACK, right?

No. We should just copy the unknown resource type from the response when we send the NACK.

This probably means that we need to have XdsAdsResponseDecodeAndParse() return the resource type from the response directly instead of returning an enum value.

In general, is it safe to assume that we will always restart the call if we are in an undefined state for ACK/NACK (like, the fields necessary for ACK/NACK are not all found)?

This case should be exceedingly rare. I think that if it does happen, it's probably due to a bug in the management server, in which case restarting the call would probably not help. So we should probably just ignore the response in that case.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 128 at r13 (raw file):

  bool seen_response() const { return seen_response_; }

  bool HasWatcher() const;

Suggest calling this something like AdsCallNeeded().


src/core/ext/filters/client_channel/xds/xds_client.cc, line 274 at r13 (raw file):

  // Load reporting state.
  std::set<std::unique_ptr<char>>

Please use std::string instead of std::unique_ptr<char>.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 666 at r13 (raw file):

bool XdsClient::ChannelState::AdsCallState::HasWatcher() const {
  for (const auto p : xds_client()->cluster_map_) {

const auto& p


src/core/ext/filters/client_channel/xds/xds_client.cc, line 666 at r13 (raw file):

bool XdsClient::ChannelState::AdsCallState::HasWatcher() const {
  for (const auto p : xds_client()->cluster_map_) {

Since this method is being used to determine if the ADS call is still needed, it needs to check whether there are watchers for any resource type, not just clusters.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 937 at r13 (raw file):

    // NACK.
    ads_calld->SendMessageLocked(ads_type, parse_error, false);
    GRPC_ERROR_UNREF(parse_error);

SendMessageLocked() should take ownership of parse_error so that we don't need to unref it here.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1313 at r13 (raw file):

  [&]() {
    // Parse the response.
    std::set<std::unique_ptr<char>> new_cluster_names;

Please use std::string instead of std::unique_ptr<char>.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1331 at r13 (raw file):

              xds_client, new_cluster_names.size(),
              new_load_reporting_interval);
      size_t i = 0;

Suggest writing this as:

for (size_t i = 0; i < new_cluster_names.size(); ++i) {
  gpr_log(GPR_INFO, "[xds_client %p] cluster_name %" PRIuPTR ": %s",
          xds_client, i, new_cluster_names[i].get());
}

src/core/ext/filters/client_channel/xds/xds_client.cc, line 1350 at r13 (raw file):

    // Ignore identical update.
    bool same_cluster_names =
        new_cluster_names.size() == lrs_calld->cluster_names_.size() &&

Once you switch to using std::string above, I think you will be able to simply say new_cluster_names == lrs_calld->cluster_names_.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

It's pre-existing in this file at line 161. I forward declared it so that the < operator can use it.

I think this forward declaration will no longer be needed if you take my suggestion below about making the < operator a standalone function instead of a member.


src/core/lib/gprpp/string_view.h, line 127 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

I think so.

https://github.com/abseil/abseil-cpp/blob/3df7b52a6ada51a72a23796b844549a7b282f1b8/absl/strings/string_view.h#L560

That's a standalone operator function, not a member function. Let's do the same thing here, so that we keep things consistent.

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 11 files reviewed, 28 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 747 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/instead of/in addition to/

Done.


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 850 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/instead of/in addition to/

Done.


src/core/ext/filters/client_channel/xds/xds_api.h, line 44 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use std::string for these fields, since we're allowed to do that now.

Done.


src/core/ext/filters/client_channel/xds/xds_api.h, line 53 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Once you switch to std::string for the data fields, there's probably no need to define any ctors or assignment operators, since the defaults generated by the compiler should work fine.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 251 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Same as above.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 270 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Same as above.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In the case of gRPC, the client-side default is to disable load reporting. There is no violation of the proto here.

Sorry I'm still confused here. I can understand that if lrs_server is not present, the client can fall back to whatever reasonable action it wants to take. But when it is present, shouldn't we send the report to lrs_server indicated here? What we are doing now is to degrade it to a bit to enable/disable load reporting and then "use the same server we obtained the CDS data from".


src/core/ext/filters/client_channel/xds/xds_api.cc, line 685 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What's the benefit of doing it in the caller? We're already checking whether it's all-zero here, so it seems simpler to do this at the same time.

In fact, if we do this, then this code gets a bit simpler, because we no longer need the all_zero variable. Instead, we can just check whether snapshot_map is empty, because we will only add entries to the map that are non-zero.

Done.

Sorry, I misread your previous comment and thought you suggested we shouldn't add the client_stats to the map if its snapshot is empty.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 762 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It's not really state-of-the-world, because the client does not need to send load reports for clusters for which it has no data. The client only needs to report load for clusters for which the LRS server has requested and the client has data for. If the client is not sending any requests to a given cluster, then it has no load data and therefore does not need to send a load report.

I see.

Added a check to only send load reports for the clusters asked for by the LRS server.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 239 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we want to send the full error string, since that includes details like filenames and line numbers that the server won't care about. I think we just want to send the error description itself.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 286 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same as above.

Done.


src/core/ext/filters/client_channel/xds/xds_client.h, line 243 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The doc is correct. In non-incremental xDS, the version is only relevant within a single call. If the call fails and we start a new one, the version from the old call is irrelevant. If we NACK the first response on the new call, we send back an empty version. The protocol does not provide a way for the client to tell the server that it will continue using data from a previous call.

Sounds good. Thanks!


src/core/ext/filters/client_channel/xds/xds_client.h, line 180 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this should go back into the EndpointState struct. The data contained in an XdsClientStats object is inherently about a particular EDS service name more than it is about a cluster.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 426 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We need to send a new request whenever a new resource name is watched, not just when the number of resource names for a specific type changes from 0 to 1, right?

For a given resource type, we need to send a new request whenever the set of resource names we want to subscribe to changes. That could mean an addition to the set or a removal from the set (although we need special handling for the case where we are unsubscribing from the last resource in LDS or CDS, since we can't send an empty resource_names field in those cases).

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 804 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

But does this also apply to EDS? If the resource names are A and B, but we are only interested in B. Should we accept B and ignore A if A contains invalid data?

For RDS and EDS, we can ignore any resources that we did not explicitly request, which means that we will not validate them nor will we cache them -- in other words, we treat the response as if it did not contain those resources to begin with.

However, if there is more than one resource in the response that we are not ignoring and one of those resources is invalid, then we need to NACK and the whole update, and then we should not accept any of the resources in the update. In other words, if we are using both resources A and B, and we get an update in which A is valid but B is invalid, we NACK the update, and we do not use the new data for A, even though it was valid, because it came in the same update with B, which was invalid.

Yes, the validation is done during parsing.

Ah, right -- I think I missed that when I was last looking at this code. How about adding a comment above the call to XdsAdsResponseDecodeAndParse() indicating that it also validates the data?

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 956 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the resource type is unknown, we would also have to restart the call because we don't know which resource type to NACK, right?

No. We should just copy the unknown resource type from the response when we send the NACK.

This probably means that we need to have XdsAdsResponseDecodeAndParse() return the resource type from the response directly instead of returning an enum value.

In general, is it safe to assume that we will always restart the call if we are in an undefined state for ACK/NACK (like, the fields necessary for ACK/NACK are not all found)?

This case should be exceedingly rare. I think that if it does happen, it's probably due to a bug in the management server, in which case restarting the call would probably not help. So we should probably just ignore the response in that case.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 128 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this something like AdsCallNeeded().

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 274 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use std::string instead of std::unique_ptr<char>.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 666 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

const auto& p

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 666 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Since this method is being used to determine if the ADS call is still needed, it needs to check whether there are watchers for any resource type, not just clusters.

Yes. For now, the other resource type is EDS, which is already checked below.

Because we don't cache unwatched resource names for EDS, it's enough to check whether endpoint_map_ is empty.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 937 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

SendMessageLocked() should take ownership of parse_error so that we don't need to unref it here.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1313 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use std::string instead of std::unique_ptr<char>.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1331 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest writing this as:

for (size_t i = 0; i < new_cluster_names.size(); ++i) {
  gpr_log(GPR_INFO, "[xds_client %p] cluster_name %" PRIuPTR ": %s",
          xds_client, i, new_cluster_names[i].get());
}

I don't think std::set supports random access.

(It's better to use std::set than vector here because it's sorted and easier to compare.)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1350 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Once you switch to using std::string above, I think you will be able to simply say new_cluster_names == lrs_calld->cluster_names_.

Done. That's great!


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this forward declaration will no longer be needed if you take my suggestion below about making the < operator a standalone function instead of a member.

This is still needed because the StringViewCmp is the last function definition in this file? And it should stay the at the bottom in this file because it uses StringView.


src/core/lib/gprpp/string_view.h, line 127 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

That's a standalone operator function, not a member function. Let's do the same thing here, so that we keep things consistent.

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good! I think most of my remaining comments are relatively minor.

Please let me know if you have any questions. Thanks!

Reviewed 4 of 9 files at r15, 1 of 4 files at r16, 4 of 4 files at r17.
Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_api.h, line 170 at r17 (raw file):

};

// Creates a CDS request querying \a cluster_names.

Please update this comment.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Sorry I'm still confused here. I can understand that if lrs_server is not present, the client can fall back to whatever reasonable action it wants to take. But when it is present, shouldn't we send the report to lrs_server indicated here? What we are doing now is to degrade it to a bit to enable/disable load reporting and then "use the same server we obtained the CDS data from".

There are 3 possible values for this field in the API:

  1. The field is not set, in which case we fall back to our local default, which is to disable load reporting.
  2. The field is set to "self", in which case we enable load reporting.
  3. The field is set to point to some other server. We do not yet support this case, so we treat the proto as being invalid. (The check immediately above this line returns an error in that case.)

Note that in case (3), we do not ignore the value of the field and use the CDS server; instead, we report the config as being invalid and reject the update.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 345 at r17 (raw file):

namespace {

std::unique_ptr<char> StringCopy(const upb_strview& strview) {

We should not be using std::unique_ptr<char> anywhere. This causes a problem, because these strings are allocated via malloc(), but std::unique_ptr<> will attempt to free them with delete instead of free().

The two valid choices here are UniquePtr<char> (which will use free() instead of delete) or std::string. And I think we should strongly prefer the latter, except in cases where (a) we are currently using something like gpr_asprintf() to construct the string, in which case converting it to std::string would cause an extra allocation and copy, or (b) we need to return the value outside of the C-core API, which should never occur in the xds code.

Note that if we use std::string, this function is probably not necessary at all.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 635 at r17 (raw file):

  upb_strview type_url_strview =
      envoy_api_v2_DiscoveryResponse_type_url(response);
  *type_url = std::string(type_url_strview.data, type_url_strview.size);

Suggestion, feel free to ignore:

For efficiency, it might make sense to return the type URL as a const char* pointing to the constant that matches the string in the proto (e.g., kRdsTypeUrl, kCdsTypeUrl, etc). That way, whenever we need to switch on the type URL or use it as a map key, it can be done based on the address of the constant instead of doing string comparisons.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 649 at r17 (raw file):

        GRPC_ERROR_CREATE_FROM_STATIC_STRING("Unsupported ADS resource type");
  }
  //

Missing comment.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 651 at r17 (raw file):

  //
  if (error != GRPC_ERROR_NONE) {
    *version = "";

I think we should return the version unconditionally. The logic outside of this function can decide if it doesn't need it for anything, but it makes the API of this function easier to understand if the only thing the caller needs to check is whether it returns an error.


src/core/ext/filters/client_channel/xds/xds_client.h, line 179 at r17 (raw file):

    // The latest data seen from CDS.
    CdsUpdate update;
    bool seen_update = false;

Instead of using this bool, consider changing the update field to be of type Optional<CdsUpdate>.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 666 at r13 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Yes. For now, the other resource type is EDS, which is already checked below.

Because we don't cache unwatched resource names for EDS, it's enough to check whether endpoint_map_ is empty.

Ah! Good point.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1331 at r13 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

I don't think std::set supports random access.

(It's better to use std::set than vector here because it's sorted and easier to compare.)

Oh, good point -- somehow I had thought this was a vector.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 132 at r17 (raw file):

  // Takes ownership of \a error.
  void SendMessageLocked(const std::string& type_url,
                         const std::string& nonce_for_unknown_type,

Why is this just for unknown types? Don't we need to send a nonce even when sending an ACK or NACK for a response of a known type?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 132 at r17 (raw file):

  // Takes ownership of \a error.
  void SendMessageLocked(const std::string& type_url,
                         const std::string& nonce_for_unknown_type,

This should probably be passed by value instead of as a const reference, so that the caller can use std::move() to pass it in. This would avoid the need for a string copy.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 440 at r17 (raw file):

        Ref(DEBUG_LOCATION, "ChannelState+ads")));
  } else {
    // FIXME: What's nonce for? It can be empty only if "the client has not yet

Whenever we send a request as a reaction to a previous response, we need to include the nonce from the response that we're reacting to, so that the server knows which response we're referring to (see https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#resource-updates for details). This is true regardless of whether we're sending an ACK or a NACK.

The only times that we can send a request without a nonce are when (a) we are sending the first request for a given resource type or (b) we are sending a spontaneous request to change the set of resource names we're subscribing to, not in reaction to a response we've received.

This particular piece of code looks like it's case (b), so it's fine to not send a nonce here.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 672 at r17 (raw file):

}

bool XdsClient::ChannelState::AdsCallState::AdsCallNeeded() const {

It looks like the only place this is called from is XdsClient::ChannelState::OnWatcherRemoved(). Maybe just move this code directly into that function, so that this function is not needed?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 707 at r17 (raw file):

        type_url, nonce_for_unknown_type, error);
  }
  GRPC_ERROR_UNREF(error);

Whichever one of the *RequestCreateAndEncode() functions we called above should take ownership of error, so we shouldn't need to unref it here.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 866 at r17 (raw file):

  auto& buffered_request_map = self->buffered_request_map_;
  // TODO(juanlishen): Simplify the code when adding LDS/RDS.
  auto iter = buffered_request_map.find(kCdsTypeUrl);

Could just say:

auto& buffered_request = self->buffered_request_map_[kCdsTypeUrl];
if (buffered_request != nullptr) {
  self->SendMessageLocked(kCdsTypeUrl, "", buffered_request->error, false);
  buffered_request.reset();
  return;
}

This approach will leave the map entry present even if it's set to null, but given how few resource types we support, that seems like a fine trade-off for the increased readability.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 935 at r17 (raw file):

      &eds_update_map, &version, &nonce, &type_url);
  grpc_slice_unref_internal(response_slice);
  // This anonymous lambda is a hack to avoid the usage of goto.

In this particular case, I think if/else blocks would be a bit clearer, since there are actually just three simple cases here:

if (type_url.empty()) {
  // ADS response can't be parsed, not even enough info to NACK.
} else if (parse_error != GRPC_ERROR_NONE) {
  // Ignore update and send NACK.
} else {
  // Accept update and send ACK.
}

src/core/ext/filters/client_channel/xds/xds_client.cc, line 938 at r17 (raw file):

  [&]() {
    // If the type_url is unparsable, ignore it.
    if (type_url.empty()) return;

Let's make sure to log an error in this case.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 941 at r17 (raw file):

    // Update nonce.
    if (type_url == kCdsTypeUrl) {
      ads_calld->cds_version_state_.nonce = nonce;

I don't think we need to store the nonce for future use. The only thing we should ever use it for is sending the ACK or NACK below, so we shouldn't need to store it outside of passing it to SendMessageLocked() below (which can be done with std::move()).


src/core/ext/filters/client_channel/xds/xds_client.cc, line 947 at r17 (raw file):

    // If the update can't be accepted, NACK it with the received type_url and
    // nonce.
    if (version.empty()) {

Seems clearer to detect this based on parse_error being set rather than on version being empty.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1278 at r17 (raw file):

  // Only send load reports for the clusters that are asked for by the LRS
  // server.
  std::string cluster_name_string(cluster_name.data(), cluster_name.size());

I don't think you need to do this explicitly. StringView should auto-convert to std::string.

explicit operator std::basic_string<char, std::char_traits<char>, Allocator>()


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1522 at r17 (raw file):

    cluster_state.watchers.erase(it);
    if (cluster_state.watchers.empty()) {
      // Sending CDS requests with no cluster names will cause Envoy-style

If envoyproxy/envoy#9029 gets merged, then we will no longer need this special case.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

This is still needed because the StringViewCmp is the last function definition in this file? And it should stay the at the bottom in this file because it uses StringView.

Hmm, wait, I don't actually see where StringViewCmp() is defined. Where is that?

Also, it looks like this functionality should actually be exposed as StringView::compare(), to be compatible with absl.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to mention: The one significant thing that's still missing here is updating the tests to really exercize the new CDS code. We should find a way to address that before we merge this.

Reviewable status: all files reviewed, 22 unresolved discussions (waiting on @apolcyn and @AspirinSJL)

@AspirinSJL AspirinSJL force-pushed the cds_in_xds_client branch 3 times, most recently from abd26c4 to 2e41075 Compare November 27, 2019 21:39
Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the test on Monday.

Reviewable status: 2 of 16 files reviewed, 20 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/xds/xds_api.h, line 170 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please update this comment.

Done.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 359 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

There are 3 possible values for this field in the API:

  1. The field is not set, in which case we fall back to our local default, which is to disable load reporting.
  2. The field is set to "self", in which case we enable load reporting.
  3. The field is set to point to some other server. We do not yet support this case, so we treat the proto as being invalid. (The check immediately above this line returns an error in that case.)

Note that in case (3), we do not ignore the value of the field and use the CDS server; instead, we report the config as being invalid and reject the update.

Understood, thanks!


src/core/ext/filters/client_channel/xds/xds_api.cc, line 345 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should not be using std::unique_ptr<char> anywhere. This causes a problem, because these strings are allocated via malloc(), but std::unique_ptr<> will attempt to free them with delete instead of free().

The two valid choices here are UniquePtr<char> (which will use free() instead of delete) or std::string. And I think we should strongly prefer the latter, except in cases where (a) we are currently using something like gpr_asprintf() to construct the string, in which case converting it to std::string would cause an extra allocation and copy, or (b) we need to return the value outside of the C-core API, which should never occur in the xds code.

Note that if we use std::string, this function is probably not necessary at all.

Done.

Audited xds/ folder.

CdsUpdate::lrs_load_reporting_server_name is reverted to UniquePtr<char> because we are using its null value to indicate whether load reporting is enabled or not.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 420 at r17 (raw file):

    }
    upb_strview cluster_name = envoy_api_v2_Cluster_name(cluster);
    cds_update_map->emplace(StringCopy(cluster_name), std::move(cds_update));

The auto conversion from StringView to std::string doesn't work here somehow.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 635 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggestion, feel free to ignore:

For efficiency, it might make sense to return the type URL as a const char* pointing to the constant that matches the string in the proto (e.g., kRdsTypeUrl, kCdsTypeUrl, etc). That way, whenever we need to switch on the type URL or use it as a map key, it can be done based on the address of the constant instead of doing string comparisons.

Yes, but the problem is how to deal with the unsupported resource types. Since we need to NACK them with the exact type_url received, we have to store them or list all the possible types as static strings (which is not easy to maintain if there are new types Envoy add in the future).


src/core/ext/filters/client_channel/xds/xds_api.cc, line 649 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Missing comment.

Stale.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 651 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we should return the version unconditionally. The logic outside of this function can decide if it doesn't need it for anything, but it makes the API of this function easier to understand if the only thing the caller needs to check is whether it returns an error.

Done.


src/core/ext/filters/client_channel/xds/xds_client.h, line 179 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of using this bool, consider changing the update field to be of type Optional<CdsUpdate>.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 132 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this just for unknown types? Don't we need to send a nonce even when sending an ACK or NACK for a response of a known type?

Done.

I thought that we should store nonce state for spontaneous requests in the future. So I put nonce together with version.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 132 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should probably be passed by value instead of as a const reference, so that the caller can use std::move() to pass it in. This would avoid the need for a string copy.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 440 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Whenever we send a request as a reaction to a previous response, we need to include the nonce from the response that we're reacting to, so that the server knows which response we're referring to (see https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#resource-updates for details). This is true regardless of whether we're sending an ACK or a NACK.

The only times that we can send a request without a nonce are when (a) we are sending the first request for a given resource type or (b) we are sending a spontaneous request to change the set of resource names we're subscribing to, not in reaction to a response we've received.

This particular piece of code looks like it's case (b), so it's fine to not send a nonce here.

Then the proto comment (https://github.com/envoyproxy/envoy/blob/b9f798dd1dec86d3ed067c449e3c3a49c57984b1/api/envoy/api/v2/discovery.proto#L46) needs updating to reflect (b).


src/core/ext/filters/client_channel/xds/xds_client.cc, line 672 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like the only place this is called from is XdsClient::ChannelState::OnWatcherRemoved(). Maybe just move this code directly into that function, so that this function is not needed?

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 707 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Whichever one of the *RequestCreateAndEncode() functions we called above should take ownership of error, so we shouldn't need to unref it here.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 866 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Could just say:

auto& buffered_request = self->buffered_request_map_[kCdsTypeUrl];
if (buffered_request != nullptr) {
  self->SendMessageLocked(kCdsTypeUrl, "", buffered_request->error, false);
  buffered_request.reset();
  return;
}

This approach will leave the map entry present even if it's set to null, but given how few resource types we support, that seems like a fine trade-off for the increased readability.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 935 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In this particular case, I think if/else blocks would be a bit clearer, since there are actually just three simple cases here:

if (type_url.empty()) {
  // ADS response can't be parsed, not even enough info to NACK.
} else if (parse_error != GRPC_ERROR_NONE) {
  // Ignore update and send NACK.
} else {
  // Accept update and send ACK.
}

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 938 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's make sure to log an error in this case.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 941 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think we need to store the nonce for future use. The only thing we should ever use it for is sending the ACK or NACK below, so we shouldn't need to store it outside of passing it to SendMessageLocked() below (which can be done with std::move()).

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 947 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Seems clearer to detect this based on parse_error being set rather than on version being empty.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1278 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think you need to do this explicitly. StringView should auto-convert to std::string.

explicit operator std::basic_string<char, std::char_traits<char>, Allocator>()

Done.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Hmm, wait, I don't actually see where StringViewCmp() is defined. Where is that?

Also, it looks like this functionality should actually be exposed as StringView::compare(), to be compatible with absl.

It's here:

inline int StringViewCmp(const StringView lhs, const StringView rhs) {

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! The only substantive remaining issues are the nonce handling (sorry for steering you wrong there!) and the test changes.

Please let me know if you have any questions. Thanks!

Reviewed 14 of 14 files at r18.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 82 at r18 (raw file):

                  RefCountedPtr<LoadBalancingPolicy::Config> fallback_policy,
                  std::string eds_service_name,
                  grpc_core::UniquePtr<char> lrs_load_reporting_server_name)

This can be of type Optional<std::string>.


src/core/ext/filters/client_channel/xds/xds_api.cc, line 635 at r17 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Yes, but the problem is how to deal with the unsupported resource types. Since we need to NACK them with the exact type_url received, we have to store them or list all the possible types as static strings (which is not easy to maintain if there are new types Envoy add in the future).

In the case of an unknown type, we could detect that simply by noticing that the returned value does not point to any of the predefined constant strings.

But in any case, this is fine as-is; we can optimize this later if needed.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 132 at r17 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Done.

As per discussion below, I was wrong about how to handle the nonce, so I think this should go back to being a const reference.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 440 at r17 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Then the proto comment (https://github.com/envoyproxy/envoy/blob/b9f798dd1dec86d3ed067c449e3c3a49c57984b1/api/envoy/api/v2/discovery.proto#L46) needs updating to reflect (b).

Sorry, I think I misread the docs previously. In case (b), we do need to send the nonce. So I guess we do need to store it along with the version string...

Sorry for steering you wrong here. :(


src/core/ext/filters/client_channel/xds/xds_client.cc, line 741 at r18 (raw file):

    }
    // Update the cluster state.
    cluster_state.update.set(cds_update);

Don't we still want to use std::move() here?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 906 at r18 (raw file):

      response_slice, xds_client->EdsServiceNames(), &cds_update_map,
      &eds_update_map, &version, &nonce, &type_url);
  gpr_log(GPR_ERROR, "=== afrer parse");

I assume this was just added for debugging. Please remove before merging.


src/core/ext/filters/client_channel/xds/xds_client_stats.h, line 59 at r18 (raw file):

  }

  const char* region() const { return region_.c_str(); }

Would it make sense for these accessors to return const std::string&?


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

It's here:

inline int StringViewCmp(const StringView lhs, const StringView rhs) {

Ah, okay, I see what's going on. I saw the #endif // GRPC_USE_ABSL and assumed that was the end of the file, so I didn't notice the stuff underneath.

As mentioned above, I think we should expose this as StringView::compare(), to be compatible with absl. I think this means that StringViewCmp() is no longer needed; its functionality can be moved directly into StringView::compare(), and we can update the callers (looks like there are only 2 of them).

It might be a good idea to split this change into a separate PR.

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm stuck in a bug.. When there are multiple localities in the EDS response, the parsing code will fail because the type_url of the resource is not EDS. The type_url of Anyshould be auto filled using the class of the object encoded, which should be type.googleapis.com/envoy.api.v2.ClusterLoadAssignment. It's strange because the parsing can pass if there is only one locality. There is no change in the EDS response building part in the test. And there is only refactoring code in the EDS response parsing code.

Any idea?

Reviewable status: 9 of 17 files reviewed, 9 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 82 at r18 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be of type Optional<std::string>.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 440 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Sorry, I think I misread the docs previously. In case (b), we do need to send the nonce. So I guess we do need to store it along with the version string...

Sorry for steering you wrong here. :(

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 741 at r18 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Don't we still want to use std::move() here?

std::move() doesn't have any effect here due to how Optional<> is implemented.

  void set(const T& val) {
    value_ = val;
    set_ = true;
  }

Adding move assignment method in #21342.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 906 at r18 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I assume this was just added for debugging. Please remove before merging.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 435 at r19 (raw file):

    const std::string& type_url) {
  if (ads_calld_ == nullptr) {
    // FIXME: Honor backoff.

This is actually not an issue. The backoff control is inside of ads_calld_. If ads_calld_ is null, it simply means we don't have any ADS request before.


src/core/ext/filters/client_channel/xds/xds_client_stats.h, line 59 at r18 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Would it make sense for these accessors to return const std::string&?

Done.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Ah, okay, I see what's going on. I saw the #endif // GRPC_USE_ABSL and assumed that was the end of the file, so I didn't notice the stuff underneath.

As mentioned above, I think we should expose this as StringView::compare(), to be compatible with absl. I think this means that StringViewCmp() is no longer needed; its functionality can be moved directly into StringView::compare(), and we can update the callers (looks like there are only 2 of them).

It might be a good idea to split this change into a separate PR.

Adding via #21344.

Will sync after that is merged.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Instead of hard-coding this here, maybe we should just represent the CDS responses in responses_and_delays_, just like we do with the EDS responses?

The CDS response is much simpler and fixed than EDS response. So I think it's fine to build the response here.

Or maybe we need to change the abstraction here so that this is a true fake, where we populate the server with xds resources and let it generate the responses normally from that database?

So far, the response content is quite deterministic. The only dynamic one is EDS response, which is set by each test. So I think the current test code is OK.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what's causing the test problem, sorry! I suggest adding some debug logging in the test right before the proto is sent and in the code where the proto is received to display the contents. That will at least tell you which side the problem is on.

Please let me know if you have any questions. Thanks!

Reviewed 6 of 8 files at r19, 2 of 2 files at r20.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 103 at r20 (raw file):

  };

  const char* lrs_load_reporting_server_name() const {

This could return const Optional<std::string>&.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r20 (raw file):

  // Takes ownership of \a error.
  void SendMessageLocked(const std::string& type_url,
                         const std::string& nonce_for_unsupported_type,

This is not just for unsupported types -- it will be used for any type.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 751 at r20 (raw file):

    }
  }
  cds_version_.version_info = std::move(new_version);

Need to save the nonce as well.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 832 at r20 (raw file):

    }
  }
  eds_version_.version_info = std::move(new_version);

Need to save the nonce as well.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 917 at r20 (raw file):

    GRPC_ERROR_UNREF(parse_error);
  } else {
    // Update nonce.

Instead of saving this here, let's just pass the nonce to either AcceptCdsUpdate() or AcceptEdsUpdate(), which can save the nonce at the same time it saves the version.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Instead of hard-coding this here, maybe we should just represent the CDS responses in responses_and_delays_, just like we do with the EDS responses?

The CDS response is much simpler and fixed than EDS response. So I think it's fine to build the response here.

Or maybe we need to change the abstraction here so that this is a true fake, where we populate the server with xds resources and let it generate the responses normally from that database?

So far, the response content is quite deterministic. The only dynamic one is EDS response, which is set by each test. So I think the current test code is OK.

In order to really exercize the CDS client code, we need to pass different CDS responses in different cases, so that we can verify that it handles various error cases properly.

That same thing will be true for every resource type. I think that as we add more resource types, having a fake here will make an increasing amount of sense. But if you don't have time to make this change right now, it's fine to leave this as a TODO assigned to me, as long as we have some other way to return various non-expected CDS responses.


test/cpp/end2end/xds_end2end_test.cc, line 376 at r20 (raw file):

      : enable_load_reporting_(enable_load_reporting) {}

  bool WaitForResourceType(const char* expected_type, DiscoveryRequest* request,

This seems wrong. How can the server know the type of the next request? And if we get a request of a different type, why do we just ignore it?

Why is this necessary? It seems like we're using this after the CDS request to wait for an EDS request, but there shouldn't ever be any other type of request between those two. So can't we just do a single read after sending the CDS response and assert that it's an EDS request?

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 17 files reviewed, 9 unresolved discussions (waiting on @apolcyn, @AspirinSJL, and @markdroth)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1522 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If envoyproxy/envoy#9029 gets merged, then we will no longer need this special case.

That PR has been merged, so we can now remove this special case.

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsing bug is due to unexpected change to size within the for loop..

Reviewable status: 8 of 16 files reviewed, 9 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/lb_policy/xds/xds.cc, line 103 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This could return const Optional<std::string>&.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 1522 at r17 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

That PR has been merged, so we can now remove this special case.

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is not just for unsupported types -- it will be used for any type.

This is just for unsupported types because (1) the supported types have nonce stored (2) when the supported types need to send spontaneous requests due to resource name change, it's easier to choose the nonce according to type_url inside of SendMessageLocked() than adding another if-elseif-else block in OnResourceNamesChanged() before calling SendMessageLocked().


src/core/ext/filters/client_channel/xds/xds_client.cc, line 751 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Need to save the nonce as well.

Already saved outside of this method.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 832 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Need to save the nonce as well.

Already saved outside of this method.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 917 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of saving this here, let's just pass the nonce to either AcceptCdsUpdate() or AcceptEdsUpdate(), which can save the nonce at the same time it saves the version.

But the accepting conditions are different for nonce and version. We should store nonce for each response, but we should only store version when we decide to accept and ACK the response. Due to this difference, it's ok to separate the storing of them.


src/core/lib/gprpp/string_view.h, line 50 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Adding via #21344.

Will sync after that is merged.

Done.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In order to really exercize the CDS client code, we need to pass different CDS responses in different cases, so that we can verify that it handles various error cases properly.

That same thing will be true for every resource type. I think that as we add more resource types, having a fake here will make an increasing amount of sense. But if you don't have time to make this change right now, it's fine to leave this as a TODO assigned to me, as long as we have some other way to return various non-expected CDS responses.

Done. Changed to a more dynamic handler.


test/cpp/end2end/xds_end2end_test.cc, line 376 at r20 (raw file):
It's hard coding the basic workflow. I changed it to a while loop now.

Why is this necessary? It seems like we're using this after the CDS request to wait for an EDS request, but there shouldn't ever be any other type of request between those two. So can't we just do a single read after sending the CDS response and assert that it's an EDS request?

Due to ACK/NACK, there might be some extra CDS requests coming in before the EDS requests.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on fixing that bug!

I think the test still needs work. It's really not exercizing the new code very well.

Please let me know if you have any questions about any of this. Thanks!

Reviewed 3 of 6 files at r21, 3 of 5 files at r23, 4 of 4 files at r24.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_client.h, line 184 at r24 (raw file):

    // Buffered requests.
    std::map<std::string /*type_url*/, std::unique_ptr<BufferedRequest>>

I think this belongs in AdsCallState, not in ChannelState, because the nonce is not valid across different ADS calls.

Also, even if the ADS call fails while there is a buffered request, I think that when a new call starts, we should just send requests for all resource types for which we have watchers. We should not need to look at buffered requests to determine this -- or perhaps we should automatically populate the buffered requests when we start the new call. But either way, we should not need to look at the previously buffered requests to do this.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r20 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

This is just for unsupported types because (1) the supported types have nonce stored (2) when the supported types need to send spontaneous requests due to resource name change, it's easier to choose the nonce according to type_url inside of SendMessageLocked() than adding another if-elseif-else block in OnResourceNamesChanged() before calling SendMessageLocked().

Okay, thanks, I think I understand what's going on here now.

Can you please document the fact that this parameter will be used only when type_url is an unknown type, and that in other cases the nonce will come from the ADS call state?


src/core/ext/filters/client_channel/xds/xds_client.cc, line 428 at r24 (raw file):

  } else if (ads_calld() == nullptr) {
    // Buffer the request if the ADS call is in backoff.
    // FIXME: should we override the ACK/NACK if we have a spontaneous request

The two should be combined into one request. The spontaneous request would be to change the set of resource names, but it looks like the code currently sets the set of resource names based on which names have watchers at the moment that the request is being sent, so that's probably good enough. All we need to do here is to avoid overwriting the nonce, which we need for the ACK/NACK.

But see my comment elsewhere: the code in this function should do nothing if ads_calld() is null, because when we create a new ADS call, we should automatically send a message for all resource types for which we have watchers.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 946 at r24 (raw file):

      }
      // ACK the update.
      ads_calld->SendMessageLocked(type_url, nonce, nullptr, false);

No need to pass nonce in here, since it won't be used.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Done. Changed to a more dynamic handler.

This is a little closer to what I had in mind, but it's still not really a true general-purpose fake. In order to make it truly general-purpose, I think we need to eliminate the current approach where the test passes in a vector of EDS responses and delays. Instead, the fake should just have a map containing the current resources, and it should not know anything about the future. The map will be a two-level map: the top-level map will be keyed by resource type, and the bottom-level map will be keyed by resource name. Whenever a request comes in, the fake should look in the map for the resource type and names specified in the request and return whatever matching resources it has. The fake should also have a method for the test to use to modify the data in the map; updating the data in the map should automatically cause the new information to be returned to existing clients. ACKs and NACKs should be processed and recorded, so that the test can verify that the client sent them appropriately.

As I said earlier, I realize that this would be a significant change, so I'm fine with leaving this a TODO for now. But in that case, I think we need some other approach that allows us to really test how the client behaves in the face of various unexpected results from the xds server. For example, we need a test that shows that if the server sends a response with an invalid resource, the client must NACK and not update its cache (i.e., it should continue using the previous versions of resources, even if there was a good resource in the same update as an invalid one).


test/cpp/end2end/xds_end2end_test.cc, line 376 at r20 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

It's hard coding the basic workflow. I changed it to a while loop now.

Why is this necessary? It seems like we're using this after the CDS request to wait for an EDS request, but there shouldn't ever be any other type of request between those two. So can't we just do a single read after sending the CDS response and assert that it's an EDS request?

Due to ACK/NACK, there might be some extra CDS requests coming in before the EDS requests.

Good point.


test/cpp/end2end/xds_end2end_test.cc, line 423 at r24 (raw file):

    }
    // Send response.
    for (const auto& p : responses_and_delays) {

In this loop, we are only sending responses, but never reading new requests. That means that any ACK/NACK requests that get sent will never be read, and any requests for new resources (of any type) will be ignored. This might not be causing a problem right now, but it may do so in the future, so it's yet another reason to make this a true fake.


test/cpp/end2end/xds_end2end_test.cc, line 452 at r24 (raw file):

      while (true) {
        if (!stream->Read(&request)) return;
        if (request.type_url() == kCdsTypeUrl) {

If we get an ACK or NACK for a CDS request, this will call HandleCdsRequest() again, which will cause the server to re-send the same resources that it already sent to the client. Won't this cause an infinite loop?


test/cpp/end2end/xds_end2end_test.cc, line 454 at r24 (raw file):

        if (request.type_url() == kCdsTypeUrl) {
          HandleCdsRequest(&request, stream);
        } else if (request.type_url() == kEdsTypeUrl) {

Same issue here for EDS (although, as mentioned above, we have a bigger problem for EDS, which is that we will never read any more requests once we get the first EDS request).


test/cpp/end2end/xds_end2end_test.cc, line 548 at r24 (raw file):

  bool ads_done_ = false;
  // CDS response data.
  grpc_core::CdsUpdateMap cds_update_map_;

I think we need to store the CDS responses as Cluster protos, not as CdsUpdate structs. Otherwise, there is no way for a test to explicitly construct an invalid proto and check that the client reacts appropriately.

We need to test that the client sends a NACK in cases like the following (may not be an exhaustive list):

  • the cluster type is not EDS
  • the eds_config field is not configured to use ADS
  • the LB policy is not ROUND_ROBIN
  • the LRS server field is set to a non-self ConfigSource

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 15 files reviewed, 9 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/xds/xds_client.h, line 184 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think this belongs in AdsCallState, not in ChannelState, because the nonce is not valid across different ADS calls.

Also, even if the ADS call fails while there is a buffered request, I think that when a new call starts, we should just send requests for all resource types for which we have watchers. We should not need to look at buffered requests to determine this -- or perhaps we should automatically populate the buffered requests when we start the new call. But either way, we should not need to look at the previously buffered requests to do this.

Reverted.

Yes, actually I already did that in AdsCallState's ctor. Somehow I forgot that and added the wrong change.. Sorry for the churn!


src/core/ext/filters/client_channel/xds/xds_client.cc, line 130 at r20 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Okay, thanks, I think I understand what's going on here now.

Can you please document the fact that this parameter will be used only when type_url is an unknown type, and that in other cases the nonce will come from the ADS call state?

Done.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 428 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The two should be combined into one request. The spontaneous request would be to change the set of resource names, but it looks like the code currently sets the set of resource names based on which names have watchers at the moment that the request is being sent, so that's probably good enough. All we need to do here is to avoid overwriting the nonce, which we need for the ACK/NACK.

But see my comment elsewhere: the code in this function should do nothing if ads_calld() is null, because when we create a new ADS call, we should automatically send a message for all resource types for which we have watchers.

Done.

Made error also a state for each ADS resource type so that it's not overwritten by message buffering when a spontaneous request is initiated before the NACK is sent.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 946 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need to pass nonce in here, since it won't be used.

Done.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is a little closer to what I had in mind, but it's still not really a true general-purpose fake. In order to make it truly general-purpose, I think we need to eliminate the current approach where the test passes in a vector of EDS responses and delays. Instead, the fake should just have a map containing the current resources, and it should not know anything about the future. The map will be a two-level map: the top-level map will be keyed by resource type, and the bottom-level map will be keyed by resource name. Whenever a request comes in, the fake should look in the map for the resource type and names specified in the request and return whatever matching resources it has. The fake should also have a method for the test to use to modify the data in the map; updating the data in the map should automatically cause the new information to be returned to existing clients. ACKs and NACKs should be processed and recorded, so that the test can verify that the client sent them appropriately.

As I said earlier, I realize that this would be a significant change, so I'm fine with leaving this a TODO for now. But in that case, I think we need some other approach that allows us to really test how the client behaves in the face of various unexpected results from the xds server. For example, we need a test that shows that if the server sends a response with an invalid resource, the client must NACK and not update its cache (i.e., it should continue using the previous versions of resources, even if there was a good resource in the same update as an invalid one).

This seems much more complex than I thought. Sorry that I need to assign this back to you! Below are some reasons that I think this change might be hard.

  1. I think our current tests are fragile and changing the current "response and delay" testing approach might expose some tricky issues.

  2. Putting update for different types into a single map seems hard because we need to unify the inside map type (likely to std::string -> DiscoveryResponse).

  3. Making the method to update the data trigger auto response to the client needs some new mechanism.


test/cpp/end2end/xds_end2end_test.cc, line 423 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In this loop, we are only sending responses, but never reading new requests. That means that any ACK/NACK requests that get sent will never be read, and any requests for new resources (of any type) will be ignored. This might not be causing a problem right now, but it may do so in the future, so it's yet another reason to make this a true fake.

Currently, sending the EDS response back almost signals the completion the the server's work (to simplify the test). Added a TODO.


test/cpp/end2end/xds_end2end_test.cc, line 452 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we get an ACK or NACK for a CDS request, this will call HandleCdsRequest() again, which will cause the server to re-send the same resources that it already sent to the client. Won't this cause an infinite loop?

In theory yes, but it's not causing an infinite loop now because all the ACK/NACK requests are sent after the EDS response is sent (because accepting CDS request will start the EDS request before CDS ACK is sent, and EDS request will result a EDS response which breaks out this while loop).

To work around the potential infinite request issue, I added a bool to only send the update once.


test/cpp/end2end/xds_end2end_test.cc, line 454 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same issue here for EDS (although, as mentioned above, we have a bigger problem for EDS, which is that we will never read any more requests once we get the first EDS request).

I agree there are some problems here, especially the testability of ACK/NACK. But it's intended to keep the changes less invasive so that I can submit this PR more easily.


test/cpp/end2end/xds_end2end_test.cc, line 548 at r24 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we need to store the CDS responses as Cluster protos, not as CdsUpdate structs. Otherwise, there is no way for a test to explicitly construct an invalid proto and check that the client reacts appropriately.

We need to test that the client sends a NACK in cases like the following (may not be an exhaustive list):

  • the cluster type is not EDS
  • the eds_config field is not configured to use ADS
  • the LB policy is not ROUND_ROBIN
  • the LRS server field is set to a non-self ConfigSource

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I'm not thrilled with the test structure, but we can improve that separately. Otherwise, all remaining comments are minor.

Thanks again for finishing this up! Please let me know if you have any questions.

Reviewed 1 of 5 files at r25, 4 of 4 files at r26.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 448 at r26 (raw file):

  // If the ADS call is in backoff state, we don't need to do anything now
  // because when the call is restarted it will resend all necessary requests.
  if (ads_calld() == nullptr) return;

This is the same expression as in the if statement on line 441 above, and this expression will never be true, because the body of that earlier if statement resets ads_calld_.

I think this should just go back to being else { ads_calld()->SendMessageLocked(...); }.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

This seems much more complex than I thought. Sorry that I need to assign this back to you! Below are some reasons that I think this change might be hard.

  1. I think our current tests are fragile and changing the current "response and delay" testing approach might expose some tricky issues.

  2. Putting update for different types into a single map seems hard because we need to unify the inside map type (likely to std::string -> DiscoveryResponse).

  3. Making the method to update the data trigger auto response to the client needs some new mechanism.

Okay. This isn't really ideal, but I understand the need to get this PR in, so we'll work on improving the tests separately. CC @muxi as an FYI.

For future reference, here are my responses to your individual points:

  1. I'm not too worried about this, because it should be possible for the proposed fake semantics to generate exactly the same set of results as the current mechanism. In principle, this change should not affect the behavior of any existing tests.

  2. We could just use google::protobuf::Any as the inner map value type.

  3. This one is the hard one. I think it will probably require changing the tests to use an async client API instead of a sync client API, because I think there are cases where the client call can't complete until a new EDS response is received.


test/cpp/end2end/xds_end2end_test.cc, line 414 at r26 (raw file):

      cds_response_state_ = SENT;
    } else if (cds_response_state_ == SENT &&
               !request->response_nonce().empty()) {

How about making this an assertion, so that the test fails if the client sends the wrong thing in the ACK?

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 448 at r26 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This is the same expression as in the if statement on line 441 above, and this expression will never be true, because the body of that earlier if statement resets ads_calld_.

I think this should just go back to being else { ads_calld()->SendMessageLocked(...); }.

Sorry about the confusing naming, but line 441 and the block only ensures we will have a RetryableCall<AdsCallState>. If that retriable call is pre-existing and in backoff state, ads_call() (the inner AdsCallState) will be null, so we need to return here instead of dereferencing it.


test/cpp/end2end/xds_end2end_test.cc, line 391 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Okay. This isn't really ideal, but I understand the need to get this PR in, so we'll work on improving the tests separately. CC @muxi as an FYI.

For future reference, here are my responses to your individual points:

  1. I'm not too worried about this, because it should be possible for the proposed fake semantics to generate exactly the same set of results as the current mechanism. In principle, this change should not affect the behavior of any existing tests.

  2. We could just use google::protobuf::Any as the inner map value type.

  3. This one is the hard one. I think it will probably require changing the tests to use an async client API instead of a sync client API, because I think there are cases where the client call can't complete until a new EDS response is received.

Ack.


test/cpp/end2end/xds_end2end_test.cc, line 414 at r26 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

How about making this an assertion, so that the test fails if the client sends the wrong thing in the ACK?

Done. Actually this uncovers a bug.

Copy link
Member Author

@AspirinSJL AspirinSJL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ubsan failure caused by misalignment, which is strange because there is no cast. I'm trying to move the maps to the beginning of the class as Nico suggested.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @apolcyn and @markdroth)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick search on the ubsan error message, and I found the following libstdc++ bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66017. @nicolasnoble, do you know if we're using a version of libstdc++ affected by that bug?

Reviewed 3 of 3 files at r27.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_client.h, line 215 at r27 (raw file):

      endpoint_map_;

  static const grpc_arg_pointer_vtable kXdsClientVtable;

This should be defined ahead of instance data members.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 448 at r26 (raw file):

Previously, AspirinSJL (Juanli Shen) wrote…

Sorry about the confusing naming, but line 441 and the block only ensures we will have a RetryableCall<AdsCallState>. If that retriable call is pre-existing and in backoff state, ads_call() (the inner AdsCallState) will be null, so we need to return here instead of dereferencing it.

Ah, good point. Yeah, the naming is confusing here. But we can clean that up later.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the ubsan suppression is working, we should be able to merge this. But there are two small nits that need to be addressed first.

Thanks for finishing this up!

Reviewed 8 of 8 files at r28.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_client.h, line 215 at r27 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be defined ahead of instance data members.

This still needs to be fixed.


third_party/abseil-cpp, line 1 at r28 (raw file):

Subproject commit 0514227d2547793b23e209809276375e41c76617

Looks like you accidentally changed the submodule commit.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r29.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @apolcyn and @AspirinSJL)


src/core/ext/filters/client_channel/xds/xds_client.h, line 215 at r27 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This still needs to be fixed.

I pushed a change to your branch to fix this.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this, Juanli!

@markdroth
Copy link
Member

Known issues: #20444 #20928 #20468 #20609

@markdroth markdroth merged commit 35ca266 into grpc:master Jan 16, 2020
@muxi
Copy link
Member

muxi commented Jan 16, 2020

🎉🎉

@lock lock bot locked as resolved and limited conversation to collaborators Apr 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants