Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associate a mutex with ALTS TSI handshaker objects #20596

Merged
merged 1 commit into from
Nov 5, 2019

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Oct 14, 2019

This PR adds a few new ALTS tests that make use of the pre-existing fake ALTS handshake server which exercise more concurrent ALTS handshake successes and failures, and try to catch races, etc.

Before the ALTS handshake code changes in this PR, the new tests expose a data race and a potential lock cycle when ran under TSAN. The potential lock cycle is described in the description of #20572, and an example data race TSAN error that the test would show is in https://paste.googleplex.com/5526592207978496.

To address the race, this PR refactors the ALTS handshaker to associate a mutex with each alts_tsi_handshaker object. The alts_tsi_handshaker objet acquires this mutex at:

  1. TSI handshaker API entry points
  2. ALTS RPC callbacks

Due to the refactoring, this PR needed to make some changes to the alts_tsi_handshaker_test.cc and alts_handshaker_client_test.cc files to adjust to the new internal APIs. One thing to note is that this gets rid of the schedule_request_grpc_call_failure_test test - I changed the code to assert that call_start_batch is successful (I believe an error in call_start_batch could only be due to an internal logic error), so that failure is no longer unit-testable.


This change is Reviewable

@apolcyn apolcyn assigned apolcyn and unassigned apolcyn Oct 14, 2019
@apolcyn apolcyn added lang/core release notes: yes Indicates if PR needs to be in release notes release notes: no Indicates if PR should not be in release notes and removed release notes: yes Indicates if PR needs to be in release notes labels Oct 14, 2019
@apolcyn apolcyn force-pushed the fix_alts_thread_safety_issues branch from 003a96e to 5de8487 Compare October 15, 2019 01:04
@apolcyn apolcyn marked this pull request as ready for review October 15, 2019 07:52
@apolcyn
Copy link
Contributor Author

apolcyn commented Oct 17, 2019

A heads up that there's a low rate of timeout flakeyness in the new test right now, and it may need some tweaking of parameters.

@apolcyn
Copy link
Contributor Author

apolcyn commented Oct 22, 2019

A heads up that there's a low rate of timeout flakeyness in the new test right now, and it may need some tweaking of parameters.

The flake rate has reduced to where I can only see about 1 timeout flake in 2000 runs, when running the alts_concurrent_connectivity_test under dbg. However, this 1/2000 flake rate is observably fixed by #20687.

I'm still trying to do some cleanup for the changes to alts_handshaker_client_test and alts_tsi_handshaker_test, but otherwise I think this can be reviewed.

@apolcyn
Copy link
Contributor Author

apolcyn commented Oct 23, 2019

The flake rate has reduced to where I can only see about 1 timeout flake in 2000 runs, when running the alts_concurrent_connectivity_test under dbg. However, this 1/2000 flake rate is observably fixed by #20687.

FTR this turned out to not be true FI, as #20687 still had this low (~1/2000 flakiness). I'm only seeing these flakes in dbg though (for which tests don't adjust test deadlines), and when these flakes occur, I'm not able to see any deadlocked threads or other kinds of stuckness. Also, increasing the test deadlines seems to get rid of these timeouts. So I've reduced the number of inner loops in the test_handshake_fails_fast_when_peer_endpoint_closes_connection_after_accepting test from 5 to 2 (which gets rid of this flakiness). The test is still busy enough to repro the original data races though.

This is still ready for review.

@apolcyn
Copy link
Contributor Author

apolcyn commented Oct 24, 2019

Unassigned @yihuazhang , who is out of the office

@apolcyn
Copy link
Contributor Author

apolcyn commented Oct 29, 2019

To clarify a couple of things this PR does brought up offline, the changes in this PR mainly try to do two things:

  1. Fix existing/potential data races
  2. Fix the potential lock cycle that TSAN complains about, which is described in more detail in Add grpc_insecure_channel_create_internal, for channels created from within core #20572. The approach taken is to schedule channel creation on the ExecCtx rather than calling it inline, as suggested in Add grpc_insecure_channel_create_internal, for channels created from within core #20572.
  3. Avoid other sources of deadlock also mainly be avoiding calling into C-core surface API's from the ALTS handshaker code inline, and deferring of calls into e.g. grpc_channel_destroy and grpc_call_unref to the ExecCtx, so they run at the bottom of the call stack (this is intended to avoid situations where we e.g. 1) currently hold lock X, 2) flush an inner ExecCtx, and in so doing run a callback that attempts to reacquire X

@apolcyn apolcyn force-pushed the fix_alts_thread_safety_issues branch from 0921db6 to 1401f8e Compare October 29, 2019 06:31
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is about the abstraction violation between the handshaker client and the handshaker itself. I've suggested what I think will be a much cleaner and simpler way to do this.

Please let me know if you have any questions.

Reviewed 7 of 15 files at r1, 13 of 13 files at r2.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @apolcyn and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 156 at r2 (raw file):

 * and its owning alts_tsi_handshaker object is currently necessary in order
 * to drop the alts_tsi_handshaker's lock and then grab it again. */
void alts_handshaker_client_continue_make_grpc_call_locked(

See my comments elsewhere about finding a cleaner way to do this that does not require breaking the abstractions this way.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):

  t->bytes_to_send_size = bytes_to_send_size;
  t->result = result;
  GRPC_CLOSURE_SCHED(

If we handled the channel creation inside of the handshaker code as I suggest elsewhere, then I think this level of indirection would not be necessary.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):

 * - is_ok: a boolean value indicating if the handshaker response is ok to read.
 */
void alts_tsi_handshaker_handle_response_dedicated(

These APIs seem to really break the abstractions between this and the handshaker client code. I'd really prefer to find a cleaner way to do this.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 303 at r2 (raw file):

      reinterpret_cast<alts_tsi_handshaker*>(self);
  grpc_core::MutexLock lock(&handshaker->mu);
  if (self->handshake_shutdown) {

Shouldn't this be checking handshaker->shutdown instead?

Might also want to double-check if there are other places where we're checking the wrong variable, too.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 395 at r2 (raw file):

  alts_tsi_handshaker* handshaker =
      reinterpret_cast<alts_tsi_handshaker*>(self);
  gpr_mu_lock(&handshaker->mu);

It should not be necessary to acquire the lock here. If we're calling destroy while some other method is still trying to use the handshaker, then there's a bug elsewhere, and acquiring the lock here is not going to help, because the other code could still try to acquire the lock after we release it.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):

      use_dedicated_cq
          ? nullptr
          : grpc_insecure_channel_create(handshaker->handshaker_service_url,

Instead of moving this into alts_tsi_handshaker_re_enter_lock_then_continue_make_grpc_call() and requiring that to intercept all interactions between this code and the handshaker client, how about just immediately scheduling a closure on the ExecCtx to create the channel? If the handshaker's next() method is called while that closure is still pending, the handshaker can save the data and then invoke it when the channel creation callback finishes.

I think this approach would make this PR much less invasive and would result in much less abstraction breakage between the various layers here (i.e., much less shared state and API surface between the ALTS client and this code).


test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 305 at r2 (raw file):

  // Create "handshaker client" objects
  {
    grpc_core::MutexLock lock(alts_tsi_handshaker_get_lock_for_testing(

It should not be necessary to grab the lock to create the handshaker client. I think this is another symptom of the abstraction breakage happening here.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 85 at r2 (raw file):

  GPR_ASSERT(gpr_asprintf(&log_trace_id, "%p", debug_id));
  grpc_arg extra_arg;
  extra_arg.type = GRPC_ARG_STRING;

Please use grpc_channel_arg_string_create():

grpc_arg grpc_channel_arg_string_create(char* name, char* value);


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 86 at r2 (raw file):

  grpc_arg extra_arg;
  extra_arg.type = GRPC_ARG_STRING;
  extra_arg.key = const_cast<char*>("ensure_subchannel_sharing_disabled");

Please use GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL for this purpose.

#define GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL "grpc.use_local_subchannel_pool"


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 89 at r2 (raw file):

  extra_arg.value.string = log_trace_id;
  grpc_channel_args* channel_args =
      grpc_channel_args_copy_and_add(nullptr, &extra_arg, 1);

Can just say {1, &extra_arg} here, so there's no need to call grpc_channel_args_destroy() below.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 115 at r2 (raw file):

  }

  grpc_core::UniquePtr<char> Address() {

This should probably just return a char*. If the caller wants to create their own copy, they can do so.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 127 at r2 (raw file):

class TestServer {
 public:
  TestServer(grpc_core::UniquePtr<char> fake_handshake_server_address) {

explicit


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 161 at r2 (raw file):

  }

  grpc_core::UniquePtr<char> Address() {

This should probably just return a char*. If the caller wants to create their own copy, they can do so.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 181 at r2 (raw file):

};

struct connect_args {

How about making this a class with connect_loop() as a method?


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 527 at r2 (raw file):

int main(int argc, char** argv) {
  grpc_init();
  {

These braces aren't necessary.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 528 at r2 (raw file):

  grpc_init();
  {
    test_basic_client_server_handshake();

Since this is a new test, can we use gtest instead of calling each test manually?

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 15 files reviewed, 16 unresolved discussions (waiting on @markdroth and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 156 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

See my comments elsewhere about finding a cleaner way to do this that does not require breaking the abstractions this way.

Thanks for the suggestions! I've removed this and agree this is cleaner. Related, I was able to revert a number of changes to the alts_handshaker_client_test from this PR.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we handled the channel creation inside of the handshaker code as I suggest elsewhere, then I think this level of indirection would not be necessary.

Unfortunately I don't see how this can be avoided. My motivation for deferring invoke_tsi_next_cb to the ExecCtx is to avoid calling back into the driver of the TSI handshaker while still holding the alts_tsi_handshaker mutex, so that we don't call back into the alts_tsi_handshaker while still holding it. Running cb at the bottom of the ExecCtx seemed safest to me, please let me know if you have any thoughts on this though.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

These APIs seem to really break the abstractions between this and the handshaker client code. I'd really prefer to find a cleaner way to do this.

I was able to remove the re_enter_lock... APIs.

Unfortunately, I don't see a way to avoid alts_tsi_handshaker_handle_response_dedicated though - because the caller of this is polling a cq on its own thread and needs to be able to jump back into the alts_tsi_handshaker mutex. So long as alts_tsi_handshaker and alts_handshaker_grpc_client share that same mutex (which I feel is less error prone), I can't see a good way around it.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 303 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Shouldn't this be checking handshaker->shutdown instead?

Might also want to double-check if there are other places where we're checking the wrong variable, too.

Good catch! Checked that we weren't doing this elsewhere


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 395 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It should not be necessary to acquire the lock here. If we're calling destroy while some other method is still trying to use the handshaker, then there's a bug elsewhere, and acquiring the lock here is not going to help, because the other code could still try to acquire the lock after we release it.

Good catch! removed locking from the dtors


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of moving this into alts_tsi_handshaker_re_enter_lock_then_continue_make_grpc_call() and requiring that to intercept all interactions between this code and the handshaker client, how about just immediately scheduling a closure on the ExecCtx to create the channel? If the handshaker's next() method is called while that closure is still pending, the handshaker can save the data and then invoke it when the channel creation callback finishes.

I think this approach would make this PR much less invasive and would result in much less abstraction breakage between the various layers here (i.e., much less shared state and API surface between the ALTS client and this code).

Thanks for the suggestion.

I've changed the code to do what I believe is roughly this suggestion: the code in the most recent update always schedules creation of a channel at the top of the alts_tsi_handshaker_next, and then basically continues the rest of alts_tsi_handshaker_next after the channel is created - but the channel creation is moved fully to within the TSI handshaker object. Leveraging the "next" call seemed to make lifetime issues simpler.


test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 305 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It should not be necessary to grab the lock to create the handshaker client. I think this is another symptom of the abstraction breakage happening here.

I agree creation doesn't need the lock, removed


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 85 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use grpc_channel_arg_string_create():

grpc_arg grpc_channel_arg_string_create(char* name, char* value);

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 86 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL for this purpose.

#define GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL "grpc.use_local_subchannel_pool"

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 89 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can just say {1, &extra_arg} here, so there's no need to call grpc_channel_args_destroy() below.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 115 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should probably just return a char*. If the caller wants to create their own copy, they can do so.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 127 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

explicit

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 161 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should probably just return a char*. If the caller wants to create their own copy, they can do so.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 181 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

How about making this a class with connect_loop() as a method?

Done, good point


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 527 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

These braces aren't necessary.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 528 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Since this is a new test, can we use gtest instead of calling each test manually?

Good point, done

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking much better! I'd like to see a bit more structural cleanup, but it's definitely moving in the right direction.

Reviewed 8 of 8 files at r3.
Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @apolcyn and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 148 at r3 (raw file):

/**
 * This method handles handshaker response returned from ALTS handshaker
 * service. Note that the only reason the API is exposed is that it is used in

Maybe restore this comment now?


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):

Previously, apolcyn wrote…

Unfortunately I don't see how this can be avoided. My motivation for deferring invoke_tsi_next_cb to the ExecCtx is to avoid calling back into the driver of the TSI handshaker while still holding the alts_tsi_handshaker mutex, so that we don't call back into the alts_tsi_handshaker while still holding it. Running cb at the bottom of the ExecCtx seemed safest to me, please let me know if you have any thoughts on this though.

How about just having the callback into the alts_tsi_handshaker assume that the lock is already held instead of trying to reacquire it? It looks like it will always be invoked from a point where the lock is held.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

  grpc_iomgr_cb_func grpc_cb;
  /* Argument to pass to grpc_cb when invoked. */
  void* grpc_cb_arg;

Doesn't look like this is needed anymore.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 139 at r3 (raw file):

  t->result = result;
  GRPC_CLOSURE_SCHED(
      GRPC_CLOSURE_CREATE(invoke_tsi_next_cb, t, grpc_schedule_on_exec_ctx),

Suggest adding a grpc_closure member to invoke_tsi_next_cb_args, so that we don't need to do an additional allocation here.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):

      client->grpc_caller(client->call, ops, static_cast<size_t>(op - ops),
                          &client->on_handshaker_service_resp_recv);
  GPR_ASSERT(call_error == GRPC_CALL_OK);

Why change this from returning an error to be an assertion? Handling the error without crashing seems preferable.

(Note that if you don't do this, then this function might as well be changed to return void, because it will never return any error.)


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 357 at r3 (raw file):

  handshaker_client_send_buffer_destroy_locked(client);
  client->send_buffer = buffer;
  make_grpc_call_locked(client, true /* is_start */);

Unless this function is changed to return void, we should continue to check its result here.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 415 at r3 (raw file):

  handshaker_client_send_buffer_destroy_locked(client);
  client->send_buffer = buffer;
  make_grpc_call_locked(client, true /* is_start */);

Same here.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r3 (raw file):

  handshaker_client_send_buffer_destroy_locked(client);
  client->send_buffer = buffer;
  make_grpc_call_locked(client, false /* is_start */);

Same here.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

      reinterpret_cast<alts_grpc_handshaker_client*>(c);
  if (client->call != nullptr) {
    // do this at the bottom of the callstack to avoid lock recursion

What lock recursion are we avoiding here? The use-case is not clear to me.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):

Previously, apolcyn wrote…

I was able to remove the re_enter_lock... APIs.

Unfortunately, I don't see a way to avoid alts_tsi_handshaker_handle_response_dedicated though - because the caller of this is polling a cq on its own thread and needs to be able to jump back into the alts_tsi_handshaker mutex. So long as alts_tsi_handshaker and alts_handshaker_grpc_client share that same mutex (which I feel is less error prone), I can't see a good way around it.

I understand the need to reacquire the lock in that case, but is there a reasonable way to avoid exposing this publicly?

As I mentioned above, ideally, we should re-acquire the lock only in the case where we're using a CQ in its own thread; in the case where we're directly handling our own polling, we should not need to do so. So maybe the right thing here is to have alts_tsi_handshaker pass a callback to the client to use when handling a response. In the case where we're using a CQ, we can use a callback that acquires the mutex; in the case where we're handling our own polling, we can use a callback that just directly calls alts_handshaker_client_handle_response_locked().


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):

Previously, apolcyn wrote…

Thanks for the suggestion.

I've changed the code to do what I believe is roughly this suggestion: the code in the most recent update always schedules creation of a channel at the top of the alts_tsi_handshaker_next, and then basically continues the rest of alts_tsi_handshaker_next after the channel is created - but the channel creation is moved fully to within the TSI handshaker object. Leveraging the "next" call seemed to make lifetime issues simpler.

Good idea! That does make the state machine easier to understand.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 288 at r3 (raw file):

}

static void alts_tsi_handshaker_maybe_create_channel(

I think we should structure things such that there's a lot less acquiring and releasing of the same mutex in the same thread. Specifically, I suggest the following:

  • In handshaker_next(), acquire the lock, then check whether we need to create a channel. If we do, schedule a closure on the ExecCtx for alts_tsi_handshaker_maybe_create_channel() and return. Otherwise, immediately call alts_tsi_handshaker_continue_handshaker_next_locked().

  • alts_tsi_handshaker_maybe_create_channel() should be a closure function that takes the alts_tsi_handshaker_continue_handshaker_next_args instance as its arg. It should acquire the lock after creating the channel and then call alts_tsi_handshaker_continue_handshaker_next_locked().

  • alts_tsi_handshaker_continue_handshaker_next() should be renamed to alts_tsi_handshaker_continue_handshaker_next_locked(). It no longer needs to be a closure callback; instead of taking the alts_tsi_handshaker_continue_handshaker_next_args instance as its argument, it can directly take arguments similar to those of handshaker_next(). It should assume that the lock is already acquired by the caller instead of trying to acquire it itself. It can return some state indicating whether the callback should be invoked and with what parameters, which the caller can do after releasing the lock.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 299 at r3 (raw file):

      create_channel = true;
    }
    handshaker_service_url = grpc_core::UniquePtr<char>(

Is it the case that the handshaker service URL is set when the alts_tsi_handshaker is created and is not changed for the lifetime of the object? If so, we can probably just use it directly without acquiring the lock.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 423 at r3 (raw file):

  args->user_data = user_data;
  GRPC_CLOSURE_SCHED(
      GRPC_CLOSURE_CREATE(alts_tsi_handshaker_continue_handshaker_next, args,

Please add a grpc_closure member to alts_tsi_handshaker_continue_handshaker_next_args, so that we can avoid a second allocation here.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 470 at r3 (raw file):

  if (handshaker->client != nullptr) {
    // this is defensive in order to avoid leaving a stray/unpolled call
    alts_handshaker_client_cancel_call_locked(handshaker->client);

Why is this function necessary? Immediately after this, we are calling alts_handshaker_client_destroy(). Can't that method cancel the call if necessary?


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 108 at r3 (raw file):

  }

  const char* Address() { return address_.get(); }

This should be lower-case, since it's a simple accessor.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 152 at r3 (raw file):

  }

  const char* Address() { return server_addr_.get(); }

This should be lower-case as well.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 202 at r3 (raw file):

      grpc_connectivity_state state =
          grpc_channel_check_connectivity_state(channel, 1);
      GPR_ASSERT(state == GRPC_CHANNEL_IDLE);

All of the GPR_ASSERT() calls in this function can be changed to use the gtest EXPECT or ASSERT macros now.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 242 at r3 (raw file):

// handshake server).
TEST(AltsConcurrentConnectivityTest, TestBasicClientServerHandshakes) {
  gpr_log(GPR_DEBUG, "Running test: test_basic_client_server_handshake");

Not needed anymore; gtest will log the test start for you.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 256 at r3 (raw file):

 * (using the fake, in-process handshake server). */
TEST(AltsConcurrentConnectivityTest, TestConcurrentClientServerHandshakes) {
  gpr_log(GPR_DEBUG, "Running test: test_concurrent_client_server_handshakes");

Not needed anymore.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 263 at r3 (raw file):

    gpr_timespec test_deadline = grpc_timeout_seconds_to_deadline(20);
    int num_concurrent_connects = 50;
    std::vector<grpc_core::UniquePtr<ConnectLoopRunner>> connect_loop_runners;

Can just use std::unique_ptr<> here.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 272 at r3 (raw file):

          GRPC_CHANNEL_READY /* expected connectivity states */));
    }
    for (size_t i = 0; i < num_concurrent_connects; i++) {

This can just be connect_loop_runners.clear();.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 349 at r3 (raw file):

  }

  const char* Address() { return address_.get(); }

This can be lower-case.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 452 at r3 (raw file):

TEST(AltsConcurrentConnectivityTest,
     TestHandshakeFailsFastWhenPeerEndpointClosesConnectionAfterAccepting) {
  gpr_log(GPR_DEBUG,

Not needed.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 461 at r3 (raw file):

  {
    gpr_timespec test_deadline = grpc_timeout_seconds_to_deadline(20);
    std::vector<grpc_core::UniquePtr<ConnectLoopRunner>> connect_loop_runners;

Can just use std::unique_ptr<> here.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 470 at r3 (raw file):

          GRPC_CHANNEL_TRANSIENT_FAILURE /* expected connectivity states */));
    }
    for (size_t i = 0; i < num_concurrent_connects; i++) {

connect_loop_runners.clear();

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @markdroth and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 148 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Maybe restore this comment now?

Restored parameter comments.

I think this comment about the reason for the API being exposed was previously accurate, and only remains inaccurate with this PR: alts_handshaker_client_handle_response is used before this PR by the dedicated CQ thread ALTS driver and the grpc polling framework based ALTS driver.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):

How about just having the callback into the alts_tsi_handshaker assume that the lock is already held instead of trying to reacquire it?

Sorry I think we might be talking about different things here.... note that invoke_tsi_next_cb_args->cb field is the TSI handshaker API callback that was passed into tsi_handshaker_next by security_handshaker.cc.

I wanted to rule out the possibility of calling back into security_handshaker.cc while holding the alts_tsi_handshaker lock, and then from there potentially calling back into alts_tsi_handshaker inline, and deadlocking.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Doesn't look like this is needed anymore.

I think this still has some usefulness in that the alts_handshaker_client_test can pass a dummy object to it, which is used in the grpc_cb callbacks provided by the test.

Without that, the alts_handshaker_client_test needs to go back to creating a dummy alts_tsi_handshaker object to the alts_handshaker_client objects it creates, just to have a valid argument in it's ALTS RPC callbacks. I'm not a huge fan of this but it seems to complement the injection of the callbacks from the tests


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 139 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest adding a grpc_closure member to invoke_tsi_next_cb_args, so that we don't need to do an additional allocation here.

Done.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why change this from returning an error to be an assertion? Handling the error without crashing seems preferable.

(Note that if you don't do this, then this function might as well be changed to return void, because it will never return any error.)

Changed to return void.

Outside of some unit tests that supply with mocked versions of grpc_caller, grpc_caller is always call_start_batch_and_execute.

As far as I understand, call_start_batch_and_execute will always return OK unless we are misusing the API due to an internal programming error. So I figure it's better to crash than to continue with potential memory corruption.

I checked elsewhere and it looks like all other call_start_batch_and_execute call sites are asserting call_error == OK, e.g. https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc#L857


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 357 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Unless this function is changed to return void, we should continue to check its result here.

Changed to void


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 415 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

changed return val to void


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

changed return val to void


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

What lock recursion are we avoiding here? The use-case is not clear to me.

Sorry, I really just meant general risks of ExecCtx stacking (since grpc_call_unref will create an inner ExecCtx).

In short, I think that if ExecCtx stacking behavior was changed to schedule at the "root", e.g. if your comment in #20572 (comment) was implemented, then it would be safe to call this inline. Otherwise, the ExecCtx stacking risks re-entrant locking.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I understand the need to reacquire the lock in that case, but is there a reasonable way to avoid exposing this publicly?

As I mentioned above, ideally, we should re-acquire the lock only in the case where we're using a CQ in its own thread; in the case where we're directly handling our own polling, we should not need to do so. So maybe the right thing here is to have alts_tsi_handshaker pass a callback to the client to use when handling a response. In the case where we're using a CQ, we can use a callback that acquires the mutex; in the case where we're handling our own polling, we can use a callback that just directly calls alts_handshaker_client_handle_response_locked().

Changed to do this, however, note that we do need to re-acquire the lock in both the dedicated-CQ and non-dedicated handshake cases. In the dedicated-CQ case, we're jumping back into the handshake client from the dedicated thread. From the internal-polling-framework case, we are running from a call_start_batch_and_execute callback scheduled on the ExecCtx.

I've restructured the code so that the alts_handshake_client takes a callback that is meant to do whatever initial setup is needed in order to safely run handle_response_locked.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 288 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think we should structure things such that there's a lot less acquiring and releasing of the same mutex in the same thread. Specifically, I suggest the following:

  • In handshaker_next(), acquire the lock, then check whether we need to create a channel. If we do, schedule a closure on the ExecCtx for alts_tsi_handshaker_maybe_create_channel() and return. Otherwise, immediately call alts_tsi_handshaker_continue_handshaker_next_locked().

  • alts_tsi_handshaker_maybe_create_channel() should be a closure function that takes the alts_tsi_handshaker_continue_handshaker_next_args instance as its arg. It should acquire the lock after creating the channel and then call alts_tsi_handshaker_continue_handshaker_next_locked().

  • alts_tsi_handshaker_continue_handshaker_next() should be renamed to alts_tsi_handshaker_continue_handshaker_next_locked(). It no longer needs to be a closure callback; instead of taking the alts_tsi_handshaker_continue_handshaker_next_args instance as its argument, it can directly take arguments similar to those of handshaker_next(). It should assume that the lock is already acquired by the caller instead of trying to acquire it itself. It can return some state indicating whether the callback should be invoked and with what parameters, which the caller can do after releasing the lock.

Thanks for the suggestion! I've changed the code to take this approach, I agree this is cleaner.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 299 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Is it the case that the handshaker service URL is set when the alts_tsi_handshaker is created and is not changed for the lifetime of the object? If so, we can probably just use it directly without acquiring the lock.

I would agree that this doesn't seem racey to access it outside of the lock, since concurrent writes to that field can't logically happen at this point.

I've change the code though to have the alts_tsi_handshake_continue_handshaker_next_args store a copy of the fields in alts_tsi_handshaker that it needs to access outside of the lock, which still allows us to get rid of the repeated locking an unlocking in the same thread.

I have a slight preference for this, as personally I think it helps readability to be strict about reading the handshaker object's fields only under the lock, even if over done. I can go either way this though, please let me know.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 423 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a grpc_closure member to alts_tsi_handshaker_continue_handshaker_next_args, so that we can avoid a second allocation here.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 470 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this function necessary? Immediately after this, we are calling alts_handshaker_client_destroy(). Can't that method cancel the call if necessary?

This is relevant in followup PR #20687, but I agree isn't relevant in this PR. I've reverted this change and will save for #20687 where it's needed


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 108 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be lower-case, since it's a simple accessor.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 152 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be lower-case as well.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 202 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

All of the GPR_ASSERT() calls in this function can be changed to use the gtest EXPECT or ASSERT macros now.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 242 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Not needed anymore; gtest will log the test start for you.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 256 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Not needed anymore.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 263 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can just use std::unique_ptr<> here.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 272 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can just be connect_loop_runners.clear();.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 349 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be lower-case.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 452 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Not needed.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 461 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can just use std::unique_ptr<> here.

Done.


test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 470 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

connect_loop_runners.clear();

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: I don't actually understand what the benefit is of adding the mutex to this code. The problem we're trying to solve here is just to do channel creation in the callback scheduled on the ExecCtx, and it seems like we've solved that problem by triggering that callback via handshaker_next(). Can't we just do that without adding all the complexity of the mutex? It's not clear to me what the benefit is of the additional complexity.

Reviewed 7 of 7 files at r4.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 163 at r4 (raw file):

 * - is_ok: a boolean value indicating if the handshaker response is ok to read.
 */
void alts_handshaker_client_handle_response_ensure_locked(

The _ensure_locked() suffix isn't needed here. The fact that this acquires the lock is an implementation detail and shouldn't be leaked into the API.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

Previously, apolcyn wrote…

I think this still has some usefulness in that the alts_handshaker_client_test can pass a dummy object to it, which is used in the grpc_cb callbacks provided by the test.

Without that, the alts_handshaker_client_test needs to go back to creating a dummy alts_tsi_handshaker object to the alts_handshaker_client objects it creates, just to have a valid argument in it's ALTS RPC callbacks. I'm not a huge fan of this but it seems to complement the injection of the callbacks from the tests

I am not suggesting that alts_grpc_handshaker_client_create() should not continue to allow the caller to pass in the callback argument. What I am saying is that I don't see a reason to store that argument in this struct.

It doesn't look like this field is actually used anywhere except in alts_grpc_handshaker_client_create(). We set this field from the parameter on line 523 and then use it on line 540 in the same function, and I don't see anywhere else that it gets used. So why bother storing it here?

Actually, same comment about the pre-existing grpc_cb field -- it doesn't actually seem to be used anywhere outside of alts_grpc_handshaker_client_create(). Do we actually need that field either?

Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):

Previously, apolcyn wrote…

Changed to return void.

Outside of some unit tests that supply with mocked versions of grpc_caller, grpc_caller is always call_start_batch_and_execute.

As far as I understand, call_start_batch_and_execute will always return OK unless we are misusing the API due to an internal programming error. So I figure it's better to crash than to continue with potential memory corruption.

I checked elsewhere and it looks like all other call_start_batch_and_execute call sites are asserting call_error == OK, e.g. https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc#L857

Historically, we've not been very good about avoiding assertions in grpc code, but we should be trying to avoid them whenver possible. I think that we should handle this by returning an error, just like the code was doing before this PR.

I'm not sure what you mean about memory corruption. Just because call_start_batch_and_execute() fails does not mean that there is any memory corruption. For example, we could somehow accidentally start a send_message op while there's already a send_message op pending. This is certainly not something the operation can recover from, but returning an error from the operation seems better than crashing the entire process.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, apolcyn wrote…

Sorry, I really just meant general risks of ExecCtx stacking (since grpc_call_unref will create an inner ExecCtx).

In short, I think that if ExecCtx stacking behavior was changed to schedule at the "root", e.g. if your comment in #20572 (comment) was implemented, then it would be safe to call this inline. Otherwise, the ExecCtx stacking risks re-entrant locking.

I think a better solution to this would be to have an internal version of grpc_call_unref() that does not create its own ExecCtx.

(Note that my objection to #20572 was not about having an internal version of a C-core API function; it was about the fact that that particular internal version removed a safety check that I didn't think should be removed. I don't mind having internal versions of functions in general, and the most common reason that we do that today is to avoid creating nested ExecCtx instances. In the long run, I still hope to eliminate the ExecCtx completely, in which case most of these internal functions can go away as well. But for now, we can stick with this pattern for consistency.)


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 372 at r4 (raw file):

  void* user_data;
  grpc_core::UniquePtr<char> handshaker_service_url;
  bool use_dedicated_cq;

This field should not be needed, because we should not need to create the channel if this is true.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 376 at r4 (raw file):

};

static void alts_tsi_handshaker_maybe_create_channel(void* arg,

We should not call this unless we need to create the channel, so let's drop maybe from the name.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 382 at r4 (raw file):

  alts_tsi_handshaker* handshaker = next_args->handshaker;
  grpc_channel* channel = nullptr;
  if (!next_args->use_dedicated_cq) {

This check should not be needed, because we should not call this in the first place if we're using a dedicated CQ.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 416 at r4 (raw file):

      reinterpret_cast<alts_tsi_handshaker*>(self);
  grpc_core::MutexLock lock(&handshaker->mu);
  if (handshaker->channel == nullptr) {

We should check use_dedicated_cq here, so that we don't need to bounce into the closure if we don't need to create the channel.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 417 at r4 (raw file):

  grpc_core::MutexLock lock(&handshaker->mu);
  if (handshaker->channel == nullptr) {
    alts_tsi_handshaker_continue_handshaker_next_args* args =

Please add a comment here explaining why we need to bounce into a callback to create the channel, since this will not be obvious to readers in the future.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 438 at r4 (raw file):

    tsi_result ok = alts_tsi_handshaker_continue_handshaker_next_locked(
        handshaker, received_bytes, received_bytes_size, cb, user_data);
    if (ok != TSI_OK) {

Don't we need to invoke the callback in this case?

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still addressing other comments.

My primary goal for this PR is really to get the new alts_concurrent_connectivity_test to pass. That test originally reveals not only the potential lock cycle issue but also a data race. The current code is racey around handling of shutdown. Before the synchronization added here, running the alts_concurrent_connectivity_test under TSAN will specifically reveal a data race around setting and checking of the "has shutdown" state, since tsi_handshaker_shutdown is free to run concurrently with an ALTS handshake RPC callback.

While putting a lock around the entire ALTS handshaker code might be overly generous, it doesn't cause any performance problems and I feel it simplifies things by just making it easy to prove that there aren't any data races.

Note too that a secondary goal of this PR is to prepare a base to reasonably add #20687 and #20722. Without this PR, the changes made in those PR's would otherwise make room for more races.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the lock around the entire ALTS handshaker code is adding a lot of complexity. I would prefer to avoid this if we don't really need it. And it's not obvious to me why we need it for either of those follow-on PRs (although I have not reviewed them in detail).

If shutdown handling causes race conditions, then let's solve that problem more narrowly. For example, could we address it by making the shutdown variable in the parent handshaker object an atomic instead of a simple bool?

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can possible solve this more narrowly with atomics, but I'm worried that this getting much more complicated with those new PRs (which add a deadline to ALTS handshake RPC's, and also make use of the RECV_STATUS_ON_CLIENT op and add a separate callback for it).

What do you think about refactoring this to use a combiner? That would require some new code for bouncing into the combiner, but could get rid of locking/unlocking complexity in this PR.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 17 files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):

Previously, apolcyn wrote…

How about just having the callback into the alts_tsi_handshaker assume that the lock is already held instead of trying to reacquire it?

Sorry I think we might be talking about different things here.... note that invoke_tsi_next_cb_args->cb field is the TSI handshaker API callback that was passed into tsi_handshaker_next by security_handshaker.cc.

I wanted to rule out the possibility of calling back into security_handshaker.cc while holding the alts_tsi_handshaker lock, and then from there potentially calling back into alts_tsi_handshaker inline, and deadlocking.

obsolete due to refactoring to drop mutx


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.

Good point, and yes I don't think grpc_cb needs to be stored in the struct either.

This suggestion seems best, but I've left it out of this PR for a future cleanup.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Historically, we've not been very good about avoiding assertions in grpc code, but we should be trying to avoid them whenver possible. I think that we should handle this by returning an error, just like the code was doing before this PR.

I'm not sure what you mean about memory corruption. Just because call_start_batch_and_execute() fails does not mean that there is any memory corruption. For example, we could somehow accidentally start a send_message op while there's already a send_message op pending. This is certainly not something the operation can recover from, but returning an error from the operation seems better than crashing the entire process.

Due to recent refactoring, this change to alts_handshaker_client_test.cc was reverted, so it now returns an error again.

That said, my motivation for this in general is that it seems like such a case, of e.g. accidentally sending an op while one is already in flight, would indicate that we have a bug and that our state machine is broken - in which case there's likely some assumptions about some object lifetime semantics that are being broken, leading to a chance of memory corruption in the process, which could have worse consequences than process exit.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think a better solution to this would be to have an internal version of grpc_call_unref() that does not create its own ExecCtx.

(Note that my objection to #20572 was not about having an internal version of a C-core API function; it was about the fact that that particular internal version removed a safety check that I didn't think should be removed. I don't mind having internal versions of functions in general, and the most common reason that we do that today is to avoid creating nested ExecCtx instances. In the long run, I still hope to eliminate the ExecCtx completely, in which case most of these internal functions can go away as well. But for now, we can stick with this pattern for consistency.)

Makes sense. I've made this change for grpc_channel_destroy, and added grpc_channel_destroy_internal and stopped throwing it to the ExecCtx.

I looked at doing this for grpc_call_unref but it actually seems tricky - I'm having a hard time seeing how to get rid of this ExecCtx::Flush() call https://github.com/grpc/grpc/blob/master/src/core/lib/surface/call.cc#L604. So I've kept the scheduling of this on ExecCtx for now.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 372 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This field should not be needed, because we should not need to create the channel if this is true.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 376 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should not call this unless we need to create the channel, so let's drop maybe from the name.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 382 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This check should not be needed, because we should not call this in the first place if we're using a dedicated CQ.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 416 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We should check use_dedicated_cq here, so that we don't need to bounce into the closure if we don't need to create the channel.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 417 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a comment here explaining why we need to bounce into a callback to create the channel, since this will not be obvious to readers in the future.

Done.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 438 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Don't we need to invoke the callback in this case?

We're returning synchronously here, so IIC we don't want to invoke the callback

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I've changed the code to fix the data races found by alts_concurrent_connectivity_test more narrowly.

In doing so, a few more data races beyond the original ones popped up in the test, and recent changes address those two.

Specifically, the test caught a data race in the ALTS security connector code, because each ALTS security connector constructor was mutating the rpc_versions fields that belonged to the shared ALTS credentials object. See TSAN warning caught by test: https://paste.googleplex.com/6179302213156864

I also realized that the two fields of alts_tsi_handshaker that are accessed by tsi_handshaker_shutdown (which can run concurrently with the rest of the code), are shutdown and client, so I've made sure to just wrap access to those in the mutex.

With that, the test revealed one more data race that was more subtle: see https://paste.googleplex.com/6529646453587968 for the TSAN warning. This occurred when we were still running in continue_handshaker_next, after we had called alts_handshaker_client_start (which in turn invoked the call op batch) - tsi_handshaker_destroy raced with our continued access to the handshaker object. This points out that it's generally unsafe to access either the alts_tsi_handshaker or the alts_handshaker_client objects after call_start_batch_and_execute has been invoked in the handshake call, because as soon as we've done so, the TSI next API callback can be invoked from a different thread, which can drop all ownership on the TSI handshaker object and allow for tsi_handshaker_destroy to run.

This last subtlety especially leads me back to being in favor of a larger refactoring to serialize all of the ALTS handshake code somehow. However, since the current PR should now be correct, that change can be postponed to a later cleanup.

Reviewable status: 7 of 17 files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that we should consider some additional restructuring of this code in the future. But this looks pretty good for now.

The only remaining significant issue is the one about grpc_call_unref_internal(). Please let me know if you have any questions.

Reviewed 7 of 13 files at r5, 4 of 5 files at r6.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @apolcyn and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

Previously, apolcyn wrote…

Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.

Good point, and yes I don't think grpc_cb needs to be stored in the struct either.

This suggestion seems best, but I've left it out of this PR for a future cleanup.

Let's go ahead and remove the grpc_cb member in this PR, since that's a trivial change. We can leave the API change for alts_grpc_handshaker_client_create() for a future PR.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):

Previously, apolcyn wrote…

Due to recent refactoring, this change to alts_handshaker_client_test.cc was reverted, so it now returns an error again.

That said, my motivation for this in general is that it seems like such a case, of e.g. accidentally sending an op while one is already in flight, would indicate that we have a bug and that our state machine is broken - in which case there's likely some assumptions about some object lifetime semantics that are being broken, leading to a chance of memory corruption in the process, which could have worse consequences than process exit.

I don't necessarily think that this kind of error automatically means memory corruption. It could potentially mean a memory leak, but it's not clear to me that it would be smashing the stack.

In any case, this looks good now.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, apolcyn wrote…

Makes sense. I've made this change for grpc_channel_destroy, and added grpc_channel_destroy_internal and stopped throwing it to the ExecCtx.

I looked at doing this for grpc_call_unref but it actually seems tricky - I'm having a hard time seeing how to get rid of this ExecCtx::Flush() call https://github.com/grpc/grpc/blob/master/src/core/lib/surface/call.cc#L604. So I've kept the scheduling of this on ExecCtx for now.

I don't think it's necessary to get rid of that ExecCtx::Flush() call. Why not just leave it there? It will flush whatever ExecCtx instance is the current one on the stack.


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 342 at r6 (raw file):

    // at this point, since alts_handshaker_client_start_client/server
    // have potentially just started an op batch on the handshake call.
    // The completion callback of for that batch is unsynchronized and so

s/of for/for/

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @markdroth and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's go ahead and remove the grpc_cb member in this PR, since that's a trivial change. We can leave the API change for alts_grpc_handshaker_client_create() for a future PR.

Done.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think it's necessary to get rid of that ExecCtx::Flush() call. Why not just leave it there? It will flush whatever ExecCtx instance is the current one on the stack.

Note that the potential lock inversion is mainly due to the nested ExecCtx flushing, rather than the ExecCtx stacking. As example, I created an internal variant of grpc_call_unref that is an exact copy except that it doesn't create a new ExecCtx. This TSAN warning about potential lock inversion is then triggered by the test: https://paste.googleplex.com/4741553991974912. Note that TSAN complains because it sees us acquire a subchannel mutex in a callback ran in the nested ExecCtx flush of grpc_call_unref_internal. Down the call stack, that same thread was holding a chttp2 connector mutex. In a different thread, TSAN saw that same chttp2 connector mutex get acquired while first holding the subchannel mutex.

I'm not sure how real of a danger this specific TSAN warning really is, but I think TSAN is at least warning about something potentially error prone (it seems hard to be sure that wouldn't accidentially be holding some lock while invoking grpc_call_unref_internal, that will be attempted to be reacquired by a callback ran in the nested ExecCtx flush).


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 342 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/of for/for/

Done.

Copy link
Contributor

@yihuazhang yihuazhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much Alex to fix the issue in ALTS and some cleanups! I only have some minor comments (mostly clarification). Also just to confirm that the current PR did not use combiner-based solution right?

Reviewed 1 of 13 files at r5, 3 of 5 files at r7.
Reviewable status: 13 of 15 files reviewed, 5 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/lib/security/credentials/alts/alts_credentials.cc, line 44 at r7 (raw file):

                                  ? gpr_strdup(GRPC_ALTS_HANDSHAKER_SERVICE_URL)
                                  : gpr_strdup(handshaker_service_url)) {
  grpc_alts_set_rpc_protocol_versions(&options_->rpc_versions);

Thanks for catching the data race by using the concurrency test.


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, apolcyn wrote…

Note that the potential lock inversion is mainly due to the nested ExecCtx flushing, rather than the ExecCtx stacking. As example, I created an internal variant of grpc_call_unref that is an exact copy except that it doesn't create a new ExecCtx. This TSAN warning about potential lock inversion is then triggered by the test: https://paste.googleplex.com/4741553991974912. Note that TSAN complains because it sees us acquire a subchannel mutex in a callback ran in the nested ExecCtx flush of grpc_call_unref_internal. Down the call stack, that same thread was holding a chttp2 connector mutex. In a different thread, TSAN saw that same chttp2 connector mutex get acquired while first holding the subchannel mutex.

I'm not sure how real of a danger this specific TSAN warning really is, but I think TSAN is at least warning about something potentially error prone (it seems hard to be sure that wouldn't accidentially be holding some lock while invoking grpc_call_unref_internal, that will be attempted to be reacquired by a callback ran in the nested ExecCtx flush).

Just want to understand the terminology of ExecCtx stacking, does it mean to stack the closure on the queue belonging to the ExecCtx? or does it mean something else?


src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 58 at r7 (raw file):

  bool use_dedicated_cq;
  // mu synchronizes all fields below
  gpr_mu mu;

Could you please add some comments explaining why only these fields are protected by mutex, not the rest?

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 15 files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @yihuazhang)


src/core/lib/security/credentials/alts/alts_credentials.cc, line 44 at r7 (raw file):

Previously, yihuazhang (yihuaz) wrote…

Thanks for catching the data race by using the concurrency test.

thanks


src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 163 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The _ensure_locked() suffix isn't needed here. The fact that this acquires the lock is an implementation detail and shouldn't be leaked into the API.

Done, this is no longer here after recent larger changes


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, yihuazhang (yihuaz) wrote…

Just want to understand the terminology of ExecCtx stacking, does it mean to stack the closure on the queue belonging to the ExecCtx? or does it mean something else?

I just mean instantiating more than one ExecCtx object in the same call stack, and e.g. the following possible degenerate scenario.

gpr_mu g_lock;

bar() {
  // deadlock
  grpc_core::MutexLock(&g_lock);
}

foo() {
  ExecCtx exec_ctx;
  ...
  GRPC_CLOSURE_SCHED(bar_closure, ...);
}

main() {
  ExecCtx exec_ctx;
  {
    grpc_core::MutexLock(&g_lock);
    foo();
  }
}

src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 58 at r7 (raw file):

Previously, yihuazhang (yihuaz) wrote…

Could you please add some comments explaining why only these fields are protected by mutex, not the rest?

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up, Alex. Please feel free to merge after addressing my remaining comment.

Reviewed 4 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @jiangtaoli2016, @yashykt, and @yihuazhang)


src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):

Previously, apolcyn wrote…

I just mean instantiating more than one ExecCtx object in the same call stack, and e.g. the following possible degenerate scenario.

gpr_mu g_lock;

bar() {
  // deadlock
  grpc_core::MutexLock(&g_lock);
}

foo() {
  ExecCtx exec_ctx;
  ...
  GRPC_CLOSURE_SCHED(bar_closure, ...);
}

main() {
  ExecCtx exec_ctx;
  {
    grpc_core::MutexLock(&g_lock);
    foo();
  }
}

Ugh, what a mess.

@yashykt, when we're done with your work to decouple closures from schedulers, we really need to find a way to either eliminate ExecCtx in favor of EventManager or do something to prevent ExecCtx stacking from ever occurring.

Alex, I guess this is fine for now, but please make this comment a TODO, so that it's clear that this is something that should be fixed.

@apolcyn
Copy link
Contributor Author

apolcyn commented Nov 4, 2019

Thanks for the detailed reviewing!

Squashing down commits before merging.

@apolcyn
Copy link
Contributor Author

apolcyn commented Nov 5, 2019

Bazel RBE ASAN C/C++: #20711
Bazel RBE opt C/C++: #20198
Bazel RBE TSAN C/C++: #20928
RBE Windows Debug C/C++: #20677
RBE Windows Opt C/C++: #20436

@apolcyn apolcyn merged commit 002282e into grpc:master Nov 5, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Feb 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/security lang/core release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants