-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Associate a mutex with ALTS TSI handshaker objects #20596
Conversation
003a96e
to
5de8487
Compare
A heads up that there's a low rate of timeout flakeyness in the new test right now, and it may need some tweaking of parameters. |
d1855dd
to
20ac533
Compare
The flake rate has reduced to where I can only see about 1 timeout flake in 2000 runs, when running the I'm still trying to do some cleanup for the changes to |
FTR this turned out to not be true FI, as #20687 still had this low (~1/2000 flakiness). I'm only seeing these flakes in This is still ready for review. |
Unassigned @yihuazhang , who is out of the office |
To clarify a couple of things this PR does brought up offline, the changes in this PR mainly try to do two things:
|
0921db6
to
1401f8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is about the abstraction violation between the handshaker client and the handshaker itself. I've suggested what I think will be a much cleaner and simpler way to do this.
Please let me know if you have any questions.
Reviewed 7 of 15 files at r1, 13 of 13 files at r2.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @apolcyn and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 156 at r2 (raw file):
* and its owning alts_tsi_handshaker object is currently necessary in order * to drop the alts_tsi_handshaker's lock and then grab it again. */ void alts_handshaker_client_continue_make_grpc_call_locked(
See my comments elsewhere about finding a cleaner way to do this that does not require breaking the abstractions this way.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):
t->bytes_to_send_size = bytes_to_send_size; t->result = result; GRPC_CLOSURE_SCHED(
If we handled the channel creation inside of the handshaker code as I suggest elsewhere, then I think this level of indirection would not be necessary.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):
* - is_ok: a boolean value indicating if the handshaker response is ok to read. */ void alts_tsi_handshaker_handle_response_dedicated(
These APIs seem to really break the abstractions between this and the handshaker client code. I'd really prefer to find a cleaner way to do this.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 303 at r2 (raw file):
reinterpret_cast<alts_tsi_handshaker*>(self); grpc_core::MutexLock lock(&handshaker->mu); if (self->handshake_shutdown) {
Shouldn't this be checking handshaker->shutdown
instead?
Might also want to double-check if there are other places where we're checking the wrong variable, too.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 395 at r2 (raw file):
alts_tsi_handshaker* handshaker = reinterpret_cast<alts_tsi_handshaker*>(self); gpr_mu_lock(&handshaker->mu);
It should not be necessary to acquire the lock here. If we're calling destroy while some other method is still trying to use the handshaker, then there's a bug elsewhere, and acquiring the lock here is not going to help, because the other code could still try to acquire the lock after we release it.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):
use_dedicated_cq ? nullptr : grpc_insecure_channel_create(handshaker->handshaker_service_url,
Instead of moving this into alts_tsi_handshaker_re_enter_lock_then_continue_make_grpc_call()
and requiring that to intercept all interactions between this code and the handshaker client, how about just immediately scheduling a closure on the ExecCtx to create the channel? If the handshaker's next() method is called while that closure is still pending, the handshaker can save the data and then invoke it when the channel creation callback finishes.
I think this approach would make this PR much less invasive and would result in much less abstraction breakage between the various layers here (i.e., much less shared state and API surface between the ALTS client and this code).
test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 305 at r2 (raw file):
// Create "handshaker client" objects { grpc_core::MutexLock lock(alts_tsi_handshaker_get_lock_for_testing(
It should not be necessary to grab the lock to create the handshaker client. I think this is another symptom of the abstraction breakage happening here.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 85 at r2 (raw file):
GPR_ASSERT(gpr_asprintf(&log_trace_id, "%p", debug_id)); grpc_arg extra_arg; extra_arg.type = GRPC_ARG_STRING;
Please use grpc_channel_arg_string_create()
:
grpc/src/core/lib/channel/channel_args.h
Line 112 in 892d5d7
grpc_arg grpc_channel_arg_string_create(char* name, char* value); |
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 86 at r2 (raw file):
grpc_arg extra_arg; extra_arg.type = GRPC_ARG_STRING; extra_arg.key = const_cast<char*>("ensure_subchannel_sharing_disabled");
Please use GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL
for this purpose.
grpc/include/grpc/impl/codegen/grpc_types.h
Line 396 in 892d5d7
#define GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL "grpc.use_local_subchannel_pool" |
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 89 at r2 (raw file):
extra_arg.value.string = log_trace_id; grpc_channel_args* channel_args = grpc_channel_args_copy_and_add(nullptr, &extra_arg, 1);
Can just say {1, &extra_arg}
here, so there's no need to call grpc_channel_args_destroy()
below.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 115 at r2 (raw file):
} grpc_core::UniquePtr<char> Address() {
This should probably just return a char*
. If the caller wants to create their own copy, they can do so.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 127 at r2 (raw file):
class TestServer { public: TestServer(grpc_core::UniquePtr<char> fake_handshake_server_address) {
explicit
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 161 at r2 (raw file):
} grpc_core::UniquePtr<char> Address() {
This should probably just return a char*
. If the caller wants to create their own copy, they can do so.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 181 at r2 (raw file):
}; struct connect_args {
How about making this a class with connect_loop()
as a method?
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 527 at r2 (raw file):
int main(int argc, char** argv) { grpc_init(); {
These braces aren't necessary.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 528 at r2 (raw file):
grpc_init(); { test_basic_client_server_handshake();
Since this is a new test, can we use gtest instead of calling each test manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 15 files reviewed, 16 unresolved discussions (waiting on @markdroth and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 156 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
See my comments elsewhere about finding a cleaner way to do this that does not require breaking the abstractions this way.
Thanks for the suggestions! I've removed this and agree this is cleaner. Related, I was able to revert a number of changes to the alts_handshaker_client_test
from this PR.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
If we handled the channel creation inside of the handshaker code as I suggest elsewhere, then I think this level of indirection would not be necessary.
Unfortunately I don't see how this can be avoided. My motivation for deferring invoke_tsi_next_cb
to the ExecCtx is to avoid calling back into the driver of the TSI handshaker while still holding the alts_tsi_handshaker
mutex, so that we don't call back into the alts_tsi_handshaker
while still holding it. Running cb at the bottom of the ExecCtx seemed safest to me, please let me know if you have any thoughts on this though.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
These APIs seem to really break the abstractions between this and the handshaker client code. I'd really prefer to find a cleaner way to do this.
I was able to remove the re_enter_lock... APIs.
Unfortunately, I don't see a way to avoid alts_tsi_handshaker_handle_response_dedicated
though - because the caller of this is polling a cq on its own thread and needs to be able to jump back into the alts_tsi_handshaker
mutex. So long as alts_tsi_handshaker
and alts_handshaker_grpc_client
share that same mutex (which I feel is less error prone), I can't see a good way around it.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 303 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Shouldn't this be checking
handshaker->shutdown
instead?Might also want to double-check if there are other places where we're checking the wrong variable, too.
Good catch! Checked that we weren't doing this elsewhere
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 395 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
It should not be necessary to acquire the lock here. If we're calling destroy while some other method is still trying to use the handshaker, then there's a bug elsewhere, and acquiring the lock here is not going to help, because the other code could still try to acquire the lock after we release it.
Good catch! removed locking from the dtors
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Instead of moving this into
alts_tsi_handshaker_re_enter_lock_then_continue_make_grpc_call()
and requiring that to intercept all interactions between this code and the handshaker client, how about just immediately scheduling a closure on the ExecCtx to create the channel? If the handshaker's next() method is called while that closure is still pending, the handshaker can save the data and then invoke it when the channel creation callback finishes.I think this approach would make this PR much less invasive and would result in much less abstraction breakage between the various layers here (i.e., much less shared state and API surface between the ALTS client and this code).
Thanks for the suggestion.
I've changed the code to do what I believe is roughly this suggestion: the code in the most recent update always schedules creation of a channel at the top of the alts_tsi_handshaker_next
, and then basically continues the rest of alts_tsi_handshaker_next
after the channel is created - but the channel creation is moved fully to within the TSI handshaker object. Leveraging the "next" call seemed to make lifetime issues simpler.
test/core/tsi/alts/handshaker/alts_handshaker_client_test.cc, line 305 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
It should not be necessary to grab the lock to create the handshaker client. I think this is another symptom of the abstraction breakage happening here.
I agree creation doesn't need the lock, removed
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 85 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please use
grpc_channel_arg_string_create()
:grpc/src/core/lib/channel/channel_args.h
Line 112 in 892d5d7
grpc_arg grpc_channel_arg_string_create(char* name, char* value);
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 86 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please use
GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL
for this purpose.grpc/include/grpc/impl/codegen/grpc_types.h
Line 396 in 892d5d7
#define GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL "grpc.use_local_subchannel_pool"
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 89 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Can just say
{1, &extra_arg}
here, so there's no need to callgrpc_channel_args_destroy()
below.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 115 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This should probably just return a
char*
. If the caller wants to create their own copy, they can do so.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 127 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
explicit
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 161 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This should probably just return a
char*
. If the caller wants to create their own copy, they can do so.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 181 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
How about making this a class with
connect_loop()
as a method?
Done, good point
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 527 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
These braces aren't necessary.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 528 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Since this is a new test, can we use gtest instead of calling each test manually?
Good point, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking much better! I'd like to see a bit more structural cleanup, but it's definitely moving in the right direction.
Reviewed 8 of 8 files at r3.
Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @apolcyn and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 148 at r3 (raw file):
/** * This method handles handshaker response returned from ALTS handshaker * service. Note that the only reason the API is exposed is that it is used in
Maybe restore this comment now?
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):
Previously, apolcyn wrote…
Unfortunately I don't see how this can be avoided. My motivation for deferring
invoke_tsi_next_cb
to the ExecCtx is to avoid calling back into the driver of the TSI handshaker while still holding thealts_tsi_handshaker
mutex, so that we don't call back into thealts_tsi_handshaker
while still holding it. Running cb at the bottom of the ExecCtx seemed safest to me, please let me know if you have any thoughts on this though.
How about just having the callback into the alts_tsi_handshaker
assume that the lock is already held instead of trying to reacquire it? It looks like it will always be invoked from a point where the lock is held.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
grpc_iomgr_cb_func grpc_cb; /* Argument to pass to grpc_cb when invoked. */ void* grpc_cb_arg;
Doesn't look like this is needed anymore.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 139 at r3 (raw file):
t->result = result; GRPC_CLOSURE_SCHED( GRPC_CLOSURE_CREATE(invoke_tsi_next_cb, t, grpc_schedule_on_exec_ctx),
Suggest adding a grpc_closure
member to invoke_tsi_next_cb_args
, so that we don't need to do an additional allocation here.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):
client->grpc_caller(client->call, ops, static_cast<size_t>(op - ops), &client->on_handshaker_service_resp_recv); GPR_ASSERT(call_error == GRPC_CALL_OK);
Why change this from returning an error to be an assertion? Handling the error without crashing seems preferable.
(Note that if you don't do this, then this function might as well be changed to return void, because it will never return any error.)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 357 at r3 (raw file):
handshaker_client_send_buffer_destroy_locked(client); client->send_buffer = buffer; make_grpc_call_locked(client, true /* is_start */);
Unless this function is changed to return void, we should continue to check its result here.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 415 at r3 (raw file):
handshaker_client_send_buffer_destroy_locked(client); client->send_buffer = buffer; make_grpc_call_locked(client, true /* is_start */);
Same here.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r3 (raw file):
handshaker_client_send_buffer_destroy_locked(client); client->send_buffer = buffer; make_grpc_call_locked(client, false /* is_start */);
Same here.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
reinterpret_cast<alts_grpc_handshaker_client*>(c); if (client->call != nullptr) { // do this at the bottom of the callstack to avoid lock recursion
What lock recursion are we avoiding here? The use-case is not clear to me.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):
Previously, apolcyn wrote…
I was able to remove the re_enter_lock... APIs.
Unfortunately, I don't see a way to avoid
alts_tsi_handshaker_handle_response_dedicated
though - because the caller of this is polling a cq on its own thread and needs to be able to jump back into thealts_tsi_handshaker
mutex. So long asalts_tsi_handshaker
andalts_handshaker_grpc_client
share that same mutex (which I feel is less error prone), I can't see a good way around it.
I understand the need to reacquire the lock in that case, but is there a reasonable way to avoid exposing this publicly?
As I mentioned above, ideally, we should re-acquire the lock only in the case where we're using a CQ in its own thread; in the case where we're directly handling our own polling, we should not need to do so. So maybe the right thing here is to have alts_tsi_handshaker
pass a callback to the client to use when handling a response. In the case where we're using a CQ, we can use a callback that acquires the mutex; in the case where we're handling our own polling, we can use a callback that just directly calls alts_handshaker_client_handle_response_locked()
.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 433 at r2 (raw file):
Previously, apolcyn wrote…
Thanks for the suggestion.
I've changed the code to do what I believe is roughly this suggestion: the code in the most recent update always schedules creation of a channel at the top of the
alts_tsi_handshaker_next
, and then basically continues the rest ofalts_tsi_handshaker_next
after the channel is created - but the channel creation is moved fully to within the TSI handshaker object. Leveraging the "next" call seemed to make lifetime issues simpler.
Good idea! That does make the state machine easier to understand.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 288 at r3 (raw file):
} static void alts_tsi_handshaker_maybe_create_channel(
I think we should structure things such that there's a lot less acquiring and releasing of the same mutex in the same thread. Specifically, I suggest the following:
-
In
handshaker_next()
, acquire the lock, then check whether we need to create a channel. If we do, schedule a closure on the ExecCtx foralts_tsi_handshaker_maybe_create_channel()
and return. Otherwise, immediately callalts_tsi_handshaker_continue_handshaker_next_locked()
. -
alts_tsi_handshaker_maybe_create_channel()
should be a closure function that takes thealts_tsi_handshaker_continue_handshaker_next_args
instance as its arg. It should acquire the lock after creating the channel and then callalts_tsi_handshaker_continue_handshaker_next_locked()
. -
alts_tsi_handshaker_continue_handshaker_next()
should be renamed toalts_tsi_handshaker_continue_handshaker_next_locked()
. It no longer needs to be a closure callback; instead of taking thealts_tsi_handshaker_continue_handshaker_next_args
instance as its argument, it can directly take arguments similar to those ofhandshaker_next()
. It should assume that the lock is already acquired by the caller instead of trying to acquire it itself. It can return some state indicating whether the callback should be invoked and with what parameters, which the caller can do after releasing the lock.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 299 at r3 (raw file):
create_channel = true; } handshaker_service_url = grpc_core::UniquePtr<char>(
Is it the case that the handshaker service URL is set when the alts_tsi_handshaker
is created and is not changed for the lifetime of the object? If so, we can probably just use it directly without acquiring the lock.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 423 at r3 (raw file):
args->user_data = user_data; GRPC_CLOSURE_SCHED( GRPC_CLOSURE_CREATE(alts_tsi_handshaker_continue_handshaker_next, args,
Please add a grpc_closure
member to alts_tsi_handshaker_continue_handshaker_next_args
, so that we can avoid a second allocation here.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 470 at r3 (raw file):
if (handshaker->client != nullptr) { // this is defensive in order to avoid leaving a stray/unpolled call alts_handshaker_client_cancel_call_locked(handshaker->client);
Why is this function necessary? Immediately after this, we are calling alts_handshaker_client_destroy()
. Can't that method cancel the call if necessary?
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 108 at r3 (raw file):
} const char* Address() { return address_.get(); }
This should be lower-case, since it's a simple accessor.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 152 at r3 (raw file):
} const char* Address() { return server_addr_.get(); }
This should be lower-case as well.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 202 at r3 (raw file):
grpc_connectivity_state state = grpc_channel_check_connectivity_state(channel, 1); GPR_ASSERT(state == GRPC_CHANNEL_IDLE);
All of the GPR_ASSERT()
calls in this function can be changed to use the gtest EXPECT
or ASSERT
macros now.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 242 at r3 (raw file):
// handshake server). TEST(AltsConcurrentConnectivityTest, TestBasicClientServerHandshakes) { gpr_log(GPR_DEBUG, "Running test: test_basic_client_server_handshake");
Not needed anymore; gtest will log the test start for you.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 256 at r3 (raw file):
* (using the fake, in-process handshake server). */ TEST(AltsConcurrentConnectivityTest, TestConcurrentClientServerHandshakes) { gpr_log(GPR_DEBUG, "Running test: test_concurrent_client_server_handshakes");
Not needed anymore.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 263 at r3 (raw file):
gpr_timespec test_deadline = grpc_timeout_seconds_to_deadline(20); int num_concurrent_connects = 50; std::vector<grpc_core::UniquePtr<ConnectLoopRunner>> connect_loop_runners;
Can just use std::unique_ptr<>
here.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 272 at r3 (raw file):
GRPC_CHANNEL_READY /* expected connectivity states */)); } for (size_t i = 0; i < num_concurrent_connects; i++) {
This can just be connect_loop_runners.clear();
.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 349 at r3 (raw file):
} const char* Address() { return address_.get(); }
This can be lower-case.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 452 at r3 (raw file):
TEST(AltsConcurrentConnectivityTest, TestHandshakeFailsFastWhenPeerEndpointClosesConnectionAfterAccepting) { gpr_log(GPR_DEBUG,
Not needed.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 461 at r3 (raw file):
{ gpr_timespec test_deadline = grpc_timeout_seconds_to_deadline(20); std::vector<grpc_core::UniquePtr<ConnectLoopRunner>> connect_loop_runners;
Can just use std::unique_ptr<>
here.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 470 at r3 (raw file):
GRPC_CHANNEL_TRANSIENT_FAILURE /* expected connectivity states */)); } for (size_t i = 0; i < num_concurrent_connects; i++) {
connect_loop_runners.clear();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 25 unresolved discussions (waiting on @markdroth and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 148 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Maybe restore this comment now?
Restored parameter comments.
I think this comment about the reason for the API being exposed was previously accurate, and only remains inaccurate with this PR: alts_handshaker_client_handle_response
is used before this PR by the dedicated CQ thread ALTS driver and the grpc polling framework based ALTS driver.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):
How about just having the callback into the alts_tsi_handshaker assume that the lock is already held instead of trying to reacquire it?
Sorry I think we might be talking about different things here.... note that invoke_tsi_next_cb_args->cb
field is the TSI handshaker API callback that was passed into tsi_handshaker_next
by security_handshaker.cc
.
I wanted to rule out the possibility of calling back into security_handshaker.cc
while holding the alts_tsi_handshaker
lock, and then from there potentially calling back into alts_tsi_handshaker
inline, and deadlocking.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Doesn't look like this is needed anymore.
I think this still has some usefulness in that the alts_handshaker_client_test
can pass a dummy object to it, which is used in the grpc_cb
callbacks provided by the test.
Without that, the alts_handshaker_client_test
needs to go back to creating a dummy alts_tsi_handshaker
object to the alts_handshaker_client
objects it creates, just to have a valid argument in it's ALTS RPC callbacks. I'm not a huge fan of this but it seems to complement the injection of the callbacks from the tests
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 139 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Suggest adding a
grpc_closure
member toinvoke_tsi_next_cb_args
, so that we don't need to do an additional allocation here.
Done.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Why change this from returning an error to be an assertion? Handling the error without crashing seems preferable.
(Note that if you don't do this, then this function might as well be changed to return void, because it will never return any error.)
Changed to return void.
Outside of some unit tests that supply with mocked versions of grpc_caller
, grpc_caller
is always call_start_batch_and_execute
.
As far as I understand, call_start_batch_and_execute
will always return OK
unless we are misusing the API due to an internal programming error. So I figure it's better to crash than to continue with potential memory corruption.
I checked elsewhere and it looks like all other call_start_batch_and_execute
call sites are asserting call_error == OK
, e.g. https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc#L857
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 357 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Unless this function is changed to return void, we should continue to check its result here.
Changed to void
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 415 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Same here.
changed return val to void
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 450 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Same here.
changed return val to void
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
What lock recursion are we avoiding here? The use-case is not clear to me.
Sorry, I really just meant general risks of ExecCtx
stacking (since grpc_call_unref
will create an inner ExecCtx
).
In short, I think that if ExecCtx
stacking behavior was changed to schedule at the "root", e.g. if your comment in #20572 (comment) was implemented, then it would be safe to call this inline. Otherwise, the ExecCtx
stacking risks re-entrant locking.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.h, line 100 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I understand the need to reacquire the lock in that case, but is there a reasonable way to avoid exposing this publicly?
As I mentioned above, ideally, we should re-acquire the lock only in the case where we're using a CQ in its own thread; in the case where we're directly handling our own polling, we should not need to do so. So maybe the right thing here is to have
alts_tsi_handshaker
pass a callback to the client to use when handling a response. In the case where we're using a CQ, we can use a callback that acquires the mutex; in the case where we're handling our own polling, we can use a callback that just directly callsalts_handshaker_client_handle_response_locked()
.
Changed to do this, however, note that we do need to re-acquire the lock in both the dedicated-CQ and non-dedicated handshake cases. In the dedicated-CQ case, we're jumping back into the handshake client from the dedicated thread. From the internal-polling-framework case, we are running from a call_start_batch_and_execute
callback scheduled on the ExecCtx.
I've restructured the code so that the alts_handshake_client
takes a callback that is meant to do whatever initial setup is needed in order to safely run handle_response_locked
.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 288 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I think we should structure things such that there's a lot less acquiring and releasing of the same mutex in the same thread. Specifically, I suggest the following:
In
handshaker_next()
, acquire the lock, then check whether we need to create a channel. If we do, schedule a closure on the ExecCtx foralts_tsi_handshaker_maybe_create_channel()
and return. Otherwise, immediately callalts_tsi_handshaker_continue_handshaker_next_locked()
.
alts_tsi_handshaker_maybe_create_channel()
should be a closure function that takes thealts_tsi_handshaker_continue_handshaker_next_args
instance as its arg. It should acquire the lock after creating the channel and then callalts_tsi_handshaker_continue_handshaker_next_locked()
.
alts_tsi_handshaker_continue_handshaker_next()
should be renamed toalts_tsi_handshaker_continue_handshaker_next_locked()
. It no longer needs to be a closure callback; instead of taking thealts_tsi_handshaker_continue_handshaker_next_args
instance as its argument, it can directly take arguments similar to those ofhandshaker_next()
. It should assume that the lock is already acquired by the caller instead of trying to acquire it itself. It can return some state indicating whether the callback should be invoked and with what parameters, which the caller can do after releasing the lock.
Thanks for the suggestion! I've changed the code to take this approach, I agree this is cleaner.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 299 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Is it the case that the handshaker service URL is set when the
alts_tsi_handshaker
is created and is not changed for the lifetime of the object? If so, we can probably just use it directly without acquiring the lock.
I would agree that this doesn't seem racey to access it outside of the lock, since concurrent writes to that field can't logically happen at this point.
I've change the code though to have the alts_tsi_handshake_continue_handshaker_next_args
store a copy of the fields in alts_tsi_handshaker
that it needs to access outside of the lock, which still allows us to get rid of the repeated locking an unlocking in the same thread.
I have a slight preference for this, as personally I think it helps readability to be strict about reading the handshaker object's fields only under the lock, even if over done. I can go either way this though, please let me know.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 423 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please add a
grpc_closure
member toalts_tsi_handshaker_continue_handshaker_next_args
, so that we can avoid a second allocation here.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 470 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Why is this function necessary? Immediately after this, we are calling
alts_handshaker_client_destroy()
. Can't that method cancel the call if necessary?
This is relevant in followup PR #20687, but I agree isn't relevant in this PR. I've reverted this change and will save for #20687 where it's needed
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 108 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This should be lower-case, since it's a simple accessor.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 152 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This should be lower-case as well.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 202 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
All of the
GPR_ASSERT()
calls in this function can be changed to use the gtestEXPECT
orASSERT
macros now.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 242 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Not needed anymore; gtest will log the test start for you.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 256 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Not needed anymore.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 263 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Can just use
std::unique_ptr<>
here.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 272 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This can just be
connect_loop_runners.clear();
.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 349 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This can be lower-case.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 452 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Not needed.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 461 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Can just use
std::unique_ptr<>
here.
Done.
test/core/tsi/alts/handshaker/alts_concurrent_connectivity_test.cc, line 470 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
connect_loop_runners.clear();
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: I don't actually understand what the benefit is of adding the mutex to this code. The problem we're trying to solve here is just to do channel creation in the callback scheduled on the ExecCtx
, and it seems like we've solved that problem by triggering that callback via handshaker_next()
. Can't we just do that without adding all the complexity of the mutex? It's not clear to me what the benefit is of the additional complexity.
Reviewed 7 of 7 files at r4.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 163 at r4 (raw file):
* - is_ok: a boolean value indicating if the handshaker response is ok to read. */ void alts_handshaker_client_handle_response_ensure_locked(
The _ensure_locked()
suffix isn't needed here. The fact that this acquires the lock is an implementation detail and shouldn't be leaked into the API.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
Previously, apolcyn wrote…
I think this still has some usefulness in that the
alts_handshaker_client_test
can pass a dummy object to it, which is used in thegrpc_cb
callbacks provided by the test.Without that, the
alts_handshaker_client_test
needs to go back to creating a dummyalts_tsi_handshaker
object to thealts_handshaker_client
objects it creates, just to have a valid argument in it's ALTS RPC callbacks. I'm not a huge fan of this but it seems to complement the injection of the callbacks from the tests
I am not suggesting that alts_grpc_handshaker_client_create()
should not continue to allow the caller to pass in the callback argument. What I am saying is that I don't see a reason to store that argument in this struct.
It doesn't look like this field is actually used anywhere except in alts_grpc_handshaker_client_create()
. We set this field from the parameter on line 523 and then use it on line 540 in the same function, and I don't see anywhere else that it gets used. So why bother storing it here?
Actually, same comment about the pre-existing grpc_cb
field -- it doesn't actually seem to be used anywhere outside of alts_grpc_handshaker_client_create()
. Do we actually need that field either?
Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):
Previously, apolcyn wrote…
Changed to return void.
Outside of some unit tests that supply with mocked versions of
grpc_caller
,grpc_caller
is alwayscall_start_batch_and_execute
.As far as I understand,
call_start_batch_and_execute
will always returnOK
unless we are misusing the API due to an internal programming error. So I figure it's better to crash than to continue with potential memory corruption.I checked elsewhere and it looks like all other
call_start_batch_and_execute
call sites are assertingcall_error == OK
, e.g. https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc#L857
Historically, we've not been very good about avoiding assertions in grpc code, but we should be trying to avoid them whenver possible. I think that we should handle this by returning an error, just like the code was doing before this PR.
I'm not sure what you mean about memory corruption. Just because call_start_batch_and_execute()
fails does not mean that there is any memory corruption. For example, we could somehow accidentally start a send_message op while there's already a send_message op pending. This is certainly not something the operation can recover from, but returning an error from the operation seems better than crashing the entire process.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, apolcyn wrote…
Sorry, I really just meant general risks of
ExecCtx
stacking (sincegrpc_call_unref
will create an innerExecCtx
).In short, I think that if
ExecCtx
stacking behavior was changed to schedule at the "root", e.g. if your comment in #20572 (comment) was implemented, then it would be safe to call this inline. Otherwise, theExecCtx
stacking risks re-entrant locking.
I think a better solution to this would be to have an internal version of grpc_call_unref()
that does not create its own ExecCtx
.
(Note that my objection to #20572 was not about having an internal version of a C-core API function; it was about the fact that that particular internal version removed a safety check that I didn't think should be removed. I don't mind having internal versions of functions in general, and the most common reason that we do that today is to avoid creating nested ExecCtx
instances. In the long run, I still hope to eliminate the ExecCtx
completely, in which case most of these internal functions can go away as well. But for now, we can stick with this pattern for consistency.)
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 372 at r4 (raw file):
void* user_data; grpc_core::UniquePtr<char> handshaker_service_url; bool use_dedicated_cq;
This field should not be needed, because we should not need to create the channel if this is true.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 376 at r4 (raw file):
}; static void alts_tsi_handshaker_maybe_create_channel(void* arg,
We should not call this unless we need to create the channel, so let's drop maybe
from the name.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 382 at r4 (raw file):
alts_tsi_handshaker* handshaker = next_args->handshaker; grpc_channel* channel = nullptr; if (!next_args->use_dedicated_cq) {
This check should not be needed, because we should not call this in the first place if we're using a dedicated CQ.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 416 at r4 (raw file):
reinterpret_cast<alts_tsi_handshaker*>(self); grpc_core::MutexLock lock(&handshaker->mu); if (handshaker->channel == nullptr) {
We should check use_dedicated_cq
here, so that we don't need to bounce into the closure if we don't need to create the channel.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 417 at r4 (raw file):
grpc_core::MutexLock lock(&handshaker->mu); if (handshaker->channel == nullptr) { alts_tsi_handshaker_continue_handshaker_next_args* args =
Please add a comment here explaining why we need to bounce into a callback to create the channel, since this will not be obvious to readers in the future.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 438 at r4 (raw file):
tsi_result ok = alts_tsi_handshaker_continue_handshaker_next_locked( handshaker, received_bytes, received_bytes_size, cb, user_data); if (ok != TSI_OK) {
Don't we need to invoke the callback in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still addressing other comments.
My primary goal for this PR is really to get the new alts_concurrent_connectivity_test
to pass. That test originally reveals not only the potential lock cycle issue but also a data race. The current code is racey around handling of shutdown. Before the synchronization added here, running the alts_concurrent_connectivity_test
under TSAN will specifically reveal a data race around setting and checking of the "has shutdown" state, since tsi_handshaker_shutdown
is free to run concurrently with an ALTS handshake RPC callback.
While putting a lock around the entire ALTS handshaker code might be overly generous, it doesn't cause any performance problems and I feel it simplifies things by just making it easy to prove that there aren't any data races.
Note too that a secondary goal of this PR is to prepare a base to reasonably add #20687 and #20722. Without this PR, the changes made in those PR's would otherwise make room for more races.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the lock around the entire ALTS handshaker code is adding a lot of complexity. I would prefer to avoid this if we don't really need it. And it's not obvious to me why we need it for either of those follow-on PRs (although I have not reviewed them in detail).
If shutdown handling causes race conditions, then let's solve that problem more narrowly. For example, could we address it by making the shutdown variable in the parent handshaker object an atomic instead of a simple bool?
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can possible solve this more narrowly with atomics, but I'm worried that this getting much more complicated with those new PRs (which add a deadline to ALTS handshake RPC's, and also make use of the RECV_STATUS_ON_CLIENT op and add a separate callback for it).
What do you think about refactoring this to use a combiner? That would require some new code for bouncing into the combiner, but could get rid of locking/unlocking complexity in this PR.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 17 files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 136 at r2 (raw file):
Previously, apolcyn wrote…
How about just having the callback into the alts_tsi_handshaker assume that the lock is already held instead of trying to reacquire it?
Sorry I think we might be talking about different things here.... note that
invoke_tsi_next_cb_args->cb
field is the TSI handshaker API callback that was passed intotsi_handshaker_next
bysecurity_handshaker.cc
.I wanted to rule out the possibility of calling back into
security_handshaker.cc
while holding thealts_tsi_handshaker
lock, and then from there potentially calling back intoalts_tsi_handshaker
inline, and deadlocking.
obsolete due to refactoring to drop mutx
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.
Good point, and yes I don't think grpc_cb
needs to be stored in the struct either.
This suggestion seems best, but I've left it out of this PR for a future cleanup.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Historically, we've not been very good about avoiding assertions in grpc code, but we should be trying to avoid them whenver possible. I think that we should handle this by returning an error, just like the code was doing before this PR.
I'm not sure what you mean about memory corruption. Just because
call_start_batch_and_execute()
fails does not mean that there is any memory corruption. For example, we could somehow accidentally start a send_message op while there's already a send_message op pending. This is certainly not something the operation can recover from, but returning an error from the operation seems better than crashing the entire process.
Due to recent refactoring, this change to alts_handshaker_client_test.cc
was reverted, so it now returns an error again.
That said, my motivation for this in general is that it seems like such a case, of e.g. accidentally sending an op while one is already in flight, would indicate that we have a bug and that our state machine is broken - in which case there's likely some assumptions about some object lifetime semantics that are being broken, leading to a chance of memory corruption in the process, which could have worse consequences than process exit.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I think a better solution to this would be to have an internal version of
grpc_call_unref()
that does not create its ownExecCtx
.(Note that my objection to #20572 was not about having an internal version of a C-core API function; it was about the fact that that particular internal version removed a safety check that I didn't think should be removed. I don't mind having internal versions of functions in general, and the most common reason that we do that today is to avoid creating nested
ExecCtx
instances. In the long run, I still hope to eliminate theExecCtx
completely, in which case most of these internal functions can go away as well. But for now, we can stick with this pattern for consistency.)
Makes sense. I've made this change for grpc_channel_destroy
, and added grpc_channel_destroy_internal
and stopped throwing it to the ExecCtx.
I looked at doing this for grpc_call_unref
but it actually seems tricky - I'm having a hard time seeing how to get rid of this ExecCtx::Flush()
call https://github.com/grpc/grpc/blob/master/src/core/lib/surface/call.cc#L604. So I've kept the scheduling of this on ExecCtx for now.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 372 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This field should not be needed, because we should not need to create the channel if this is true.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 376 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
We should not call this unless we need to create the channel, so let's drop
maybe
from the name.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 382 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This check should not be needed, because we should not call this in the first place if we're using a dedicated CQ.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 416 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
We should check
use_dedicated_cq
here, so that we don't need to bounce into the closure if we don't need to create the channel.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 417 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Please add a comment here explaining why we need to bounce into a callback to create the channel, since this will not be obvious to readers in the future.
Done.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 438 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Don't we need to invoke the callback in this case?
We're returning synchronously here, so IIC we don't want to invoke the callback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I've changed the code to fix the data races found by alts_concurrent_connectivity_test
more narrowly.
In doing so, a few more data races beyond the original ones popped up in the test, and recent changes address those two.
Specifically, the test caught a data race in the ALTS security connector code, because each ALTS security connector constructor was mutating the rpc_versions fields that belonged to the shared ALTS credentials object. See TSAN warning caught by test: https://paste.googleplex.com/6179302213156864
I also realized that the two fields of alts_tsi_handshaker
that are accessed by tsi_handshaker_shutdown
(which can run concurrently with the rest of the code), are shutdown
and client
, so I've made sure to just wrap access to those in the mutex.
With that, the test revealed one more data race that was more subtle: see https://paste.googleplex.com/6529646453587968 for the TSAN warning. This occurred when we were still running in continue_handshaker_next
, after we had called alts_handshaker_client_start
(which in turn invoked the call op batch) - tsi_handshaker_destroy
raced with our continued access to the handshaker
object. This points out that it's generally unsafe to access either the alts_tsi_handshaker
or the alts_handshaker_client
objects after call_start_batch_and_execute
has been invoked in the handshake call, because as soon as we've done so, the TSI next API callback can be invoked from a different thread, which can drop all ownership on the TSI handshaker object and allow for tsi_handshaker_destroy
to run.
This last subtlety especially leads me back to being in favor of a larger refactoring to serialize all of the ALTS handshake code somehow. However, since the current PR should now be correct, that change can be postponed to a later cleanup.
Reviewable status: 7 of 17 files reviewed, 15 unresolved discussions (waiting on @apolcyn, @markdroth, and @yihuazhang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that we should consider some additional restructuring of this code in the future. But this looks pretty good for now.
The only remaining significant issue is the one about grpc_call_unref_internal()
. Please let me know if you have any questions.
Reviewed 7 of 13 files at r5, 4 of 5 files at r6.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @apolcyn and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
Previously, apolcyn wrote…
Furthermore, since we're always just using these two arguments to create a closure, why not just have the caller pass in the closure in the first place? That should probably be done in a separte PR, though.
Good point, and yes I don't think
grpc_cb
needs to be stored in the struct either.This suggestion seems best, but I've left it out of this PR for a future cleanup.
Let's go ahead and remove the grpc_cb
member in this PR, since that's a trivial change. We can leave the API change for alts_grpc_handshaker_client_create()
for a future PR.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 283 at r3 (raw file):
Previously, apolcyn wrote…
Due to recent refactoring, this change to
alts_handshaker_client_test.cc
was reverted, so it now returns an error again.That said, my motivation for this in general is that it seems like such a case, of e.g. accidentally sending an op while one is already in flight, would indicate that we have a bug and that our state machine is broken - in which case there's likely some assumptions about some object lifetime semantics that are being broken, leading to a chance of memory corruption in the process, which could have worse consequences than process exit.
I don't necessarily think that this kind of error automatically means memory corruption. It could potentially mean a memory leak, but it's not clear to me that it would be smashing the stack.
In any case, this looks good now.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, apolcyn wrote…
Makes sense. I've made this change for
grpc_channel_destroy
, and addedgrpc_channel_destroy_internal
and stopped throwing it to the ExecCtx.I looked at doing this for
grpc_call_unref
but it actually seems tricky - I'm having a hard time seeing how to get rid of thisExecCtx::Flush()
call https://github.com/grpc/grpc/blob/master/src/core/lib/surface/call.cc#L604. So I've kept the scheduling of this on ExecCtx for now.
I don't think it's necessary to get rid of that ExecCtx::Flush()
call. Why not just leave it there? It will flush whatever ExecCtx
instance is the current one on the stack.
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 342 at r6 (raw file):
// at this point, since alts_handshaker_client_start_client/server // have potentially just started an op batch on the handshake call. // The completion callback of for that batch is unsynchronized and so
s/of for/for/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @markdroth and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 58 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Let's go ahead and remove the
grpc_cb
member in this PR, since that's a trivial change. We can leave the API change foralts_grpc_handshaker_client_create()
for a future PR.
Done.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
I don't think it's necessary to get rid of that
ExecCtx::Flush()
call. Why not just leave it there? It will flush whateverExecCtx
instance is the current one on the stack.
Note that the potential lock inversion is mainly due to the nested ExecCtx flushing, rather than the ExecCtx stacking. As example, I created an internal variant of grpc_call_unref
that is an exact copy except that it doesn't create a new ExecCtx. This TSAN warning about potential lock inversion is then triggered by the test: https://paste.googleplex.com/4741553991974912. Note that TSAN complains because it sees us acquire a subchannel mutex in a callback ran in the nested ExecCtx flush of grpc_call_unref_internal
. Down the call stack, that same thread was holding a chttp2 connector mutex. In a different thread, TSAN saw that same chttp2 connector mutex get acquired while first holding the subchannel mutex.
I'm not sure how real of a danger this specific TSAN warning really is, but I think TSAN is at least warning about something potentially error prone (it seems hard to be sure that wouldn't accidentially be holding some lock while invoking grpc_call_unref_internal
, that will be attempted to be reacquired by a callback ran in the nested ExecCtx flush).
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 342 at r6 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
s/of for/for/
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much Alex to fix the issue in ALTS and some cleanups! I only have some minor comments (mostly clarification). Also just to confirm that the current PR did not use combiner-based solution right?
Reviewed 1 of 13 files at r5, 3 of 5 files at r7.
Reviewable status: 13 of 15 files reviewed, 5 unresolved discussions (waiting on @apolcyn and @markdroth)
src/core/lib/security/credentials/alts/alts_credentials.cc, line 44 at r7 (raw file):
? gpr_strdup(GRPC_ALTS_HANDSHAKER_SERVICE_URL) : gpr_strdup(handshaker_service_url)) { grpc_alts_set_rpc_protocol_versions(&options_->rpc_versions);
Thanks for catching the data race by using the concurrency test.
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, apolcyn wrote…
Note that the potential lock inversion is mainly due to the nested ExecCtx flushing, rather than the ExecCtx stacking. As example, I created an internal variant of
grpc_call_unref
that is an exact copy except that it doesn't create a new ExecCtx. This TSAN warning about potential lock inversion is then triggered by the test: https://paste.googleplex.com/4741553991974912. Note that TSAN complains because it sees us acquire a subchannel mutex in a callback ran in the nested ExecCtx flush ofgrpc_call_unref_internal
. Down the call stack, that same thread was holding a chttp2 connector mutex. In a different thread, TSAN saw that same chttp2 connector mutex get acquired while first holding the subchannel mutex.I'm not sure how real of a danger this specific TSAN warning really is, but I think TSAN is at least warning about something potentially error prone (it seems hard to be sure that wouldn't accidentially be holding some lock while invoking
grpc_call_unref_internal
, that will be attempted to be reacquired by a callback ran in the nested ExecCtx flush).
Just want to understand the terminology of ExecCtx
stacking, does it mean to stack the closure on the queue belonging to the ExecCtx
? or does it mean something else?
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 58 at r7 (raw file):
bool use_dedicated_cq; // mu synchronizes all fields below gpr_mu mu;
Could you please add some comments explaining why only these fields are protected by mutex, not the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 13 of 15 files reviewed, 5 unresolved discussions (waiting on @jiangtaoli2016, @markdroth, and @yihuazhang)
src/core/lib/security/credentials/alts/alts_credentials.cc, line 44 at r7 (raw file):
Previously, yihuazhang (yihuaz) wrote…
Thanks for catching the data race by using the concurrency test.
thanks
src/core/tsi/alts/handshaker/alts_handshaker_client.h, line 163 at r4 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
The
_ensure_locked()
suffix isn't needed here. The fact that this acquires the lock is an implementation detail and shouldn't be leaked into the API.
Done, this is no longer here after recent larger changes
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, yihuazhang (yihuaz) wrote…
Just want to understand the terminology of
ExecCtx
stacking, does it mean to stack the closure on the queue belonging to theExecCtx
? or does it mean something else?
I just mean instantiating more than one ExecCtx
object in the same call stack, and e.g. the following possible degenerate scenario.
gpr_mu g_lock;
bar() {
// deadlock
grpc_core::MutexLock(&g_lock);
}
foo() {
ExecCtx exec_ctx;
...
GRPC_CLOSURE_SCHED(bar_closure, ...);
}
main() {
ExecCtx exec_ctx;
{
grpc_core::MutexLock(&g_lock);
foo();
}
}
src/core/tsi/alts/handshaker/alts_tsi_handshaker.cc, line 58 at r7 (raw file):
Previously, yihuazhang (yihuaz) wrote…
Could you please add some comments explaining why only these fields are protected by mutex, not the rest?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up, Alex. Please feel free to merge after addressing my remaining comment.
Reviewed 4 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @jiangtaoli2016, @yashykt, and @yihuazhang)
src/core/tsi/alts/handshaker/alts_handshaker_client.cc, line 475 at r3 (raw file):
Previously, apolcyn wrote…
I just mean instantiating more than one
ExecCtx
object in the same call stack, and e.g. the following possible degenerate scenario.gpr_mu g_lock; bar() { // deadlock grpc_core::MutexLock(&g_lock); } foo() { ExecCtx exec_ctx; ... GRPC_CLOSURE_SCHED(bar_closure, ...); } main() { ExecCtx exec_ctx; { grpc_core::MutexLock(&g_lock); foo(); } }
Ugh, what a mess.
@yashykt, when we're done with your work to decouple closures from schedulers, we really need to find a way to either eliminate ExecCtx in favor of EventManager or do something to prevent ExecCtx stacking from ever occurring.
Alex, I guess this is fine for now, but please make this comment a TODO, so that it's clear that this is something that should be fixed.
Thanks for the detailed reviewing! Squashing down commits before merging. |
440ce44
to
6bb8629
Compare
This PR adds a few new ALTS tests that make use of the pre-existing fake ALTS handshake server which exercise more concurrent ALTS handshake successes and failures, and try to catch races, etc.
Before the ALTS handshake code changes in this PR, the new tests expose a data race and a potential lock cycle when ran under TSAN. The potential lock cycle is described in the description of #20572, and an example data race TSAN error that the test would show is in https://paste.googleplex.com/5526592207978496.
To address the race, this PR refactors the ALTS handshaker to associate a mutex with each alts_tsi_handshaker object. The alts_tsi_handshaker objet acquires this mutex at:
Due to the refactoring, this PR needed to make some changes to the
alts_tsi_handshaker_test.cc
andalts_handshaker_client_test.cc
files to adjust to the new internal APIs. One thing to note is that this gets rid of theschedule_request_grpc_call_failure_test
test - I changed the code to assert that call_start_batch is successful (I believe an error in call_start_batch could only be due to an internal logic error), so that failure is no longer unit-testable.This change is