Skip to content

Commit

Permalink
storage_service: wait for NORMAL state handler before setup_group0()
Browse files Browse the repository at this point in the history
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 5363616, then in
79ee381.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang
in raft-topology mode by adding
`_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology
branch of `handle_state_normal`.

Fixes: scylladb#12972
  • Loading branch information
kbr-scylla committed Jun 27, 2023
1 parent f2696ec commit 69c1125
Showing 1 changed file with 22 additions and 13 deletions.
35 changes: 22 additions & 13 deletions service/storage_service.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1969,6 +1969,27 @@ future<> storage_service::join_token_ring(cdc::generation_service& cdc_gen_servi
auto advertise = gms::advertise_myself(!replacing_a_node_with_same_ip);
co_await _gossiper.start_gossiping(generation_number, app_states, advertise);

if (should_bootstrap()) {
// Wait for NORMAL state handlers to finish for existing nodes now, so that connection dropping
// (happening at the end of `handle_state_normal`: `notify_joined`) doesn't interrupt
// group 0 joining or repair. (See #12764, #12956, #12972, #13302)
//
// But before we can do that, we must make sure that gossip sees at least one other node
// and fetches the list of peers from it; otherwise `wait_for_normal_state_handled_on_boot`
// may trivially finish without waiting for anyone.
co_await wait_for_live_node_to_show_up();

auto ignore_nodes = ri
? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
// TODO: specify ignore_nodes for bootstrap
: std::unordered_set<gms::inet_address>{};
auto sync_nodes = co_await get_nodes_to_sync_with(ignore_nodes);
if (ri) {
sync_nodes.erase(ri->address);
}
co_await wait_for_normal_state_handled_on_boot(sync_nodes);
}

assert(_group0);
// if the node is bootstrapped the functin will do nothing since we already created group0 in main.cc
co_await _group0->setup_group0(_sys_ks.local(), initial_contact_nodes, raft_replace_info, *this, qp, _migration_manager.local(), cdc_gen_service);
Expand Down Expand Up @@ -2062,7 +2083,6 @@ future<> storage_service::join_token_ring(cdc::generation_service& cdc_gen_servi

// if our schema hasn't matched yet, keep sleeping until it does
// (post CASSANDRA-1391 we don't expect this to be necessary very often, but it doesn't hurt to be careful)
co_await wait_for_live_node_to_show_up();
co_await wait_for_ring_to_settle();

if (!replace_address) {
Expand Down Expand Up @@ -2282,18 +2302,6 @@ future<> storage_service::bootstrap(cdc::generation_service& cdc_gen_service, st
}
}

{
// Wait for normal state handler to finish for existing nodes in the cluster.
auto ignore_nodes = replacement_info ? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
// TODO: specify ignore_nodes for bootstrap
: std::unordered_set<gms::inet_address>{};
auto sync_nodes = get_nodes_to_sync_with(ignore_nodes).get();
if (replacement_info) {
sync_nodes.erase(replacement_info->address);
}
wait_for_normal_state_handled_on_boot(sync_nodes).get();
}

if (!replacement_info) {
// Even if we reached this point before but crashed, we will make a new CDC generation.
// It doesn't hurt: other nodes will (potentially) just do more generation switches.
Expand Down Expand Up @@ -2450,6 +2458,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {

if (_raft_topology_change_enabled) {
slogger.debug("ignore handle_state_normal since topology change are using raft");
_normal_state_handled_on_boot.insert(endpoint);
co_return;
}

Expand Down

0 comments on commit 69c1125

Please sign in to comment.