storage_service: wait for NORMAL state handler before setup_group0()

`handle_state_normal` may drop connections to the handled node. This causes spurious failures if there's an ongoing concurrent operation. This problem was already solved twice in the past in different contexts: first in 5363616, then in 79ee381. Time to fix it for the third time. Now we do this right after enabling gossiping, so hopefully it's the last time. This time it's causing snapshot transfer failures in group 0. Although the transfer is retried and eventually succeeds, the failed transfer is wasted work and causes an annoying ERROR message in the log which dtests, SCT, and I don't like. The fix is done by moving the `wait_for_normal_state_handled_on_boot()` call before `setup_group0()`. But for the wait to work correctly we must first ensure that gossiper sees an alive node, so we precede it with `wait_for_live_node_to_show_up()` (before this commit, the call site of `wait_for_normal_state_handled_on_boot` was already after this wait). Also ensure that `wait_for_normal_state_handled_on_boot()` doesn't hang in raft-topology mode by adding `_normal_state_handled_on_boot.insert(endpoint);` in the raft-topology branch of `handle_state_normal`. Fixes: scylladb#12972
kbr-scylla · Jun 27, 2023 · 69c1125 · 69c1125
1 parent f2696ec
commit 69c1125
Showing 1 changed file with 22 additions and 13 deletions.
diff --git a/service/storage_service.cc b/service/storage_service.cc
@@ -1969,6 +1969,27 @@ future<> storage_service::join_token_ring(cdc::generation_service& cdc_gen_servi
     auto advertise = gms::advertise_myself(!replacing_a_node_with_same_ip);
     co_await _gossiper.start_gossiping(generation_number, app_states, advertise);
 
+    if (should_bootstrap()) {
+        // Wait for NORMAL state handlers to finish for existing nodes now, so that connection dropping
+        // (happening at the end of `handle_state_normal`: `notify_joined`) doesn't interrupt
+        // group 0 joining or repair. (See #12764, #12956, #12972, #13302)
+        //
+        // But before we can do that, we must make sure that gossip sees at least one other node
+        // and fetches the list of peers from it; otherwise `wait_for_normal_state_handled_on_boot`
+        // may trivially finish without waiting for anyone.
+        co_await wait_for_live_node_to_show_up();
+
+        auto ignore_nodes = ri
+                ? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
+                // TODO: specify ignore_nodes for bootstrap
+                : std::unordered_set<gms::inet_address>{};
+        auto sync_nodes = co_await get_nodes_to_sync_with(ignore_nodes);
+        if (ri) {
+            sync_nodes.erase(ri->address);
+        }
+        co_await wait_for_normal_state_handled_on_boot(sync_nodes);
+    }
+
     assert(_group0);
     // if the node is bootstrapped the functin will do nothing since we already created group0 in main.cc
     co_await _group0->setup_group0(_sys_ks.local(), initial_contact_nodes, raft_replace_info, *this, qp, _migration_manager.local(), cdc_gen_service);
@@ -2062,7 +2083,6 @@ future<> storage_service::join_token_ring(cdc::generation_service& cdc_gen_servi
 
         // if our schema hasn't matched yet, keep sleeping until it does
         // (post CASSANDRA-1391 we don't expect this to be necessary very often, but it doesn't hurt to be careful)
-        co_await wait_for_live_node_to_show_up();
         co_await wait_for_ring_to_settle();
 
         if (!replace_address) {
@@ -2282,18 +2302,6 @@ future<> storage_service::bootstrap(cdc::generation_service& cdc_gen_service, st
             }
         }
 
-        {
-            // Wait for normal state handler to finish for existing nodes in the cluster.
-            auto ignore_nodes = replacement_info ? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
-                                                 // TODO: specify ignore_nodes for bootstrap
-                                                 : std::unordered_set<gms::inet_address>{};
-            auto sync_nodes = get_nodes_to_sync_with(ignore_nodes).get();
-            if (replacement_info) {
-                sync_nodes.erase(replacement_info->address);
-            }
-            wait_for_normal_state_handled_on_boot(sync_nodes).get();
-        }
-
         if (!replacement_info) {
             // Even if we reached this point before but crashed, we will make a new CDC generation.
             // It doesn't hurt: other nodes will (potentially) just do more generation switches.
@@ -2450,6 +2458,7 @@ future<> storage_service::handle_state_normal(inet_address endpoint) {
 
     if (_raft_topology_change_enabled) {
         slogger.debug("ignore handle_state_normal since topology change are using raft");
+        _normal_state_handled_on_boot.insert(endpoint);
         co_return;
     }