Skip to content

Commit

Permalink
storage_service: use token_metadata to calculate nodes waited for t…
Browse files Browse the repository at this point in the history
…o be UP

At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In scylladb#14468 and scylladb#14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes scylladb#14468
Fixes scylladb#14487
  • Loading branch information
kbr-scylla committed Jul 4, 2023
1 parent b675b15 commit 680fb1f
Showing 1 changed file with 24 additions and 13 deletions.
37 changes: 24 additions & 13 deletions service/storage_service.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1958,26 +1958,37 @@ future<> storage_service::join_token_ring(cdc::generation_service& cdc_gen_servi
// Node state changes are propagated to the cluster through explicit global barriers.
co_await wait_for_normal_state_handled_on_boot();

auto ignore_nodes = ri
? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), get_token_metadata())
// TODO: specify ignore_nodes for bootstrap
: std::unordered_set<gms::inet_address>{};
auto sync_nodes = co_await get_nodes_to_sync_with(ignore_nodes);
if (ri) {
sync_nodes.erase(ri->address);
}

// NORMAL doesn't necessarily mean UP (#14042). Wait for these nodes to be UP as well
// to reduce flakiness (we need them to be UP to perform CDC generation write and for repair/streaming).
//
// This could be done in Raft topology mode as well, but the calculation of nodes to sync with
// has to be done based on topology state machine instead of gossiper as it is here;
// furthermore, the place in the code where we do this has to be different (it has to be coordinated
// by the topology coordinator after it joins the node to the cluster).
std::vector<gms::inet_address> sync_nodes_vec{sync_nodes.begin(), sync_nodes.end()};
slogger.info("Waiting for nodes {} to be alive", sync_nodes_vec);
co_await _gossiper.wait_alive(sync_nodes_vec, std::chrono::seconds{30});
slogger.info("Nodes {} are alive", sync_nodes_vec);
//
// We calculate nodes to wait for based on token_metadata. Previously we would use gossiper
// directly for this, but gossiper may still contain obsolete entries from 1. replaced nodes
// and 2. nodes that have changed their IPs; these entries are eventually garbage-collected,
// but here they may still be present if we're performing topology changes in quick succession.
// `token_metadata` has all host ID / token collisions resolved so in particular it doesn't contain
// these obsolete IPs. Refs: #14487, #14468
auto& tm = get_token_metadata();
auto ignore_nodes = ri
? parse_node_list(_db.local().get_config().ignore_dead_nodes_for_replace(), tm)
// TODO: specify ignore_nodes for bootstrap
: std::unordered_set<gms::inet_address>{};

std::vector<gms::inet_address> sync_nodes;
tm.get_topology().for_each_node([&] (const locator::node* np) {
auto ep = np->endpoint();
if (!ignore_nodes.contains(ep) && (!ri || ep != ri->address)) {
sync_nodes.push_back(ep);
}
});

slogger.info("Waiting for nodes {} to be alive", sync_nodes);
co_await _gossiper.wait_alive(sync_nodes, std::chrono::seconds{30});
slogger.info("Nodes {} are alive", sync_nodes);
}

assert(_group0);
Expand Down

0 comments on commit 680fb1f

Please sign in to comment.