Skip to content

Commit

Permalink
storage_service: node_ops_cmd_handler: decommission rollback, ignore …
Browse files Browse the repository at this point in the history
…the node if's already removed

This is a regression after scylladb#15903. Before these changes
del_leaving_endpoint took IP as a parameter and did nothing
if it was called with a non-existent IP.

The problem was revealed by the dtest test_remove_garbage_members_from_group0_after_abort_decommission[Announcing_that_I_have_left_the_ring-]. The test was
flaky as in most cases the node died before the
gossiper notification reached all the other nodes. To make
it fail consistently and reproduce the problem one
can move the info log 'Announcing that I have' after
the sleep and add additional sleep after it in
storage_service::leave_ring function.
  • Loading branch information
Petr Gusev committed Dec 21, 2023
1 parent ee203f8 commit 6531bb7
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion service/storage_service.cc
Original file line number Diff line number Diff line change
Expand Up @@ -5544,8 +5544,17 @@ future<node_ops_cmd_response> storage_service::node_ops_cmd_handler(gms::inet_ad
node_ops_insert(ops_uuid, coordinator, std::move(req.ignore_nodes), [this, coordinator, req = std::move(req)] () mutable {
return mutate_token_metadata([this, coordinator, req = std::move(req)] (mutable_token_metadata_ptr tmptr) mutable {
for (auto& node : req.leaving_nodes) {
// Decommission calls leave_ring() as one of its last steps.
// The leave_ring() function removes the endpoint from the local token_metadata,
// sends a notification about this through gossiper (node status becomes 'left')
// and waits for ring_delay. It's possible the node being decommissioned might
// die after it has sent this notification. If this happens, the node would
// have already been removed from this token_metadata, so we wouldn't find it here.
const auto node_id = tmptr->get_host_id_if_known(node);
slogger.info("decommission[{}]: Removed node={} as leaving node, coordinator={}", req.ops_uuid, node, coordinator);
tmptr->del_leaving_endpoint(tmptr->get_host_id(node));
if (node_id) {
tmptr->del_leaving_endpoint(*node_id);
}
}
return update_topology_change_info(tmptr, ::format("decommission {}", req.leaving_nodes));
});
Expand Down

0 comments on commit 6531bb7

Please sign in to comment.