From f2a59c772c91932d481040de6bd7efda36e7f2ff Mon Sep 17 00:00:00 2001 From: Kamil Braun Date: Thu, 4 Apr 2024 13:53:48 +0200 Subject: [PATCH] gossiper: lock local endpoint when updating heart_beat In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb/scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = *ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. *During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in 38c2347a3ce298945d47d258c93ec3858a43c4fc, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb/scylladb#15393 Fixes: scylladb/scylladb#15602 Fixes: scylladb/scylladb#16668 Fixes: scylladb/scylladb#16902 Fixes: scylladb/scylladb#17493 Fixes: scylladb/scylladb#18118 Fixes: scylladb/scylla-enterprise#3720 --- gms/gossiper.cc | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/gms/gossiper.cc b/gms/gossiper.cc index a17a4164011b..7933266e857c 100644 --- a/gms/gossiper.cc +++ b/gms/gossiper.cc @@ -1059,11 +1059,15 @@ void gossiper::run() { //wait on messaging service to start listening // MessagingService.instance().waitUntilListening(); - /* Update the local heartbeat counter. */ - heart_beat_state& hbs = my_endpoint_state().get_heart_beat_state(); - hbs.update_heart_beat(); + { + auto permit = lock_endpoint(get_broadcast_address(), null_permit_id).get(); + /* Update the local heartbeat counter. */ + heart_beat_state& hbs = my_endpoint_state().get_heart_beat_state(); + hbs.update_heart_beat(); + + logger.trace("My heartbeat is now {}", hbs.get_heart_beat_version()); + } - logger.trace("My heartbeat is now {}", hbs.get_heart_beat_version()); utils::chunked_vector g_digests; this->make_random_gossip_digest(g_digests);