Skip to content

Commit

Permalink
raft: server: add workaround for scylladb#12972
Browse files Browse the repository at this point in the history
When a node joins the cluster, it closes connections after learning
topology information from other nodes, in order to reopen them with
correct encryption, compression etc.

In ScyllaDB 5.2, this mechanism may interrupt an ongoing Raft snapshot
transfer. This was fixed in later versions by putting some order into
the bootstrap process with 50e8ec7 but
the fix was not backported due to many prerequisites and complexity.

Raft automatically recovers from interrupted snapshot transfer by
retrying it eventually, and everything works. However an ERROR is
reported due to that one failed snapshot transfer, and dtests dont like
ERRORs -- they report the test case as failed if an ERROR happened in
any node's logs even if the test passed otherwise.

Here we apply a simple workaround to please dtests -- in this particular
scenario, turn the ERROR into a WARN.
  • Loading branch information
kbr-scylla committed Feb 1, 2024
1 parent 84004ab commit cbe8e05
Showing 1 changed file with 17 additions and 1 deletion.
18 changes: 17 additions & 1 deletion raft/server.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1064,6 +1064,18 @@ future<> server_impl::io_fiber(index_t last_stable) {
co_return;
}

static bool is_closed_error(std::exception_ptr ep) {
try {
std::rethrow_exception(ep);
} catch (const seastar::rpc::remote_verb_error& e) {
return std::string_view{e.what()} == "connection is closed";
} catch (const seastar::rpc::closed_error&) {
return true;
} catch (...) {
return false;
}
}

void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
seastar::abort_source as;
uint64_t id = _next_snapshot_transfer_id++;
Expand All @@ -1079,7 +1091,11 @@ void server_impl::send_snapshot(server_id dst, install_snapshot&& snp) {
_snapshot_transfers.erase(dst);
auto reply = raft::snapshot_reply{.current_term = _fsm->get_current_term(), .success = false};
if (f.failed()) {
logger.error("[{}] Transferring snapshot to {} failed with: {}", _id, dst, f.get_exception());
auto ep = f.get_exception();
// Report our or remote's closed_error as WARNs instead of ERRORs.
// Workaround for scylladb/scylladb#12972 for ScyllaDB 5.2.
auto level = is_closed_error(ep) ? log_level::warn : log_level::error;
logger.log(level, "[{}] Transferring snapshot to {} failed with: {}", _id, dst, ep);
} else {
logger.trace("[{}] Transferred snapshot to {}", _id, dst);
reply = f.get();
Expand Down

0 comments on commit cbe8e05

Please sign in to comment.