Skip to content

Clickhouse Keeper aborted shortly after its sled was restarted during an update #9190

@jgallagher

Description

@jgallagher

After today's Reconfigurator-based dogfood update, we had two clickhouse corefiles, both from Clickhouse Keeper zones, one from sled 8 and one from sled 17. Both have identical stacks and statuses. From mdb:

> $C
fffff5ffb9c045c0 libc.so.1`_lwp_kill+0xa()
fffff5ffb9c045f0 libc.so.1`raise+0x22(6)
fffff5ffb9c04640 libc.so.1`abort+0x94()
fffff5ffb9c04650 ~DB::KeeperStateManager::system_exit+8()
fffff5ffb9c046f0 nuraft::raft_server::send_reconnect_request+0x257()
fffff5ffb9c047a0 nuraft::raft_server::handle_prevote_resp+0x62d()
fffff5ffb9c04860 nuraft::raft_server::handle_peer_resp+0x537()
fffff5ffb9c04890 nuraft::cmd_result<std::shared_ptr<nuraft::resp_msg>, std::shared_ptr<nuraft::rpc_exception> >::set_result+0xd8()
fffff5ffb9c049a0 nuraft::peer::handle_rpc_result+0xb07()
fffff5ffb9c04a10 void std::__invoke_impl<void, void +0x147()
fffff5ffb9c04a30 std::_Function_handler<void +0x29()
fffff5ffb9c04bf0 nuraft::asio_rpc_client::response_read+0x5ae()
fffff5ffb9c04c40 boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>, boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::as
io::detail::transfer_all_t, std::_Bind<void +0x149()
fffff5ffb9c04d80 boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::any_io_executor>,
boost::asio::mutable_buffers_1, boost::asio::mutable_buffer const*, boost::asio::detail::transfer_all_t, std::_Bind<void +0x217()
fffff5ffb9c04e00 boost::asio::detail::scheduler::do_run_one+0x35b()
fffff5ffb9c04ef0 boost::asio::detail::scheduler::run+0x141()
fffff5ffb9c04f90 nuraft::asio_service_impl::worker_entry+0x27f()
fffff5ffb9c04fb0 libstdc++.so.6.0.32`execute_native_thread_routine+0x10()
fffff5ffb9c04fe0 libc.so.1`_thrp_setup+0x77(fffff5ffba83cd40)
fffff5ffb9c04ff0 libc.so.1`_lwp_start()
> ::status
debugging core file of clickhouse (64-bit) from oxz_clickhouse_keeper_73404218-b57f-48d7-87c7-824fea56a988
file: /pool/ext/027a82e8-daa3-4fa6-8205-ed03445e1086/crypt/zone/oxz_clickhouse_keeper_73404218-b57f-48d7-87c7-824fea56a988/root/opt/oxide/clickhouse_keeper/clickhouse
initial argv: /opt/oxide/clickhouse_keeper/clickhouse keeper --config /opt/oxide/clickhouse_k
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=18983 uid=0 code=-1

The timestamps on the corefiles are shortly after the sleds hosting these services were restarted for OS updates. The error logs from Clickhouse itself are also identical; these are the contents starting from the first line after the sled reboot on sled 17:

2025.10.09 17:15:08.672100 [ 139 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:08.677275 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 1
2025.10.09 17:15:08.677359 [ 137 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 2
2025.10.09 17:15:10.609299 [ 12 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:10.609845 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 3
2025.10.09 17:15:10.609917 [ 136 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 4
2025.10.09 17:15:12.458163 [ 12 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:12.458726 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 5
2025.10.09 17:15:12.458805 [ 136 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 6
2025.10.09 17:15:13.783672 [ 134 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:13.784270 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 7
2025.10.09 17:15:13.784357 [ 12 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 8
2025.10.09 17:15:14.909304 [ 134 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:14.909934 [ 134 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 9
2025.10.09 17:15:14.910004 [ 133 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 10
2025.10.09 17:15:16.600039 [ 138 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:16.600611 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 11
2025.10.09 17:15:16.600665 [ 12 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 12
2025.10.09 17:15:18.406721 [ 133 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:18.407225 [ 138 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 13
2025.10.09 17:15:18.407309 [ 131 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 14
2025.10.09 17:15:20.054449 [ 132 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:20.057641 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 15
2025.10.09 17:15:20.057732 [ 129 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 16
2025.10.09 17:15:21.075497 [ 131 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:21.076131 [ 129 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 17
2025.10.09 17:15:21.076239 [ 131 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 18
2025.10.09 17:15:22.520049 [ 133 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:15:22.520675 [ 12 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 19
2025.10.09 17:15:22.520736 [ 133 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 20
2025.10.09 17:15:22.520848 [ 133 ] {} <Fatal> RaftInstance: too many pre-vote rejections, probably this node is not receiving heartbeat from leader. we should re-establish the network connection
2025.10.09 17:15:22.520862 [ 133 ] {} <Fatal> RaftInstance: cannot find leader!
2025.10.09 17:15:22.523513 [ 144 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2025.10.09 17:15:22.523548 [ 144 ] {} <Fatal> BaseDaemon: (version 23.8.7.1, build id: , git hash: 3042d295d963012962f5c683bd2776fa331a38c3) (from thread 133) Received signal 6
2025.10.09 17:15:22.525017 [ 144 ] {} <Fatal> BaseDaemon: Signal description: Abort
2025.10.09 17:15:22.525038 [ 144 ] {} <Fatal> BaseDaemon:
2025.10.09 17:15:22.525115 [ 144 ] {} <Fatal> BaseDaemon: Stack trace: 0x000000000f9acd4d 0x000000000fb4f90d
2025.10.09 17:15:22.525130 [ 144 ] {} <Fatal> BaseDaemon: ########################################
2025.10.09 17:15:22.525143 [ 144 ] {} <Fatal> BaseDaemon: (version 23.8.7.1, build id: , git hash: 3042d295d963012962f5c683bd2776fa331a38c3) (from thread 133) (no query) Received signal Abort (6)
2025.10.09 17:15:22.525153 [ 144 ] {} <Fatal> BaseDaemon:
2025.10.09 17:15:22.525164 [ 144 ] {} <Fatal> BaseDaemon: Stack trace: 0x000000000f9acd4d 0x000000000fb4f90d
2025.10.09 17:15:22.548129 [ 144 ] {} <Fatal> BaseDaemon: 0. StackTrace::StackTrace(ucontext const&) @ 0x000000000f9acd4d in /opt/oxide/clickhouse_keeper/clickhouse
2025.10.09 17:15:22.548806 [ 144 ] {} <Fatal> BaseDaemon: 1. signalHandler(int, siginfo*, void*) @ 0x000000000fb4f90d in /opt/oxide/clickhouse_keeper/clickhouse
2025.10.09 17:15:22.548825 [ 144 ] {} <Fatal> BaseDaemon: This ClickHouse version is not official and should be upgraded to the official build.

And sled 8:

2025.10.09 17:55:41.114229 [ 13 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:41.120498 [ 135 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 1
2025.10.09 17:55:41.120543 [ 137 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 2
2025.10.09 17:55:42.423071 [ 139 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:42.423707 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 3
2025.10.09 17:55:42.423777 [ 134 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 4
2025.10.09 17:55:43.560007 [ 136 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:43.560656 [ 137 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 5
2025.10.09 17:55:43.560703 [ 136 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 6
2025.10.09 17:55:45.202114 [ 135 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:45.202669 [ 134 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 7
2025.10.09 17:55:45.202730 [ 135 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 8
2025.10.09 17:55:46.762794 [ 132 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:46.763466 [ 135 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 9
2025.10.09 17:55:46.763545 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 10
2025.10.09 17:55:47.939525 [ 137 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:47.940077 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 11
2025.10.09 17:55:47.940151 [ 133 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 12
2025.10.09 17:55:49.736075 [ 131 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:49.737473 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 13
2025.10.09 17:55:49.737536 [ 133 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 14
2025.10.09 17:55:51.545490 [ 134 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:51.545953 [ 131 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 15
2025.10.09 17:55:51.550266 [ 130 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 16
2025.10.09 17:55:53.237739 [ 134 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:53.238295 [ 137 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 17
2025.10.09 17:55:53.238394 [ 132 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 18
2025.10.09 17:55:54.481719 [ 130 ] {} <Warning> RaftInstance: Election timeout, initiate leader election
2025.10.09 17:55:54.482474 [ 131 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 19
2025.10.09 17:55:54.482537 [ 130 ] {} <Warning> RaftInstance: [PRE-VOTE] rejected by quorum, count 20
2025.10.09 17:55:54.482607 [ 130 ] {} <Fatal> RaftInstance: too many pre-vote rejections, probably this node is not receiving heartbeat from leader. we should re-establish the network connection
2025.10.09 17:55:54.482620 [ 130 ] {} <Fatal> RaftInstance: cannot find leader!
2025.10.09 17:55:54.485306 [ 143 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2025.10.09 17:55:54.485338 [ 143 ] {} <Fatal> BaseDaemon: (version 23.8.7.1, build id: , git hash: 3042d295d963012962f5c683bd2776fa331a38c3) (from thread 130) Received signal 6
2025.10.09 17:55:54.486502 [ 143 ] {} <Fatal> BaseDaemon: Signal description: Abort
2025.10.09 17:55:54.486520 [ 143 ] {} <Fatal> BaseDaemon:
2025.10.09 17:55:54.486600 [ 143 ] {} <Fatal> BaseDaemon: Stack trace: 0x000000000f9acd4d 0x000000000fb4f90d
2025.10.09 17:55:54.486612 [ 143 ] {} <Fatal> BaseDaemon: ########################################
2025.10.09 17:55:54.486623 [ 143 ] {} <Fatal> BaseDaemon: (version 23.8.7.1, build id: , git hash: 3042d295d963012962f5c683bd2776fa331a38c3) (from thread 130) (no query) Received signal Abort (6)
2025.10.09 17:55:54.486633 [ 143 ] {} <Fatal> BaseDaemon:
2025.10.09 17:55:54.486644 [ 143 ] {} <Fatal> BaseDaemon: Stack trace: 0x000000000f9acd4d 0x000000000fb4f90d
2025.10.09 17:55:54.513127 [ 143 ] {} <Fatal> BaseDaemon: 0. StackTrace::StackTrace(ucontext const&) @ 0x000000000f9acd4d in /opt/oxide/clickhouse_keeper/clickhouse
2025.10.09 17:55:54.513224 [ 143 ] {} <Fatal> BaseDaemon: 1. signalHandler(int, siginfo*, void*) @ 0x000000000fb4f90d in /opt/oxide/clickhouse_keeper/clickhouse
2025.10.09 17:55:54.513240 [ 143 ] {} <Fatal> BaseDaemon: This ClickHouse version is not official and should be upgraded to the official build.

Our very cursory interpretation of this is: On startup, these nodes weren't able to join the raft cluster, so decided to exit, which is implemented as calling abort(). The decision to exit to attempt to recover from network isolation seems pretty dubious, but there isn't much we can do about that. What we don't know is what caused these nodes to believe they were isolated, or why we didn't see similar problems with the other 3 keeper nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions