Skip to content

Local disk creation may trigger cockroach DB service restarts #10478

@wfchandler

Description

@wfchandler

On dogfood, we observed that two CRDB nodes restarted on the morning of 2026-05-20:

root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep cockroach) ps  -ef -o comm,stime | grep "cockroach "' 9 10 14 16 17
 9  BRM44220005        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:37:28
10  BRM42220009        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:41:33
14  BRM42220051        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
16  BRM42220014        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
17  BRM42220017        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16

The last few seconds of the logs prior to the restart are unfortunately not present in the archived cockroach.log files, but in the time leading up it we see warnings that the node may be overloaded:

I260520 04:37:14.565540 735 kv/kvserver/store_raft.go:522 ⋮ [n2,s2,r1105/1:‹/System/tsd/cr.node.{kv…-rp…}›,raft] 235892  raft ready handling: 3.69s [append=0.00s, apply=2.47s, commit-batch-sync=1.22s, other=0.00s], wrote 43 K
iB sync [append-ent=42 KiB (7), apply=42 KiB (7 in 2 batches)]; node might be overloaded
W260520 04:37:14.704457 77340944 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235893  slow heartbeat took 6.000507049s; err=context canceled
E260520 04:37:14.704524 77340944 kv/kvserver/replica_range_lease.go:427 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235894  failed to heartbeat own liveness record: context canceled
W260520 04:37:15.139560 899 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,liveness-hb] 235895  slow heartbeat took 4.500268177s; err=context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896  failed node liveness heartbeat: <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)
<E2><80><BA>: context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +(1) <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)<E2><80><BA>
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Wraps: (2) context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +An inability to maintain liveness will prevent a node from participating in a
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +cluster. If this problem persists, it may be a sign of resource starvation or
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +of network connectivity problems. For help troubleshooting, visit:
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues

The svcadm logs indicate the new process started at 04:37:26, 11 seconds after the last message logged.

BRM44220005 # head -3 /pool/ext/4eb2e4eb-41d8-496c-9a5a-687d7e004aa4/crypt/debug/oxz_cockroachdb_4c3ef132-ec83-4b1b-9574-7c7d3035f9e9/oxide-cockroachdb:default.log.1779251862                                                     
[ May 20 04:37:26 Stopping because all processes in service exited. ]                                                                                                                                                          
[ May 20 04:37:26 Executing stop method (:kill). ]                                                                                                                                                                             
[ May 20 04:37:26 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]                            

There is no indication that unusually heavy queries were being executed by CRDB, but both of these nodes are hosted a U.2 drive that was initializing a local disk at the time of the crash.

On sled 9 CRDB is held on oxp_b358fb1e-f52a-4a63-9aab-170225509b37.

During this interval, sled-agent is logging that is it initializing a new local storage volume:

2026-05-20 04:37:01.176Z INFO SledAgent/658 (dropshot (SledAgent)) on BRM44220005: request completed
    error_message_external = Service Unavailable
    error_message_internal = Failed to ensure volume 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37/local_storage_unencrypted/3ae6c3f2-f450-4924-9d3b-b9c96bab514a/vol': created but not ready yet: still zero initializing
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:862
    latency_us = 6098673
    local_addr = [fd00:1122:3344:105::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:101::89]:57137
    req_id = 828a2954-e4dc-4eae-9f11-fbbcd14a24ff
    response_code = 503
    uri = /local-storage

The period these messages are logged matches the "overloaded" period in CRDB closely:

will@castle /staff/core/dogfood-crdb-crash-2026-05-20 $ grep 'T04:3' oxide-sled-agent:default.log.1779333004 | egrep 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37.*still zero initializing' | jq -r .time 
2026-05-20T04:35:43.632979477Z
2026-05-20T04:35:44.532078728Z
2026-05-20T04:35:45.901170989Z
2026-05-20T04:35:46.187545552Z
2026-05-20T04:35:47.160109624Z
2026-05-20T04:35:47.62813723Z
2026-05-20T04:35:48.431222544Z
2026-05-20T04:35:49.98001739Z
2026-05-20T04:35:50.786933625Z
2026-05-20T04:35:53.318926925Z
2026-05-20T04:35:55.137188354Z
2026-05-20T04:35:57.664169831Z
2026-05-20T04:36:05.842494794Z
2026-05-20T04:36:08.355253085Z
2026-05-20T04:36:26.300306902Z
2026-05-20T04:36:28.505755413Z
2026-05-20T04:37:01.17697614Z
2026-05-20T04:37:03.275917293Z
2026-05-20T04:38:00.457628507Z
2026-05-20T04:38:03.686246866Z

A similar pattern is seen on sled 10, hosted on oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae, we see a similar pattern.

New process starts at 04:41:32.

BRM42220009 # head -3 /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/debug/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/oxide-cockroachdb:default.log.1779252106
[ May 20 04:41:32 Stopping because all processes in service exited. ]
[ May 20 04:41:32 Executing stop method (:kill). ]
[ May 20 04:41:32 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]

In the period leading up to that time a local disk is being initialized on it:

BRM42220009 # grep 'T04:[3-4]' /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time 
2026-05-20T04:37:57.815494599Z
2026-05-20T04:37:58.773125217Z
2026-05-20T04:37:59.961247628Z
2026-05-20T04:38:01.468213757Z
2026-05-20T04:38:04.542124621Z
2026-05-20T04:38:08.710330616Z
2026-05-20T04:39:59.623875444Z
2026-05-20T04:40:00.726780259Z
2026-05-20T04:40:02.042064876Z
2026-05-20T04:40:03.847478895Z
2026-05-20T04:40:06.998500211Z
2026-05-20T04:40:14.603304114Z
2026-05-20T04:40:28.356296048Z
2026-05-20T04:40:46.729091263Z
2026-05-20T04:41:18.850718116Z

With some horrifying shell, we can see that the "node may be overloaded" messages correlate with local disk creation on that U.2:

BRM42220009 # cat /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time | cut -d ':' -f -2 | uniq -c
   6 2026-05-20T03:12
   3 2026-05-20T03:14
   5 2026-05-20T03:15
   6 2026-05-20T03:51
   7 2026-05-20T03:52
   2 2026-05-20T03:53
   3 2026-05-20T04:37
   3 2026-05-20T04:38
   1 2026-05-20T04:39
   7 2026-05-20T04:40
   1 2026-05-20T04:41

BRM42220009 # grep 'node might be overloaded' /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/zone/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/root/data/logs/cockroach.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2026-05-19T11_26_10Z.021989.log | awk '{print $2}' | cut -d ':' -f -2 | uniq -c
 186 00:38
  58 00:39
 228 00:40
 206 00:41
 457 03:15
  35 03:51
  12 03:52
 128 03:53
 202 04:40

Metadata

Metadata

Assignees

No one assigned

    Labels

    databaseRelated to database access

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions