Local disk creation may trigger cockroach DB service restarts

On dogfood, we observed that two CRDB nodes restarted on the morning of 2026-05-20:

```
root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep cockroach) ps  -ef -o comm,stime | grep "cockroach "' 9 10 14 16 17
 9  BRM44220005        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:37:28
10  BRM42220009        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:41:33
14  BRM42220051        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
16  BRM42220014        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
17  BRM42220017        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
```

The last few seconds of the logs prior to the restart are unfortunately not present in the archived `cockroach.log` files, but in the time leading up it we see warnings that the node may be overloaded:

```
I260520 04:37:14.565540 735 kv/kvserver/store_raft.go:522 ⋮ [n2,s2,r1105/1:‹/System/tsd/cr.node.{kv…-rp…}›,raft] 235892  raft ready handling: 3.69s [append=0.00s, apply=2.47s, commit-batch-sync=1.22s, other=0.00s], wrote 43 K
iB sync [append-ent=42 KiB (7), apply=42 KiB (7 in 2 batches)]; node might be overloaded
W260520 04:37:14.704457 77340944 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235893  slow heartbeat took 6.000507049s; err=context canceled
E260520 04:37:14.704524 77340944 kv/kvserver/replica_range_lease.go:427 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235894  failed to heartbeat own liveness record: context canceled
W260520 04:37:15.139560 899 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,liveness-hb] 235895  slow heartbeat took 4.500268177s; err=context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896  failed node liveness heartbeat: <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)
<E2><80><BA>: context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +(1) <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)<E2><80><BA>
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Wraps: (2) context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +An inability to maintain liveness will prevent a node from participating in a
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +cluster. If this problem persists, it may be a sign of resource starvation or
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +of network connectivity problems. For help troubleshooting, visit:
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
```

The `svcadm` logs indicate the new process started at 04:37:26, 11 seconds after the last message logged.

```
BRM44220005 # head -3 /pool/ext/4eb2e4eb-41d8-496c-9a5a-687d7e004aa4/crypt/debug/oxz_cockroachdb_4c3ef132-ec83-4b1b-9574-7c7d3035f9e9/oxide-cockroachdb:default.log.1779251862                                                     
[ May 20 04:37:26 Stopping because all processes in service exited. ]                                                                                                                                                          
[ May 20 04:37:26 Executing stop method (:kill). ]                                                                                                                                                                             
[ May 20 04:37:26 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]                            
```

There is no indication that unusually heavy queries were being executed by CRDB, but both of these nodes are hosted a U.2 drive that was initializing a local disk at the time of the crash.

On sled 9 CRDB is held on `oxp_b358fb1e-f52a-4a63-9aab-170225509b37`.

During this interval, `sled-agent` is logging that is it initializing a new local storage volume:

```
2026-05-20 04:37:01.176Z INFO SledAgent/658 (dropshot (SledAgent)) on BRM44220005: request completed
    error_message_external = Service Unavailable
    error_message_internal = Failed to ensure volume 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37/local_storage_unencrypted/3ae6c3f2-f450-4924-9d3b-b9c96bab514a/vol': created but not ready yet: still zero initializing
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:862
    latency_us = 6098673
    local_addr = [fd00:1122:3344:105::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:101::89]:57137
    req_id = 828a2954-e4dc-4eae-9f11-fbbcd14a24ff
    response_code = 503
    uri = /local-storage
```

The period these messages are logged matches the "overloaded" period in CRDB closely:

```
will@castle /staff/core/dogfood-crdb-crash-2026-05-20 $ grep 'T04:3' oxide-sled-agent:default.log.1779333004 | egrep 'oxp_b358fb1e-f52a-4a63-9aab-170225509b37.*still zero initializing' | jq -r .time 
2026-05-20T04:35:43.632979477Z
2026-05-20T04:35:44.532078728Z
2026-05-20T04:35:45.901170989Z
2026-05-20T04:35:46.187545552Z
2026-05-20T04:35:47.160109624Z
2026-05-20T04:35:47.62813723Z
2026-05-20T04:35:48.431222544Z
2026-05-20T04:35:49.98001739Z
2026-05-20T04:35:50.786933625Z
2026-05-20T04:35:53.318926925Z
2026-05-20T04:35:55.137188354Z
2026-05-20T04:35:57.664169831Z
2026-05-20T04:36:05.842494794Z
2026-05-20T04:36:08.355253085Z
2026-05-20T04:36:26.300306902Z
2026-05-20T04:36:28.505755413Z
2026-05-20T04:37:01.17697614Z
2026-05-20T04:37:03.275917293Z
2026-05-20T04:38:00.457628507Z
2026-05-20T04:38:03.686246866Z
```

A similar pattern is seen on sled 10, hosted on `oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae`, we see a similar pattern.

New process starts at 04:41:32.

```
BRM42220009 # head -3 /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/debug/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/oxide-cockroachdb:default.log.1779252106
[ May 20 04:41:32 Stopping because all processes in service exited. ]
[ May 20 04:41:32 Executing stop method (:kill). ]
[ May 20 04:41:32 Executing start method ("/opt/oxide/lib/svc/manifest/cockroachdb.sh"). ]
```

In the period leading up to that time a local disk is being initialized on it:

```
BRM42220009 # grep 'T04:[3-4]' /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time 
2026-05-20T04:37:57.815494599Z
2026-05-20T04:37:58.773125217Z
2026-05-20T04:37:59.961247628Z
2026-05-20T04:38:01.468213757Z
2026-05-20T04:38:04.542124621Z
2026-05-20T04:38:08.710330616Z
2026-05-20T04:39:59.623875444Z
2026-05-20T04:40:00.726780259Z
2026-05-20T04:40:02.042064876Z
2026-05-20T04:40:03.847478895Z
2026-05-20T04:40:06.998500211Z
2026-05-20T04:40:14.603304114Z
2026-05-20T04:40:28.356296048Z
2026-05-20T04:40:46.729091263Z
2026-05-20T04:41:18.850718116Z
```

With some horrifying shell, we can see that the "node may be overloaded" messages correlate with local disk creation on that U.2:

```
BRM42220009 # cat /pool/ext/b252b176-3974-436a-915b-60382b21eb76/crypt/debug/global/oxide-sled-agent:default.log.1779333005 | egrep 'oxp_0e485ad3-04e6-404b-b619-87d4fea9f5ae.*still zero initializing' | jq -r .time | cut -d ':' -f -2 | uniq -c
   6 2026-05-20T03:12
   3 2026-05-20T03:14
   5 2026-05-20T03:15
   6 2026-05-20T03:51
   7 2026-05-20T03:52
   2 2026-05-20T03:53
   3 2026-05-20T04:37
   3 2026-05-20T04:38
   1 2026-05-20T04:39
   7 2026-05-20T04:40
   1 2026-05-20T04:41

BRM42220009 # grep 'node might be overloaded' /pool/ext/0e485ad3-04e6-404b-b619-87d4fea9f5ae/crypt/zone/oxz_cockroachdb_a3628a56-6f85-43b5-be50-71d8f0e04877/root/data/logs/cockroach.oxzcockroachdba3628a56-6f85-43b5-be50-71d8f0e04877.root.2026-05-19T11_26_10Z.021989.log | awk '{print $2}' | cut -d ':' -f -2 | uniq -c
 186 00:38
  58 00:39
 228 00:40
 206 00:41
 457 03:15
  35 03:51
  12 03:52
 128 03:53
 202 04:40
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local disk creation may trigger cockroach DB service restarts #10478

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Local disk creation may trigger cockroach DB service restarts #10478

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions