While running and wrapping up the test_local_storage_scale test on Dogfood, one of the local disks did not get deleted successfully.
DISK_STATE
- deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9 - - faulted
DISK_NAME TIME_CREATED TIME_DELETED ALLOCATION_TYPE DATASET_ID POOL_ID SLED_ID DATASET_SIZE
deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9 2026-05-20 04:37:25.607011 UTC unencrypted 10bb5fb2-8830-4151-858a-c9509aa8fdbb 4c157f35-865d-4310-9d81-c6259cb69293 db183874-65b5-4263-a1c1-ddb2737ae0e9 573571072000
I did start deleting some of those disks manually in the middle of the terraform destroy just for funsies, while another instance of this tests was starting in parallel. So maybe there was weird combo there.
The backtrace carried by @hawkw and @wfchandler yesterday delivered the following:
root@oxz_switch1:~# omdb db saga show 59aed31b-b6b1-41f1-811a-cbe253b2975b 2>/dev/null
id | current_sec | time_created | name | state
--------------------------------------+--------------------------------------+--------------------------+-------------+-------
59aed31b-b6b1-41f1-811a-cbe253b2975b | 6ae939ca-e613-49a9-9efd-dbb60c94b1f3 | 2026-05-21T02:19:56.652Z | disk-delete | Done
DAG: {"end_node":5,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[4,0,null],[3,5,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"disk_delete.delete_disk_record","label":"DeleteDiskRecord","name":"deleted_disk"}},{"Action":{"action_name":"disk_delete.space_account","label":"SpaceAccount","name":"no_result1"}},{"Action":{"action_name":"disk_delete.delete_local_storage","label":"DeleteLocalStorage","name":"delete_local_storage"}},{"Action":{"action_name":"disk_delete.deallocate_local_storage","label":"DeallocateLocalStorage","name":"deallocate_local_storage"}},{"Start":{"params":{"disk":{"LocalStorage":{"disk":{"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"faulted","disk_type":"LocalStorage","identity":{"description":"local data disk 0 for locscale-47","id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","name":"deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9","time_created":"2026-05-20T04:34:40.165795Z","time_deleted":null,"time_modified":"2026-05-20T04:34:40.165795Z"},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","rcgen":1,"size":536870912000,"slot":null,"state_generation":2,"time_state_updated":"2026-05-20T04:34:40.289004Z"},"disk_type_local_storage":{"disk_id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","local_storage_dataset_allocation_id":null,"local_storage_unencrypted_dataset_allocation_id":"86021725-cfe0-4903-90e7-cf3bd92da3bd","required_dataset_overhead":36700160000},"local_storage_dataset_allocation":{"Unencrypted":{"dataset_size":573571072000,"id":"86021725-cfe0-4903-90e7-cf3bd92da3bd","local_storage_unencrypted_dataset_id":"10bb5fb2-8830-4151-858a-c9509aa8fdbb","pool_id":"4c157f35-865d-4310-9d81-c6259cb69293","sled_id":"db183874-65b5-4263-a1c1-ddb2737ae0e9","time_created":"2026-05-20T04:37:25.607011Z","time_deleted":null}}}},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","serialized_authn":{"kind":{"Authenticated":[{"actor":{"SiloUser":{"silo_id":"7bd7623a-68ed-4636-8ecb-b59e3b068787","silo_user_id":"9ceeb971-95af-4cbe-8b25-0c9ad57c925b"}},"credential_id":"a3e5d682-6c98-44b2-affa-7a3f6b7b5282","device_token_expiration":null},{"mapped_fleet_roles":{"admin":["admin"]}}]}}}}},"End"]},"saga_name":"disk-delete","start_node":4}
event time | sub saga | node id | event type | data
------------------------ | -------- | ------------------------------------- | ------------- | ---
2026-05-21T02:19:56.703Z | | 4: start | started |
2026-05-21T02:19:56.726Z | | 4: start | succeeded |
2026-05-21T02:19:56.731Z | | 0: disk_delete.delete_disk_record | started |
2026-05-21T02:19:56.741Z | | 0: disk_delete.delete_disk_record | succeeded | "deleted_disk" => {"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"faulted","disk_type":"LocalStorage","identity":{"description":"local data disk 0 for locscale-47","id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","name":"deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9","time_created":"2026-05-20T04:34:40.165795Z","time_deleted":null,"time_modified":"2026-05-20T04:34:40.165795Z"},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","rcgen":1,"size":536870912000,"slot":null,"state_generation":2,"time_state_updated":"2026-05-20T04:34:40.289004Z"}
2026-05-21T02:19:56.744Z | | 1: disk_delete.space_account | started |
2026-05-21T02:19:56.760Z | | 1: disk_delete.space_account | succeeded |
2026-05-21T02:19:56.764Z | | 2: disk_delete.delete_local_storage | started |
2026-05-21T02:19:56.799Z | | 2: disk_delete.delete_local_storage | failed | "delete_local_storage" => {"ActionFailed":{"source_error":{"InternalError":{"internal_message":"failed to delete local storage: failed at attempt 1: progenitor API operation failed: Error Response: status: 500 Internal Server Error; headers: {\"content-type\": \"application/json\", \"x-request-id\": \"b6330f85-046d-43e3-bb49-b8955fd7409a\", \"content-length\": \"124\", \"date\": \"Thu, 21 May 2026 02:19:56 GMT\"}; value: Error { error_code: Some(\"Internal\"), message: \"Internal Server Error\", request_id: \"b6330f85-046d-43e3-bb49-b8955fd7409a\" }"}}}}
2026-05-21T02:19:56.810Z | | 1: disk_delete.space_account | undo_started |
2026-05-21T02:19:56.829Z | | 1: disk_delete.space_account | undo_finished |
2026-05-21T02:19:56.831Z | | 0: disk_delete.delete_disk_record | undo_started |
2026-05-21T02:19:56.843Z | | 0: disk_delete.delete_disk_record | undo_finished |
2026-05-21T02:19:56.847Z | | 4: start | undo_started |
2026-05-21T02:19:56.875Z | | 4: start | undo_finished |
1
2
3
4
5
6
7
8
9
10
11
12
root@oxz_switch1:~# pilot host exec -c 'grep -h b6330f85-046d-43e3-bb49-b8955fd7409a $(/opt/oxide/oxlog/oxlog logs global sled-agent --current --archived -A 2026-05-20) | looker' BRM42220009
10 BRM42220009 ok: 02:19:56.799Z INFO SledAgent (dropshot (SledAgent)): request completed
error_message_external = Internal Server Error
error_message_internal = Failed to remove reservation from 'oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/86021725-cfe0-4903-90e7-cf3bd92da3bd': Failed to get value '["reservation"]' from filesystem oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/86021725-cfe0-4903-90e7-cf3bd92da3bd
file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:862
latency_us = 13288
local_addr = [fd00:1122:3344:107::1]:12345
method = DELETE
remote_addr = [fd00:1122:3344:102::a6]:64202
req_id = b6330f85-046d-43e3-bb49-b8955fd7409a
response_code = 500
uri = /local-storage
@wfchandler noted the dataset didn't exist anymore:
BRM42220009 # zfs list -r oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted
NAME USED AVAIL REFER MOUNTPOINT
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted 85.5G 1.56T 24K /pool/ext/4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/8c5d8744-8712-4631-98d1-382d0d54ab33 80.0G 5.46G 24K /8c5d8744-8712-4631-98d1-382d0d54ab33
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/8c5d8744-8712-4631-98d1-382d0d54ab33/vol 80.0G 5.46G 80.0G -
BRM42220009 # zpool history oxp_4c157f35-865d-4310-9d81-c6259cb69293 | grep 86021725-cfe0-4903-90e7-cf3bd92da3bd
BRM42220009 # echo $?
And noted that instance might never have started:
root@oxz_switch1:~# omdb db instance info 8139118e-76a9-415d-a327-a20eb0a8c1c7 --all 2>/dev/null
== INSTANCE ====================================================================
ID: 8139118e-76a9-415d-a327-a20eb0a8c1c7
project ID: 7adf3d37-e230-47fd-b0cc-8ced0c51e4d5
name: locscale-47
description: local storage scale test instance 47
created at: 2026-05-20 04:37:21.564874 UTC
last modified at: 2026-05-20 05:16:01.719800 UTC
/!\ deleted at: 2026-05-20 05:16:02.053285 UTC
== CONFIGURATION ===============================================================
vCPUs: 2
memory: 8 GiB
hostname: locscale-47
boot disk: Some(3a8db7fb-1987-413a-b5fb-1b5f7be71e21)
auto-restart:
InstanceAutoRestart {
policy: None,
cooldown: None,
}
== RUNTIME STATE ===============================================================
nexus state: Destroyed
(i) external API state: Destroyed
intended state: destroyed
last updated at: 2026-05-20T04:37:21.564874Z (generation 2)
needs reincarnation: false
karmic status: saṃsāra (reincarnation enabled)
last reincarnated at: None
active VMM ID: None
target VMM ID: None
migration ID: None
updater lock: UNLOCKED at generation: 1
== VIRTUAL RESOURCES PROVISIONED ===============================================
(i) no virtual resources provisioned for this instance
== VMM HISTORY =================================================================
ID STATE GEN SLED_ID TIME_CREATED TIME_DELETED
47a3d43b-a975-4735-826e-66ecd9443490 saga_unwound 2 db183874-65b5-4263-a1c1-ddb2737ae0e9 2026-05-20T04:37:25.762Z 2026-05-20T04:37:37.221Z
However, a CRDB process appeared to have restarted around that time:
root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep cockroach) ps -ef -o comm,stime | grep "cockroach "' 9 10 14 16 17
9 BRM44220005 ok: /opt/oxide/cockroachdb/bin/cockroach 04:37:28
10 BRM42220009 ok: /opt/oxide/cockroachdb/bin/cockroach 04:41:33
14 BRM42220051 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
16 BRM42220014 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
17 BRM42220017 ok: /opt/oxide/cockroachdb/bin/cockroach May_16
and added the logs the CRDB node on sled 9:
.47s, commit-batch-sync=1.22s, other=0.00s], wrote 43 KiB sync [append-ent=42 KiB (7), apply=42 KiB (7 in 2 batches)]; node might be overloaded
W260520 04:37:14.704457 77340944 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235893 slow heartbeat took 6.000507049s; err=context canceled
E260520 04:37:14.704524 77340944 kv/kvserver/replica_range_lease.go:427 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235894 failed to heartbeat own liveness record: context canceled
W260520 04:37:15.139560 899 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,liveness-hb] 235895 slow heartbeat took 4.500268177s; err=context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 failed node liveness heartbeat: <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)
<E2><80><BA>: context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +(1) <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)<E2><80><BA>
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Wraps: (2) context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +An inability to maintain liveness will prevent a node from participating in a
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +cluster. If this problem persists, it may be a sign of resource starvation or
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +of network connectivity problems. For help troubleshooting, visit:
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 + https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
Last observation from @wfchandler was a spike in fsync times:
I260520 04:37:08.376712 743 kv/kvserver/store_raft.go:522 ⋮ [n2,s2,r1108/7:‹/System{/tsd/cr.s…-tse}›,raft] 235779 raft ready handling: 9.47s [append=0.00s, apply=1.16s, commit-batch-sync=8.31s, other=0.00s], wrote 15 KiB sync [append-ent=15 KiB (4), apply=15 KiB (4)]; node might be overloaded
While running and wrapping up the test_local_storage_scale test on Dogfood, one of the local disks did not get deleted successfully.
I did start deleting some of those disks manually in the middle of the terraform destroy just for funsies, while another instance of this tests was starting in parallel. So maybe there was weird combo there.
The backtrace carried by @hawkw and @wfchandler yesterday delivered the following:
@wfchandler noted the dataset didn't exist anymore:
And noted that instance might never have started:
However, a CRDB process appeared to have restarted around that time:
and added the logs the CRDB node on sled 9:
Last observation from @wfchandler was a spike in fsync times: