Skip to content

Unable to delete faulted disk from console during test_local_storage_scale #10477

@paudmir

Description

@paudmir

While running and wrapping up the test_local_storage_scale test on Dogfood, one of the local disks did not get deleted successfully.

DISK_STATE
-           deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9 -             -             faulted
DISK_NAME                                    TIME_CREATED                   TIME_DELETED ALLOCATION_TYPE DATASET_ID                           POOL_ID                              SLED_ID                              DATASET_SIZE
deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9 2026-05-20 04:37:25.607011 UTC              unencrypted     10bb5fb2-8830-4151-858a-c9509aa8fdbb 4c157f35-865d-4310-9d81-c6259cb69293 db183874-65b5-4263-a1c1-ddb2737ae0e9 573571072000

I did start deleting some of those disks manually in the middle of the terraform destroy just for funsies, while another instance of this tests was starting in parallel. So maybe there was weird combo there.

The backtrace carried by @hawkw and @wfchandler yesterday delivered the following:

root@oxz_switch1:~# omdb db saga show 59aed31b-b6b1-41f1-811a-cbe253b2975b 2>/dev/null
 id                                   | current_sec                          | time_created             | name        | state 
--------------------------------------+--------------------------------------+--------------------------+-------------+-------
 59aed31b-b6b1-41f1-811a-cbe253b2975b | 6ae939ca-e613-49a9-9efd-dbb60c94b1f3 | 2026-05-21T02:19:56.652Z | disk-delete | Done  

DAG: {"end_node":5,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[4,0,null],[3,5,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"disk_delete.delete_disk_record","label":"DeleteDiskRecord","name":"deleted_disk"}},{"Action":{"action_name":"disk_delete.space_account","label":"SpaceAccount","name":"no_result1"}},{"Action":{"action_name":"disk_delete.delete_local_storage","label":"DeleteLocalStorage","name":"delete_local_storage"}},{"Action":{"action_name":"disk_delete.deallocate_local_storage","label":"DeallocateLocalStorage","name":"deallocate_local_storage"}},{"Start":{"params":{"disk":{"LocalStorage":{"disk":{"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"faulted","disk_type":"LocalStorage","identity":{"description":"local data disk 0 for locscale-47","id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","name":"deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9","time_created":"2026-05-20T04:34:40.165795Z","time_deleted":null,"time_modified":"2026-05-20T04:34:40.165795Z"},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","rcgen":1,"size":536870912000,"slot":null,"state_generation":2,"time_state_updated":"2026-05-20T04:34:40.289004Z"},"disk_type_local_storage":{"disk_id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","local_storage_dataset_allocation_id":null,"local_storage_unencrypted_dataset_allocation_id":"86021725-cfe0-4903-90e7-cf3bd92da3bd","required_dataset_overhead":36700160000},"local_storage_dataset_allocation":{"Unencrypted":{"dataset_size":573571072000,"id":"86021725-cfe0-4903-90e7-cf3bd92da3bd","local_storage_unencrypted_dataset_id":"10bb5fb2-8830-4151-858a-c9509aa8fdbb","pool_id":"4c157f35-865d-4310-9d81-c6259cb69293","sled_id":"db183874-65b5-4263-a1c1-ddb2737ae0e9","time_created":"2026-05-20T04:37:25.607011Z","time_deleted":null}}}},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","serialized_authn":{"kind":{"Authenticated":[{"actor":{"SiloUser":{"silo_id":"7bd7623a-68ed-4636-8ecb-b59e3b068787","silo_user_id":"9ceeb971-95af-4cbe-8b25-0c9ad57c925b"}},"credential_id":"a3e5d682-6c98-44b2-affa-7a3f6b7b5282","device_token_expiration":null},{"mapped_fleet_roles":{"admin":["admin"]}}]}}}}},"End"]},"saga_name":"disk-delete","start_node":4}

              event time | sub saga | node id                               | event type    | data
------------------------ | -------- | ------------------------------------- | ------------- | ---
2026-05-21T02:19:56.703Z |          |                              4: start | started       | 
2026-05-21T02:19:56.726Z |          |                              4: start | succeeded     | 
2026-05-21T02:19:56.731Z |          |     0: disk_delete.delete_disk_record | started       | 
2026-05-21T02:19:56.741Z |          |     0: disk_delete.delete_disk_record | succeeded     | "deleted_disk" => {"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"faulted","disk_type":"LocalStorage","identity":{"description":"local data disk 0 for locscale-47","id":"237846bd-ec9b-4ce2-a8e9-48e2c56487f9","name":"deleted-237846bd-ec9b-4ce2-a8e9-48e2c56487f9","time_created":"2026-05-20T04:34:40.165795Z","time_deleted":null,"time_modified":"2026-05-20T04:34:40.165795Z"},"project_id":"7adf3d37-e230-47fd-b0cc-8ced0c51e4d5","rcgen":1,"size":536870912000,"slot":null,"state_generation":2,"time_state_updated":"2026-05-20T04:34:40.289004Z"}
2026-05-21T02:19:56.744Z |          |          1: disk_delete.space_account | started       | 
2026-05-21T02:19:56.760Z |          |          1: disk_delete.space_account | succeeded     | 
2026-05-21T02:19:56.764Z |          |   2: disk_delete.delete_local_storage | started       | 
2026-05-21T02:19:56.799Z |          |   2: disk_delete.delete_local_storage | failed        | "delete_local_storage" => {"ActionFailed":{"source_error":{"InternalError":{"internal_message":"failed to delete local storage: failed at attempt 1: progenitor API operation failed: Error Response: status: 500 Internal Server Error; headers: {\"content-type\": \"application/json\", \"x-request-id\": \"b6330f85-046d-43e3-bb49-b8955fd7409a\", \"content-length\": \"124\", \"date\": \"Thu, 21 May 2026 02:19:56 GMT\"}; value: Error { error_code: Some(\"Internal\"), message: \"Internal Server Error\", request_id: \"b6330f85-046d-43e3-bb49-b8955fd7409a\" }"}}}}
2026-05-21T02:19:56.810Z |          |          1: disk_delete.space_account | undo_started  | 
2026-05-21T02:19:56.829Z |          |          1: disk_delete.space_account | undo_finished | 
2026-05-21T02:19:56.831Z |          |     0: disk_delete.delete_disk_record | undo_started  | 
2026-05-21T02:19:56.843Z |          |     0: disk_delete.delete_disk_record | undo_finished | 
2026-05-21T02:19:56.847Z |          |                              4: start | undo_started  | 
2026-05-21T02:19:56.875Z |          |                              4: start | undo_finished |
1
2
3
4
5
6
7
8
9
10
11
12
root@oxz_switch1:~# pilot host exec -c 'grep -h b6330f85-046d-43e3-bb49-b8955fd7409a $(/opt/oxide/oxlog/oxlog logs global sled-agent --current --archived -A 2026-05-20) | looker' BRM42220009
10  BRM42220009        ok: 02:19:56.799Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Failed to remove reservation from 'oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/86021725-cfe0-4903-90e7-cf3bd92da3bd': Failed to get value '["reservation"]' from filesystem oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/86021725-cfe0-4903-90e7-cf3bd92da3bd
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:862
    latency_us = 13288
    local_addr = [fd00:1122:3344:107::1]:12345
    method = DELETE
    remote_addr = [fd00:1122:3344:102::a6]:64202
    req_id = b6330f85-046d-43e3-bb49-b8955fd7409a
    response_code = 500
    uri = /local-storage

@wfchandler noted the dataset didn't exist anymore:

BRM42220009 # zfs list -r oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted 
NAME                                                                                                          USED  AVAIL     REFER  MOUNTPOINT
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted                                           85.5G  1.56T       24K  /pool/ext/4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/8c5d8744-8712-4631-98d1-382d0d54ab33      80.0G  5.46G       24K  /8c5d8744-8712-4631-98d1-382d0d54ab33
oxp_4c157f35-865d-4310-9d81-c6259cb69293/local_storage_unencrypted/8c5d8744-8712-4631-98d1-382d0d54ab33/vol  80.0G  5.46G     80.0G  -
BRM42220009 # zpool history oxp_4c157f35-865d-4310-9d81-c6259cb69293 | grep 86021725-cfe0-4903-90e7-cf3bd92da3bd
BRM42220009 # echo $?

And noted that instance might never have started:

root@oxz_switch1:~# omdb db instance info  8139118e-76a9-415d-a327-a20eb0a8c1c7 --all 2>/dev/null

== INSTANCE ====================================================================
                        ID: 8139118e-76a9-415d-a327-a20eb0a8c1c7
                project ID: 7adf3d37-e230-47fd-b0cc-8ced0c51e4d5
                      name: locscale-47
               description: local storage scale test instance 47
                created at: 2026-05-20 04:37:21.564874 UTC
          last modified at: 2026-05-20 05:16:01.719800 UTC
/!\             deleted at: 2026-05-20 05:16:02.053285 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 8 GiB
                  hostname: locscale-47
                 boot disk: Some(3a8db7fb-1987-413a-b5fb-1b5f7be71e21)
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Destroyed
(i)     external API state: Destroyed
            intended state: destroyed
           last updated at: 2026-05-20T04:37:21.564874Z (generation 2)
       needs reincarnation: false
             karmic status: saṃsāra (reincarnation enabled)
      last reincarnated at: None
             active VMM ID: None
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 1

== VIRTUAL RESOURCES PROVISIONED ===============================================
(i) no virtual resources provisioned for this instance

== VMM HISTORY =================================================================
ID                                   STATE        GEN SLED_ID                              TIME_CREATED             TIME_DELETED             
47a3d43b-a975-4735-826e-66ecd9443490 saga_unwound 2   db183874-65b5-4263-a1c1-ddb2737ae0e9 2026-05-20T04:37:25.762Z 2026-05-20T04:37:37.221Z 

However, a CRDB process appeared to have restarted around that time:

root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep cockroach) ps  -ef -o comm,stime | grep "cockroach "' 9 10 14 16 17
 9  BRM44220005        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:37:28
10  BRM42220009        ok: /opt/oxide/cockroachdb/bin/cockroach                                             04:41:33
14  BRM42220051        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
16  BRM42220014        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16
17  BRM42220017        ok: /opt/oxide/cockroachdb/bin/cockroach                                               May_16

and added the logs the CRDB node on sled 9:

.47s, commit-batch-sync=1.22s, other=0.00s], wrote 43 KiB sync [append-ent=42 KiB (7), apply=42 KiB (7 in 2 batches)]; node might be overloaded
W260520 04:37:14.704457 77340944 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235893  slow heartbeat took 6.000507049s; err=context canceled
E260520 04:37:14.704524 77340944 kv/kvserver/replica_range_lease.go:427 <E2><8B><AE> [n2,s2,r118/6:<E2><80><B9>/Table/34{8-9}<E2><80><BA>] 235894  failed to heartbeat own liveness record: context canceled
W260520 04:37:15.139560 899 kv/kvserver/liveness/liveness.go:883 <E2><8B><AE> [n2,liveness-hb] 235895  slow heartbeat took 4.500268177s; err=context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896  failed node liveness heartbeat: <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)
<E2><80><BA>: context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +(1) <E2><80><B9>operation "node liveness heartbeat" timed out after 4.5s (given timeout 4.5s)<E2><80><BA>
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Wraps: (2) context deadline exceeded
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +An inability to maintain liveness will prevent a node from participating in a
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +cluster. If this problem persists, it may be a sign of resource starvation or
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +of network connectivity problems. For help troubleshooting, visit:
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +
W260520 04:37:15.139662 899 kv/kvserver/liveness/liveness.go:785 <E2><8B><AE> [n2,liveness-hb] 235896 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues

Last observation from @wfchandler was a spike in fsync times:

I260520 04:37:08.376712 743 kv/kvserver/store_raft.go:522 ⋮ [n2,s2,r1108/7:‹/System{/tsd/cr.s…-tse}›,raft] 235779  raft ready handling: 9.47s [append=0.00s, apply=1.16s, commit-batch-sync=8.31s, other=0.00s], wrote 15 KiB sync [append-ent=15 KiB (4), apply=15 KiB (4)]; node might be overloaded

Metadata

Metadata

Assignees

Labels

local storagerelating to the local storage feature

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions