Skip to content

crucible upstairs stuck in live repair loop resulted in continuous I/O errors in guest #837

@jordanhendricks

Description

@jordanhendricks

On the dogfood cluster in rack2, some ubuntu 22.04 instances started complaining about I/O errors immediately after booting and never stopped. They generally were so sad that one could not log in on the serial console, either.

Some sample errors from such a guest:

42083.744118] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1663: inode #80932: comm cron: reading directory lblock 0
[42143.743759] EXT4-fs error (device nvme0n1p1): __ext4_get_inode_loc_noinmem:4433: inode #33445: block 2379: comm cron: unable to read itable block
[42143.746472] EXT4-fs error (device nvme0n1p1): __ext4_find_entry:1663: inode #80932: comm cron: reading directory lblock 0
[42203.746123] EXT4-fs error (device nvme0n1p1): __ext4_get_inode_loc_noinmem:4433: inode #33445: block 2379: comm cron: unable to read itable block

I drilled into one such instance and saw that one of the downstairs came and went a couple of times:

[2023-07-16T22:49:57.858213302Z]  INFO: propolis-server/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Creating storage device sysbench-mysql-1 of kind Nvme
[2023-07-16T22:49:57.858314115Z]  INFO: propolis-server/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Creating Crucible disk from request Volume { id: bdb29e87-bff9-4dd7-bb55-1acdb0e29519, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 1600, opts: CrucibleOpts { id: bdb29e87-bff9-4dd7-bb55-1acdb0e29519, target: [[fd00:1122:3344:107::6]:19000, [fd00:1122:3344:106::8]:19000, [fd00:1122:3344:102::a]:19000], lossy: false, flush_timeout: None, key: Some("gh6Yuzy3lH2jJgYjkaSd0Mnt1GP1l1uudDAVkAVuFOw="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: false }, gen: 1 }], read_only_parent: Some(Volume { id: a9a0f459-31a6-441c-80df-ab88d971d5ab, block_size: 512, sub_volumes: [Region { block_size: 512, blocks_per_extent: 131072, extent_count: 48, opts: CrucibleOpts { id: 09e3bab3-f79c-4d58-89fa-effb976f6a87, target: [[fd00:1122:3344:101::a]:19001, [fd00:1122:3344:105::d]:19001, [fd00:1122:3344:106::c]:19001], lossy: false, flush_timeout: None, key: Some("cM1bVwfdfxeygBfUgotFLRXk4WZUJaPV4ob4sQteHis="), cert_pem: None, key_pem: None, root_cert_pem: None, control: None, read_only: true }, gen: 2 }], read_only_parent: None }) }
[2023-07-16T22:49:57.858522642Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs starts

...

[2023-07-16T22:51:11.99106978Z] ERROR: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: process_ds_operation error: No longer active
[2023-07-16T22:51:11.991077575Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: [0] will exit pm_task, this downstairs Faulted
[2023-07-16T22:51:11.99108555Z] ERROR: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: [fd00:1122:3344:107::6]:19000: proc: [0] client work task ended, Ok(Err([0] This downstairs now in Faulted)), so we end too (lo

We determined that the downstairs not being able to keep up was likely due to a connectivity issue between gc8 (downstairs) and gc16 (instance). (See: https://github.com/oxidecomputer/meta/issues/231).

A single faulted downstairs doesn't explain why the guest was seeing I/O errors. We also see from the propolis logs that the crucible upstairs is seemingly stuck in some sort of loop related to live repair:

[2023-07-16T23:54:45.950369378Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:55:45.953343455Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:55:45.953416544Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again later
[2023-07-16T23:55:45.953432715Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:56:45.95684997Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:56:45.956919332Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again later
[2023-07-16T23:56:45.956937888Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Live Repair in progress, try again
[2023-07-16T23:57:45.959171286Z]  INFO: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Checking if live repair is needed
[2023-07-16T23:57:45.959222754Z]  WARN: propolis-server/crucible-bdb29e87-bff9-4dd7-bb55-1acdb0e29519/25180 on oxz_propolis-server_da88ee76-b2a5-4c9c-840c-97130ae73896: Upstairs repair task running, trying again

I then wondered what I/Os the guest was sending the device, so I traced all nvme probes for both the propolis-server USDT provider and the pid provider (glomming on pid$N::*nvme*:). Surprisingly, I did not see any probes firing at all, indicating nothing was making it to our nvme emulation. I also traced I/Os to the downstairs and saw nothing in flight to any of the 3. This is all strange, because the guest is clearly sending I/Os, as it continuously was complaining.

Given that the path from the guest sending an I/O is: propolis nvme emulation -> crucible block device -> crucible upstairs -> etc, and the data that nothing was firing in the emulation layer or the downstairs, this suggests to me that the nvme emulation itself was wedged. A theory is that crucible was stuck, the I/O queues for the device filled up and never cleared, and the guest could then not submit new I/Os at all (hence the complaining). I would like to be able to prove that via inspecting the nvme queues in propolis, but short of better support for analyzing propolis-server core dumps, I'm not sure how to.

I then tried rebooting a different guest with the same pathology and observed that it was stuck rebooting, because crucible was stuck trying to shut down. This further makes me think that crucible is stuck in an irrecoverable way. I don't have enough context in crucible to quickly understand why.

Metadata

Metadata

Assignees

Labels

known issueTo include in customer documentation and training

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions