nvme_driver: don't lazily restore interrupts#3220
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates NVMe servicing restore behavior to avoid “lazy” interrupt restoration by eagerly restoring I/O queues in a deterministic interrupt-vector order, working around device/VPci interrupt allocation ordering issues.
Changes:
- Change I/O queue restore strategy to (by default) restore all queues sorted by MSI-X interrupt vector.
- Add a saved-state flag (
allow_lazy_restore) to optionally fall back to the prior eager+proto (lazy) restore split. - Persist
allow_lazy_restore: Some(false)on save to force eager restore for now.
| .filter(|q| { | ||
| q.queue_data.qid == 1 | ||
| || !q.queue_data.handler_data.pending_cmds.commands.is_empty() | ||
| }) | ||
| .flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> { | ||
| let qid = q.queue_data.qid; | ||
| let cpu = q.cpu; |
There was a problem hiding this comment.
Iterator::flat_map with a closure returning Result<_, _> will silently drop Err values (because Result implements IntoIterator). That means queue restore failures (e.g. interrupt mapping / IoQueue::restore) can be ignored and the restore may succeed with missing queues. Use a fallible collection pattern (e.g., map + collect::<Result<Vec<_>, _>>()?, or try_fold) so errors propagate.
There was a problem hiding this comment.
Pre-existing pattern — this PR didn't change the collection strategy. Worth fixing in a follow-up.
There was a problem hiding this comment.
Filed #3229 to track this and the other two pre-existing error handling items together.
| worker.io = sorted_io | ||
| .into_iter() | ||
| .flat_map(|q| -> Result<IoQueue<D>, anyhow::Error> { | ||
| let qid = q.queue_data.qid; | ||
| let cpu = q.cpu; |
There was a problem hiding this comment.
Same issue as above: this flat_map over Result will drop restore errors instead of propagating them, potentially leaving the driver partially restored without failing. Switch to a Result-propagating collection (collect::<Result<...>>()? / try_fold) so any queue restore error aborts restore.
There was a problem hiding this comment.
Same — pre-existing in the eager path.
| .expect("unable to find restored mem block") | ||
| .to_owned(); |
There was a problem hiding this comment.
Restore uses expect("unable to find restored mem block"), which will panic if the saved state doesn't match the restored DMA allocations (corrupt/partial state, version skew, etc.). Prefer returning a structured restore error with context instead of panicking.
| .expect("unable to find restored mem block") | |
| .to_owned(); | |
| .cloned() | |
| .ok_or_else(|| { | |
| anyhow::anyhow!( | |
| "unable to find restored mem block for {}, qid {}, cpu {}, mem_len {}, base_pfn {}", | |
| pci_id, | |
| qid, | |
| cpu, | |
| q.queue_data.mem_len, | |
| q.queue_data.base_pfn | |
| ) | |
| })?; |
There was a problem hiding this comment.
Pre-existing — the original code had the same expect. Agree the suggestion is better, but out of scope for this PR.
| issuer: q.queue.issuer().clone(), | ||
| cpu: q.cpu, | ||
| }; | ||
| this.io_issuers.per_cpu[q.cpu as usize].set(issuer).unwrap(); |
There was a problem hiding this comment.
q.cpu from saved state is used as an index (per_cpu[q.cpu as usize]) and .set(...).unwrap() assumes it is both in-range and not already set. With the new eager-restore-all-queues path, any mismatch between saved CPUs and cpu_count (or duplicate CPUs) will panic during restore. Validate q.cpu < cpu_count and handle duplicate/previously-set issuers by returning an error instead of panicking.
| this.io_issuers.per_cpu[q.cpu as usize].set(issuer).unwrap(); | |
| let issuer_slot = this | |
| .io_issuers | |
| .per_cpu | |
| .get(q.cpu as usize) | |
| .ok_or_else(|| { | |
| anyhow::anyhow!( | |
| "saved queue {} has invalid cpu {} (cpu_count {})", | |
| qid, | |
| q.cpu, | |
| this.io_issuers.per_cpu.len() | |
| ) | |
| })?; | |
| issuer_slot.set(issuer).map_err(|_| { | |
| anyhow::anyhow!( | |
| "duplicate io issuer restore for cpu {} on queue {}", | |
| q.cpu, | |
| qid | |
| ) | |
| })?; |
There was a problem hiding this comment.
Pre-existing — identical .unwrap() existed in the original eager restore path. Saved state CPU values always come from our own prior save. Agree defensive validation would be better, but out of scope.
| io, | ||
| qsize: worker_state.qsize, | ||
| max_io_queues: worker_state.max_io_queues, | ||
| allow_lazy_restore: Some(false), // For now, we always restore eagerly to work around device bugs. |
There was a problem hiding this comment.
allow_lazy_restore is always saved as Some(false), and there is no code path that can persist Some(true). That makes the lazy-restore branch effectively dead and likely to rot. Consider either (a) removing the flag/branch for now, or (b) plumbing a real knob (config/feature/env) and adding coverage so the lazy path stays tested.
| allow_lazy_restore: Some(false), // For now, we always restore eagerly to work around device bugs. | |
| // For now, we always restore eagerly to work around device bugs. | |
| // Don't persist an explicit `false` value for a mode that has no way | |
| // to be enabled by current save paths; `None` preserves the default. | |
| allow_lazy_restore: None, |
There was a problem hiding this comment.
Intentional. The lazy path is preserved so the restore side can opt back into it once the host-side IV ordering bug is fixed. Some(false) is explicit — it signals "we chose eager" rather than "this field wasn't set."
| // Eager restore path: restore ALL queues sorted by interrupt | ||
| // vector for ordered VPci allocation (MSI-X ordering workaround). | ||
| // | ||
| // Devnote: Safety of inline new_self_drained(): This loop is fully | ||
| // synchronous (no .await). Although IoQueue::restore() spawns | ||
| // queue handler tasks, they don't poll until the async runtime | ||
| // yields — which happens only after .collect() completes. So all | ||
| // new_self_drained() and new_draining() calls finish before any | ||
| // handler can fire the drain-complete signal. If this loop is ever | ||
| // refactored to be async, the waiters for empty queues must be | ||
| // pre-created (as done in the lazy path above). | ||
| let mut sorted_io: Vec<_> = saved_state.worker_data.io.iter().collect(); | ||
| sorted_io.sort_by_key(|q| q.iv); | ||
|
|
There was a problem hiding this comment.
The new restore behavior (restoring all IO queues sorted by interrupt vector to enforce VPci CreateInterruptMessage ordering) is not covered by tests in this crate. Add a unit/integration test that exercises servicing save/restore and asserts the interrupt mapping/allocation order or at least that all queues are restored eagerly (no proto queues) under the default setting.
| #[mesh(4)] | ||
| pub max_io_queues: u16, | ||
| /// Whether to allow lazy restore of IO queues that had no pending commands at the time of save. | ||
| #[mesh(5)] |
There was a problem hiding this comment.
nit: Do we need to save this if it is always false? We can just initialize the variable to false every time
There was a problem hiding this comment.
You're right, since a missing bool defaults to false, saving a false is overkill. I just wanted to be very explicit in the ternary: we know it's unsafe, we know it's safe, or we don't know (assume unsafe).
Works around a device-side issue: some devices cannot handle the case where interrupts are reprogrammed out of IV order. (cherry picked from commit 958b8a4)
Works around a device-side issue: some devices cannot handle the case where interrupts are reprogrammed out of IV order.
Works around a device-side issue: some devices cannot handle the case where interrupts are reprogrammed out of IV order.