Skip to content

Conversation

@pfmooney
Copy link
Contributor

The Identify and GetLogPage admin commands in NVMe should not assume that the output buffers provided to them in the PRPs consist of a single page-sized page-aligned entry. Guest (such as Linux) can and will issue those commands with a page offset in PRP1, splitting the output into another page.

Fixes #427

prp: cmds::PrpIter,
data: &T,
mem: &MemCtx,
) -> Option<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not especially fond of using Option<()> here, but it allowed the logic to be more terse for now. I'm working on some improvements to PrpIter and will revisit some of the error handling there.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've tested this out with a PHD boot loop test and have gotten through about 425 iterations with no apparent boot timeouts or guest service segfaults. To put that in perspective, usually a 50-iteration test run would produce 2-3 guest boot timeouts; now we're at the point where the test appears to hit bugs in the PHD framework before it observes any guest failures.

/// Write result data from an admin command into host memory
///
/// Returns `Some(())` if successful, else None
fn write_admin_result<T: Copy>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be neat to enforce the packed-ness of T here, but the most promising looking crate I found to do so (repr-trait) doesn't quite look to me like it would drop into this code neatly (reading through its derives I'm not convinced it will handle the repr(C, packed(1)) repr that NVMe's structs use). Oh well--maybe another time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, a write_from_ptr() interface would keep us clear of the potential UB from the slice creation. This should work for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the comment regarding the repr() expectations, so it's at least a bit more visible in the mean time.

The Identify and GetLogPage admin commands in NVMe should not assume
that the output buffers provided to them in the PRPs consist of a single
page-sized page-aligned entry.  Guest (such as Linux) can and will issue
those commands with a page offset in PRP1, splitting the output into
another page.

Fixes oxidecomputer#427
@jordanhendricks
Copy link
Contributor

I also tested this with a reboot overnight and saw > 2100 successful test results before I turned it off. (Before, I would see about 5-10 segfaults every 300 boots).

@leftwo
Copy link
Contributor

leftwo commented Jul 15, 2023

I have a PR out for updating Crucible/Propolis in Omicron.
oxidecomputer/omicron#3646

If you expect this to go back soon (and I believe you do), then I can hold that
PR and update it to point to the git rev of this PR after it merges.

That way we won't have two PRs in Omicron to update Propolis

@pfmooney pfmooney merged commit fbd701c into oxidecomputer:master Jul 15, 2023
@pfmooney pfmooney deleted the nvme-427 branch July 15, 2023 17:47
@pfmooney
Copy link
Contributor Author

I have a PR out for updating Crucible/Propolis in Omicron. oxidecomputer/omicron#3646

If you expect this to go back soon (and I believe you do), then I can hold that PR and update it to point to the git rev of this PR after it merges.

That way we won't have two PRs in Omicron to update Propolis

Thanks for waiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ubuntu 22.04 guest: "segfault at 10 ip 00007f68a0fd5b41 sp 00007ffc956aa800 error 6 in libc.so.6" during first boot

4 participants