wait for viona Poller to run before declaring device running by iximeow · Pull Request #1118 · oxidecomputer/propolis

iximeow · 2026-04-17T00:45:47Z

When Lifecycle::run() completes, the device should be fully running. PciVirtioViona did not uphold this, as run() would return after having asked the poller to start, but before confirming it was started. If the device was paused shortly after, the poller may have never actually started. running would not be set, and "pause" would immediately succeed only for the Poller to begin running shortly after!

Worse, the device may be halted and the underlying link connected to the viona fd may have been destroyed, all while the poller was starting. At this point the AsyncFd::with_interest() will fail with ENXIO and panic the poller task.

This is another "never happens in practice" bug. But in unit tests that rapidly set up a viona device, configure it, and tear it down, this is quite consistently hit.

Working backwards from the ENXIO, I eventually ended up at the following to diagnose what specifically was causing the ENXIO:

> cat debug.d
fbt::viona_chpoll:entry {
  self->interested = 1;
  printf("chpoll");
}

fbt::ddi_get_soft_state:return / self->interested / {
  printf("ss->link: %p", ((viona_soft_state_t*)arg1)->ss_link);
}

(note this currently merges into #1117 but is kinda independent except that I noticed it as part of writing up #1117)

iximeow · 2026-04-17T00:52:29Z

    // or resumption from a "paused" state.
    fn run(&self) {
-        self.poller_start();
+        tokio::task::block_in_place(|| self.poller_start());


sorry for more tokio crimes @hawkw

poller_start now sleeps on a condition variable, so we'll block for however long on whatever runtime thread we were setting up this device on. seems right to block_in_place() waiting for this to happen instead of doing that.

At first I was like, wait this is not an async fn, but then I realized start_vm is. I think this is probably okay until we potentially fully utilize the Indicator stuff. This made me think about the vsock stuff where we use a barrier for start and a oneshot channel for pause which technically also block the runtime.

In fact the viona Lifecycle::pause is going to end up in wait_state.wait_stopped(); and we don't have a block_in_place there.

iximeow · 2026-04-17T00:54:51Z

    #[cfg_attr(not(target_os = "illumos"), ignore)]
    fn run_viona_tests() {
-        let rt = tokio::runtime::Builder::new_current_thread()
+        let rt = tokio::runtime::Builder::new_multi_thread()


the other side of it: you can't block_in_place on a single-threaded runtime, so make this multi-threaded like server and standalone.

the real problem is that (as is kinda implicit in create_test_ctx()/start() we have to be able to create the device, import state, and then set it running, so we can't just wait for the poller to be running out of the box. this is the least annoying way I see of stringing it up.

When `Lifecycle::run()` completes, the device should be fully running. `PciVirtioViona` did not uphold this, as `run()` would return after having asked the poller to start, but before confirming it *was* started. If the device was paused shortly after, the poller may have never actually started. `running` would not be set, and "pause" would immediately succeed only for the Poller to begin running shortly after! Worse, the device may be halted and the underlying link connected to the viona fd may have been destroyed, all while the poller was starting. At this point the `AsyncFd::with_interest()` will fail with ENXIO and panic the poller task. This is another "never happens in practice" bug. But in unit tests that rapidly set up a viona device, configure it, and tear it down, this is quite consistently hit. Working backwards from the ENXIO, I eventually ended up at the following to diagnose what specifically was causing the ENXIO: ``` > cat debug.d fbt::viona_chpoll:entry { self->interested = 1; printf("chpoll"); } fbt::ddi_get_soft_state:return / self->interested / { printf("ss->link: %p", ((viona_soft_state_t*)arg1)->ss_link); } ```

iximeow commented Apr 17, 2026

View reviewed changes

iximeow force-pushed the ixi/virtio-queue-size-migrate branch from 3ea068e to e2ff18e Compare April 20, 2026 22:29

Base automatically changed from ixi/virtio-queue-size-migrate to master April 20, 2026 22:55

iximeow force-pushed the ixi/virtio-poller-startup branch from b53a58b to cc70b34 Compare April 20, 2026 22:57

block_in_place in a less surprising .. place

a3d4598

papertigers approved these changes Apr 21, 2026

View reviewed changes

iximeow merged commit c29c9bb into master Apr 21, 2026
14 checks passed

iximeow deleted the ixi/virtio-poller-startup branch April 21, 2026 19:42

jmpesp mentioned this pull request May 14, 2026

Bump crucible and propolis revs to latest oxidecomputer/omicron#10447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait for viona Poller to run before declaring device running#1118

wait for viona Poller to run before declaring device running#1118
iximeow merged 2 commits into
masterfrom
ixi/virtio-poller-startup

iximeow commented Apr 17, 2026

Uh oh!

iximeow Apr 17, 2026

Uh oh!

papertigers Apr 21, 2026

Uh oh!

iximeow Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iximeow commented Apr 17, 2026

Uh oh!

iximeow Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

papertigers Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

iximeow Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants