Skip to content

wait for viona Poller to run before declaring device running#1118

Merged
iximeow merged 2 commits into
masterfrom
ixi/virtio-poller-startup
Apr 21, 2026
Merged

wait for viona Poller to run before declaring device running#1118
iximeow merged 2 commits into
masterfrom
ixi/virtio-poller-startup

Conversation

@iximeow
Copy link
Copy Markdown
Member

@iximeow iximeow commented Apr 17, 2026

When Lifecycle::run() completes, the device should be fully running. PciVirtioViona did not uphold this, as run() would return after having asked the poller to start, but before confirming it was started. If the device was paused shortly after, the poller may have never actually started. running would not be set, and "pause" would immediately succeed only for the Poller to begin running shortly after!

Worse, the device may be halted and the underlying link connected to the viona fd may have been destroyed, all while the poller was starting. At this point the AsyncFd::with_interest() will fail with ENXIO and panic the poller task.

This is another "never happens in practice" bug. But in unit tests that rapidly set up a viona device, configure it, and tear it down, this is quite consistently hit.

Working backwards from the ENXIO, I eventually ended up at the following to diagnose what specifically was causing the ENXIO:

> cat debug.d
fbt::viona_chpoll:entry {
  self->interested = 1;
  printf("chpoll");
}

fbt::ddi_get_soft_state:return / self->interested / {
  printf("ss->link: %p", ((viona_soft_state_t*)arg1)->ss_link);
}

(note this currently merges into #1117 but is kinda independent except that I noticed it as part of writing up #1117)

Comment thread lib/propolis/src/hw/virtio/viona.rs Outdated
// or resumption from a "paused" state.
fn run(&self) {
self.poller_start();
tokio::task::block_in_place(|| self.poller_start());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for more tokio crimes @hawkw

poller_start now sleeps on a condition variable, so we'll block for however long on whatever runtime thread we were setting up this device on. seems right to block_in_place() waiting for this to happen instead of doing that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was like, wait this is not an async fn, but then I realized start_vm is. I think this is probably okay until we potentially fully utilize the Indicator stuff. This made me think about the vsock stuff where we use a barrier for start and a oneshot channel for pause which technically also block the runtime.

In fact the viona Lifecycle::pause is going to end up in wait_state.wait_stopped(); and we don't have a block_in_place there.

#[cfg_attr(not(target_os = "illumos"), ignore)]
fn run_viona_tests() {
let rt = tokio::runtime::Builder::new_current_thread()
let rt = tokio::runtime::Builder::new_multi_thread()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other side of it: you can't block_in_place on a single-threaded runtime, so make this multi-threaded like server and standalone.

the real problem is that (as is kinda implicit in create_test_ctx()/start() we have to be able to create the device, import state, and then set it running, so we can't just wait for the poller to be running out of the box. this is the least annoying way I see of stringing it up.

@iximeow iximeow force-pushed the ixi/virtio-queue-size-migrate branch from 3ea068e to e2ff18e Compare April 20, 2026 22:29
When `Lifecycle::run()` completes, the device should be fully running.
`PciVirtioViona` did not uphold this, as `run()` would return after
having asked the poller to start, but before confirming it *was*
started. If the device was paused shortly after, the poller may have
never actually started. `running` would not be set, and "pause" would
immediately succeed only for the Poller to begin running shortly after!

Worse, the device may be halted and the underlying link connected to the
viona fd may have been destroyed, all while the poller was starting. At
this point the `AsyncFd::with_interest()` will fail with ENXIO and panic
the poller task.

This is another "never happens in practice" bug. But in unit tests that
rapidly set up a viona device, configure it, and tear it down, this is
quite consistently hit.

Working backwards from the ENXIO, I eventually ended up at the following
to diagnose what specifically was causing the ENXIO:
```
> cat debug.d
fbt::viona_chpoll:entry {
  self->interested = 1;
  printf("chpoll");
}

fbt::ddi_get_soft_state:return / self->interested / {
  printf("ss->link: %p", ((viona_soft_state_t*)arg1)->ss_link);
}
```
Base automatically changed from ixi/virtio-queue-size-migrate to master April 20, 2026 22:55
@iximeow iximeow force-pushed the ixi/virtio-poller-startup branch from b53a58b to cc70b34 Compare April 20, 2026 22:57
@iximeow iximeow merged commit c29c9bb into master Apr 21, 2026
14 checks passed
@iximeow iximeow deleted the ixi/virtio-poller-startup branch April 21, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants