wait for viona Poller to run before declaring device running#1118
Conversation
| // or resumption from a "paused" state. | ||
| fn run(&self) { | ||
| self.poller_start(); | ||
| tokio::task::block_in_place(|| self.poller_start()); |
There was a problem hiding this comment.
sorry for more tokio crimes @hawkw
poller_start now sleeps on a condition variable, so we'll block for however long on whatever runtime thread we were setting up this device on. seems right to block_in_place() waiting for this to happen instead of doing that.
There was a problem hiding this comment.
At first I was like, wait this is not an async fn, but then I realized start_vm is. I think this is probably okay until we potentially fully utilize the Indicator stuff. This made me think about the vsock stuff where we use a barrier for start and a oneshot channel for pause which technically also block the runtime.
In fact the viona Lifecycle::pause is going to end up in wait_state.wait_stopped(); and we don't have a block_in_place there.
| #[cfg_attr(not(target_os = "illumos"), ignore)] | ||
| fn run_viona_tests() { | ||
| let rt = tokio::runtime::Builder::new_current_thread() | ||
| let rt = tokio::runtime::Builder::new_multi_thread() |
There was a problem hiding this comment.
the other side of it: you can't block_in_place on a single-threaded runtime, so make this multi-threaded like server and standalone.
the real problem is that (as is kinda implicit in create_test_ctx()/start() we have to be able to create the device, import state, and then set it running, so we can't just wait for the poller to be running out of the box. this is the least annoying way I see of stringing it up.
3ea068e to
e2ff18e
Compare
When `Lifecycle::run()` completes, the device should be fully running.
`PciVirtioViona` did not uphold this, as `run()` would return after
having asked the poller to start, but before confirming it *was*
started. If the device was paused shortly after, the poller may have
never actually started. `running` would not be set, and "pause" would
immediately succeed only for the Poller to begin running shortly after!
Worse, the device may be halted and the underlying link connected to the
viona fd may have been destroyed, all while the poller was starting. At
this point the `AsyncFd::with_interest()` will fail with ENXIO and panic
the poller task.
This is another "never happens in practice" bug. But in unit tests that
rapidly set up a viona device, configure it, and tear it down, this is
quite consistently hit.
Working backwards from the ENXIO, I eventually ended up at the following
to diagnose what specifically was causing the ENXIO:
```
> cat debug.d
fbt::viona_chpoll:entry {
self->interested = 1;
printf("chpoll");
}
fbt::ddi_get_soft_state:return / self->interested / {
printf("ss->link: %p", ((viona_soft_state_t*)arg1)->ss_link);
}
```
b53a58b to
cc70b34
Compare
When
Lifecycle::run()completes, the device should be fully running.PciVirtioVionadid not uphold this, asrun()would return after having asked the poller to start, but before confirming it was started. If the device was paused shortly after, the poller may have never actually started.runningwould not be set, and "pause" would immediately succeed only for the Poller to begin running shortly after!Worse, the device may be halted and the underlying link connected to the viona fd may have been destroyed, all while the poller was starting. At this point the
AsyncFd::with_interest()will fail with ENXIO and panic the poller task.This is another "never happens in practice" bug. But in unit tests that rapidly set up a viona device, configure it, and tear it down, this is quite consistently hit.
Working backwards from the ENXIO, I eventually ended up at the following to diagnose what specifically was causing the ENXIO:
(note this currently merges into #1117 but is kinda independent except that I noticed it as part of writing up #1117)