server: failed migrations are reported as having completed before the source VM is fully resumed

Originally seen in #809; see excerpt below.

---

The PHD failure here is weird. The failure is in the `running_process::import_failure` test, which is what I'd expect this change to break if it broke anything. But the failure mode is odd: the first migration fails as expected, but then the source VM immediately gets a *stop request*, which it immediately handles, rendering it unable to serve as a migration source for the second migration:

```
{"msg":"migration out failed, resuming","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434800658Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","error":"RemoteError(Destination, \"failed to migrate device state: failed to apply deserialized device state: you have no chance to survive, make your time\")"}
{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434818339Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: d5a22278-1907-4188-beb1-dbaab29b33a3, state: Error }) }","state":"Running","gen":12}
{"msg":"state driver handled event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434839439Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","outcome":"Continue"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.459488929Z","hostname":"buskin","pid":1013,"uri":"/instance","method":"GET","req_id":"2b877526-cd31-4935-81e0-ab86c6740d17","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":173,"response_code":"200"}
{"msg":"requested state via API","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460691072Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","state":"Stop"}
{"msg":"Queuing external request","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460710913Z","hostname":"buskin","pid":1013,"component":"request_queue","component":"vm_state_driver","disposition":"Enqueue","request":"Stop"}
{"msg":"state driver handling event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460729383Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","event":"ExternalRequest(Stop)"}
{"msg":"stopping instance","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460745983Z","hostname":"buskin","pid":1013,"component":"vm_state_driver"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460760814Z","hostname":"buskin","pid":1013,"uri":"/instance/state","method":"PUT","req_id":"a85e15e1-5993-40b9-8fed-5f7ac47337fd","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":142,"response_code":"204"}
```

We log state changes requested through the framework's VM state change functions, as well as attempts to stop a VM when it's being torn down, and I don't see anything especially suspicious preceding the VM stopping. My best guess is that there's a synchronization bug in `phd_framework::Framework::wait_for_cleanup_tasks` that's allowing a VM teardown task from a previous test to run and affect this one, but I haven't yet found any evidence of this.

I also can't reproduce this locally, so I've queued a rerun to see if it happens again. Could probably use some more instrumentation in the PHD VM cleanup logic in any case.

_Originally posted by @gjcolombo in https://github.com/oxidecomputer/propolis/issues/809#issuecomment-2475130081_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: failed migrations are reported as having completed before the source VM is fully resumed #811

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: failed migrations are reported as having completed before the source VM is fully resumed #811

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions