Skip to content

server: failed migrations are reported as having completed before the source VM is fully resumed #811

@gjcolombo

Description

@gjcolombo

Originally seen in #809; see excerpt below.


The PHD failure here is weird. The failure is in the running_process::import_failure test, which is what I'd expect this change to break if it broke anything. But the failure mode is odd: the first migration fails as expected, but then the source VM immediately gets a stop request, which it immediately handles, rendering it unable to serve as a migration source for the second migration:

{"msg":"migration out failed, resuming","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434800658Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","error":"RemoteError(Destination, \"failed to migrate device state: failed to apply deserialized device state: you have no chance to survive, make your time\")"}
{"msg":"publishing new instance state","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434818339Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","migration":"InstanceMigrateStatusResponse { migration_in: None, migration_out: Some(InstanceMigrationStatus { id: d5a22278-1907-4188-beb1-dbaab29b33a3, state: Error }) }","state":"Running","gen":12}
{"msg":"state driver handled event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.434839439Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","outcome":"Continue"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.459488929Z","hostname":"buskin","pid":1013,"uri":"/instance","method":"GET","req_id":"2b877526-cd31-4935-81e0-ab86c6740d17","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":173,"response_code":"200"}
{"msg":"requested state via API","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460691072Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","state":"Stop"}
{"msg":"Queuing external request","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460710913Z","hostname":"buskin","pid":1013,"component":"request_queue","component":"vm_state_driver","disposition":"Enqueue","request":"Stop"}
{"msg":"state driver handling event","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460729383Z","hostname":"buskin","pid":1013,"component":"vm_state_driver","event":"ExternalRequest(Stop)"}
{"msg":"stopping instance","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460745983Z","hostname":"buskin","pid":1013,"component":"vm_state_driver"}
{"msg":"request completed","v":0,"name":"propolis-server","level":30,"time":"2024-11-13T23:28:13.460760814Z","hostname":"buskin","pid":1013,"uri":"/instance/state","method":"PUT","req_id":"a85e15e1-5993-40b9-8fed-5f7ac47337fd","remote_addr":"127.0.0.1:37727","local_addr":"127.0.0.1:9000","latency_us":142,"response_code":"204"}

We log state changes requested through the framework's VM state change functions, as well as attempts to stop a VM when it's being torn down, and I don't see anything especially suspicious preceding the VM stopping. My best guess is that there's a synchronization bug in phd_framework::Framework::wait_for_cleanup_tasks that's allowing a VM teardown task from a previous test to run and affect this one, but I haven't yet found any evidence of this.

I also can't reproduce this locally, so I've queued a rerun to see if it happens again. Could probably use some more instrumentation in the PHD VM cleanup logic in any case.

Originally posted by @gjcolombo in #809 (comment)

Metadata

Metadata

Assignees

Labels

serverRelated specifically to the Propolis server API and its VM management functions.testingRelated to testing and/or the PHD test framework.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions