Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581
Merged
SuperStrongDinosaur merged 3 commits intoMay 28, 2026
Conversation
Databean
approved these changes
May 19, 2026
jemoreira
reviewed
May 19, 2026
2448508 to
d246800
Compare
Databean
approved these changes
May 23, 2026
jemoreira
approved these changes
May 26, 2026
d246800 to
7102151
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This change addresses a set of critical deadlocks, resource-leak hangs, and configuration-mapping bugs encountered in multi-instance CVD deployments, specifically during a cvd restart.
By resolving these architectural synchronization issues, single instances within a multi-instance group can now be safely restarted independently without hanging or crashing other running instances, and without deadlocking host-side shared daemons.
Fixes
Issue: Previously, SharedFD::Fifo always deleted the path before calling mkfifo(). In a multi-instance setup, global host-side daemons are started once and hold open connections to the VM instances. Unlinking these paths on a single-instance restart destroys the inode mapping, meaning the restarted instance's crosvm would construct new FIFOs that the active netsimd daemon has no knowledge of. This caused crosvm to hang indefinitely on startup, waiting for a reader/writer connection that would never come.
Solution: Modified SharedFD::Fifo to perform a stat() check on the target path first. If the file already exists and is verified to be a FIFO, it is opened directly instead of unlinking and recreating it. Removed bt_fifo_vm, nfc_fifo_vm, and uwb_fifo_vm from the unlinking sequence in ServerLoopImpl::DeleteFifos() to ensure their persistent paths are preserved across instance-specific lifecycles.
Issue: The UWB host-connector logic only created the uwb_fifo_vm FIFOs if instance.enable_host_uwb_connector() was evaluated to true. If config.enable_host_uwb() was true but the specific instance did not run the connector, the missing FIFOs would crash host-side daemons or break crosvm initialization.
Solution: Decoupled the FIFO generation from the launcher command check. The FIFOs are now always initialized if global UWB is enabled, while the local host-connector service itself is only spawned if the instance enable_host_uwb_connector() is enabled.
Issue: During a shutdown or restart event, ProcessMonitor could hang waiting on socket reads in ReadMonitorSocketLoop because the control channel remained blocked. If the socket connection returned an error or was half-closed, the loop could fail or crash rather than exiting cleanly.
Solution: Retained the child socket's raw file descriptor (child_sock_) within the ProcessMonitor class. Added an explicit child_sock_->Shutdown(SHUT_RDWR) call at the end of the MonitorRoutine execution to actively force-unblock any outstanding or stuck reads on the socket during shutdown. Hardened ReadMonitorSocketLoop to ignore read errors gracefully if the monitor has already been marked as shutting down (!running.load()).
Testing and Verification
The stability of these fixes was successfully validated under multi-instance conditions:
Multi-Instance Booting
Booted two parallel VM instances using the locally compiled binaries:
cvd create --config=sdv_core_instance1 --num_instances=2
Both instances initialized, completed guest execution, and reached the VIRTUAL_DEVICE_BOOT_COMPLETED signal cleanly.
Single-Instance Restarting (No Deadlocks)
Executed a single-instance restart command targeting instance 2:
cvd --group_name=cvd_1 --instance_name=2 restart
Instance 2 successfully halted its virtual processes, safely terminated crosvm and its children, preserved the shared Bluetooth/UWB/NFC FIFOs, and booted back to VIRTUAL_DEVICE_BOOT_COMPLETED.
Instance 1 remained completely undisturbed and functional during the entire restart cycle.
Verified using cvd status --print that both instances returned to a healthy Running state.
b/510634395