Skip to content

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581

Merged
SuperStrongDinosaur merged 3 commits into
google:mainfrom
SuperStrongDinosaur:restartHangFix
May 28, 2026
Merged

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581
SuperStrongDinosaur merged 3 commits into
google:mainfrom
SuperStrongDinosaur:restartHangFix

Conversation

@SuperStrongDinosaur
Copy link
Copy Markdown
Collaborator

@SuperStrongDinosaur SuperStrongDinosaur commented May 19, 2026

Description
This change addresses a set of critical deadlocks, resource-leak hangs, and configuration-mapping bugs encountered in multi-instance CVD deployments, specifically during a cvd restart.

By resolving these architectural synchronization issues, single instances within a multi-instance group can now be safely restarted independently without hanging or crashing other running instances, and without deadlocking host-side shared daemons.

Fixes

  1. Prevent FIFO Unlinking Deadlocks (SharedFD::Fifo & DeleteFifos )

Issue: Previously, SharedFD::Fifo always deleted the path before calling mkfifo(). In a multi-instance setup, global host-side daemons are started once and hold open connections to the VM instances. Unlinking these paths on a single-instance restart destroys the inode mapping, meaning the restarted instance's crosvm would construct new FIFOs that the active netsimd daemon has no knowledge of. This caused crosvm to hang indefinitely on startup, waiting for a reader/writer connection that would never come.

Solution: Modified SharedFD::Fifo to perform a stat() check on the target path first. If the file already exists and is verified to be a FIFO, it is opened directly instead of unlinking and recreating it. Removed bt_fifo_vm, nfc_fifo_vm, and uwb_fifo_vm from the unlinking sequence in ServerLoopImpl::DeleteFifos() to ensure their persistent paths are preserved across instance-specific lifecycles.

  1. Ensure Proper FIFO Provisioning for Shared UWB Services (Files affected: uwb_connector.cpp)

Issue: The UWB host-connector logic only created the uwb_fifo_vm FIFOs if instance.enable_host_uwb_connector() was evaluated to true. If config.enable_host_uwb() was true but the specific instance did not run the connector, the missing FIFOs would crash host-side daemons or break crosvm initialization.

Solution: Decoupled the FIFO generation from the launcher command check. The FIFOs are now always initialized if global UWB is enabled, while the local host-connector service itself is only spawned if the instance enable_host_uwb_connector() is enabled.

  1. Fix ProcessMonitor Shutdown Socket Hangs (Files affected: process_monitor.cc, process_monitor.h)

Issue: During a shutdown or restart event, ProcessMonitor could hang waiting on socket reads in ReadMonitorSocketLoop because the control channel remained blocked. If the socket connection returned an error or was half-closed, the loop could fail or crash rather than exiting cleanly.

Solution: Retained the child socket's raw file descriptor (child_sock_) within the ProcessMonitor class. Added an explicit child_sock_->Shutdown(SHUT_RDWR) call at the end of the MonitorRoutine execution to actively force-unblock any outstanding or stuck reads on the socket during shutdown. Hardened ReadMonitorSocketLoop to ignore read errors gracefully if the monitor has already been marked as shutting down (!running.load()).

Testing and Verification
The stability of these fixes was successfully validated under multi-instance conditions:

Multi-Instance Booting

Booted two parallel VM instances using the locally compiled binaries:
cvd create --config=sdv_core_instance1 --num_instances=2

Both instances initialized, completed guest execution, and reached the VIRTUAL_DEVICE_BOOT_COMPLETED signal cleanly.

Single-Instance Restarting (No Deadlocks)

Executed a single-instance restart command targeting instance 2:
cvd --group_name=cvd_1 --instance_name=2 restart

Instance 2 successfully halted its virtual processes, safely terminated crosvm and its children, preserved the shared Bluetooth/UWB/NFC FIFOs, and booted back to VIRTUAL_DEVICE_BOOT_COMPLETED.

Instance 1 remained completely undisturbed and functional during the entire restart cycle.

Verified using cvd status --print that both instances returned to a healthy Running state.

b/510634395

@SuperStrongDinosaur SuperStrongDinosaur marked this pull request as ready for review May 19, 2026 09:56
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:run Run e2e tests. label May 19, 2026
@SuperStrongDinosaur SuperStrongDinosaur changed the title Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UW… Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches May 19, 2026
Comment thread base/cvd/cuttlefish/common/libs/fs/shared_fd.cpp Outdated
Comment thread base/cvd/cuttlefish/common/libs/fs/shared_fd.cpp Outdated
@SuperStrongDinosaur SuperStrongDinosaur added the kokoro:run Run e2e tests. label May 27, 2026
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:run Run e2e tests. label May 27, 2026
@3405691582 3405691582 added the kokoro:force-run Trigger a presubmit build unconditionally. label May 27, 2026
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:force-run Trigger a presubmit build unconditionally. label May 27, 2026
@SuperStrongDinosaur SuperStrongDinosaur added the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@SuperStrongDinosaur SuperStrongDinosaur added the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@SuperStrongDinosaur SuperStrongDinosaur added the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:force-run Trigger a presubmit build unconditionally. label May 28, 2026
@SuperStrongDinosaur SuperStrongDinosaur added this pull request to the merge queue May 28, 2026
Merged via the queue into google:main with commit b8d9225 May 28, 2026
32 checks passed
@SuperStrongDinosaur SuperStrongDinosaur deleted the restartHangFix branch May 28, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants