Skip to content

Resolve room handle synchronously in on_disconnect#1066

Draft
MaxHeimbrock wants to merge 1 commit intomainfrom
max/fix-on-disconnect-handle-race
Draft

Resolve room handle synchronously in on_disconnect#1066
MaxHeimbrock wants to merge 1 commit intomainfrom
max/fix-on-disconnect-handle-race

Conversation

@MaxHeimbrock
Copy link
Copy Markdown
Contributor

Summary

on_disconnect spawned an async task that retrieved the FfiRoom via the DashMap inside the spawn body and unwrap()'d the result. The synchronous DisconnectResponse returned to the foreign-language client before the spawn was scheduled, so a client that disposed its room handle as soon as the request returned would race the tokio scheduler. When the spawn finally polled, retrieve_handle saw the entry already removed by drop_handle and the unwrap aborted the process via SIGABRT inside the runtime worker. On iOS the race lost roughly half the time.

Move the retrieve_handle + clone before the spawn. The spawn now owns its Arc<FfiRoom> clone, so the close runs even if the foreign side has already removed the DashMap entry. Also replace the unwrap with ? so a genuinely missing handle surfaces as a proper FfiResult error rather than an abort.

Surfaced in the Unity SDK after a separate change started disposing the room handle synchronously on disconnect (rather than waiting for the Disconnected event + GC of the SafeHandle). Either change is reasonable on the foreign side; the FFI server should be robust to it.

Test plan

  • Reproduced the SIGABRT on iOS by disposing the room handle immediately after the request returns; ~50% crash rate before fix, 0 crashes after fix across multiple connect/disconnect cycles.
  • macOS still works.
  • CI tests pass.

🤖 Generated with Claude Code

The handler spawned an async task that retrieved the FfiRoom via the
DashMap inside the spawn body and unwrap()'d the result. The synchronous
DisconnectResponse returned to the foreign-language client before the
spawn was scheduled, so a client that disposed its room handle as soon
as the request returned would race the tokio scheduler. When the spawn
finally polled, retrieve_handle saw the entry already removed by
drop_handle and the unwrap aborted the process via SIGABRT inside the
runtime worker. On iOS the race lost roughly half the time.

Move the retrieve_handle + clone before the spawn. The spawn now owns
its Arc<FfiRoom> clone, so the close runs even if the foreign side has
already removed the DashMap entry. Also replace the unwrap with `?` so
a genuinely missing handle surfaces as a proper FfiResult error rather
than an abort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

No changeset found

This PR modifies the following packages but doesn't include a changeset:

Directly changed:

  • livekit-ffi

Click here to create a changeset

The link pre-populates a changeset file with patch bumps for all affected packages.
Edit the description and bump types as needed before committing.

If this change doesn't require a version bump, add the internal label to this PR.

@MaxHeimbrock
Copy link
Copy Markdown
Contributor Author

Would the fix in max/fix-on-disconnect-handle-race not be needed if we dispose the room only after the disconnect event arrives back in Unity?
⎿ ⧉ Selected 1 lines from ../../client-sdk-rust~/livekit-ffi/src/server/requests.rs in Visual Studio Code

⏺ Yes, you're right — that's an alternative fix that would also avoid the race. If C# waited for the Disconnected room event (or DisconnectCallback) before    
  disposing the handle, the Rust spawn would have already retrieved its Arc<FfiRoom> clone before the C# drop happened, so the panic wouldn't trigger.         
                                                                                                                                                               
  But I'd argue the Rust fix is still worth landing, for a few reasons:                                                                                        
                                                                                                                                                               
  1. Defensive vs. polite                                                                                                                                      
                                                                                                                                                               
  The Rust function panics on user-controllable input (room_handle). Other handlers in the same file use ? and return proper errors. The current behavior is   
  "if anyone passes me a stale handle, I abort the whole process" — which is too sharp. Fixing it makes the FFI honest about what it does.                     
                                                                                                                                                               
  2. Other foreign clients                                                                                                                                     
                                                                                                                                                               
  Python, Node, the official Swift SDK, etc. each have their own threading models and lifecycle assumptions. We can't expect every binding to know "wait for   
  the Disconnected event before drop_handle, or you abort the host process." The fix lets the FFI server be safe regardless of how the binding chooses to      
  handle the disconnect signal.                                                                                                                                
                                                                                                                                                               
  3. Other races we haven't enumerated                                                                                                                         
  
  - Network drop → Rust may emit a Disconnected event independently of any C# Disconnect call. If C# decides to drop the handle in response, there could still 
  be in-flight requests against that room. Same race shape, different trigger.
  - Concurrent calls (Disconnect + something else) — some other request retrieves the handle, races with drop.                                                 
                                                                                                                                                               
  4. Cost
                                                                                                                                                               
  The Rust fix is 6 lines that strictly tightens behavior. Same compile, no perf change, no semantic change for well-behaved clients.                          
   
  Trade-off summary                                                                                                                                            
                  
  ┌───────────────────────────────────────┬───────────────────┬───────────────────────────────────────────────────────────────────────┐                        
  │               Approach                │       Scope       │                                 Risk                                  │
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤
  │ C# waits for Disconnected before drop │ Unity-only        │ Need a timeout fallback if event is lost; more complex Disconnect API │
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤
  │ Rust resolves handle synchronously    │ All FFI clients   │ None I can see                                                        │                        
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤                        
  │ Both                                  │ Belt + suspenders │ Slightly more code                                                    │                        
  └───────────────────────────────────────┴───────────────────┴───────────────────────────────────────────────────────────────────────┘                        
                  
  I'd recommend keeping the Rust fix regardless of whether you also change the C# side. They're not mutually exclusive — the Rust fix is just hardening that   
  benefits everyone using the FFI.
                                                                                                                                                               
  But if you'd rather have the C# fix instead and close the Rust PR, that's a defensible call too. Up to you. Want to discuss either path further?  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant