Skip to content

fix(runtime): harden connmgr teardown drop contract#949

Merged
slepp merged 2 commits intomainfrom
fix/connmgr-teardown-drop-contract
Apr 11, 2026
Merged

fix(runtime): harden connmgr teardown drop contract#949
slepp merged 2 commits intomainfrom
fix/connmgr-teardown-drop-contract

Conversation

@slepp
Copy link
Copy Markdown
Contributor

@slepp slepp commented Apr 11, 2026

Summary

  • fence hew_connmgr_add during teardown so late reconnect installs are rejected fail-closed
  • close transport in ConnectionActor::drop before join as defense in depth
  • make hew_connmgr_add the single owner of failed-install cleanup and add focused teardown regressions

Validation

  • cargo test -p hew-runtime

slepp added 2 commits April 10, 2026 23:18
Defense-in-depth so drop does not rely on callers having already
closed the transport before dropping a ConnectionActor.

Previously ConnectionActor::drop only set reader_stop and called
join, which hangs if the reader thread is blocked inside recv() and
the transport connection was never explicitly closed (e.g., a late
add that was rejected after teardown drain but before close reached
the actor, or a future code path that forgets the close-then-drop
ordering).

Changes:
- Add transport (*mut HewTransport) and transport_closed (AtomicBool)
  fields to ConnectionActor.  transport is set in hew_connmgr_add;
  null for test-only actors created directly (null-safe via
  close_transport_conn).
- Add ConnectionActor::close_transport() which swaps transport_closed
  true before calling close_transport_conn, guarding against double-
  close.
- Drop now calls reader_stop then close_transport() then join, so a
  reader blocked in recv() is always unblocked before the join.
- hew_connmgr_free and hew_connmgr_remove: set transport_closed=true
  before the existing explicit close_transport_conn call so drop
  skips the redundant close.
- install_connection_actor error path: set transport_closed=true on
  the rejected actor before the explicit close so drop is a no-op
  on the transport.
- hew_connmgr_add thread-spawn failure path: use actor.close_transport()
  to mark and close in one step.

New regression tests:
- conn_actor_drop_closes_transport_before_join: verifies that dropping
  an actor with a running reader (blocked waiting for a channel close
  signal) unblocks and joins via the drop's close_transport() call.
- connmgr_free_rejects_concurrent_add_and_closes_transport: verifies
  that hew_connmgr_add returns -1 and closes the transport when the
  shutdown flag is set, and that hew_connmgr_free completes cleanly.
…nmgr_add

Before this commit, hew_connmgr_add had three failure paths that did NOT close
conn_id (mutex-poisoned, duplicate, and two callers post-failure), and two that
DID (shutdown fence in install_connection_actor and early-shutdown check), with
reconnect_worker_loop and hew_node::hew_node_connect also closing on -1 return.
HewTransportOps::close_conn has no idempotence contract, so the overlapping
closes were a double-close hazard.

New invariant:
  hew_connmgr_add owns conn_id from entry.
  On success: ownership transfers to the manager (closed by remove/free).
  On failure: hew_connmgr_add closes conn_id before returning -1.
  Exception: if mgr is null (API misuse), no transport pointer is available;
  the caller retains ownership (documented in the null-check comment).

Changes:
- connection.rs: add close_transport_conn in the mutex-poisoned and duplicate
  early-return paths so all non-null-mgr failure paths close conn_id.
- connection.rs: document ownership contract on hew_connmgr_add entry.
- connection.rs: remove the post-failure close from reconnect_worker_loop
  (hew_connmgr_add now owns it).
- hew_node.rs: remove the post-failure close from hew_node_connect
  (hew_connmgr_add now owns it).

The actor-drop defense-in-depth (transport_closed AtomicBool) is unchanged.
install_connection_actor still marks transport_closed=true before closing
on its rejection path, preventing a second close through actor Drop.

Validation: cargo test -p hew-runtime — 1067 passed, 0 failed.
@slepp slepp force-pushed the fix/connmgr-teardown-drop-contract branch from c08c823 to 05dff1c Compare April 11, 2026 05:29
@slepp slepp merged commit 9e6b5b2 into main Apr 11, 2026
13 checks passed
@slepp slepp deleted the fix/connmgr-teardown-drop-contract branch April 11, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant