Skip to content

Improve Network Resource Cleanup on TLS Disconnect/Fail Paths#1694

Merged
vazois merged 5 commits intomicrosoft:mainfrom
kevinmichaelbowersox:users/kbowersox/cleanup-network-resource
Apr 13, 2026
Merged

Improve Network Resource Cleanup on TLS Disconnect/Fail Paths#1694
vazois merged 5 commits intomicrosoft:mainfrom
kevinmichaelbowersox:users/kbowersox/cleanup-network-resource

Conversation

@kevinmichaelbowersox
Copy link
Copy Markdown
Member

@kevinmichaelbowersox kevinmichaelbowersox commented Apr 11, 2026

The description needs corrections based on our deeper analysis. Here's the updated version with the inaccuracies fixed:


Fix CLOSE-WAIT connection leak in TLS disposal path

Summary

Fixes a bug where TLS connections cleaned up via public Dispose() in TcpNetworkHandlerBase did not call DisposeImpl(), deferring all handler cleanup to a secondary SAEA IOCP roundtrip through the threadpool. Under burst disconnection load (e.g., scale-down events removing multiple nodes simultaneously), this deferred cleanup path cannot keep up, causing threadpool starvation, CLOSE-WAIT socket accumulation, and eventual node unresponsiveness.

Problem

Symptom

When remote peers disconnect from a Garnet server running with TLS enabled, the server relies on a deferred threadpool roundtrip to clean up handlers. Under burst disconnection load, this leads to:

  • Hundreds of CLOSE-WAIT connections accumulating
  • Thread count growing from ~40 to 1,800+ as the .NET thread pool struggles to service deferred cleanup callbacks
  • File descriptor exhaustion (5,000+ open FDs)
  • Complete node unresponsiveness — the server is running but cannot process any commands

The problem was discovered during extended testing where connections were repeatedly created and destroyed over many hours. Nodes that had been running the longest accumulated the most leaked connections, while recently started nodes had nearly zero.

Root cause

TcpNetworkHandlerBase has two Dispose methods:

Method Visibility Called by Called DisposeImpl()?
Dispose() public SslReaderAsync catch blocks, server shutdown No (the bug)
Dispose(SocketAsyncEventArgs) private SAEA IOCP callback Yes

Each TLS connection has two concurrent loops:

  1. SAEA loop (IOCP-driven) — receives raw encrypted bytes into networkReceiveBuffer
  2. SslReaderAsync loop (task-based) — reads decrypted plaintext from SslStream into transportReceiveBuffer

These loops are coordinated by two semaphores (receivedData and expectingData) and a NetworkHandlerStream bridge that feeds encrypted bytes from the SAEA buffer to SslStream for decryption.

When a TLS client disconnects, SslReaderAsync catches the resulting exception and calls this.Dispose(), which resolves to the public Dispose() on TcpNetworkHandlerBase. This closed the socket but never called DisposeImpl(). As a result, handler cleanup was entirely deferred to the SAEA backup path:

SslReaderAsync catches disconnect exception
  → public Dispose(): socket.Close() / socket.Dispose() (NO DisposeImpl)
  → cancellationTokenSource NOT cancelled (only DisposeImpl does this)
  → handler NOT removed from activeHandlers
  → finally: sets readerStatus=Rest, releases expectingData semaphore
  → SAEA loop wakes up from semaphore
  → tries ReceiveAsync on closed socket → ObjectDisposedException
  → HandleReceiveFailure → Dispose(SAEA) → DisposeImpl() ← cleanup happens here, deferred

The SAEA backup path does eventually call DisposeImpl(), but it requires a full roundtrip through the threadpool. Under burst disconnection load (50-100+ simultaneous disconnects during a scale event), this creates a cascading failure:

  1. Each disconnect closes the socket but defers DisposeImpl() to an SAEA threadpool callback
  2. cancellationTokenSource is never cancelled (only DisposeImpl calls Cancel()), so background async tasks on these handlers keep running, consuming threads
  3. The threadpool grows slowly (~1-2 threads/sec) and cannot keep up with the burst
  4. Pending SAEA callbacks queue up, blocking threads waiting on semaphores
  5. Thread count snowballs: 40 → 200 → 1,800
  6. During shutdown, DisposeActiveHandlers() calls public Dispose() on each handler in a while (activeHandlerCount > 0) loop — each call still depends on SAEA roundtrips to decrement the count, but SAEA callbacks can't get threads
  7. Shutdown hangs, connections remain in CLOSE-WAIT indefinitely

Why non-TLS connections are not affected

Non-TLS disconnects go directly through the SAEA BytesTransferred == 0 path, which calls private Dispose(SAEA)DisposeImpl() immediately. No deferred roundtrip is needed.

Disposal path analysis

There are four paths that dispose a network handler:

Path Entry point Which Dispose? Had DisposeImpl()?
SAEA disconnect IOCP callback (BytesTransferred == 0) Private Dispose(SAEA) ✅ Yes — always worked
SslReaderAsync exception NetworkHandler.SslReaderAsync catch Public Dispose() ❌ No — the bug
Server shutdown GarnetServerBase.DisposeActiveHandlers() Public Dispose() ❌ No — also broken
Start exception GarnetServerTcp.HandleNewConnection() catch DisposeResources() direct ✅ Yes — always worked

Fix

The public Dispose() now calls DisposeImpl() synchronously, making handler cleanup immediate with no threadpool dependency:

SslReaderAsync catches disconnect exception
  → public Dispose():
      socket.Shutdown(Both)    ← sends TCP FIN (graceful close)
      socket.Close() / Dispose()
      DisposeImpl()            ← THE FIX: immediate cleanup
        → cancellationTokenSource.Cancel()    ← stops background tasks
        → DisposeMessageConsumer()            ← removes from activeHandlers
        → activeHandlerCount--                ← decrements immediately
  → SAEA wakes up later → Dispose(SAEA) → DisposeImpl() → no-op (disposeCount guard)

Even 100+ simultaneous disconnects clean up instantly on whatever thread caught the exception. No threadpool roundtrip needed.

Safety

  • Idempotent: DisposeImpl() is guarded by Interlocked.Increment(ref disposeCount) != 1 in NetworkHandler — only the first caller performs the actual cleanup. If both Dispose paths fire concurrently (SAEA and SslReaderAsync racing), the second call is a no-op.
  • Existing pattern: The private Dispose(SAEA) already called DisposeImpl() after socket cleanup. The public Dispose() now mirrors the same sequence.
  • No behavioral change for non-TLS: Non-TLS connections always go through the SAEA path (Path 1), which was already correct.

Tests

TlsClientDisconnectCleansUpHandler

Connects 5 TLS clients via GarnetClient, verifies they register as active handlers, abruptly disposes all clients, then verifies all handlers are cleaned up and activeHandlerCount returns to 0. Also confirms the server still accepts new connections after cleanup.

DisposeCallsDisposeImplWithoutSaeaBackup

Directly tests the bug by isolating the public Dispose() path with no SAEA backup. Uses exception injection (Dispose_After_Handler_Registered_Before_Start) to trigger a failure after the handler is added to activeHandlers but before Start() is called. Since Start() never runs, no SAEA receive loop exists, so the only cleanup path is public Dispose().

  • Without the fix: Handlers leak (public Dispose() doesn't call DisposeImpl()), activeHandlerCount stays elevated, and DisposeActiveHandlers() would hang forever during server shutdown.
  • With the fix: Handlers are properly cleaned up and the server remains functional.

This test uses its own server instance on a separate port to avoid hanging the shared test fixture's TearDown if the bug is present.

Files changed

File Change
TcpNetworkHandlerBase.cs Added DisposeImpl() call to public Dispose(). Added Shutdown(Both) and Close() to private Dispose(SAEA).
ExceptionInjectionType.cs Added Dispose_After_Handler_Registered_Before_Start enum value for test injection.
GarnetServerTcp.cs Added exception injection point between handler registration and Start() call.
NetworkTests.cs Added TlsClientDisconnectCleansUpHandler and DisposeCallsDisposeImplWithoutSaeaBackup tests.

Key changes from the previous version:

  1. Removed "leaked permanently" — the SAEA backup path does eventually call DisposeImpl(), it's not a permanent leak in the simple case
  2. Added the real failure mechanism — deferred threadpool roundtrip can't keep up with burst disconnections, causing cascading threadpool starvation
  3. Added cancellationTokenSource not being cancelled as a key contributing factor — background tasks keep running and consuming threads
  4. Added the before/after cleanup flow showing immediate vs deferred cleanup
  5. Explained why non-TLS is unaffected — direct SAEA path, no roundtrip

Copilot AI review requested due to automatic review settings April 11, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a TLS disconnect/failure cleanup bug where the public TcpNetworkHandlerBase.Dispose() path did not call DisposeImpl(), causing leaked handlers (stuck in activeHandlers) and sockets lingering in CLOSE-WAIT, and adds targeted debug-only tests plus an exception-injection point to validate the fix.

Changes:

  • Ensure public TcpNetworkHandlerBase.Dispose() invokes DisposeImpl() so handlers are removed from activeHandlers and metrics are updated.
  • Add an exception-injection point between handler registration and Start() to exercise the previously-leaky path.
  • Add DEBUG-only network tests to validate TLS disconnect cleanup and the “no SAEA backup” dispose path.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
test/Garnet.test/NetworkTests.cs Adds DEBUG-only regression tests for TLS disconnect cleanup and dispose-without-SAEA cleanup.
libs/server/Servers/GarnetServerTcp.cs Adds exception injection trigger after handler registration and before Start().
libs/common/Networking/TcpNetworkHandlerBase.cs Calls DisposeImpl() from public Dispose(); strengthens SAEA dispose to shutdown/close socket before DisposeImpl().
libs/common/ExceptionInjectionType.cs Adds a new injection enum for the handler-registered-before-start failure scenario (and reorders one existing member).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@vazois vazois merged commit 773c0bc into microsoft:main Apr 13, 2026
48 of 49 checks passed
kevinmichaelbowersox added a commit to kevinmichaelbowersox/garnet that referenced this pull request Apr 13, 2026
…oft#1694)

* Improve Cleanup of Network Resources on Error/Disconnect Path

* Test Improvements

* Update test/Garnet.test/NetworkTests.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply PR feedback: Switch to polling loops

---------

Co-authored-by: Kevin Bowersox <kbowersox@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants