Improve Network Resource Cleanup on TLS Disconnect/Fail Paths by kevinmichaelbowersox · Pull Request #1694 · microsoft/garnet

kevinmichaelbowersox · 2026-04-11T20:00:54Z

The description needs corrections based on our deeper analysis. Here's the updated version with the inaccuracies fixed:

Fix CLOSE-WAIT connection leak in TLS disposal path

Summary

Fixes a bug where TLS connections cleaned up via public Dispose() in TcpNetworkHandlerBase did not call DisposeImpl(), deferring all handler cleanup to a secondary SAEA IOCP roundtrip through the threadpool. Under burst disconnection load (e.g., scale-down events removing multiple nodes simultaneously), this deferred cleanup path cannot keep up, causing threadpool starvation, CLOSE-WAIT socket accumulation, and eventual node unresponsiveness.

Problem

Symptom

When remote peers disconnect from a Garnet server running with TLS enabled, the server relies on a deferred threadpool roundtrip to clean up handlers. Under burst disconnection load, this leads to:

Hundreds of CLOSE-WAIT connections accumulating
Thread count growing from ~40 to 1,800+ as the .NET thread pool struggles to service deferred cleanup callbacks
File descriptor exhaustion (5,000+ open FDs)
Complete node unresponsiveness — the server is running but cannot process any commands

The problem was discovered during extended testing where connections were repeatedly created and destroyed over many hours. Nodes that had been running the longest accumulated the most leaked connections, while recently started nodes had nearly zero.

Root cause

TcpNetworkHandlerBase has two Dispose methods:

Method	Visibility	Called by	Called `DisposeImpl()`?
`Dispose()`	public	`SslReaderAsync` catch blocks, server shutdown	No (the bug)
`Dispose(SocketAsyncEventArgs)`	private	SAEA IOCP callback	Yes

Each TLS connection has two concurrent loops:

SAEA loop (IOCP-driven) — receives raw encrypted bytes into networkReceiveBuffer
SslReaderAsync loop (task-based) — reads decrypted plaintext from SslStream into transportReceiveBuffer

These loops are coordinated by two semaphores (receivedData and expectingData) and a NetworkHandlerStream bridge that feeds encrypted bytes from the SAEA buffer to SslStream for decryption.

When a TLS client disconnects, SslReaderAsync catches the resulting exception and calls this.Dispose(), which resolves to the public Dispose() on TcpNetworkHandlerBase. This closed the socket but never called DisposeImpl(). As a result, handler cleanup was entirely deferred to the SAEA backup path:

SslReaderAsync catches disconnect exception
  → public Dispose(): socket.Close() / socket.Dispose() (NO DisposeImpl)
  → cancellationTokenSource NOT cancelled (only DisposeImpl does this)
  → handler NOT removed from activeHandlers
  → finally: sets readerStatus=Rest, releases expectingData semaphore
  → SAEA loop wakes up from semaphore
  → tries ReceiveAsync on closed socket → ObjectDisposedException
  → HandleReceiveFailure → Dispose(SAEA) → DisposeImpl() ← cleanup happens here, deferred

The SAEA backup path does eventually call DisposeImpl(), but it requires a full roundtrip through the threadpool. Under burst disconnection load (50-100+ simultaneous disconnects during a scale event), this creates a cascading failure:

Each disconnect closes the socket but defers DisposeImpl() to an SAEA threadpool callback
cancellationTokenSource is never cancelled (only DisposeImpl calls Cancel()), so background async tasks on these handlers keep running, consuming threads
The threadpool grows slowly (~1-2 threads/sec) and cannot keep up with the burst
Pending SAEA callbacks queue up, blocking threads waiting on semaphores
Thread count snowballs: 40 → 200 → 1,800
During shutdown, DisposeActiveHandlers() calls public Dispose() on each handler in a while (activeHandlerCount > 0) loop — each call still depends on SAEA roundtrips to decrement the count, but SAEA callbacks can't get threads
Shutdown hangs, connections remain in CLOSE-WAIT indefinitely

Why non-TLS connections are not affected

Non-TLS disconnects go directly through the SAEA BytesTransferred == 0 path, which calls private Dispose(SAEA) → DisposeImpl() immediately. No deferred roundtrip is needed.

Disposal path analysis

There are four paths that dispose a network handler:

Path	Entry point	Which `Dispose`?	Had `DisposeImpl()`?
SAEA disconnect	IOCP callback (`BytesTransferred == 0`)	Private `Dispose(SAEA)`	✅ Yes — always worked
SslReaderAsync exception	`NetworkHandler.SslReaderAsync` catch	Public `Dispose()`	❌ No — the bug
Server shutdown	`GarnetServerBase.DisposeActiveHandlers()`	Public `Dispose()`	❌ No — also broken
Start exception	`GarnetServerTcp.HandleNewConnection()` catch	`DisposeResources()` direct	✅ Yes — always worked

Fix

The public Dispose() now calls DisposeImpl() synchronously, making handler cleanup immediate with no threadpool dependency:

SslReaderAsync catches disconnect exception
  → public Dispose():
      socket.Shutdown(Both)    ← sends TCP FIN (graceful close)
      socket.Close() / Dispose()
      DisposeImpl()            ← THE FIX: immediate cleanup
        → cancellationTokenSource.Cancel()    ← stops background tasks
        → DisposeMessageConsumer()            ← removes from activeHandlers
        → activeHandlerCount--                ← decrements immediately
  → SAEA wakes up later → Dispose(SAEA) → DisposeImpl() → no-op (disposeCount guard)

Even 100+ simultaneous disconnects clean up instantly on whatever thread caught the exception. No threadpool roundtrip needed.

Safety

Idempotent: DisposeImpl() is guarded by Interlocked.Increment(ref disposeCount) != 1 in NetworkHandler — only the first caller performs the actual cleanup. If both Dispose paths fire concurrently (SAEA and SslReaderAsync racing), the second call is a no-op.
Existing pattern: The private Dispose(SAEA) already called DisposeImpl() after socket cleanup. The public Dispose() now mirrors the same sequence.
No behavioral change for non-TLS: Non-TLS connections always go through the SAEA path (Path 1), which was already correct.

Tests

`TlsClientDisconnectCleansUpHandler`

Connects 5 TLS clients via GarnetClient, verifies they register as active handlers, abruptly disposes all clients, then verifies all handlers are cleaned up and activeHandlerCount returns to 0. Also confirms the server still accepts new connections after cleanup.

`DisposeCallsDisposeImplWithoutSaeaBackup`

Directly tests the bug by isolating the public Dispose() path with no SAEA backup. Uses exception injection (Dispose_After_Handler_Registered_Before_Start) to trigger a failure after the handler is added to activeHandlers but before Start() is called. Since Start() never runs, no SAEA receive loop exists, so the only cleanup path is public Dispose().

Without the fix: Handlers leak (public Dispose() doesn't call DisposeImpl()), activeHandlerCount stays elevated, and DisposeActiveHandlers() would hang forever during server shutdown.
With the fix: Handlers are properly cleaned up and the server remains functional.

This test uses its own server instance on a separate port to avoid hanging the shared test fixture's TearDown if the bug is present.

Files changed

File	Change
TcpNetworkHandlerBase.cs	Added `DisposeImpl()` call to public `Dispose()`. Added `Shutdown(Both)` and `Close()` to private `Dispose(SAEA)`.
ExceptionInjectionType.cs	Added `Dispose_After_Handler_Registered_Before_Start` enum value for test injection.
GarnetServerTcp.cs	Added exception injection point between handler registration and `Start()` call.
NetworkTests.cs	Added `TlsClientDisconnectCleansUpHandler` and `DisposeCallsDisposeImplWithoutSaeaBackup` tests.

Key changes from the previous version:

Removed "leaked permanently" — the SAEA backup path does eventually call DisposeImpl(), it's not a permanent leak in the simple case
Added the real failure mechanism — deferred threadpool roundtrip can't keep up with burst disconnections, causing cascading threadpool starvation
Added cancellationTokenSource not being cancelled as a key contributing factor — background tasks keep running and consuming threads
Added the before/after cleanup flow showing immediate vs deferred cleanup
Explained why non-TLS is unaffected — direct SAEA path, no roundtrip

Copilot

Pull request overview

This PR fixes a TLS disconnect/failure cleanup bug where the public TcpNetworkHandlerBase.Dispose() path did not call DisposeImpl(), causing leaked handlers (stuck in activeHandlers) and sockets lingering in CLOSE-WAIT, and adds targeted debug-only tests plus an exception-injection point to validate the fix.

Changes:

Ensure public TcpNetworkHandlerBase.Dispose() invokes DisposeImpl() so handlers are removed from activeHandlers and metrics are updated.
Add an exception-injection point between handler registration and Start() to exercise the previously-leaky path.
Add DEBUG-only network tests to validate TLS disconnect cleanup and the “no SAEA backup” dispose path.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
test/Garnet.test/NetworkTests.cs	Adds DEBUG-only regression tests for TLS disconnect cleanup and dispose-without-SAEA cleanup.
libs/server/Servers/GarnetServerTcp.cs	Adds exception injection trigger after handler registration and before `Start()`.
libs/common/Networking/TcpNetworkHandlerBase.cs	Calls `DisposeImpl()` from public `Dispose()`; strengthens SAEA dispose to shutdown/close socket before `DisposeImpl()`.
libs/common/ExceptionInjectionType.cs	Adds a new injection enum for the handler-registered-before-start failure scenario (and reorders one existing member).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/Garnet.test/NetworkTests.cs

libs/common/ExceptionInjectionType.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

test/Garnet.test/NetworkTests.cs

…source

…oft#1694) * Improve Cleanup of Network Resources on Error/Disconnect Path * Test Improvements * Update test/Garnet.test/NetworkTests.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply PR feedback: Switch to polling loops --------- Co-authored-by: Kevin Bowersox <kbowersox@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Kevin Bowersox added 2 commits April 11, 2026 12:53

Improve Cleanup of Network Resources on Error/Disconnect Path

d68e34f

Test Improvements

5cc0df4

Copilot AI review requested due to automatic review settings April 11, 2026 20:00

Copilot started reviewing on behalf of kevinmichaelbowersox April 11, 2026 20:01 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

test/Garnet.test/NetworkTests.cs Show resolved Hide resolved

libs/common/ExceptionInjectionType.cs Show resolved Hide resolved

Update test/Garnet.test/NetworkTests.cs

3111777

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vazois reviewed Apr 13, 2026

View reviewed changes

test/Garnet.test/NetworkTests.cs Outdated Show resolved Hide resolved

vazois reviewed Apr 13, 2026

View reviewed changes

test/Garnet.test/NetworkTests.cs Outdated Show resolved Hide resolved

vazois approved these changes Apr 13, 2026

View reviewed changes

kevinmichaelbowersox and others added 2 commits April 13, 2026 10:42

Merge branch 'microsoft:main' into users/kbowersox/cleanup-network-re…

3a0e499

…source

Apply PR feedback: Switch to polling loops

bbd7c02

vazois approved these changes Apr 13, 2026

View reviewed changes

vazois merged commit 773c0bc into microsoft:main Apr 13, 2026
48 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Network Resource Cleanup on TLS Disconnect/Fail Paths#1694

Improve Network Resource Cleanup on TLS Disconnect/Fail Paths#1694
vazois merged 5 commits intomicrosoft:mainfrom
kevinmichaelbowersox:users/kbowersox/cleanup-network-resource

kevinmichaelbowersox commented Apr 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevinmichaelbowersox commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix CLOSE-WAIT connection leak in TLS disposal path

Summary

Problem

Symptom

Root cause

Why non-TLS connections are not affected

Disposal path analysis

Fix

Safety

Tests

TlsClientDisconnectCleansUpHandler

DisposeCallsDisposeImplWithoutSaeaBackup

Files changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevinmichaelbowersox commented Apr 11, 2026 •

edited

Loading

`TlsClientDisconnectCleansUpHandler`

`DisposeCallsDisposeImplWithoutSaeaBackup`