Skip to content

fix: connection-lifecycle#601

Merged
javi11 merged 1 commit into
javi11:mainfrom
iPromKnight:feat/connection_handling
May 22, 2026
Merged

fix: connection-lifecycle#601
javi11 merged 1 commit into
javi11:mainfrom
iPromKnight:feat/connection_handling

Conversation

@iPromKnight
Copy link
Copy Markdown
Contributor

Connection-lifecycle storms (S1/S3/S5/S6/S8/S9)
Re-opening of #589 (branch was accidentally deleted after merge to dev).

Scope is connection-lifecycle and observability only. Cross-process budgeting between streams and imports is out of scope — that lives in #595.

Fixes

S1 / S3 — retry.Attempts(5) → Attempts(2) with [50, 150]ms jitter, so a permanently failing segment caps at 2 wire calls and concurrent retriers desync against a recovering provider instead of thundering-herd.
S5 — per-file segment LRU on the ephemeral ReadAt path coalesces Plex/Jellyfin scrub bursts (200 small reads in an 8-segment window → 8 wire calls). Plus an EOF-straddle slice-bound fix that was panicking on partially-filled final segments.
S6 — bounded closer-worker pool absorbs seek-spam (50-seek burst → ~4 goroutines instead of ~50, none pinning pool slots).
S8 — speed-test routes through pool.Manager + a singleflight-dedupe coordinator with a per-provider client cache and janitor sweep. Endpoint can no longer be rate-abused into exhausting a provider's connection budget.
S9 — MetadataVirtualFile.Close fires UsenetReader.Interrupt before contending for mvf.mu, so client disconnects mid-read return in microseconds instead of pinning the goroutine + connection slot for the full segment latency. Big one for Plex/Jellyfin scrubbing.
Test infrastructure

Deterministic fakepool seam (per-message latency / error injection, in-flight counters, gate primitive), segments payload generator, goroutines snapshot helper. Every fix is pinned by a regression test asserting MaxInFlight / BodyCalls / goroutine count directly.

Running result

Combined with #595 in my dev branch — solid through 24h+ playback + import workload on the gluetun/k8s setup that originally surfaced the conntrack storms. Go heap steady at ~26MB, no kernel WireGuard instability.

Clean merge against current main.

@javi11
Copy link
Copy Markdown
Owner

javi11 commented May 22, 2026

After merge the first PR this becomed corrupted, can you fix it @iPromKnight ?

…ness

Reworks the streaming and metadata-virtual-file paths to bound retry,
goroutine, and connection-slot fan-out under realistic Plex/Jellyfin
scrub patterns and provider-recovery storms. Adds a deterministic
fakepool test seam so each invariant is pinned by a regression test
that asserts MaxInFlight / goroutine / call-count bounds directly.

Cross-process budgeting between streams and imports (admission
control) is intentionally NOT in this PR — that work belongs in the
separate admission-controller PR. This change touches only the
connection-lifecycle layer.

S1 + S3 (usenet/usenet_reader.go)
  retry.Attempts(5) -> retry.Attempts(2) so a permanently failing
  segment produces at most 2 wire calls instead of 5. Replace the
  20ms FixedDelay with retry.CombineDelay(BackOffDelay, RandomDelay)
  + retry.MaxJitter(100ms) so concurrent retriers desync into
  [50, 150]ms instead of thundering-herd against a recovering
  provider. ErrArticleNotFound still never retries; DeadlineExceeded
  still retries immediately on a fresh round-robin connection.

S5 + EOF straddle (nzbfilesystem/metadata_remote_file.go)
  Add randomReadCache: a per-file LRU (size 8) of full-segment bytes
  for the ephemeral ReadAt path. 200 small reads within an 8-segment
  hot window now produce 8 wire calls instead of 200. Clamp the slice
  bound to min(end-off+1, len(p)) so reads that straddle EOF in a
  partially-filled final segment cannot panic with "slice bounds out
  of range".

S6 (nzbfilesystem/metadata_remote_file.go)
  Replace closeWg.Go's unbounded goroutine fan-out in
  closeCurrentReader with a per-file bounded closer pool
  (closerWorkerCount = 4). A 50-seek scrub burst produces a goroutine
  delta of ~4 instead of ~50, each holding any in-flight pool slot
  it had open. When the channel fills, the next Seek runs Close
  inline as backpressure.

S8 (api/speedtest_coordinator.go + provider_speedtest_handler.go)
  handleTestProviderSpeed previously called nntppool.NewClient inline
  per HTTP request, opening Provider.MaxConnections fresh slots per
  invocation. Route through pool.Manager.GetPool when the provider is
  active; otherwise through a process-wide speedtestCoordinator with
  singleflight dedupe by providerID and a 5-minute per-provider
  client cache. A janitor goroutine ticks every clientTTL/5 and
  evicts expired entries; shutdown is idempotent.

S9 (nzbfilesystem/metadata_remote_file.go + usenet/usenet_reader.go)
  When a streaming HTTP client disconnects mid-read, the request
  goroutine calls mvf.Close synchronously; a concurrent mvf.Read
  holds mvf.mu for the full segment-download latency (15s × retries).
  Add UsenetReader.Interrupt (non-blocking ctx-cancel + cond
  broadcast + segment closer) and an atomic interruptHandle on
  MetadataVirtualFile so Close fires the interrupt BEFORE contending
  for mvf.mu. Close now returns in microseconds regardless of segment
  latency. Also move streamTracker.Remove + streamID write inside
  mvf.mu's critical section to satisfy the race detector.

Test infrastructure
  Narrow pool.NntpClient interface so tests can substitute a
  deterministic fake without standing up real NNTP connections.
  internal/testsupport/fakepool exposes per-message-ID latency / error
  injection and an in-flight high-water counter (MaxInFlight) used as
  the primary observability primitive for the invariants. Companion
  helpers: testsupport/segments (deterministic payload / message-ID
  generation) and testsupport/goroutines (snapshot +
  AssertReturnedToBaseline for closer-accumulation scenarios).

Pinned by the new storm tests:
  - TestStorm_RetryAmplifiesPerMessageCallCount (S1)
  - TestStorm_RetryUsesFixedDelayInsteadOfExponentialBackoff (S3)
  - TestStorm_RandomReadAtCreatesEphemeralReaderPerCall (S5)
  - TestRandomReadCache_EOFReadDoesNotPanic
  - TestStorm_SeekSpamAccumulatesCloserGoroutines (S6)
  - TestStorm_SpeedTestBypassesPoolManager (S8)
  - TestStorm_ClientDisconnectHoldsPoolSlotForUpTo30s (S9)
  - TestStorm_ConcurrentDisconnectsPinManyGoroutines (S9)

go test -race -count=1 clean across the touched packages.
@iPromKnight iPromKnight force-pushed the feat/connection_handling branch from 0b9b64f to c8899e4 Compare May 22, 2026 20:14
@iPromKnight
Copy link
Copy Markdown
Contributor Author

After merge the first PR this becomed corrupted, can you fix it @iPromKnight ?

yep - rebased on main :)

@javi11 javi11 changed the title fix: connection-lifecycle storms (S1/S3/S5/S6/S8/S9) fix: connection-lifecycle May 22, 2026
@javi11 javi11 merged commit 4d05686 into javi11:main May 22, 2026
2 checks passed
@iPromKnight iPromKnight deleted the feat/connection_handling branch May 22, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants