Skip to content

fix: stop FDv2DataSource.Conditions from leaking on healthy primary#163

Open
kinyoklion wants to merge 4 commits into
mainfrom
rlamb/fdv2-conditions-leak-fix
Open

fix: stop FDv2DataSource.Conditions from leaking on healthy primary#163
kinyoklion wants to merge 4 commits into
mainfrom
rlamb/fdv2-conditions-leak-fix

Conversation

@kinyoklion
Copy link
Copy Markdown
Member

@kinyoklion kinyoklion commented May 15, 2026

Summary

FDv2DataSource.Conditions.getFuture() returned the same shared CompletableFuture<Object> instance to every caller. The run loop does CompletableFuture.anyOf(getFuture(), synchronizer.next()).get() on every iteration, which attaches a new OrRelay Completion to the shared future's stack each time. CompletableFuture has no deregister path for the loser of an anyOf race, so those Completion nodes stay on the stack until the shared future itself completes.

On a healthy primary streaming ChangeSets without ever firing fallback/recovery, the shared future never completes — the stack grows monotonically for the synchronizer's entire tenure (effectively the SDK's uptime on a stable server).

Per-iteration cost: ~200 B (OrRelay + anyOf result CF + chain references).
At 10 ChangeSets/sec sustained: ~150 MB/day per active synchronizer.

The fix

A single permanent whenComplete listener on the underlying aggregate fans out completion to every fresh future handed out by getFuture(). Pending fresh futures are tracked via WeakReference, so a fresh future whose only strong references were the caller's local variables (typical lifetime: one loop iteration) becomes garbage-collectable once that iteration ends. Pending entries whose referent has been collected are pruned opportunistically on each getFuture() call and on close().

Conditions is now package-private (was private) so direct unit tests can reach it. A test-only pendingSize() helper is added.

Test plan

Adds FDv2DataSourceConditionsAggregateTest with five tests:

  • getFutureReturnsDistinctInstancesPerCall — bug-prover. Fails on the pre-fix shared-instance behavior, passes after the fix.
  • getFutureReturnsDistinctInstancesEvenWithNoConditions — bug-prover. Covers the empty-conditions case (single-synchronizer configuration), which is exactly where per-iteration accumulation would be most damaging.
  • allFreshFuturesCompleteWhenAggregateFires — verifies fan-out via the single permanent listener actually delivers to multiple fresh futures handed out before the aggregate fires.
  • getFutureAfterAggregateFiresReturnsCompletedFuture — verifies the fast path: callers arriving after completion get an already-completed future synchronously.
  • pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped — 10k-iteration soak test that simulates the run-loop pattern (race a fresh future against a fast-resolving sibling, drop the result) and asserts the pending list stays bounded via GC + opportunistic pruning. Caveat in the test docstring about System.gc() not being guaranteed — if it ever flakes on CI we can migrate to -XX:+UseSerialGC or relax the ceiling.

Verified bug-proving discipline: the two distinctness tests fail on the pre-fix shared-instance behavior and pass after the fix. The full server-sdk test suite (1857 tests across 109 classes) is clean.

Context

This was identified during a multi-agent review of the analogous cpp-sdks PR (launchdarkly/cpp-sdks#531), which mirrors this Java implementation's Conditions design. The cpp version has the same structural leak; this Java fix shape is what was prototyped there. Filing here first since the runtime impact on a long-running JVM-based server SDK is more pronounced.


Note

Medium Risk
Touches the FDv2 synchronizer condition-aggregation logic used in the main run loop; mistakes could cause missed fallback/recovery signals or incorrect exceptional completion behavior, though changes are localized and covered by new unit tests.

Overview
Prevents a long-lived memory leak in FDv2DataSource.Conditions by changing getFuture() to return a fresh CompletableFuture per call until the underlying condition aggregate completes, rather than returning the same shared pending future each iteration.

Adds a single whenComplete fan-out from the aggregate to complete all outstanding per-call futures (and to propagate exceptional completion), tracks pending futures in a WeakHashMap-backed set for GC cleanup, and makes Conditions package-private to allow direct testing.

Introduces FDv2DataSourceConditionsAggregateTest to assert per-call distinctness, correct completion fan-out, and correct behavior on exceptional and post-completion paths.

Reviewed by Cursor Bugbot for commit a6c7c36. Bugbot is set up for automated code reviews on this repo. Configure here.

FDv2DataSource's run loop calls
CompletableFuture.anyOf(conditions.getFuture(), synchronizer.next()).get()
on every iteration. Before this change, getFuture() returned the same
shared CompletableFuture<Object> instance to every caller. Each anyOf
call attaches an OrRelay Completion node to the shared instance's
stack; CompletableFuture has no deregister path for the loser of a
race, so the OrRelay stays on the stack until the shared future
completes.

The shared future only completes when fallback or recovery fires. On a
healthy primary streaming ChangeSets, fallback is never armed and
recovery is suppressed (only-available-synchronizer / single-prime
configurations). The future never completes; the stack grows
monotonically for the synchronizer's full tenure -- effectively the
SDK's uptime on a healthy server.

Per-iteration cost ~200 B: an OrRelay Completion plus the anyOf result
CompletableFuture plus the chain references back to the inputs. At 10
ChangeSets/sec sustained that is ~150 MB/day per active synchronizer.

The fix: a single permanent whenComplete listener on the underlying
aggregate fans out completion to every fresh future handed out by
getFuture(). Pending fresh futures are tracked via WeakReference, so a
fresh future whose only strong references were the caller's local
variables (typical lifetime: one loop iteration) becomes
garbage-collectable once that iteration ends. Pending entries whose
referent has been collected are pruned opportunistically on each
getFuture() call and on close().

Conditions is now package-private rather than private so the new
direct unit tests can reach it. Adds a test-only pendingSize() helper.

Verified bug-proving discipline: two of the new tests
(getFutureReturnsDistinctInstancesPerCall,
getFutureReturnsDistinctInstancesEvenWithNoConditions) fail on the
pre-fix shared-instance behavior and pass after the fix. Full
server-sdk test suite (1857 tests) is clean.
@kinyoklion kinyoklion requested a review from a team as a code owner May 15, 2026 21:07
The pendingListDoesNotGrowUnboundedlyWhenFreshFuturesAreDropped test
needed System.gc() + Thread.sleep to encourage reclamation, which is
brittle. The two distinctness tests are sufficient bug-provers for the
shared-instance behavior they fail on, so drop the soak test and the
test-only pendingSize() helper that supported it.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 536176e. Configure here.

Comment thread lib/sdk/server/src/main/java/com/launchdarkly/sdk/server/FDv2DataSource.java Outdated
@kinyoklion kinyoklion marked this pull request as draft May 15, 2026 21:23
Bugbot pointed out that completedValue (volatile Object) used null as a
sentinel for "not yet fired", and the whenComplete listener also stored
null on exceptional completion. After exceptional completion a
subsequent getFuture() entered the lock, saw pending == null (drained
by the listener), and returned CompletableFuture.completedFuture(null)
-- silently converting the exception into a null result. The run loop
would then NPE on res.getClass().

Replace the null-sentinel pattern with an explicit volatile boolean
isFired plus separate firedResult/firedThrowable fields. A new
makeCompletedFuture() helper builds a fresh completed future mirroring
whichever terminal state the aggregate reached.

Adds a bug-proving test (getFutureFailsExceptionallyWhenAggregateFails-
Exceptionally) that drives a manually-controlled condition to
completeExceptionally and asserts both pre- and post-firing getFuture()
results throw ExecutionException with the original cause. Verified the
test fails on the pre-fix null-sentinel behavior ("expected
ExecutionException, got normal completion").
@kinyoklion
Copy link
Copy Markdown
Member Author

@kinyoklion kinyoklion marked this pull request as ready for review May 15, 2026 21:43
Copy link
Copy Markdown

@beekld beekld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this solution to the problem, but have a few comments about how it might be simplified / improved.

(Unfortunately, this will be more complicated in C++, since weak references aren't really built into the language.)

* true. Written under {@code lock}; readable without the lock once
* {@code isFired} has been observed true (volatile happens-before).
*/
private Object firedResult;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need firedResult and firedThrowable? It seems like if isFired = true, then we can read aggregate.getNow(null) to get the result/throwable. It's safe to call that without a lock if the future is complete, and it means we don't have to worry about getting the order of setting them correct.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wouldn't it even be safe for makeCompletedFuture to just return aggregate if isFired == true? It won't grow unbounded, because whatever continuations get added to it will be fired and removed immediately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a6c7c36. Returning aggregate directly when aggregate.isDone() == true -- continuations registered on an already-completed CompletableFuture fire synchronously at registration and cleanStack removes them immediately, so the original per-iteration accumulation can't re-occur in the post-completion path. Dropped isFired, firedResult, firedThrowable, and makeCompletedFuture().

* (and all pending entries have been drained). Mutated only under
* {@code lock}.
*/
private List<WeakReference<CompletableFuture<Object>>> pending = new ArrayList<>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's an improvement to store an empty WeakReference<CompletableFuture<Object>> instead of an unresolved CompletableFuture<Object>, but won't this list still grow unbounded? Maybe we should use a WeakHashMap instead, so the collection itself gets automatically pruned?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see if that would work better.

Right now it doesn't grow unbounded because it prunes the old entries when its adding the new entry.

             while (it.hasNext()) {
                    if (it.next().get() == null) {
                        it.remove();
                    }
                }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed that. Either way then.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to a WeakHashMap-backed Set in a6c7c36 -- entries auto-prune when GC reclaims their keys, no more manual loop. Listener snapshots new ArrayList<>(pending) under the lock so entries surviving GC to that point get strong refs and are completed.

Two simplifications suggested by @beekld:

1. Return aggregate directly when it has already completed. Continuations
   registered on an already-completed CompletableFuture fire synchronously
   at registration time and are removed from the stack immediately by
   cleanStack, so the original leak (per-iteration anyOf accumulation on a
   never-completing aggregate) cannot re-occur in the post-completion
   path. Drops the isFired flag, firedResult/firedThrowable fields, and
   the makeCompletedFuture helper.

2. Replace the WeakReference list with a WeakHashMap-backed Set. Fresh
   pending futures get GC'd automatically when the caller's loop iteration
   drops its strong reference, with no manual prune loop in getFuture().

All five aggregate tests still pass; full server-sdk suite (1857 tests)
still passes. Verified bug-proving discipline: temporarily reverting
getFuture() to the pre-fix shared-instance behavior makes the two
distinctness tests + the exceptional-path test go red, exactly as before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants