fix: shutdown hook deadlock under leader election and deprecate Operator#installShutdownHook(Duration)#3383
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improves operator shutdown behavior in leader-election scenarios by always installing a JVM shutdown hook and preventing System.exit(1) from being invoked during graceful shutdown, with a regression test covering the deadlock scenario.
Changes:
- Always register a JVM shutdown hook in
Operator, and deprecate the legacy overload that accepted a (now ignored) timeout. - Add a graceful-shutdown guard in
LeaderElectionManager.stopLeading()to skipSystem.exit(1)when shutdown is already in progress. - Add a regression test ensuring
stopLeading()doesn’t terminate the JVM whenstop()was called first.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| operator-framework-core/src/test/java/io/javaoperatorsdk/operator/LeaderElectionManagerTest.java | Adds a regression test for shutdown-hook/leader-election interaction to avoid JVM termination. |
| operator-framework-core/src/main/java/io/javaoperatorsdk/operator/Operator.java | Changes shutdown hook installation behavior and improves shutdown documentation; deprecates old overload. |
| operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java | Adds graceful-shutdown coordination to avoid System.exit during JVM shutdown; minor permission-check change. |
| @Test | ||
| void stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirst() { | ||
| // When stop() is called before the onStopLeading callback fires (which is what happens when | ||
| // stop()'s future cancellation triggers the callback), stopLeading() must skip | ||
| // System.exit(1). Otherwise calling stop() from inside a JVM shutdown hook deadlocks against | ||
| // the java.lang.Shutdown class lock. If this regression is ever reintroduced, this test | ||
| // method would terminate the JUnit JVM via System.exit(1) instead of failing cleanly. | ||
| final var leaderElectionManager = leaderElectionManager(null); | ||
| leaderElectionManager.stop(); | ||
| leaderElectionManager.stopLeading(); | ||
| } |
| * <p>Internally this class wraps a Fabric8 {@link LeaderElector} that coordinates via a Kubernetes | ||
| * {@code Lease} resource (group {@value #COORDINATION_GROUP}, resource {@value #LEASES_RESOURCE}). | ||
| * When this pod acquires the lease, {@link #startLeading()} starts event processing on the | ||
| * controller manager. When the lease is lost or the leader-election future is cancelled, {@link | ||
| * #stopLeading()} is invoked. |
There was a problem hiding this comment.
I don't think we need that level of information, especially because this touches the internal implementation, which might change over time, and addresses concepts that have not been defined in this scope.
There was a problem hiding this comment.
As part of the last commit I removed the JavaDoc for this class and also the ones made as part of the installShutdownHook methods. Please let me know if you want to stick to the current installShutdownHook method JavaDocs.
There was a problem hiding this comment.
I didn't mean to remove all the javadoc 😉
I just thought that the internal details were just a bit much but the overall description of the purpose and behavior was OK.
There was a problem hiding this comment.
Got it, thanks for the clarification! I've restored the class-level JavaDoc on LeaderElectionManager along with the JavaDocs on Operator#stop() and Operator#installShutdownHook(). The paragraph that I read as the "internal details" bit (the one naming the Fabric8 LeaderElector and the Lease group/resource constants) is the only piece I kept out. Let me know if that's the right trim or if there's more to pull out.
c2c90e4 to
2ac010f
Compare
|
|
In the lost-lead scenario we will have this flow:
So, If you to increase the test coverage for this I can add a second regression test that simulates the lost-lead-then-shutdown sequence on top of the existing one. |
…tor#installShutdownHook(Duration) Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
Signed-off-by: Dennis-Mircea Ciupitu <dennis.mircea.ciupitu@gmail.com>
2ac010f to
b135032
Compare
Summary
Resolves all three concerns raised in #3376:
Operator#installShutdownHook(...)was previously skipped whenever leader election was enabled (introduced in fix: leader election stop deadlock #1618 to work around the deadlock reported in LeaderElectionManager#stopLeading maybe deadlock with Operator#installShutdownHook #1614). As a result, a leader pod receivingSIGTERMdid not release its lease, forcing standby replicas to wait for lease expiry before they could take over.gracefulShutdownTimeoutargument onOperator#installShutdownHook(Duration)has been ignored since PR feat: support for graceful shutdown based on configuration #2479. The reconciliation termination timeout is now read fromConfigurationService#reconciliationTerminationTimeout().Durationparameter was not documented anywhere; neither were theOperator#stop()shutdown sequence and theLeaderElectionManagerlifecycle.What changed
Fix the underlying deadlock (
LeaderElectionManager)The deadlock from #1614 was very specific:
Operator#stop()called from inside a JVM shutdown hook would cancel the leader-election future, which fired theonStopLeadingcallback, which calledSystem.exit(1). ThatSystem.exitthen blocked indefinitely on thejava.lang.Shutdownclass lock that the shutdown hook thread itself was already holding.This PR breaks the recursion.
LeaderElectionManagernow carries astoppingGracefullyAtomicBoolean, set instop()before the future is cancelled and checked at the top ofstopLeading(). When the flag is set,stopLeading()returns immediately instead of invokingSystem.exit. The "restart on lost lead" behavior (when the leader-election library detects a real lost lease without a priorstop()) is preserved because that path runs without the flag ever being set.Re-enable the shutdown hook unconditionally (
Operator)With the deadlock fixed at its source, the conditional in
Operator#installShutdownHook()is no longer required. The hook is now registered regardless of leader-election state, and the"Leader election is on, shutdown hook will not be installed."warn log line is removed. A leader pod receivingSIGTERMwill run the hook, which callsOperator#stop(), which now cleanly cancels the leader-election future and releases the lease.Deprecate
installShutdownHook(Duration)(Operator)The
Durationargument has been dead since #2479. This PR adds a new no-arginstallShutdownHook()overload (the recommended replacement) whose JavaDoc points users atConfigurationServiceOverrider#withReconciliationTerminationTimeout(Duration)as the real configuration knob. The existinginstallShutdownHook(Duration)is marked@Deprecated(forRemoval = true)and delegates to the no-arg overload. TheDurationparameter is kept only for source and binary compatibility.Documentation additions
LeaderElectionManagerexplaining its role, configuration entry points, theLease-based coordination, the three-way behavior ofstopLeading(), and the lifecycle ownership byOperator.Operator#stop()documenting the four-step shutdown sequence (controller manager, executor service manager with timeout, leader-election manager, optional client close), an explicit "safe to call from a JVM shutdown hook" guarantee, the not-started edge-case behavior, and thecloseClientOnStop()opt-out (defaulttrue).Operator#installShutdownHook()explaining the timeout configuration knob and adding aterminationGracePeriodSecondsnote that recommends sizing the pod's grace period to fit the configuredreconciliationTerminationTimeoutplus a small buffer.Regression test
LeaderElectionManagerTest#stopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirstcallsstop()and thenstopLeading()directly. If the graceful-shutdown short-circuit is ever reintroduced asSystem.exit(1), this test method would terminate the JUnit JVM rather than failing cleanly, making the regression impossible to miss in CI.LeaderElectionManager#stopLeadingwas lowered fromprivatetoprotectedto enable the test (and to allow subclasses to extend the behavior).Behavior summary by scenario
SIGTERMstop()runs, graceful shutdownSIGTERMstop()runs, lease released cleanlySIGTERMstop()is effectively a no-op for leader-election state (no lease to release)onStopLeadingfires without priorstop())System.exit(1)triggered, JVM restartsinstallShutdownHook(Duration)called by userDurationsilently ignoredDurationsilently ignored, plus deprecation warning at compile timeTest plan
mvn -pl operator-framework-core testpasses, including the newstopLeadingDoesNotInvokeSystemExitWhenStopWasCalledFirstregression.installShutdownHook(Duration).