Skip to content

[FLINK-30068][runtime] Add configurable commit failure strategy for Sink V2#4

Closed
nateab wants to merge 1 commit intomasterfrom
fix/FLINK-30068-configurable-commit-failure-strategy
Closed

[FLINK-30068][runtime] Add configurable commit failure strategy for Sink V2#4
nateab wants to merge 1 commit intomasterfrom
fix/FLINK-30068-configurable-commit-failure-strategy

Conversation

@nateab
Copy link
Copy Markdown
Owner

@nateab nateab commented Feb 11, 2026

Summary

  • Adds CommitFailureStrategy enum (FAIL/WARN) and config option sink.committer.failure-strategy
  • When set to WARN, signalFailedWithUnknownReason() logs a warning and skips the committable instead of throwing, allowing recovery from expired transactions
  • Default is FAIL, preserving existing behavior exactly

Motivation

When a Flink job recovers from a checkpoint/savepoint, the CommitterOperator replays uncommitted transactions. If a Kafka transaction has expired or the producer ID mapping is lost (e.g., InvalidPidMappingException), the commit fails and the job enters an infinite restart loop with no way to recover -- even from earlier savepoints.

The CommitRequestImpl code already had TODO comments noting: "let the user configure a strategy for failing and apply it here". This PR implements that configurability.

Changes

File Change
CommitFailureStrategy.java (new) @PublicEvolving enum with FAIL and WARN
SinkOptions.java New COMMITTER_FAILURE_STRATEGY config option
CommitRequestImpl.java Strategy field; signalFailedWithUnknownReason respects WARN
CheckpointCommittableManager.java Overloaded commit() with strategy param (default delegates with FAIL)
CheckpointCommittableManagerImpl.java Sets strategy on requests before committing
CommitterOperator.java Reads config, passes strategy to commit calls
GlobalCommitterOperator.java Same pattern as CommitterOperator

Test plan

  • CommitRequestImplTest - unit tests for FAIL/WARN/default strategy behavior
  • CheckpointCommittableManagerImplTest - WARN skips failures, FAIL throws
  • SinkV2CommitterOperatorTest - operator-level WARN/FAIL tests + recovery scenario
  • All 27 new+existing tests pass

@nateab nateab force-pushed the fix/FLINK-30068-configurable-commit-failure-strategy branch 2 times, most recently from 869721d to c405937 Compare February 11, 2026 11:16
…ink V2

When recovering from a checkpoint/savepoint, the CommitterOperator replays
uncommitted transactions. If a transaction has expired or the producer ID
mapping is lost, the commit fails with signalFailedWithUnknownReason() which
throws unconditionally, causing an infinite restart loop with no recovery path.

This adds a CommitFailureStrategy enum (FAIL/WARN) and a new config option
sink.committer.failure-strategy that controls whether unknown commit failures
throw (default, preserving current behavior) or log a warning and skip the
committable, allowing recovery to proceed.
@nateab nateab force-pushed the fix/FLINK-30068-configurable-commit-failure-strategy branch from c405937 to f582a30 Compare February 11, 2026 11:19
@nateab nateab closed this Feb 11, 2026
nateab added a commit that referenced this pull request Feb 14, 2026
Fixes applied:
- #2: Nexus API uses profile-specific endpoints, correct XML payloads
- #3: JIRA auth added to all API calls including reads
- #4: Dropped FLIP framing, reframed as dev@ discussion / normal PR
- #5: URL encoding uses stdin pipe (fixes single-quote breakage in JQL)
- apache#6: Uses perl -pi -e instead of sed -i (macOS portability)
- apache#7: Added set -o pipefail to all scripts
- apache#8: Glob expansion handled directly, not through run_cmd wrapper
- apache#9: require_var moved inside subcommands, not at top level
- apache#11: Removed shared releasing_utils.sh, each script is self-contained
- apache#13: Removed check-pypi-space (PyPI API doesn't expose quotas)
- apache#15: Backtick command substitution matching existing conventions
- apache#17: Added set -o xtrace for auditability (with noted exceptions)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant