Skip to content

Conversation

@m-nagarajan
Copy link
Contributor

Problem Statement

rollForwardToFutureVersion in parent controller adds a message in admin channel and once consumed by the child controllers, it truncates the parent kafka topic and succeeds. The child controllers consumes this message and roll forward to the future version if present with out checking whether the future version partitions have enough ready to serve replicas. This might result in read failures if there is some rebalancing or some host goes down around that time.

Solution

The issue can happen via

  1. rebalances during cluster expansion or host swap: Helix will be handling the rebalancing (3-4-3) and the PR makes state transition of the completed future version follow similar path to the current version and wait until its ready to serve before marking it as STANDBY.
  2. one/more host can suddenly go down leading to having less than min active replicas for one or more partitions: This PR fixes this issue by making the child controllers to check the readiness of the all partitions of the future version before doing roll forward. If one or more regions fails the check, roll forward will fail in those regions and parent controller will throw an exception with the details and go ahead with truncating the parent topic to not block the new push. The message in admin channel is still valid and will be retried until it succeeds after getting enough ready to serve replicas.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.
    Roll forward to future version can fail by throwing an exception and will be retried automatically.

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @misyel and @majisourav99 for the review. Addressed some comments and left some replies.

@m-nagarajan m-nagarajan requested a review from majisourav99 May 1, 2025 01:24
Copy link
Contributor

@majisourav99 majisourav99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Manoj for fixing the issues in both server and controller.

@m-nagarajan m-nagarajan merged commit ca9880c into linkedin:main May 1, 2025
58 checks passed
@m-nagarajan
Copy link
Contributor Author

Thanks @majisourav99 and @misyel for the reviews.

@minhmo1620 minhmo1620 mentioned this pull request May 1, 2025
14 tasks
WhitneyDeng pushed a commit to WhitneyDeng/venice that referenced this pull request May 16, 2025
…y to serve replicas (linkedin#1741)

Child controllers currently process the rollForwardToFutureVersion message from admin channel and, if a future version exists, promote it without confirming that each partition has the required ready-to-serve replicas. If a rebalance is in progress or a host goes down, this can cause read failures.

This PR adds a readiness check: before rolling forward, child controllers now verify that all partitions of the future version meet the minActiveReplicas threshold. If any region fails, the roll-forward is aborted in that region, the parent controller throws a detailed exception, and still truncates the parent topic so new pushes are not blocked. The admin message stays in the channel and will automatically retry until every partition is healthy and roll forward succeeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants