[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741

m-nagarajan · 2025-04-29T05:44:53Z

Problem Statement

rollForwardToFutureVersion in parent controller adds a message in admin channel and once consumed by the child controllers, it truncates the parent kafka topic and succeeds. The child controllers consumes this message and roll forward to the future version if present with out checking whether the future version partitions have enough ready to serve replicas. This might result in read failures if there is some rebalancing or some host goes down around that time.

Solution

The issue can happen via

rebalances during cluster expansion or host swap: Helix will be handling the rebalancing (3-4-3) and the PR makes state transition of the completed future version follow similar path to the current version and wait until its ready to serve before marking it as STANDBY.
one/more host can suddenly go down leading to having less than min active replicas for one or more partitions: This PR fixes this issue by making the child controllers to check the readiness of the all partitions of the future version before doing roll forward. If one or more regions fails the check, roll forward will fail in those regions and parent controller will throw an exception with the details and go ahead with truncating the parent topic to not block the new push. The message in admin channel is still valid and will be retried until it succeeds after getting enough ready to serve replicas.

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.
Roll forward to future version can fail by throwing an exception and will be retried automatically.

…ready to serve instances

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

...Test/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithoutTargetedRegionPush.java

internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

m-nagarajan

Thanks @misyel and @majisourav99 for the review. Addressed some comments and left some replies.

internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java

majisourav99

LGTM! Thanks Manoj for fixing the issues in both server and controller.

m-nagarajan · 2025-05-01T18:34:18Z

Thanks @majisourav99 and @misyel for the reviews.

…y to serve replicas (linkedin#1741) Child controllers currently process the rollForwardToFutureVersion message from admin channel and, if a future version exists, promote it without confirming that each partition has the required ready-to-serve replicas. If a rebalance is in progress or a host goes down, this can cause read failures. This PR adds a readiness check: before rolling forward, child controllers now verify that all partitions of the future version meet the minActiveReplicas threshold. If any region fails, the roll-forward is aborted in that region, the parent controller throws a detailed exception, and still truncates the parent topic so new pushes are not blocked. The admin message stays in the channel and will automatically retry until every partition is healthy and roll forward succeeds.

m-nagarajan added 2 commits April 28, 2025 21:35

Fail rollForwardToFutureVersion if not all child regions have enough …

598e332

…ready to serve instances

add an extra check on the region filter before failing

fea5079

misyel reviewed Apr 29, 2025

View reviewed changes

majisourav99 reviewed Apr 29, 2025

View reviewed changes

internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java Outdated Show resolved Hide resolved

majisourav99 reviewed Apr 29, 2025

View reviewed changes

services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Show resolved Hide resolved

Address review comments and add some retry in rollForwardToFutureVersion

75401e4

m-nagarajan commented Apr 29, 2025

View reviewed changes

m-nagarajan added 2 commits April 29, 2025 15:50

fix the size of the log

ff8d5b8

address review comments

defda5a

m-nagarajan requested review from majisourav99 and misyel April 30, 2025 04:47

majisourav99 reviewed Apr 30, 2025

View reviewed changes

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java Outdated Show resolved Hide resolved

majisourav99 reviewed Apr 30, 2025

View reviewed changes

...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java Outdated Show resolved Hide resolved

address review comments

1ca8680

m-nagarajan requested a review from majisourav99 May 1, 2025 01:24

majisourav99 approved these changes May 1, 2025

View reviewed changes

m-nagarajan merged commit ca9880c into linkedin:main May 1, 2025
58 checks passed

minhmo1620 mentioned this pull request May 1, 2025

[controller] Fix symbol reference #1755

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741

[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741

Uh oh!

m-nagarajan commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m-nagarajan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

majisourav99 left a comment

Uh oh!

Uh oh!

m-nagarajan commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741

[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741

Uh oh!

Conversation

m-nagarajan commented Apr 29, 2025

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m-nagarajan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

majisourav99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

m-nagarajan commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants