-
Notifications
You must be signed in to change notification settings - Fork 113
[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[controller] Fail roll forward if not all partitions have enough ready to serve replicas #1741
Conversation
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...Test/java/com/linkedin/venice/endToEnd/TestDeferredVersionSwapWithoutTargetedRegionPush.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java
Outdated
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
m-nagarajan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @misyel and @majisourav99 for the review. Addressed some comments and left some replies.
internal/venice-common/src/main/java/com/linkedin/venice/controllerapi/ControllerClient.java
Outdated
Show resolved
Hide resolved
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
...s/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java
Outdated
Show resolved
Hide resolved
majisourav99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks Manoj for fixing the issues in both server and controller.
|
Thanks @majisourav99 and @misyel for the reviews. |
…y to serve replicas (linkedin#1741) Child controllers currently process the rollForwardToFutureVersion message from admin channel and, if a future version exists, promote it without confirming that each partition has the required ready-to-serve replicas. If a rebalance is in progress or a host goes down, this can cause read failures. This PR adds a readiness check: before rolling forward, child controllers now verify that all partitions of the future version meet the minActiveReplicas threshold. If any region fails, the roll-forward is aborted in that region, the parent controller throws a detailed exception, and still truncates the parent topic so new pushes are not blocked. The admin message stays in the channel and will automatically retry until every partition is healthy and roll forward succeeds.
Problem Statement
rollForwardToFutureVersionin parent controller adds a message in admin channel and once consumed by the child controllers, it truncates the parent kafka topic and succeeds. The child controllers consumes this message and roll forward to the future version if present with out checking whether the future version partitions have enough ready to serve replicas. This might result in read failures if there is some rebalancing or some host goes down around that time.Solution
The issue can happen via
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?
Roll forward to future version can fail by throwing an exception and will be retried automatically.