Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard against a concurrent downscaling and rollout #79

Closed
jhalterman opened this issue Sep 13, 2023 · 4 comments · Fixed by #82
Closed

Guard against a concurrent downscaling and rollout #79

jhalterman opened this issue Sep 13, 2023 · 4 comments · Fixed by #82
Assignees

Comments

@jhalterman
Copy link
Member

jhalterman commented Sep 13, 2023

At present the rollout operator prevents concurrent rollout operations, but doesn’t prevent concurrent downscale + rollout operations. This can happen, for example, if the controller is currently rolling zone B or C, and then something downscales zone A.

In order to guard against this, I propose using an annotation on the resource being scaled to represent a “lock” across the scaling group. Ex:

grafana.com/scaling-lock: rollout

Creating a scaling-lock

When the rollout operator reconciles, it will choose a resource, if any, that already has a scaling lock to reconcile first, allowing reentrancy. If none exist, it will add a scaling-lock annotation to the sts when it begins rolling its pods. A scaling lock should not be needed though if the controller is just adjusting replicas up/down (since last-downscaled will already provide protection here), only when rolling pods.

Validating the scaling-lock

When the scaling-lock is present on any resource in the rollout-group, the downscale webhook will reject downscale attempts for resources that don't have the scaling lock. For example: if the scaling lock is present on the zone B statefulset, zone A's statefulset cannot be downscaled.

Removing the scaling-lock

When the rollout-operator reconciles a resource, if it sees no non-ready pods and has nothing to do for a resource, it will clear the scaling-lock for that resource if any is present.

Other alternatives

While this could also be accomplished by extending the last-updated annotation to be updated by the rollout-operator’s controller, that would block downscales from taking place for some time after rollouts are complete, which is not necessary. We just want to prevent concurrent scaling from occurring.

@jhalterman jhalterman changed the title Add a resource scaling lock Guard against concurrent a downscaling and rollout Sep 13, 2023
@jhalterman jhalterman changed the title Guard against concurrent a downscaling and rollout Guard against a concurrent downscaling and rollout Sep 13, 2023
@56quarters
Copy link
Contributor

56quarters commented Sep 13, 2023

I'd like to propose an alternative that I think would be a little simpler, and not involve modifications to the statefulset annotations.

Instead of adding a lock annotation that the prepare-downscale webhook checks for, we put roughly the same logic into the prepare-downscale that the controller would use in your plan.

In psuedo code, the prepare-downscale hook would check:

def is_being_rolled_out(statefulsets) -> bool:
    for sts in statefulsets:
        if not sts.state.replicas == sts.state.updated == sts.state.ready:
            return True
     return False

The prepare-downscale webhook would run this whenever it's called and reject the request if there is a rollout happening. This has the advantage of disallowing downscaling as soon as the updates to statefulsets (via PR + flux) are merged: all pods will no longer be considered "updated".

I also think this would be a better fit for the way the rollout operator works: there's not really any "state" to a rollout, the rollout operator reads the state of the world and adjusts statefulsets accordingly.

@jhalterman
Copy link
Member Author

jhalterman commented Sep 13, 2023

I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where sts.state.replicas == sts.state.updated == sts.state.ready, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.

One way we could mitigate this is by making sure each rollout iteration doesn't delete all of the pods - it could delete max-unavailable - 1. Only after the last iteration would all replicas be updated and ready again, at which point a downscale could be allowed.

@56quarters
Copy link
Contributor

I like that proposal. There is a race though, where if a rollout is in progress and gets back to a steady state where sts.state.replicas == sts.state.updated == sts.state.ready, then a downscale could start at the same time. This would cause the next rollout iteration to fail, and we'd be stuck mid-rollout.

I'm not clear about how you think it would get back to steady state in the middle of a rollout. Once a statefulset is updated as part of a rollout, all the pods in that statefulset would be considered not "updated".

@jhalterman
Copy link
Member Author

You're right. I was thinking that nodes were marked as updated in batches for some reason, but of course that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants