New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make zone spread only apply within a given revision #1724
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
27f608f
to
d279a8c
Compare
Looks like you need to run unit tests with UPDATE=1, other than that, it looks good to me. |
1 similar comment
Looks like you need to run unit tests with UPDATE=1, other than that, it looks good to me. |
/test capi-provider-agent-sanity |
1 similar comment
/test capi-provider-agent-sanity |
Prow issue, the test never started |
06ebba5
to
3722ccf
Compare
support/config/deployment.go
Outdated
func (c *DeploymentConfig) setMultizoneSpread(labels map[string]string) { | ||
if labels == nil { | ||
func (c *DeploymentConfig) setMultizoneSpread(pod *corev1.PodTemplateSpec) { | ||
if !c.setDefaults || c.Replicas <= 1 { | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need any check? since we are using a unique hash per Deployment I think could simplify and apply this unconditionally now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just keeps the previous behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider current behaviour and implementation suboptimal. Using the hash enables us to apply always a consistent set of labels and affinity unconditionally. That simplifies code but also reduce the number of different config combinations and divergence on the clusters we create.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Btw, this change shaves a good 25% of the runtime of our e2e tests off. Without this, we always need more than 2h, with this, we need around 1h30m. We likely need openshift/release#31897 to make it stable though. |
@enxebre removed the |
97b2e84
to
806f7ca
Compare
lgtm, unit and e2e fails though |
We currently apply zone spread for all revisions for a given workload. This means that a new revision can only be rolled out after a replica of the old revision was removed. This PR fixes that by: * Moving the calculation of the zone spread affinity into the ApplyTo funcs * Calculating a hash of the podTemplate there * Applying a label to the pod with that hash * Using all pod labels including the one with the hash in the AntiAffinity rule As a sideeffect, this removes the requirement to know the pods labels by the time DeploymentConfig.SetDefaults() is called. In many cases, they weren't known by that time so it was called with a nil labelmap, which resulted in the Zone spread code being short-circuited. With this change, everything that calls SetDefaults and has more than one replica will get the zone spread, as it is makes sense for all components. I manually verified that with this change and a sufficiently-sized management cluster, a HA controlplane upgrade results in zero failed scheduling events.
/retest-required |
2 similar comments
/retest-required |
/retest-required |
@enxebre the presubmit finally passed without any failed scheduling event. The upgrade failure is due to OCPBUGS-990, until we have a fix for that promoted into an N-1 release, it won't pass but that is unrelated to this change |
/lgtm |
The upgrade job is known broken and everything else passed |
@alvaroaleman: Overrode contexts on behalf of alvaroaleman: ci/prow/e2e-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@alvaroaleman: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We currently apply zone spread for all revisions for a given workload.
This means that a new revision can only be rolled out after a replica of
the old revision was removed.
This PR fixes that by:
funcs
AntiAffinity rule
As a sideeffect, this removes the requirement to know the pods labels by
the time DeploymentConfig.SetDefaults() is called. In many cases, they
weren't known by that time so it was called with a nil labelmap, which
resulted in the Zone spread code being short-circuited. With this
change, everything that calls SetDefaults and has more than one replica
will get the zone spread, as it is makes sense for all components.
I manually verified that with this change and a
sufficiently-sized management cluster, a HA controlplane upgrade results
in zero failed scheduling events.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, use
fixes #<issue_number>(, fixes #<issue_number>, ...)
format, where issue_number might be a GitHub issue, or a Jira story:ref https://issues.redhat.com/browse/HOSTEDCP-518
Checklist