New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug 1804717: Switch to managing openshift-apiserver pods with a deployment #313
Conversation
|
@deads2k Some questions:
|
pkg/operator/workloadcontroller/workload_controller_openshiftapiserver_v311_00.go
Show resolved
Hide resolved
no, just rollout. We can tolerate having some extras for a bit. |
|
@sttts Updated, PTAL |
17e9b5a
to
41b66fa
Compare
ce422f9
to
e67a91f
Compare
pkg/operator/starter.go
Outdated
| @@ -246,6 +248,10 @@ func RunOperator(ctx context.Context, controllerConfig *controllercmd.Controller | |||
| "Progressing", | |||
| // in 4.1.0-4.3.0 this was used for indicating the apiserver daemonset was available | |||
| "Available", | |||
| // in 4.4.0 this was used for indicating the conditions of the apiserver daemonset. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4.1.0-4.3.z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pkg/operator/starter.go
Outdated
| // It's safe not to check the error returned by the poll function so long | ||
| // as conditionFunc never returns an error. The poll function will only | ||
| // exit on success or if the stop channel halts execution. | ||
| go wait.PollImmediateUntil(time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UntilWithContext please. This should run forever. I'm ok doing it on two cadences if you prefer (fast until we delete, slow after that), but this will run every time the pod restarts, which is just a controller on an unpredictable resync cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I didn't bother with 2 cadences, can address that if you feel strongly.
53d1f90
to
d1c4d90
Compare
|
@deads2k Updated, PTAL |
|
/lgtm this could be improved by emitting some events, but we need this badly enough not to quibble. |
|
@deads2k What events would you like to see? Successful deletion of daemonset? Deployment doesn't have replicas yet? |
|
/retest |
1 similar comment
|
/retest |
The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.
|
Updated to use explicit tolerations which should result in e2e passing, PTAL. |
|
/retest |
|
/test all |
|
/retest |
|
/test all |
| metadata: | ||
| namespace: openshift-apiserver | ||
| name: apiserver | ||
| labels: | ||
| app: openshift-apiserver | ||
| apiserver: "true" | ||
| spec: | ||
| updateStrategy: | ||
| type: RollingUpdate | ||
| replicas: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean for single-node clusters? Will we be always progressing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be observed from the number of masters in the cluster? if someone scales them to 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 instances in a 5 master node cluster at least sounds sane and doesn't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean for single-node clusters? Will we be always progressing?
Good catch. The logic should be determining health by checking whether the replica count is equal or greater to the number of masters to maintain the invariant previously guaranteed by the daemonset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 instances in a 5 master node cluster at least sounds sane and doesn't break.
Is that preferable to adjusting the replica count according to the number of masters that are present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single node clusters are not a supported platform. Refinements could be made, but I don't see them as blocking us stablizing upgrades for supported platforms.
| @@ -363,6 +369,7 @@ func manageOpenShiftAPIServerDaemonSet_v311_00_to_latest( | |||
|
|
|||
| required.Labels["revision"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision)) | |||
| required.Spec.Template.Labels["revision"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision)) | |||
| required.Spec.Template.Labels["previousGeneration"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision)) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. I have no recollection of where that came from.
| desiredReplicas = *(actualDeployment.Spec.Replicas) | ||
| } | ||
|
|
||
| deploymentHasAllPodsAvailable := actualDeployment.Status.AvailableReplicas == desiredReplicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we enforce that two replicasets for the deployment don't overlap and increase this number temporarily although we are still progressing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we do not enforce no overlap. As per your previous comment, health should be determined by comparing replica count to master count.
it would be nice if the description said |
I'll leave it to @deads2k to explain. I have no idea why and the bz is short on details. |
There is an issue draining nodes. Daemonsets are not evicted early. When the termination finally happens, there is an issue where graceful deletion doesn't take place. In addition, it is opening us up to SDN restart issues which result in API outages. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, marun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherrypick release-4.4 |
|
@deads2k: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@marun: All pull requests linked via external trackers have merged. Bugzilla bug 1804717 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@deads2k: #313 failed to apply on top of branch "release-4.4": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
#324 addresses outstanding comments on this PR. If the cherry-pick bot can't apply cleanly, does that imply that a manual PR is required? |
The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.
bz: https://bugzilla.redhat.com/show_bug.cgi?id=1804717
/cc @sttts @deads2k