Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug 1804717: Switch to managing openshift-apiserver pods with a deployment #313

Merged
merged 2 commits into from Feb 27, 2020

Conversation

marun
Copy link
Contributor

@marun marun commented Feb 20, 2020

The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.

bz: https://bugzilla.redhat.com/show_bug.cgi?id=1804717

/cc @sttts @deads2k

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 20, 2020
@marun
Copy link
Contributor Author

marun commented Feb 20, 2020

@deads2k Some questions:

  • should removal of the daemonset block creation of the deployment (and vise-versa for cleanup in previous releases on downgrade)?
  • the bz mentioned wiring HPA to enable scaling - should that be tackled in this PR or separately?

pkg/operator/starter.go Outdated Show resolved Hide resolved
@deads2k
Copy link
Contributor

deads2k commented Feb 20, 2020

  • should removal of the daemonset block creation of the deployment (and vise-versa for cleanup in previous releases on downgrade)?

no, just rollout. We can tolerate having some extras for a bit.

@marun
Copy link
Contributor Author

marun commented Feb 20, 2020

@sttts Updated, PTAL

@marun marun force-pushed the ds-to-deploy branch 3 times, most recently from 17e9b5a to 41b66fa Compare February 20, 2020 16:43
@marun marun changed the title WIP Bug 1804717 Switch to managing openshift-apiserver pods with a deployment Bug 1804717 Switch to managing openshift-apiserver pods with a deployment Feb 20, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2020
@marun marun force-pushed the ds-to-deploy branch 2 times, most recently from ce422f9 to e67a91f Compare February 20, 2020 16:57
@@ -246,6 +248,10 @@ func RunOperator(ctx context.Context, controllerConfig *controllercmd.Controller
"Progressing",
// in 4.1.0-4.3.0 this was used for indicating the apiserver daemonset was available
"Available",
// in 4.4.0 this was used for indicating the conditions of the apiserver daemonset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.1.0-4.3.z

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// It's safe not to check the error returned by the poll function so long
// as conditionFunc never returns an error. The poll function will only
// exit on success or if the stop channel halts execution.
go wait.PollImmediateUntil(time.Minute, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UntilWithContext please. This should run forever. I'm ok doing it on two cadences if you prefer (fast until we delete, slow after that), but this will run every time the pod restarts, which is just a controller on an unpredictable resync cadence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I didn't bother with 2 cadences, can address that if you feel strongly.

@marun marun force-pushed the ds-to-deploy branch 2 times, most recently from 53d1f90 to d1c4d90 Compare February 20, 2020 18:51
@marun
Copy link
Contributor Author

marun commented Feb 20, 2020

@deads2k Updated, PTAL

@deads2k
Copy link
Contributor

deads2k commented Feb 20, 2020

/lgtm

this could be improved by emitting some events, but we need this badly enough not to quibble.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2020
@marun
Copy link
Contributor Author

marun commented Feb 20, 2020

@deads2k What events would you like to see? Successful deletion of daemonset? Deployment doesn't have replicas yet?

@marun
Copy link
Contributor Author

marun commented Feb 24, 2020

/retest

1 similar comment
@marun
Copy link
Contributor Author

marun commented Feb 25, 2020

/retest

The switch from using a daemonset to using a deployment is intended to
improve reliability during upgrade.
@marun
Copy link
Contributor Author

marun commented Feb 25, 2020

Updated to use explicit tolerations which should result in e2e passing, PTAL.

@marun
Copy link
Contributor Author

marun commented Feb 25, 2020

/retest

@marun
Copy link
Contributor Author

marun commented Feb 25, 2020

/test all

@marun
Copy link
Contributor Author

marun commented Feb 26, 2020

/retest

@marun
Copy link
Contributor Author

marun commented Feb 26, 2020

/test all

metadata:
namespace: openshift-apiserver
name: apiserver
labels:
app: openshift-apiserver
apiserver: "true"
spec:
updateStrategy:
type: RollingUpdate
replicas: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean for single-node clusters? Will we be always progressing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be observed from the number of masters in the cluster? if someone scales them to 5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 instances in a 5 master node cluster at least sounds sane and doesn't break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean for single-node clusters? Will we be always progressing?

Good catch. The logic should be determining health by checking whether the replica count is equal or greater to the number of masters to maintain the invariant previously guaranteed by the daemonset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 instances in a 5 master node cluster at least sounds sane and doesn't break.

Is that preferable to adjusting the replica count according to the number of masters that are present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single node clusters are not a supported platform. Refinements could be made, but I don't see them as blocking us stablizing upgrades for supported platforms.

@@ -363,6 +369,7 @@ func manageOpenShiftAPIServerDaemonSet_v311_00_to_latest(

required.Labels["revision"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision))
required.Spec.Template.Labels["revision"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision))
required.Spec.Template.Labels["previousGeneration"] = strconv.Itoa(int(operatorConfig.Status.LatestAvailableRevision))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. I have no recollection of where that came from.

desiredReplicas = *(actualDeployment.Spec.Replicas)
}

deploymentHasAllPodsAvailable := actualDeployment.Status.AvailableReplicas == desiredReplicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we enforce that two replicasets for the deployment don't overlap and increase this number temporarily although we are still progressing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not enforce no overlap. As per your previous comment, health should be determined by comparing replica count to master count.

@tnozicka
Copy link
Contributor

The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.

it would be nice if the description said why

@marun
Copy link
Contributor Author

marun commented Feb 26, 2020

The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.

it would be nice if the description said why

I'll leave it to @deads2k to explain. I have no idea why and the bz is short on details.

@deads2k
Copy link
Contributor

deads2k commented Feb 27, 2020

The switch from using a daemonset to using a deployment is intended to improve reliability during upgrade.

it would be nice if the description said why

There is an issue draining nodes. Daemonsets are not evicted early. When the termination finally happens, there is an issue where graceful deletion doesn't take place. In addition, it is opening us up to SDN restart issues which result in API outages.

@deads2k
Copy link
Contributor

deads2k commented Feb 27, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, marun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2020
@deads2k
Copy link
Contributor

deads2k commented Feb 27, 2020

/cherrypick release-4.4

@openshift-cherrypick-robot

@deads2k: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 9eda24a into openshift:master Feb 27, 2020
@openshift-ci-robot
Copy link

@marun: All pull requests linked via external trackers have merged. Bugzilla bug 1804717 has been moved to the MODIFIED state.

In response to this:

bug 1804717: Switch to managing openshift-apiserver pods with a deployment

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@deads2k: #313 failed to apply on top of branch "release-4.4":

Applying: Update bindata
error: Failed to merge in the changes.
Using index info to reconstruct a base tree...
M	pkg/operator/v311_00_assets/bindata.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/operator/v311_00_assets/bindata.go
CONFLICT (content): Merge conflict in pkg/operator/v311_00_assets/bindata.go
Patch failed at 0002 Update bindata

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marun
Copy link
Contributor Author

marun commented Feb 27, 2020

#324 addresses outstanding comments on this PR.

If the cherry-pick bot can't apply cleanly, does that imply that a manual PR is required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants