Bug 1867608: ds/machine-config-daemon: Set maxUnavailable 10% #1992

sdodson · 2020-08-08T01:19:37Z

This daemonset is not critical to availability like SDN or DNS pods therefore
allow maxUnavailable to scale with cluster size. This roughly increases rollout
speed by 10x on a 250 node cluster.

How to verify this, provision a cluster with 20 or more hosts. Confirm that updates
to the machine-config-daemonset allows for up to 10% to update at once but otherwise
operates normally.

sdodson · 2020-08-08T01:21:08Z

This is similar to others I'm updating as well, but for some reason this one feels riskier than those just given the criticality of MCO but I don't think it really is. Feedback welcome though.

openshift/cluster-monitoring-operator#902
openshift/cluster-node-tuning-operator#149

sdodson · 2020-08-08T16:54:55Z

/retest

sdodson · 2020-08-10T13:14:16Z

/retitle Bug 1867608: ds/machine-config-daemon: Set maxUnavailable 10%

sdodson · 2020-08-10T13:14:23Z

/retest

openshift-ci-robot · 2020-08-10T13:14:38Z

@sdodson: This pull request references Bugzilla bug 1867608, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1867608: ds/machine-config-daemon: Set maxUnavailable 10%

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-08-14T00:45:00Z

/test e2e-aws

runcom · 2020-09-01T11:29:04Z

@sdodson given how critical the MCO is, what do you think if we defer this to the next release to give it more and more time to soak? the issue is known (around timing) but it'll make us more comfortable and find us ready to debug if anything goes south. Would that work?

cgwalters · 2020-09-01T12:43:00Z

/approve
Agree with the rationale here - in fact most of the time the MCD is effectively idle I believe. We watch for ssh events from journald but I'm not thinking offhand of anything else we react to other than when the MCC changes the desiredConfig annotation.

sdodson · 2020-09-01T13:30:11Z

Yeah, I'm fine with deferring this as long as we're open to backporting to 4.6.z before it transitions to maintenance phase. This is just an optimization.

sdodson · 2020-09-01T13:31:11Z

/bugzilla refresh

openshift-ci-robot · 2020-09-01T13:31:15Z

@sdodson: This pull request references Bugzilla bug 1867608, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-09-14T18:04:37Z

@sdodson: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi	ef0f01a5630edbbadfbec17690be49f94237d6bc	link	`/test e2e-metal-ipi`
ci/prow/okd-e2e-aws	ef0f01a5630edbbadfbec17690be49f94237d6bc	link	`/test okd-e2e-aws`
ci/prow/e2e-ovn-step-registry	ef0f01a5630edbbadfbec17690be49f94237d6bc	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-aws	ef0f01a5630edbbadfbec17690be49f94237d6bc	link	`/test e2e-aws`
ci/prow/e2e-upgrade	ef0f01a5630edbbadfbec17690be49f94237d6bc	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sdodson · 2020-10-20T12:22:15Z

/bugzilla refresh

openshift-ci-robot · 2020-10-20T12:22:20Z

@sdodson: This pull request references Bugzilla bug 1867608, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This daemonset is not critical to availability like SDN or DNS pods therefore allow maxUnavailable to scale with cluster size. This roughly increases rollout speed by 10x on a 250 node cluster.

cgwalters

/lgtm

openshift-ci-robot · 2020-10-20T13:09:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-10-20T14:00:16Z

/retest