Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane #1946

Closed
wants to merge 6 commits into from

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jul 24, 2020

This PR rolls up a few others, including notably:

Then adds: WIP: Update etcd followers first

The way we're talking to etcd is a bit hacky, I ended up cargo
culting some code. This would be much cleaner if the etcd operator
did it.

But it's critical that we update the etcd followers first, because
leader elections are disruptive events and we can easily minimize
that.

Closes: #1897


Updated release images for testing (v2)

I think the previous results were misleading/broken due to using the same release image as base, new images:

test upgrade registry.svc.ci.openshift.org/coreos/walters-mco-upgrade-release@sha256:25deaaf3074cf984f352ab92fca5f61c4663c9b506ce05c50db8f723a5b386b7 registry.svc.ci.openshift.org/coreos/walters-mco-upgrade-release-target@sha256:1ad5e2fb2df0e4af4abd3c1e59461a7977924201eb0eada524ce7af34d7ac1c4

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 24, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 24, 2020
@cgwalters cgwalters force-pushed the controlplane-upgrades branch 5 times, most recently from 692c5ee to 3dd8b0a Compare July 30, 2020 00:13
@cgwalters
Copy link
Member Author

OK a while ago I'd invested some time in tweaking the node controller to have useful logs around what it's doing; my first "point of contact" when looking at upgrades was its logs. But...we lose most those on upgrade since the pod gets killed, so trying to determine success for this PR was annoying.

Ended up reworking things here so the node controller emits events - currently the MCD emits useful events which can be queried afterwards (in our CI runs we dump events.json). I added a single event emitted when the node controller changes a node target config, but also added special events just for the control plane. Going forward I think we should (strategically) do more of this "special case the control plane" at least for extra logging and care.

@cgwalters cgwalters force-pushed the controlplane-upgrades branch 2 times, most recently from a1a3f0d to 3a0a35e Compare July 30, 2020 03:17
@kikisdeliveryservice
Copy link
Contributor

Wondering how this squares with @ericavonb comment:
#1947 (comment)

Catching up on all the issues behind this PR. I don't like adding new etcd-specific logic to the MCO. It feels like this moves us backwards from separating out etcd concerns into its own operator. Can we do this in a more component-agnostic way?

Something like, a node label or annotation that specifies upgrade priority classes maybe?

Since there have been huge efforts to decouple MCO/etcd in 4.5 and move to the etcd operator?

@ericavonb
Copy link
Contributor

Since there have been huge efforts to decouple MCO/etcd in 4.5 and move to the etcd operator?
Right. Like looking at this PR, could the etcd-specific stuff in the daemon be done in the etcd-operator? Why are putting it in the MCO?

@cgwalters
Copy link
Member Author

cgwalters commented Jul 30, 2020

OK yeah I think the events are useful, e.g. with this PR I can do
jq '.items | map(select(.source.component == "machineconfigcontroller-nodecontroller" or .source.component == "machineconfigdaemon")) | sort_by(.firstTimestamp | fromdate) | map(.firstTimestamp + " " + .involvedObject.kind + " " + .involvedObject.name + ": " + .message)' < events.json and see:

[
  "2020-07-30T03:33:00Z Node ci-op-5tr7l488-28de9-769ls-master-0: Setting node ci-op-5tr7l488-28de9-769ls-master-0, currentConfig rendered-master-af0d95246327961eb25456f7c7b92abd to Done",
  "2020-07-30T03:33:31Z Node ci-op-5tr7l488-28de9-769ls-master-1: Setting node ci-op-5tr7l488-28de9-769ls-master-1, currentConfig rendered-master-af0d95246327961eb25456f7c7b92abd to Done",
  "2020-07-30T03:33:31Z Node ci-op-5tr7l488-28de9-769ls-master-2: Setting node ci-op-5tr7l488-28de9-769ls-master-2, currentConfig rendered-master-af0d95246327961eb25456f7c7b92abd to Done",
  "2020-07-30T03:57:19Z Node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l: Setting node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l, currentConfig rendered-worker-4e1c003655b50eb3b2b76659e17fff14 to Done",
  "2020-07-30T03:57:20Z Node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc: Setting node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc, currentConfig rendered-worker-4e1c003655b50eb3b2b76659e17fff14 to Done",
  "2020-07-30T03:57:37Z Node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4: Setting node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4, currentConfig rendered-worker-4e1c003655b50eb3b2b76659e17fff14 to Done",
  "2020-07-30T04:28:07Z MachineConfigPool master: Deferring update of etcd leader ci-op-5tr7l488-28de9-769ls-master-2",
  "2020-07-30T04:28:07Z MachineConfigPool master: Targeted node ci-op-5tr7l488-28de9-769ls-master-0 to config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:28:07Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-0 now has machineconfiguration.openshift.io/desiredConfig=rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:28:07Z MachineConfigPool worker: Targeted node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc to config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:28:08Z Node ci-op-5tr7l488-28de9-769ls-master-0: Draining node to update config.",
  "2020-07-30T04:28:08Z Node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc: Draining node to update config.",
  "2020-07-30T04:28:08Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-0 now has machineconfiguration.openshift.io/state=Working",
  "2020-07-30T04:28:24Z Node ci-op-5tr7l488-28de9-769ls-master-0: Written pending config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:28:24Z Node ci-op-5tr7l488-28de9-769ls-master-0: Node will reboot into config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:29:27Z Node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc: Written pending config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:29:27Z Node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc: Node will reboot into config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:29:54Z Node ci-op-5tr7l488-28de9-769ls-master-0: Setting node ci-op-5tr7l488-28de9-769ls-master-0, currentConfig rendered-master-a22c80e6e6f0f775632695e7b355590f to Done",
  "2020-07-30T04:30:22Z MachineConfigPool master: Deferring update of etcd leader ci-op-5tr7l488-28de9-769ls-master-2",
  "2020-07-30T04:30:22Z MachineConfigPool master: Targeted node ci-op-5tr7l488-28de9-769ls-master-1 to config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:30:22Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-1 now has machineconfiguration.openshift.io/desiredConfig=rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:30:23Z Node ci-op-5tr7l488-28de9-769ls-master-1: Draining node to update config.",
  "2020-07-30T04:30:23Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-1 now has machineconfiguration.openshift.io/state=Working",
  "2020-07-30T04:30:37Z Node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc: Setting node ci-op-5tr7l488-28de9-769ls-worker-b-95ssc, currentConfig rendered-worker-49464673d8d72c9fe228d47ddc99d968 to Done",
  "2020-07-30T04:31:01Z Node ci-op-5tr7l488-28de9-769ls-master-1: Written pending config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:31:01Z Node ci-op-5tr7l488-28de9-769ls-master-1: Node will reboot into config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:32:33Z Node ci-op-5tr7l488-28de9-769ls-master-1: Setting node ci-op-5tr7l488-28de9-769ls-master-1, currentConfig rendered-master-a22c80e6e6f0f775632695e7b355590f to Done",
  "2020-07-30T04:32:53Z MachineConfigPool master: Targeted node ci-op-5tr7l488-28de9-769ls-master-2 to config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:32:53Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-2 now has machineconfiguration.openshift.io/desiredConfig=rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:32:53Z MachineConfigPool worker: Targeted node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l to config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:32:54Z Node ci-op-5tr7l488-28de9-769ls-master-2: Draining node to update config.",
  "2020-07-30T04:32:54Z Node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l: Draining node to update config.",
  "2020-07-30T04:32:54Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-2 now has machineconfiguration.openshift.io/state=Working",
  "2020-07-30T04:33:30Z Node ci-op-5tr7l488-28de9-769ls-master-2: Written pending config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:33:30Z Node ci-op-5tr7l488-28de9-769ls-master-2: Node will reboot into config rendered-master-a22c80e6e6f0f775632695e7b355590f",
  "2020-07-30T04:33:34Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-0 now has machineconfiguration.openshift.io/uds=1",
  "2020-07-30T04:33:49Z Node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l: Written pending config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:33:49Z Node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l: Node will reboot into config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:34:58Z Node ci-op-5tr7l488-28de9-769ls-master-2: Setting node ci-op-5tr7l488-28de9-769ls-master-2, currentConfig rendered-master-a22c80e6e6f0f775632695e7b355590f to Done",
  "2020-07-30T04:34:58Z MachineConfigPool master: Node ci-op-5tr7l488-28de9-769ls-master-2 now has machineconfiguration.openshift.io/uds=0",
  "2020-07-30T04:35:04Z Node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l: Setting node ci-op-5tr7l488-28de9-769ls-worker-c-7fp2l, currentConfig rendered-worker-49464673d8d72c9fe228d47ddc99d968 to Done",
  "2020-07-30T04:35:09Z MachineConfigPool worker: Targeted node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4 to config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:35:10Z Node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4: Draining node to update config.",
  "2020-07-30T04:36:29Z Node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4: Written pending config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:36:29Z Node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4: Node will reboot into config rendered-worker-49464673d8d72c9fe228d47ddc99d968",
  "2020-07-30T04:37:40Z Node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4: Setting node ci-op-5tr7l488-28de9-769ls-worker-d-gtzb4, currentConfig rendered-worker-49464673d8d72c9fe228d47ddc99d968 to Done"
]

EDIT: watch me learn jq live!

@cgwalters
Copy link
Member Author

Since there have been huge efforts to decouple MCO/etcd in 4.5 and move to the etcd operator?
Right. Like looking at this PR, could the etcd-specific stuff in the daemon be done in the etcd-operator? Why are putting it in the MCO?

It sounds like you're suggesting a different tactic, not a different strategy, correct? In other words, the strategy is ensuring the MCO upgrades etcd followers first, the tactic is how that's implemented.

I totally agree that the way we're talking to etcd directly here is hacky, but the debate for that is over here:
openshift/api#694

@cgwalters
Copy link
Member Author

@cgwalters
Copy link
Member Author

Using Promecieus and looking at etcd_server_leader_changes_seen_total from this PR's upgrade job I see just one leader change, which looks good.

@cgwalters cgwalters changed the title WIP: update etcd followers first, use bfq on control plane BZ 1850057: WIP: update etcd followers first, use bfq on control plane Jul 30, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 30, 2020
@cgwalters
Copy link
Member Author

@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

We had an event when we were starting an OS update, but nothing
when it was completed - one could implicitly get that by looking
at the next event, but that's a bit fragile.

And since then we started doing a lot more stuff with the OS,
so let's add an event emitted before and after all OS changes
so we can consistently get e.g. timing information about it.

Relates to openshift#1962
around getting better data about timing during upgrades.
A while ago I'd invested some time in tweaking the
node controller to have useful logs around what it's
doing; my first "point of contact" when looking at
upgrades was its pod logs. But...we lose most
those on upgrade since the pod gets killed.

Add events to the node controller too.
Currently the MCD emits useful events which
can be queried afterwards (in our CI runs we
dump `events.json`).

With this we can create a "journal/history"
for upgrade/update events just by querying the
event stream.
Part of solving openshift#1897
A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

The TL;DR is that the `bfq` I/O scheduler better respects IO priorities,
and also does a better job of handling latency sensitive processes
like `etcd` versus bulk/background I/O .
We switched rpm-ostree to do this when applying updates, but
it also makes sense to do when extracting the oscontainer.

Part of: openshift#1897
Which is about staging OS updates more nicely when etcd is running.
The way we're talking to etcd is a bit hacky, I ended up cargo
culting some code.  This would be much cleaner if the etcd operator
did it.

But it's critical that we update the etcd followers first, because
leader elections are disruptive events and we can easily minimize
that.

Closes: openshift#1897
@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws 9c6b844 link /test okd-e2e-aws
ci/prow/e2e-metal-ipi 9c6b844 link /test e2e-metal-ipi
ci/prow/e2e-gcp-op 9c6b844 link /test e2e-gcp-op
ci/prow/e2e-aws 9c6b844 link /test e2e-aws
ci/prow/e2e-aws-scaleup-rhel7 9c6b844 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-ovn-step-registry 9c6b844 link /test e2e-ovn-step-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

I just noticed that one of the azure runs above claimed success, but actually didn't upgrade the OS - it seems the MCO was reporting success but clearly the MCC and MCD were still working.

It's possible this bug is somehow introduced by this PR...going to go look at some existing e2e-upgrade jobs.

@cgwalters
Copy link
Member Author

cgwalters commented Aug 6, 2020

EDIT: See #1991

1 similar comment
@cgwalters
Copy link
Member Author

cgwalters commented Aug 7, 2020

EDIT: See #1991

@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ashcrow ashcrow requested a review from runcom August 7, 2020 13:12
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

Updated some commentary in #1957

The argument for merging this one (etcd leader awareness) is basically: A useful baseline metric for disruption is etcd leader election count, and if OS upgrades might randomly force that anywhere between 1-3 times, success becomes harder to measure.

But it probably does make sense to do #1957 first, (also get openshift/cluster-etcd-operator#418 in), measure a bit, then experiment with this on top.

@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link
Contributor

Re: testing, I commented on the issue (#1897 (comment)) but maybe should have commented here. Cross-referencing...

@cgwalters
Copy link
Member Author

Per #1897 (comment) we will focus on etcd latency metrics and not leader elections - so it's not worth carrying the code today to have "etcd leader awareness" for node upgrades.

@cgwalters cgwalters closed this Aug 20, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1850057. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

[WIP] Bug 1850057: update etcd followers first, use bfq on control plane

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

This PR is now replaced by #1957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug 1850057: stage OS updates (nicely) while etcd is still running
5 participants