Bug 1907333: daemon: Move rollback removal to update loop #2297

cgwalters · 2020-12-14T14:38:29Z

See https://bugzilla.redhat.com/1907333

This needs to be a more proper part of the control loop so that
we retry, rather than having the MCD die.

Also by doing it here it will happen on firstboot when there's
no I/O contention with other processes.

See https://bugzilla.redhat.com/1907333 This needs to be a more proper part of the control loop so that we retry, rather than having the MCD die. Also by doing it here it will happen on firstboot when there's no I/O contention with other processes.

openshift-ci-robot · 2020-12-14T14:38:32Z

@cgwalters: This pull request references Bugzilla bug 1907333, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1907333: daemon: Move rollback removal to update loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-12-14T14:38:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mandre · 2020-12-14T14:45:35Z

/bugzilla refresh

openshift-ci-robot · 2020-12-14T14:45:42Z

@mandre: This pull request references Bugzilla bug 1907333, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mandre · 2020-12-14T14:46:02Z

/test e2e-openstack

mandre · 2020-12-14T14:47:48Z

Once again,
/bugzilla refresh

openshift-ci-robot · 2020-12-14T14:47:52Z

@mandre: This pull request references Bugzilla bug 1907333, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Once again,
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sinnykumari · 2020-12-14T16:00:53Z

pkg/daemon/update.go

+			MCDPivotErr.WithLabelValues(dn.node.Name, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
+			return err
+		}
+		if err := dn.removeRollback(); err != nil {


This could be problematic where updated OS doesn't boot fine. Admin will no longer be able to manually rollback which I think can be the case in disconnected install.

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed? Keeping rollback deployment is anyway not a concern other than it provides additional free space.

Yeah, that is a concern. That said, OpenShift doesn't really support rollbacks in general, and even specifically for the MCO doing it will trigger issues related to #1190 - basically the MCO will go degraded.

We don't really have a strong story for e.g. "The new kernel in 4.6.X broke booting on this type of hardware" other than "pause the update and wait for a fixed kernel to arrive and then update".

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed?

The problem with that is that if we e.g. just log errors we have no real visibility into failures (though I guess we could add an alert) and:

Keeping rollback deployment is anyway not a concern other than it provides additional free space.

The original motivation for doing this is (very briefly) mentioned in the PR:
#2220

It's basically: compliance tooling that e.g. adjusts kernel parameters for security wants to avoid a system (accidentally) booting into a non-compliant configuration. If the rollback deployment is there it's an easy down-arrow away for an admin at the grub prompt.

So to clarify, this will be removing the rollback (current) deployment when there is an osupdate, during the update cycle of the MCD. If something fails below (e.g. updating kargs), we also call a removePendingDeployment. Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

We also run an explicit defer of applyOSChanges to roll back to the previous deployment, maybe we should modify that code just in case something was hitting that before.

Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

There's actually three deployments potentially; "pending", "booted", "rollback". rpm-ostree will never let you remove the booted deployment. There's no API to do it in rpm-ostree, and multiple layers of safeguards against it (e.g. this code), in addition to the entire codebase being designed to avoid mutating the running system.

Does that address the concern?

Ah I see, sorry I got the terms confused.

In this case wouldn't the rollback deployment being removed here get pushed out anyways once we update and reboot? Assuming nothing went wrong the booted becomes rollback, the pending becomes booted, and the rollback being removed here would be gone anyways?

In a firstboot scenario, this code also runs but fresh nodes would only have a booted deployment, and updates to a pending deployment via the firstboot systemd unit. There wouldn't be a rollback to remove there, so if there are security concerns the firstboot would linger as a rollback and doesn't get removed until a further update.

Oh yes you're totally right 😄 This won't have the intended effect indeed.

Blah...the architectural problem we have here is there's no "sync loop" in the MCD outside of trying to perform an update; the closest we have is a few ad-hoc goroutines like the ssh monitor.

openshift-merge-robot · 2020-12-14T16:33:46Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-ovn-step-registry	`04f28d5`	link	`/test e2e-ovn-step-registry`
ci/prow/okd-e2e-aws	`04f28d5`	link	`/test okd-e2e-aws`
ci/prow/e2e-openstack	`04f28d5`	link	`/test e2e-openstack`
ci/prow/e2e-aws	`04f28d5`	link	`/test e2e-aws`
ci/prow/e2e-agnostic-upgrade	`04f28d5`	link	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2020-12-14T22:34:19Z

Closing in favor of #2302

openshift-ci-robot · 2020-12-14T22:34:23Z

@cgwalters: This pull request references Bugzilla bug 1907333. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1907333: daemon: Move rollback removal to update loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Dec 14, 2020

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 14, 2020

openshift-ci-robot requested review from ashcrow and yuqi-zhang December 14, 2020 14:38

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2020

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Dec 14, 2020

openshift-ci-robot removed the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Dec 14, 2020

openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Dec 14, 2020

ashcrow requested review from sinnykumari and removed request for ashcrow December 14, 2020 15:55

sinnykumari reviewed Dec 14, 2020

View reviewed changes

cgwalters closed this Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1907333: daemon: Move rollback removal to update loop #2297

Bug 1907333: daemon: Move rollback removal to update loop #2297

cgwalters commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

mandre commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

mandre commented Dec 14, 2020

mandre commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

sinnykumari Dec 14, 2020

cgwalters Dec 14, 2020 •

edited

Loading

yuqi-zhang Dec 14, 2020

cgwalters Dec 14, 2020

yuqi-zhang Dec 14, 2020

cgwalters Dec 14, 2020

openshift-merge-robot commented Dec 14, 2020

cgwalters commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

Bug 1907333: daemon: Move rollback removal to update loop #2297

Bug 1907333: daemon: Move rollback removal to update loop #2297

Conversation

cgwalters commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

mandre commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

mandre commented Dec 14, 2020

mandre commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

sinnykumari Dec 14, 2020

Choose a reason for hiding this comment

cgwalters Dec 14, 2020 • edited Loading

Choose a reason for hiding this comment

yuqi-zhang Dec 14, 2020

Choose a reason for hiding this comment

cgwalters Dec 14, 2020

Choose a reason for hiding this comment

yuqi-zhang Dec 14, 2020

Choose a reason for hiding this comment

cgwalters Dec 14, 2020

Choose a reason for hiding this comment

openshift-merge-robot commented Dec 14, 2020

cgwalters commented Dec 14, 2020

openshift-ci-robot commented Dec 14, 2020

cgwalters Dec 14, 2020 •

edited

Loading