-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1907333: daemon: Move rollback removal to update loop #2297
Conversation
See https://bugzilla.redhat.com/1907333 This needs to be a more proper part of the control loop so that we retry, rather than having the MCD die. Also by doing it here it will happen on firstboot when there's no I/O contention with other processes.
@cgwalters: This pull request references Bugzilla bug 1907333, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/bugzilla refresh |
@mandre: This pull request references Bugzilla bug 1907333, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-openstack |
Once again, |
@mandre: This pull request references Bugzilla bug 1907333, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
MCDPivotErr.WithLabelValues(dn.node.Name, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime() | ||
return err | ||
} | ||
if err := dn.removeRollback(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be problematic where updated OS doesn't boot fine. Admin will no longer be able to manually rollback which I think can be the case in disconnected install.
I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed? Keeping rollback deployment is anyway not a concern other than it provides additional free space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is a concern. That said, OpenShift doesn't really support rollbacks in general, and even specifically for the MCO doing it will trigger issues related to #1190 - basically the MCO will go degraded.
We don't really have a strong story for e.g. "The new kernel in 4.6.X broke booting on this type of hardware" other than "pause the update and wait for a fixed kernel to arrive and then update".
I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed?
The problem with that is that if we e.g. just log errors we have no real visibility into failures (though I guess we could add an alert) and:
Keeping rollback deployment is anyway not a concern other than it provides additional free space.
The original motivation for doing this is (very briefly) mentioned in the PR:
#2220
It's basically: compliance tooling that e.g. adjusts kernel parameters for security wants to avoid a system (accidentally) booting into a non-compliant configuration. If the rollback deployment is there it's an easy down-arrow away for an admin at the grub prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, this will be removing the rollback (current) deployment when there is an osupdate, during the update cycle of the MCD. If something fails below (e.g. updating kargs), we also call a removePendingDeployment
. Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?
We also run an explicit defer of applyOSChanges
to roll back to the previous deployment, maybe we should modify that code just in case something was hitting that before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?
There's actually three deployments potentially; "pending", "booted", "rollback". rpm-ostree will never let you remove the booted deployment. There's no API to do it in rpm-ostree, and multiple layers of safeguards against it (e.g. this code), in addition to the entire codebase being designed to avoid mutating the running system.
Does that address the concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, sorry I got the terms confused.
In this case wouldn't the rollback deployment being removed here get pushed out anyways once we update and reboot? Assuming nothing went wrong the booted
becomes rollback
, the pending
becomes booted
, and the rollback
being removed here would be gone anyways?
In a firstboot scenario, this code also runs but fresh nodes would only have a booted
deployment, and updates to a pending
deployment via the firstboot systemd unit. There wouldn't be a rollback
to remove there, so if there are security concerns the firstboot would linger as a rollback and doesn't get removed until a further update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes you're totally right 😄 This won't have the intended effect indeed.
Blah...the architectural problem we have here is there's no "sync loop" in the MCD outside of trying to perform an update; the closest we have is a few ad-hoc goroutines like the ssh monitor.
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing in favor of #2302 |
@cgwalters: This pull request references Bugzilla bug 1907333. The bug has been updated to no longer refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
See https://bugzilla.redhat.com/1907333
This needs to be a more proper part of the control loop so that
we retry, rather than having the MCD die.
Also by doing it here it will happen on firstboot when there's
no I/O contention with other processes.