Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1907333: daemon: Move rollback removal to update loop #2297

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

See https://bugzilla.redhat.com/1907333

This needs to be a more proper part of the control loop so that
we retry, rather than having the MCD die.

Also by doing it here it will happen on firstboot when there's
no I/O contention with other processes.

See https://bugzilla.redhat.com/1907333

This needs to be a more proper part of the control loop so that
we retry, rather than having the MCD die.

Also by doing it here it will happen on firstboot when there's
no I/O contention with other processes.
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1907333, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1907333: daemon: Move rollback removal to update loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2020
@mandre
Copy link
Member

mandre commented Dec 14, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

@mandre: This pull request references Bugzilla bug 1907333, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mandre
Copy link
Member

mandre commented Dec 14, 2020

/test e2e-openstack

@mandre
Copy link
Member

mandre commented Dec 14, 2020

Once again,
/bugzilla refresh

@openshift-ci-robot openshift-ci-robot removed the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

@mandre: This pull request references Bugzilla bug 1907333, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Once again,
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Dec 14, 2020
@ashcrow ashcrow requested review from sinnykumari and removed request for ashcrow December 14, 2020 15:55
MCDPivotErr.WithLabelValues(dn.node.Name, newConfig.Spec.OSImageURL, err.Error()).SetToCurrentTime()
return err
}
if err := dn.removeRollback(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be problematic where updated OS doesn't boot fine. Admin will no longer be able to manually rollback which I think can be the case in disconnected install.

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed? Keeping rollback deployment is anyway not a concern other than it provides additional free space.

Copy link
Member Author

@cgwalters cgwalters Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a concern. That said, OpenShift doesn't really support rollbacks in general, and even specifically for the MCO doing it will trigger issues related to #1190 - basically the MCO will go degraded.

We don't really have a strong story for e.g. "The new kernel in 4.6.X broke booting on this type of hardware" other than "pause the update and wait for a fixed kernel to arrive and then update".

I was thinking maybe we can keep the previous code and just ignore the error when deleting rollback failed?

The problem with that is that if we e.g. just log errors we have no real visibility into failures (though I guess we could add an alert) and:

Keeping rollback deployment is anyway not a concern other than it provides additional free space.

The original motivation for doing this is (very briefly) mentioned in the PR:
#2220

It's basically: compliance tooling that e.g. adjusts kernel parameters for security wants to avoid a system (accidentally) booting into a non-compliant configuration. If the rollback deployment is there it's an easy down-arrow away for an admin at the grub prompt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, this will be removing the rollback (current) deployment when there is an osupdate, during the update cycle of the MCD. If something fails below (e.g. updating kargs), we also call a removePendingDeployment. Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

We also run an explicit defer of applyOSChanges to roll back to the previous deployment, maybe we should modify that code just in case something was hitting that before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that remove both deployments from rpm-ostree? i.e. if we somehow fail there, would we be able to recover at all?

There's actually three deployments potentially; "pending", "booted", "rollback". rpm-ostree will never let you remove the booted deployment. There's no API to do it in rpm-ostree, and multiple layers of safeguards against it (e.g. this code), in addition to the entire codebase being designed to avoid mutating the running system.

Does that address the concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, sorry I got the terms confused.

In this case wouldn't the rollback deployment being removed here get pushed out anyways once we update and reboot? Assuming nothing went wrong the booted becomes rollback, the pending becomes booted, and the rollback being removed here would be gone anyways?

In a firstboot scenario, this code also runs but fresh nodes would only have a booted deployment, and updates to a pending deployment via the firstboot systemd unit. There wouldn't be a rollback to remove there, so if there are security concerns the firstboot would linger as a rollback and doesn't get removed until a further update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes you're totally right 😄 This won't have the intended effect indeed.

Blah...the architectural problem we have here is there's no "sync loop" in the MCD outside of trying to perform an update; the closest we have is a few ad-hoc goroutines like the ssh monitor.

@openshift-merge-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovn-step-registry 04f28d5 link /test e2e-ovn-step-registry
ci/prow/okd-e2e-aws 04f28d5 link /test okd-e2e-aws
ci/prow/e2e-openstack 04f28d5 link /test e2e-openstack
ci/prow/e2e-aws 04f28d5 link /test e2e-aws
ci/prow/e2e-agnostic-upgrade 04f28d5 link /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

Closing in favor of #2302

@cgwalters cgwalters closed this Dec 14, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1907333. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1907333: daemon: Move rollback removal to update loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants