Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements/update/automatic-updates: Propose a new enhancement #124

Open
wants to merge 1 commit into
base: master
from

Conversation

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Nov 19, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To complete the pull request process, please assign ashcrow
You can assign the PR to them by writing /assign @ashcrow in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs

This comment has been minimized.

Copy link

vrutkovs commented Nov 20, 2019

LGTM, is it worth noting that autoupdates would be limited to the selected channel?

enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
enhancements/update/automatic-updates.md Outdated Show resolved Hide resolved
Adminstrators can configure the cluster to push those alerts out to the on-call admistrator to recover the cluster.
* Stability testing.
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each upgrade.
We will not place upgrades in production channels unless they have proven themselves stable in earlier testing, and we will remove upgrades from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

s/remove/stop pushing/

This comment has been minimized.

Copy link
@wking

wking Nov 21, 2019

Author Member

s/remove/stop pushing/

Expanded "we will remove upgrades from production channels" to "we will remove update recommendations from production channels". We're not really pushing anything, we're just serving graphs to clients like the CVO. Does the new wording look good to you, or do you have an alternative idea?

If that update proves unstable, many of those upgrades would already be in progress by the time the first Telemetry comes back with failure messages.
A phased rollout would limit the number of simultaneously updating clusters to give Telemetry time to come back so we could stop recommending upgrades that proved unstable on live clusters not yet covered in pretesting.

There is also a security risk where a compromised upstream Cincinnati could recommend cluster updates that were not in the cluster's best interest (e.g. 4.2.4 -> 4.1.0).

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

Not sure if relevant for OpenShift mitigations, but Zincati has client-side checks and knobs to prevent auto-downgrades: https://github.com/coreos/zincati/blob/dbb0b0a8884435f2b1186b2228199cd4adb6f705/docs/usage/auto-updates.md#updates-ordering-and-downgrades

This comment has been minimized.

Copy link
@wking

wking Nov 20, 2019

Author Member

Not sure if relevant for OpenShift mitigations...

We could grow this. But sometimes you might want to recommend a downgrade. E.g, 4.y.7 turns out to explode after 24h because of broken cert rotation, and a fixed 4.y.8 is 48h out, so you recommend 4.y.7->4.y.6 until then to get folks back to a safe place.

This comment has been minimized.

Copy link
@ericavonb

ericavonb Dec 11, 2019

Contributor

I think it makes sense to version updates separately from the cluster versions, e.g.

update-1.0 => (ocp-4.y.6 -> ocp-4.y.7)
update-1.1 => (ocp-4.y.7 -> ocp-4.y.6)
update-1.2 => (ocp-4.y.6 -> ocp-4.y.8, ocp-4.y.7 -> ocp-4.y.8)

You always want the latest update version, which may or may not upgrade you to the highest cluster version.

Another way to address rollback attacks, as well as freeze attacks: TUF uses timestamps to certify updates for short durations, requiring constant refreshing to ensure the latest update metadata. Timestamp ordering is simple enough but does require secure and accurate time sources (for TUF, see the timestamp.json section: tuf-spec.md#4-document-formats


There are also potential future mitigations:

* The cluster-version operator could be taught to limit automatic upgrades to those which appear in the [signed metadata][cluster-version-operator-release-metadata] of either the source or target release.

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

I don't think this is a mitigation that actually fit in the rest of design/release-engineering (or more likely, I didn't understand your suggestion).
In particular, it seems to conflict with the "rollouts based on Telemetry" point above.

This comment has been minimized.

Copy link
@wking

wking Nov 20, 2019

Author Member

I don't think this is a mitigation that actually fit in the rest of design/release-engineering...

This is a guard against malicious Cincinnati, whose risk exists independent of this enhancement. This enhancement just increases exposure to the preexisting risk.

enhancements/update/automatic-updates.md Show resolved Hide resolved
- Tues..Thurs *-11..1-* 11:00
```

The schedule structure was designed to support maintenance windows, allowing for updates on certain days or during certain times.

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

The entries shown above are punctual datetimes, not maintenance windows.
That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger). I fear this is a relevant semantic detail if there is any polling logic involved.

This comment has been minimized.

Copy link
@wking

wking Nov 20, 2019

Author Member

That is, they specify when a maintenance event can start (edge-trigger) but not how long it can be active (level-trigger).

There's no cap on upgrade duration. If you're concerned about it, you'd have small windows early in your day for initiating upgrades, to leave lots of time for monitoring/recovering before you went home. But this is an alternative proposal; I dunno how deeply we want to dig into it here.

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

Oh sorry, I realize my comment was not worded clearly enough. I'm not trying to cap upgrade duration.

I'm stating that the schedule format would need to define a timespan, even just for the starting event.
That is, "allow auto-update events between 02:00 and 03:00 on Sat" instead of "allow auto-update events exactly at 02:00:00.00 on Sat".

The current syntax is taken from systemd timestamps, which defines a single specific point in time instead.

But I guess all of these are implementations details, so feel free to defer this discussion.

For example, a `schedule` property directly on [the ClusterVersion `spec`][api-spec] would protect administrators from accidentally using the web console or `oc adm upgrade ...` to trigger an upgrade outside of the configured window.
Instead, administrators would have to adjust the `schedule` configuration or set an override option to trigger out-of-window updates.
Customization like this can also be addressed by intermediate [policy-engine][cincinnati-policy-engine], without involving the ClusterVersion configuration at all.
Teaching the cluster-version operator about a local `schedule` filter is effectively like a local policy engine.

This comment has been minimized.

Copy link
@lucab

lucab Nov 20, 2019

Member

There is a difference though:

  • if a schedule is enforced locally by CVO, the operator can grow logic for an escape hatch for manual forced-updates (or only apply to auto-updates at all)
  • if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

This comment has been minimized.

Copy link
@wking

wking Nov 20, 2019

Author Member

if a schedule is enforced globally by policy-engine, it cannot be force-bypassed locally

Sure it can, you just set the update by pullspec and force the "don't worry if this is not in your graph" exception.

@cgwalters

This comment has been minimized.

Copy link
Member

cgwalters commented Nov 20, 2019

Only skimmed this, looks reasonable so far.

One thing I do like to say though is that Red Hat hasn't truly completed the acquisition of CoreOS until we enable automatic updates by default.

@wking wking force-pushed the wking:automatic-updates branch from dae7d51 to 9b61a96 Nov 21, 2019
@wking wking force-pushed the wking:automatic-updates branch from 9b61a96 to 568a1bd Nov 21, 2019
@wking

This comment has been minimized.

Copy link
Member Author

wking commented Nov 21, 2019

Ok, I think I've addressed most of @lucab's excellent feedback. Outstanding threads are:

  • Wording around edge removal. I've made some changes, but would like feedback on the new wording.
  • Malicious downgrade protection (also here). Do we consider compromised Cincinnati a high enough risk to need guards here before we add an auto-update knob?
  • Details of the calendar idea (also here). I think we both agree we can punt that as peripheral, but I haven't marked the thread resolved because I want the information to be discoverable by anyone who does pick up a calendar-guard enhancement.
// automaticUpdates enables automatic updates.
// +optional
AutomaticUpdates bool `json:"automaticUpdates,omitempty"`
Comment on lines +54 to +56

This comment has been minimized.

Copy link
@abhinavdahiya

abhinavdahiya Nov 21, 2019

Member

I don't think booleans are great API as they are not extensible..

having a boolean makes it so that we are saying that cvo only knows update always or just don't and it is never going to do anything else, any condition if required will be done somewhere else..

personally i think something like union discriminator kubernetes/enhancements#926 is much better suited to express the intent no-updates, auto-updates-always, auto-updates-schedule etc..

https://github.com/openshift/enhancements/pull/124/files#diff-8354eb4f233832a1e5afa3ffc9bea8e8R154-R160

There is an overhead trade-off that needs to taken into consideration before saying CVO doesn't do local stuff...
will a user with single or handful clusters want to run local policy-agent? or is it sane for them to do it?
will the users that want different policy for their clusters like testing/staging/prod like to run 3 different policy agents?

Does the fact that we have no policy agent at this point, and not sure how far that is or even the concept of supporting upstream that is self-signed... deter anybody from even using this... hence diminishing the value prop of the boolean.

Personally i would like clusters to support at some one usecase of automatic updates based on condition..

This comment has been minimized.

Copy link
@wking

wking Nov 26, 2019

Author Member

I don't think booleans are great API as they are not extensible..

I'm still not clear on what sort of extensibility you'd want. I think there's a set of policy engines which can be distributed in external chains or run inside the CVO itself, that eventually spit out a set of recommended update targets. Then you need a boolean knob to decide if the CVO automatically initiates an update when a target becomes available. For example:

  1. Upstream Cincinnati spits out graph A.
  2. On-site policy engine ingests graph A and applies local stability filtering, removing Red-Hat-recommended update edges until they have also passed local update testing. This engine spits out graph B.
  3. CVO ingests graph B via the upstream URI, and applies schedule filtering, removing all update edges if and only if the current time lies outside the configured update windows. This engine spits out graph C.
  4. CVO ingests its own graph C and looks for available updates from the current version, storing those in AvailableUpdates.
  5. CVO looks at AvailableUpdates and the AutomaticUpdates boolean, and decides whether to automatically set DesiredUpdate.

So there's still lots of room for extension. The boolean option is just an in-cluster knob for the existing CLI flag.

...any condition if required will be done somewhere else...

No, you could still apply policy filtering on the CVO side, independent of whether the CVO automatically updates based on the results of a local policy engine. This is what I was trying to convey here.

Administrators can configure the cluster to push those alerts out to the on-call administrator to recover the cluster.
* Stability testing.
We are continually refining our CI suite and processing Telemetry from live clusters in order to assess the stability of each update.
We will not place updates in production channels unless they have proven themselves stable in earlier testing, and we will remove update recommendations from production channels if a live cluster trips over a corner case that we do not yet cover in pretesting.

This comment has been minimized.

Copy link
@smarterclayton

smarterclayton Nov 23, 2019

Member

Without the ability to control rollout, this seems like a risky feature. What is the minimal phased rollout feature that mitigates risk?

This comment has been minimized.

Copy link
@wking

wking Nov 24, 2019

Author Member

What is the minimal phased rollout feature that mitigates risk?

The ability to spread the update out across a release-admin-specified time window, with clusters in the given channel randomly spread across that window. Things like automatically pulling edges on failure would be nice, but can be mitigated with manual monitoring and large-duration windows (i.e. slow rollouts). Things like intelligently sorting sensitive clusters toward the back of the queue would also be nice, but can be mitigated by ensuring sufficient populations in less-stable channels.

There are already canary clusters will polling automatic updates, so we can use those for testing and will not need to provision more long-running clusters to excercise this enhancement.
We can provision additional short-lived clusters in existing CI accounts if we want to provide additional end-to-end testing.

[api-pull-request]: https://github.com/openshift/api/pull/326

This comment has been minimized.

Copy link
@LalatenduMohanty

LalatenduMohanty Nov 25, 2019

Member

This pull request is now closed. So may be we can remove it from here. Not sure what is the right action here.

This comment has been minimized.

Copy link
@wking

wking Nov 25, 2019

Author Member

This pull request is now closed.

It was closed pending this enhancement discussion. If the enhancement lands, I'll re-open and reroll the PR.

Allowing automatic updates would make it more likely that update failures happen when there is no administrator actively watching to notice and recover from the failure.
This is mitigated by:

* Alerting.

This comment has been minimized.

Copy link
@brancz

brancz Dec 12, 2019

We actually know that certain (non critical) alerts will fire during upgrades. The observability group (of which the in-cluster monitoring team is a part of) has been trialing silencing everything but critical alerts during upgrades. While this is not perfect yet, I think this is what we ultimately want to automate for automatic upgrades.

That said any severity: critical alerts firing should prevent an automatic upgrade from happening. This level is used to page people, so in a production cluster this should already mean that someone has been notified. Our goal must be that no critical alerts are firing during upgrades. Essentially this means we need to set certain objectives per component and if they are violated we cannot proceed with an automatic upgrade and/or people are notified. I know the in-cluster team is planning an initiative to introduce these objectives for components (at this point earliest in the 4.5 time frame I would say).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.