[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert #24901

adambkaplan · 2020-04-21T06:39:19Z

In 4.4 the automatic image pruner is initially disabled. Cluster admins are responsible for enabling it day 2. An alert is fired if the image pruner is disabled.

Note that in 4.5 the automatic image pruner is enabled on new clusters.

In 4.4 the automatic image pruner is initially disabled. Cluster admins are responsible for enabling it day 2. An alert is fired if the image pruner is disabled.

openshift-ci-robot · 2020-04-21T06:39:23Z

@adambkaplan: This pull request references Bugzilla bug 1826033, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.4.0) matches configured target release for branch (4.4.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1825284 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
dependent Bugzilla bug 1825284 targets the "4.5.0" release, which is one of the valid target releases: 4.5.0, 4.5.z
bug has dependents

In response to this:

[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adambkaplan · 2020-04-21T06:40:10Z

/assign @coreydaley
/cc @dmage

brancz · 2020-04-21T11:22:50Z

Can we have a discussion about this please? I really think we should not be abusing alerting as a general purpose notification mechanism to users. It seems that we are lacking such a thing, but alerting should not be used for this.

/hold

lilic · 2020-04-21T11:32:36Z

It seems that we are lacking such a thing, but alerting should not be used for this.

Agreed! I understand that is important to notify the admin, but we had a discussion with Clayton and Adi from Console team and all agreed we should not use alerts for anything but alerting, rest can be notifications in the console. The discussion was on slack, let me know if you want a link to it. :)

paulfantom · 2020-04-21T11:41:42Z

This exclusion list shouldn't be treated as a green light for converting alerts into a "to do list" for a cluster administrator.

Additionally, I looked more closely into the alert itself and I have a question about its validity. In short: in what unattended situation this alert can fire? As far as I can see the alert is based on image_registry_operator_image_pruner_install_status metric, which changes value only when a cluster administrator does some action on purpose (correct me if I am wrong). For me, this is clearly a notification and not alert and this is something that admin can check via operator status after installing the component.

brancz · 2020-04-21T12:06:34Z

Agreed with @paulfantom and @lilic. I think this shows a general need for additional configuration for/after an upgrade (which I believe is actually a feature that was already present back in Tectonic, I wonder why we don't have this .. it would block an auto upgrade until configuration was supplied and we alerted on upgrades not progressing: symptom based, not cause based 🙂 ).

adambkaplan · 2020-04-21T13:00:54Z

@brancz @lilic totally agree that normally we should not be using the alerting system as a post-install. However, this is to address a late-breaking 4.4 GA blocker bug for the new auto image pruning feature.

We spec-ed that the image pruner should not be installed on 4.4 clusters by default because of the risk that a) image pruning carries a low risk of removing images that are in use by customers, and b) customers may have deployed their own solutions for pruning images. The latter is especially true for OpenShift Dedicated clusters. Unfortunately, we did not do this in our implementation and did not catch it in our code reviews or CI.

In discussion with @dmage and @bparees we agreed that the best course of action, given our risk tolerance and time constraints, was as follows:

Image pruning should be installed for all 4.4, but should be suspended. This fires the ImagePrunerDisabled alert. The result is that the image pruner resources are in place on the cluster, ready to be enabled with a simple oc patch/oc apply command.
In 4.5 the pruner is enabled by default on new clusters, on upgrade the admin setting is preserved. For a new install no alerts should fire.

As far as I can see the alert is based on image_registry_operator_image_pruner_install_status metric, which changes value only when a cluster administrator does some action on purpose (correct me if I am wrong)

Not entirely true. Even in our original enhancement proposal, we stated we would fire an alert fire on upgrade from 4.3 to 4.4 clusters [1]. The notable change as a result of this BZ is that this alert would fire for all clusters, not just upgrades.

This exclusion list shouldn't be treated as a green light for converting alerts into a "to do list" for a cluster administrator.

Agree - and for clean installs (such as the e2e-aws suites) we should only have very limited number of alerts firing. We're asking for ImagePrunerDisabled to be considered a special case because image pruning is critical to cluster health. Otherwise customers risk running out of space in etcd and cluster death.

[1] https://github.com/openshift/enhancements/blob/master/enhancements/image-registry/automate-pruning.md#upgrade--downgrade-strategy

adambkaplan · 2020-04-21T13:01:48Z

/cc @bparees @smarterclayton

adambkaplan · 2020-04-21T13:50:50Z

Correction per further discussion with @smarterclayton - etcd death is not solely determined by storage. etcd has a performance threshold on the number of keys, above which it is not capable of doing its job.

adambkaplan · 2020-04-21T14:57:10Z

/retest

bparees · 2020-04-21T16:44:11Z

It seems that we are lacking such a thing, but alerting should not be used for this.

Agreed! I understand that is important to notify the admin, but we had a discussion with Clayton and Adi from Console team and all agreed we should not use alerts for anything but alerting, rest can be notifications in the console. The discussion was on slack, let me know if you want a link to it. :)

this is largely news to me, I've been on numerous architecture discussion calls about problems like this and "set an alert" was always the prescribed solution. e.g. registry needing storage configuration, or registry is using empty dir storage.

I'm not sure where we draw the line between an "alert" and a "notification". But something that, if the admin doesn't take action, can kill their cluster, seems worthy of an alert.

Also the notification api sounds like it is substantially lacking. An admin can dismiss/silence an alert(right?). A notification, if created by the operator, can't be ignored/dismissed by the admin, they'd have to fix the config so the operator stopped sending the notification, which, while the right thing to do in most cases, may not be what they want to do in every case. (And making every operator that sends notifications introduce a "silence my notifications" api doesn't seem reasonable).

In addition notifications are presented as banners at the top of the console. If we start having lots of components setting them, that's going to make the console very hard to use. So i think notifications need some more thinking before we start suggesting they are the solution here.

and lastly, notifications are seen by all users, not just admins. Presenting information that's irrelevant to a normal user, and which they can do nothing about, doesn't make sense. So again, notifications need more work before they could be used here.

bparees · 2020-04-21T16:47:08Z

We spec-ed that the image pruner should not be installed on 4.4 clusters by default because of the risk that a) image pruning carries a low risk of removing images that are in use by customers

and just for completeness, i'd categorize the risk as more than low. Especially if someone is using their OCP cluster to provide a registry to other clusters/consumers (in which case the registry will contain images that are not referenced by anything on the cluster, thus pruning will be free to remove any/all of those images except those associated explicitly with a tag, or whatever N history we preserve, regardless of what external components might be trying to pull those images).

there are also numerous ways to consume/reference images on cluster that the pruner will be unaware of(e.g. any CRD), and thus be allowed to prune away.

adambkaplan · 2020-04-21T20:28:30Z

/retest

bparees · 2020-04-21T20:35:49Z

/hold cancel

this is a 4.4 GA blocker, @adambkaplan has captured the future work to revert this and create a more preferred alert based on "too many images, check your pruning settings" in the future, but for now this gets us where we need to be. (It should not be cited as precedent for other teams looking to add similar alerts).

https://issues.redhat.com/browse/DEVEXP-596

coreydaley · 2020-04-21T21:11:30Z

/retest

lilic · 2020-04-22T07:52:22Z

but for now this gets us where we need to be.

Agreed! We will get better at improving our docs to make it more clear that we should alert on symptoms not on causes, the main point was symptom vs cause based alerts from the various slack discussions. And I think Adam captured this well in the new ticket! So happy to move forward.

(It should not be cited as precedent for other teams looking to add similar alerts).

Thanks!

dmage · 2020-04-22T10:26:29Z

/retest

cuppett · 2020-04-22T13:35:59Z

/test e2e-gcp

coreydaley · 2020-04-22T14:08:38Z

/retest

bparees · 2020-04-22T14:19:52Z

/lgtm

openshift-ci-robot · 2020-04-22T14:23:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, bparees

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/prometheus/OWNERS~~ [adambkaplan,bparees]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-04-22T16:35:49Z

/retest