Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert #24901

Merged

Conversation

adambkaplan
Copy link
Contributor

In 4.4 the automatic image pruner is initially disabled. Cluster admins are responsible for enabling it day 2. An alert is fired if the image pruner is disabled.

Note that in 4.5 the automatic image pruner is enabled on new clusters.

In 4.4 the automatic image pruner is initially disabled.
Cluster admins are responsible for enabling it day 2.
An alert is fired if the image pruner is disabled.
@openshift-ci-robot
Copy link

@adambkaplan: This pull request references Bugzilla bug 1826033, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.4.0) matches configured target release for branch (4.4.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1825284 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
  • dependent Bugzilla bug 1825284 targets the "4.5.0" release, which is one of the valid target releases: 4.5.0, 4.5.z
  • bug has dependents

In response to this:

[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/urgent bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 21, 2020
@adambkaplan
Copy link
Contributor Author

/assign @coreydaley
/cc @dmage

@brancz
Copy link
Contributor

brancz commented Apr 21, 2020

Can we have a discussion about this please? I really think we should not be abusing alerting as a general purpose notification mechanism to users. It seems that we are lacking such a thing, but alerting should not be used for this.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2020
@lilic
Copy link
Contributor

lilic commented Apr 21, 2020

It seems that we are lacking such a thing, but alerting should not be used for this.

Agreed! I understand that is important to notify the admin, but we had a discussion with Clayton and Adi from Console team and all agreed we should not use alerts for anything but alerting, rest can be notifications in the console. The discussion was on slack, let me know if you want a link to it. :)

@paulfantom
Copy link
Contributor

paulfantom commented Apr 21, 2020

This exclusion list shouldn't be treated as a green light for converting alerts into a "to do list" for a cluster administrator.

Additionally, I looked more closely into the alert itself and I have a question about its validity. In short: in what unattended situation this alert can fire? As far as I can see the alert is based on image_registry_operator_image_pruner_install_status metric, which changes value only when a cluster administrator does some action on purpose (correct me if I am wrong). For me, this is clearly a notification and not alert and this is something that admin can check via operator status after installing the component.

@brancz
Copy link
Contributor

brancz commented Apr 21, 2020

Agreed with @paulfantom and @lilic. I think this shows a general need for additional configuration for/after an upgrade (which I believe is actually a feature that was already present back in Tectonic, I wonder why we don't have this .. it would block an auto upgrade until configuration was supplied and we alerted on upgrades not progressing: symptom based, not cause based 🙂 ).

@adambkaplan
Copy link
Contributor Author

@brancz @lilic totally agree that normally we should not be using the alerting system as a post-install. However, this is to address a late-breaking 4.4 GA blocker bug for the new auto image pruning feature.

We spec-ed that the image pruner should not be installed on 4.4 clusters by default because of the risk that a) image pruning carries a low risk of removing images that are in use by customers, and b) customers may have deployed their own solutions for pruning images. The latter is especially true for OpenShift Dedicated clusters. Unfortunately, we did not do this in our implementation and did not catch it in our code reviews or CI.

In discussion with @dmage and @bparees we agreed that the best course of action, given our risk tolerance and time constraints, was as follows:

  1. Image pruning should be installed for all 4.4, but should be suspended. This fires the ImagePrunerDisabled alert. The result is that the image pruner resources are in place on the cluster, ready to be enabled with a simple oc patch/oc apply command.
  2. In 4.5 the pruner is enabled by default on new clusters, on upgrade the admin setting is preserved. For a new install no alerts should fire.

As far as I can see the alert is based on image_registry_operator_image_pruner_install_status metric, which changes value only when a cluster administrator does some action on purpose (correct me if I am wrong)

Not entirely true. Even in our original enhancement proposal, we stated we would fire an alert fire on upgrade from 4.3 to 4.4 clusters [1]. The notable change as a result of this BZ is that this alert would fire for all clusters, not just upgrades.

This exclusion list shouldn't be treated as a green light for converting alerts into a "to do list" for a cluster administrator.

Agree - and for clean installs (such as the e2e-aws suites) we should only have very limited number of alerts firing. We're asking for ImagePrunerDisabled to be considered a special case because image pruning is critical to cluster health. Otherwise customers risk running out of space in etcd and cluster death.

[1] https://github.com/openshift/enhancements/blob/master/enhancements/image-registry/automate-pruning.md#upgrade--downgrade-strategy

@adambkaplan
Copy link
Contributor Author

/cc @bparees @smarterclayton

@adambkaplan
Copy link
Contributor Author

Correction per further discussion with @smarterclayton - etcd death is not solely determined by storage. etcd has a performance threshold on the number of keys, above which it is not capable of doing its job.

@adambkaplan
Copy link
Contributor Author

/retest

@bparees
Copy link
Contributor

bparees commented Apr 21, 2020

It seems that we are lacking such a thing, but alerting should not be used for this.

Agreed! I understand that is important to notify the admin, but we had a discussion with Clayton and Adi from Console team and all agreed we should not use alerts for anything but alerting, rest can be notifications in the console. The discussion was on slack, let me know if you want a link to it. :)

this is largely news to me, I've been on numerous architecture discussion calls about problems like this and "set an alert" was always the prescribed solution. e.g. registry needing storage configuration, or registry is using empty dir storage.

I'm not sure where we draw the line between an "alert" and a "notification". But something that, if the admin doesn't take action, can kill their cluster, seems worthy of an alert.

Also the notification api sounds like it is substantially lacking. An admin can dismiss/silence an alert(right?). A notification, if created by the operator, can't be ignored/dismissed by the admin, they'd have to fix the config so the operator stopped sending the notification, which, while the right thing to do in most cases, may not be what they want to do in every case. (And making every operator that sends notifications introduce a "silence my notifications" api doesn't seem reasonable).

In addition notifications are presented as banners at the top of the console. If we start having lots of components setting them, that's going to make the console very hard to use. So i think notifications need some more thinking before we start suggesting they are the solution here.

and lastly, notifications are seen by all users, not just admins. Presenting information that's irrelevant to a normal user, and which they can do nothing about, doesn't make sense. So again, notifications need more work before they could be used here.

@bparees
Copy link
Contributor

bparees commented Apr 21, 2020

We spec-ed that the image pruner should not be installed on 4.4 clusters by default because of the risk that a) image pruning carries a low risk of removing images that are in use by customers

and just for completeness, i'd categorize the risk as more than low. Especially if someone is using their OCP cluster to provide a registry to other clusters/consumers (in which case the registry will contain images that are not referenced by anything on the cluster, thus pruning will be free to remove any/all of those images except those associated explicitly with a tag, or whatever N history we preserve, regardless of what external components might be trying to pull those images).

there are also numerous ways to consume/reference images on cluster that the pruner will be unaware of(e.g. any CRD), and thus be allowed to prune away.

@adambkaplan
Copy link
Contributor Author

/retest

@bparees
Copy link
Contributor

bparees commented Apr 21, 2020

/hold cancel

this is a 4.4 GA blocker, @adambkaplan has captured the future work to revert this and create a more preferred alert based on "too many images, check your pruning settings" in the future, but for now this gets us where we need to be. (It should not be cited as precedent for other teams looking to add similar alerts).

https://issues.redhat.com/browse/DEVEXP-596

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2020
@coreydaley
Copy link
Member

/retest

@lilic
Copy link
Contributor

lilic commented Apr 22, 2020

but for now this gets us where we need to be.

Agreed! We will get better at improving our docs to make it more clear that we should alert on symptoms not on causes, the main point was symptom vs cause based alerts from the various slack discussions. And I think Adam captured this well in the new ticket! So happy to move forward.

(It should not be cited as precedent for other teams looking to add similar alerts).

Thanks!

@dmage
Copy link
Member

dmage commented Apr 22, 2020

/retest

@cuppett
Copy link
Member

cuppett commented Apr 22, 2020

/test e2e-gcp

@coreydaley
Copy link
Member

/retest

@bparees
Copy link
Contributor

bparees commented Apr 22, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, bparees

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@bparees bparees added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Apr 23, 2020
@bparees
Copy link
Contributor

bparees commented Apr 23, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@adambkaplan
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link

@adambkaplan: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upgrade 688c9ca link /test e2e-gcp-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sttts
Copy link
Contributor

sttts commented Apr 24, 2020

/retest

@bparees
Copy link
Contributor

bparees commented Apr 24, 2020

the only thing is this stuck on is gcp ugprade and it can't be breaking that job. manually merging.

@bparees bparees merged commit b9b84e0 into openshift:release-4.4 Apr 24, 2020
@openshift-ci-robot
Copy link

@adambkaplan: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:

In response to this:

[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet