New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert #24901
[release-4.4] Bug 1826033: Ignore ImagePruningDisabled alert #24901
Conversation
In 4.4 the automatic image pruner is initially disabled. Cluster admins are responsible for enabling it day 2. An alert is fired if the image pruner is disabled.
@adambkaplan: This pull request references Bugzilla bug 1826033, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @coreydaley |
Can we have a discussion about this please? I really think we should not be abusing alerting as a general purpose notification mechanism to users. It seems that we are lacking such a thing, but alerting should not be used for this. /hold |
Agreed! I understand that is important to notify the admin, but we had a discussion with Clayton and Adi from Console team and all agreed we should not use alerts for anything but alerting, rest can be notifications in the console. The discussion was on slack, let me know if you want a link to it. :) |
This exclusion list shouldn't be treated as a green light for converting alerts into a "to do list" for a cluster administrator. Additionally, I looked more closely into the alert itself and I have a question about its validity. In short: in what unattended situation this alert can fire? As far as I can see the alert is based on |
Agreed with @paulfantom and @lilic. I think this shows a general need for additional configuration for/after an upgrade (which I believe is actually a feature that was already present back in Tectonic, I wonder why we don't have this .. it would block an auto upgrade until configuration was supplied and we alerted on upgrades not progressing: symptom based, not cause based 🙂 ). |
@brancz @lilic totally agree that normally we should not be using the alerting system as a post-install. However, this is to address a late-breaking 4.4 GA blocker bug for the new auto image pruning feature. We spec-ed that the image pruner should not be installed on 4.4 clusters by default because of the risk that a) image pruning carries a low risk of removing images that are in use by customers, and b) customers may have deployed their own solutions for pruning images. The latter is especially true for OpenShift Dedicated clusters. Unfortunately, we did not do this in our implementation and did not catch it in our code reviews or CI. In discussion with @dmage and @bparees we agreed that the best course of action, given our risk tolerance and time constraints, was as follows:
Not entirely true. Even in our original enhancement proposal, we stated we would fire an alert fire on upgrade from 4.3 to 4.4 clusters [1]. The notable change as a result of this BZ is that this alert would fire for all clusters, not just upgrades.
Agree - and for clean installs (such as the |
Correction per further discussion with @smarterclayton - |
/retest |
this is largely news to me, I've been on numerous architecture discussion calls about problems like this and "set an alert" was always the prescribed solution. e.g. registry needing storage configuration, or registry is using empty dir storage. I'm not sure where we draw the line between an "alert" and a "notification". But something that, if the admin doesn't take action, can kill their cluster, seems worthy of an alert. Also the notification api sounds like it is substantially lacking. An admin can dismiss/silence an alert(right?). A notification, if created by the operator, can't be ignored/dismissed by the admin, they'd have to fix the config so the operator stopped sending the notification, which, while the right thing to do in most cases, may not be what they want to do in every case. (And making every operator that sends notifications introduce a "silence my notifications" api doesn't seem reasonable). In addition notifications are presented as banners at the top of the console. If we start having lots of components setting them, that's going to make the console very hard to use. So i think notifications need some more thinking before we start suggesting they are the solution here. and lastly, notifications are seen by all users, not just admins. Presenting information that's irrelevant to a normal user, and which they can do nothing about, doesn't make sense. So again, notifications need more work before they could be used here. |
and just for completeness, i'd categorize the risk as more than low. Especially if someone is using their OCP cluster to provide a registry to other clusters/consumers (in which case the registry will contain images that are not referenced by anything on the cluster, thus pruning will be free to remove any/all of those images except those associated explicitly with a tag, or whatever N history we preserve, regardless of what external components might be trying to pull those images). there are also numerous ways to consume/reference images on cluster that the pruner will be unaware of(e.g. any CRD), and thus be allowed to prune away. |
/retest |
/hold cancel this is a 4.4 GA blocker, @adambkaplan has captured the future work to revert this and create a more preferred alert based on "too many images, check your pruning settings" in the future, but for now this gets us where we need to be. (It should not be cited as precedent for other teams looking to add similar alerts). |
/retest |
Agreed! We will get better at improving our docs to make it more clear that we should alert on symptoms not on causes, the main point was symptom vs cause based alerts from the various slack discussions. And I think Adam captured this well in the new ticket! So happy to move forward.
Thanks! |
/retest |
/test e2e-gcp |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adambkaplan, bparees The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
@adambkaplan: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
the only thing is this stuck on is gcp ugprade and it can't be breaking that job. manually merging. |
@adambkaplan: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In 4.4 the automatic image pruner is initially disabled. Cluster admins are responsible for enabling it day 2. An alert is fired if the image pruner is disabled.
Note that in 4.5 the automatic image pruner is enabled on new clusters.