New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1793850: Update UsingDeprecatedAPIExtensionsV1Beta1 alert #730
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cblecker The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This and #705 should be cherrypicked back to release-4.3 |
@cblecker: This pull request references Bugzilla bug 1793850, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1e72fc6
to
f9ac58c
Compare
/retest |
expr: | | ||
apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"} | ||
increase( | ||
apiserver_request_count{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that apiserver_request_count
is a deprecated metric and we are dropping those metrics from openshift as well. We should be using apiserver_request_total
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are dropping this metric start with 4.4. It has been deprecated since 1.14 k8s version and has a replacement which is apiserver_request_total
so just changing it to that should just work ™️ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change, and I verified the alert still fires as expected in a test cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 @lilic does the new metric will be working fine in 4.3 as well? Just ensuring before we switch it in the backport too, or the backport has to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My test cluster is running 4.3.0 GA, so I can confirm that this works.
labels: | ||
severity: warning | ||
severity: none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why severity none? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the OSD context, Critical and Warnings are fed into monitoring for cluster health. This alert isn't particularly actionable and doesn't actually impact the cluster's current health -- it's an "info" type alert that you should do something prior to the next upgrade, but the current cluster health isn't degraded. None in this case, I believe is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that with a severity of none, it still surfaces in the on-cluster UI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alert isn't particularly actionable and doesn't actually impact the cluster's current health
Precisely why this alert shouldn't exist in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning behind is that we need to yell loudly at people using this api, b/c otherwise after next upgrade this won't work. I can be convinced and I was originally going after an info level, but none such exists atm. That's why the warning, and it's important one to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in 4.4 its not firing, otherwise we would have had failed jobs which confirms @tnozicka:
note that for 4.4 the serving is disabled so we should drop this check entirely there
Do we want our customers to react on this alert? And if yes how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want our customers to react on this alert? And if yes how?
If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can be convinced and I was originally going after an info level, but none such exists atm.
They do exist. OLM uses info
level alerts and we also (will) have special rules for handling this severity.
If this fires in 4.3 it means you have clients accessing endpoints that will be removed in the next release and they should be updating that code.
This is not immediately actionable from operator point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that for 4.4 the serving is disabled so we should drop this check entirely there
group="extensions", | ||
version="v1beta1", | ||
resource!~"ingresses|", | ||
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why velero-server? the only reason for the others was using discovery to spin informers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Velero is a piece of backup software, and it scrapes every API type available on the system in order to back them up. This causes the alert to fire in OSD and never resolve. @sttts mentioned that ignoring this client may be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a minor problem I can live with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also note that there is no action that we or velero need to take on this. When the API endpoint goes away in 4.4, it won't be there to scrape anymore.
labels: | ||
severity: warning | ||
severity: none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that severity should've stayed, from the discussion we were to only change the metrics to be rate based.
expr: | | ||
apiserver_request_count{group="extensions",version="v1beta1",resource!~"ingresses|",client!~"hyperkube/.*|cluster-policy-controller/.*"} | ||
increase( | ||
apiserver_request_count{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilic drop when? This PR is meant for 4.3 most importantly. If you're dropping this in 4.4 - fine change, if in the next we'll fix it then.
version="v1beta1", | ||
resource!~"ingresses|", | ||
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*" | ||
}[24h] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you want to have 5m here? Why going with 24h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding of this alert, is we want it to surface to the administrators of the cluster that some client is making a call that they will no longer be able to in 4.4+. Right now, this alert alarms and never resolves -- which on the one hand accomplishes this goal, but on the other hand doesn't allow for the alert to resolve if the condition stops.
I'm okay with 5m, but I think that lessons the usefulness of this alert, as the administrator would need to catch it within that 5 minute period. For clients that aren't calling the API that frequently (cron jobs, deploy pipelines, etc) it might be harder to catch this alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong preferences either way, I was just asking 😉
version="v1beta1", | ||
resource!~"ingresses|", | ||
client!~"hyperkube/.*|cluster-policy-controller/.*|velero-server/.*" | ||
}[24h] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you want to have 5m here? Why going with 24h?
f9ac58c
to
6bbd781
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only one concern with this is still the level, it has to stay warning, or if there's info that's the lowest we can go. It's quite important feedback for users to get, before the next upgrade which is when it'll be too late for them, and they will be blocked for the upgrade or even much more after the upgrade.
- Updated message that describes the issue more clearly - Exclude velero-server client - Change alert to use an increase in the last 24h - Change severity to none
6bbd781
to
4599af8
Compare
@soltysh Updated to info. It's important that this not be a "warning" because the clusters health is not actually degraded. |
/retest |
1 similar comment
/retest |
@cblecker: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Superseded with #741. |
/assign @soltysh
/cc @sttts @brancz @tnozicka