New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add DeprecatedAPIInUse alert #1018
add DeprecatedAPIInUse alert #1018
Conversation
f0f5b90
to
974ddc0
Compare
/retest |
/test k8s-e2e-gcp |
@sanchezl: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
message: | | ||
Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} | ||
expr: | | ||
group(apiserver_requested_deprecated_apis{removed_release="1.20"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() apiserver_request_total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master is 1.20, right, so shouldn't this be removed_release="1.21"
here? And then changed to 1.20
if/when this gets backported to 4.6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking, you are correct. We weren't on 1.20 yet when I originally started this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with such alert is that it will never clear by itself. That's already been discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1793850 and #742 with the UsingDeprecatedAPIExtensionsV1Beta1
alert. Eventually this alert was removed in #764. Are we sure that we want this again? A test in origin e2e might be better suited?
Agree with Simon, this was a huge pain for both our SREs and customers (info is in the Bugzilla linked by Simon). We need all alerts to be able to be resolvable by our customers, but our customers cannot control openshift operators only their own workload. Adding an e2e test based on these metrics makes perfect sense. There was also talk about adding notifications in the console to warn users of deprecated APIs. |
@lilic @simonpasquier How should a customer be warned that a deprecated API, which is going to be removed is being used in their cluster and will result in a disruption for their particular workload? Not all workloads come in OCP payloads, but any workload using the deprecated APIs will be impacted when they are removed in a following release. I'm open to other solutions, but I'd like to know what that solution is and what we should do to manage that in 1.22 as the first round of APIs like CRD/v1beta1 are removed.
An e2e test in origin does not help customers who are running code using these old APIs that is not in an OCP payload. |
I think the console team has notifications either in place already or are planning on adding, those are more appropriate for this, I would suggest following up with them. Anything that does not page admins, but just shows notifications to users in the console is a better user experience in my opinion. |
@lilic technically both are feasible obviously. Sounds like we are moving into product manager or architect territory, how to best present that information to users. Are we sure that with console notifications we won't get "nobody warned me about the broken workloads after upgrade. I never look into the UI." sort of complains? To repeat what @deads2k said between the lines: we need this before 1.22 is upgraded to. So in 4.8 the latest. If console team (@jhadvig) is not finished with notifications like that (metrics based notification in console), this is going to be too late. Instead of being too late and breaking customers, I would prefer a little less ideal user experience (to be decided that it actually is) of metric based alerts in 4.8. |
Some of our users also do not use alertmanager and never configure it, so same can be said for alerts really. In any case, I believe we discussed offline and sort of agreed on info level alert for this is fine if there are no other alternatives. And @simonpasquier provided with a better alerti2ng rule, so that users can actually resolve this alert. And that is should be just for non openshift workloads IIRC? cc @s-urbaniak |
group(apiserver_requested_deprecated_apis{removed_release="1.20"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() apiserver_request_total | ||
for: 1h | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is barely actionable and neither urgent and shouldn't be higher severity than "info"
I agree that we are leveraging alerting for use cases it has not been envisioned for and am +1 on using info level only for this kind of workaround as discussed OOB. |
974ddc0
to
7729380
Compare
@s-urbaniak can probably clarify but iirc all alerts are presented in the console. That should include info level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments.
- alert: DeprecratedAPIUsed | ||
annotations: | ||
message: | | ||
Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mentioned to also say how to fix this, e.g. "Remove the workload that is using this API, otherwise " anything that is helpful to end users, who receive this page.
Also is it not possible to get a namespace for this to display?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to:
Deprecated API that will be removed in the next version1 is being used. Removing the workload that is using the
group.version/resource
API might2 be necessary for a successful upgrade to the next cluster version.
1 by next version here we mean next y-stream version. Is there a more exact, customer-facing terminology to use here to make it clear that a z-stream version would not remove an API? For example, next minor version, but the term minor version is not well defined in the documentation.
2 I say might here because the deprecated API usage might be in the kube control plane itself, which of course would be updated in the next version to not use the deprecated APIs, but I can't easily filter that use in the metric/alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it not possible to get a namespace for this to display?
@lilic The namespace is not available.
Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} | ||
expr: | | ||
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0 | ||
for: 1m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why its 1minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed typo.
7729380
to
3c30e1b
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks so much for addressing our comments!
@simonpasquier leaving up to Simon in case he has any more comments.
the {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} API might be necessary for | ||
a successful upgrade to the next cluster version. | ||
expr: | | ||
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we make sure this version is updated each release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to you can either edit for the next release so in 4.8 you would replace or add an or for 1.22, or use a regex matcher here that would match against multiple versions, see -> https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my question was more about how we remember this in practice. We could add CI test that compares the value with the running server version, in order to be reminded during rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is a good idea, or something along the lines of creating one of the deprecated APIs and evaluating if the alert is firing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts Added an e2e test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to get rid of the on(...) group_right()
part
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0 | |
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) and (sum by(group,version,resource,subresource) (rate(apiserver_request_total[10m]))) > 0 |
You can even remove the subresource
label from the by() clauses since it's not used in the annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier Thank you. Updated.
message: >- | ||
Deprecated API that will be removed in the next version is being used. Removing the workload that is using | ||
the {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} API might be necessary for | ||
a successful upgrade to the next cluster version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to audit logs to find out the actor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
3c30e1b
to
ab00e11
Compare
ab00e11
to
d4df862
Compare
lgtm |
"k8s.io/client-go/discovery" | ||
) | ||
|
||
func TestDeprecatedAPIInUse(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good test. It will break every kube bump. I find this acceptable since it won't block the kube bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 @sanchezl ignore my ask for a test that breaks elsewhere. This is it I guess.
a successful upgrade to the next cluster version. Refer to the audit logs to identify the workload. | ||
expr: | | ||
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total[10m]))) > 0 | ||
for: 1h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use 4h, will that cover the entire timeframe of an e2e test? Is this the spot or the rate the spot? Probably here I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed up at #1044.
/lgtm holding so we can decide about the exact timeframe. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/hold cancel |
/test e2e-aws-serial |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Warn via an
info
alert that an API that will be removed in the next OpenShift version is currently in use.