Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DeprecatedAPIInUse alert #1018

Merged
merged 3 commits into from Feb 16, 2021

Conversation

sanchezl
Copy link
Contributor

@sanchezl sanchezl commented Dec 8, 2020

Warn via an info alert that an API that will be removed in the next OpenShift version is currently in use.

@sanchezl sanchezl force-pushed the deprecated-alert branch 3 times, most recently from f0f5b90 to 974ddc0 Compare December 8, 2020 16:34
@sanchezl
Copy link
Contributor Author

/retest

@sanchezl
Copy link
Contributor Author

/test k8s-e2e-gcp
/test e2e-aws-operator

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Dec 14, 2020

@sanchezl: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/k8s-e2e-gcp 974ddc0 link /test k8s-e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

message: |
Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}}
expr: |
group(apiserver_requested_deprecated_apis{removed_release="1.20"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() apiserver_request_total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master is 1.20, right, so shouldn't this be removed_release="1.21" here? And then changed to 1.20 if/when this gets backported to 4.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking, you are correct. We weren't on 1.20 yet when I originally started this.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with such alert is that it will never clear by itself. That's already been discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1793850 and #742 with the UsingDeprecatedAPIExtensionsV1Beta1 alert. Eventually this alert was removed in #764. Are we sure that we want this again? A test in origin e2e might be better suited?

cc @lilic @s-urbaniak

@lilic
Copy link
Contributor

lilic commented Dec 17, 2020

Agree with Simon, this was a huge pain for both our SREs and customers (info is in the Bugzilla linked by Simon). We need all alerts to be able to be resolvable by our customers, but our customers cannot control openshift operators only their own workload.

Adding an e2e test based on these metrics makes perfect sense. There was also talk about adding notifications in the console to warn users of deprecated APIs.

@deads2k
Copy link
Contributor

deads2k commented Jan 4, 2021

@lilic @simonpasquier How should a customer be warned that a deprecated API, which is going to be removed is being used in their cluster and will result in a disruption for their particular workload?

Not all workloads come in OCP payloads, but any workload using the deprecated APIs will be impacted when they are removed in a following release. I'm open to other solutions, but I'd like to know what that solution is and what we should do to manage that in 1.22 as the first round of APIs like CRD/v1beta1 are removed.

A test in origin e2e might be better suited?

An e2e test in origin does not help customers who are running code using these old APIs that is not in an OCP payload.

@lilic
Copy link
Contributor

lilic commented Jan 5, 2021

I think the console team has notifications either in place already or are planning on adding, those are more appropriate for this, I would suggest following up with them. Anything that does not page admins, but just shows notifications to users in the console is a better user experience in my opinion.

@sttts
Copy link
Contributor

sttts commented Jan 5, 2021

those are more appropriate for this

@lilic technically both are feasible obviously. Sounds like we are moving into product manager or architect territory, how to best present that information to users. Are we sure that with console notifications we won't get "nobody warned me about the broken workloads after upgrade. I never look into the UI." sort of complains?

To repeat what @deads2k said between the lines: we need this before 1.22 is upgraded to. So in 4.8 the latest. If console team (@jhadvig) is not finished with notifications like that (metrics based notification in console), this is going to be too late.

Instead of being too late and breaking customers, I would prefer a little less ideal user experience (to be decided that it actually is) of metric based alerts in 4.8.

cc @mcurry-rh @derekwaynecarr @jhadvig

@lilic
Copy link
Contributor

lilic commented Jan 6, 2021

Are we sure that with console notifications we won't get "nobody warned me about the broken workloads after upgrade. I never look into the UI." sort of complains?

Some of our users also do not use alertmanager and never configure it, so same can be said for alerts really.

In any case, I believe we discussed offline and sort of agreed on info level alert for this is fine if there are no other alternatives. And @simonpasquier provided with a better alerti2ng rule, so that users can actually resolve this alert. And that is should be just for non openshift workloads IIRC? cc @s-urbaniak

group(apiserver_requested_deprecated_apis{removed_release="1.20"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() apiserver_request_total
for: 1h
labels:
severity: warning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is barely actionable and neither urgent and shouldn't be higher severity than "info"

@s-urbaniak
Copy link
Contributor

I agree that we are leveraging alerting for use cases it has not been envisioned for and am +1 on using info level only for this kind of workaround as discussed OOB.

@RiRa12621
Copy link

@s-urbaniak can probably clarify but iirc all alerts are presented in the console. That should include info level.
So we'd get the same effect but no pages that are giving admins the noisy feelings.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments.

- alert: DeprecratedAPIUsed
annotations:
message: |
Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mentioned to also say how to fix this, e.g. "Remove the workload that is using this API, otherwise " anything that is helpful to end users, who receive this page.

Also is it not possible to get a namespace for this to display?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to:

Deprecated API that will be removed in the next version1 is being used. Removing the workload that is using the group.version/resource API might2 be necessary for a successful upgrade to the next cluster version.


1 by next version here we mean next y-stream version. Is there a more exact, customer-facing terminology to use here to make it clear that a z-stream version would not remove an API? For example, next minor version, but the term minor version is not well defined in the documentation.
2 I say might here because the deprecated API usage might be in the kube control plane itself, which of course would be updated in the next version to not use the deprecated APIs, but I can't easily filter that use in the metric/alert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it not possible to get a namespace for this to display?
@lilic The namespace is not available.

Deprecated API that will be removed in the next version is being used: {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}}
expr: |
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0
for: 1m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why its 1minute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typo.

@sanchezl
Copy link
Contributor Author

/retest

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks so much for addressing our comments!

@simonpasquier leaving up to Simon in case he has any more comments.

the {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} API might be necessary for
a successful upgrade to the next cluster version.
expr: |
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we make sure this version is updated each release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to you can either edit for the next release so in 4.8 you would replace or add an or for 1.22, or use a regex matcher here that would match against multiple versions, see -> https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors

Copy link
Contributor

@sttts sttts Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question was more about how we remember this in practice. We could add CI test that compares the value with the running server version, in order to be reminded during rebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is a good idea, or something along the lines of creating one of the deprecated APIs and evaluating if the alert is firing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts Added an e2e test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to get rid of the on(...) group_right() part

Suggested change
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) * on(group,version,resource,subresource) group_right() rate(apiserver_request_total[10m]) > 0
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource,subresource) and (sum by(group,version,resource,subresource) (rate(apiserver_request_total[10m]))) > 0

You can even remove the subresource label from the by() clauses since it's not used in the annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier Thank you. Updated.

message: >-
Deprecated API that will be removed in the next version is being used. Removing the workload that is using
the {{"{{$labels.group}}"}}.{{"{{$labels.version}}"}}/{{"{{$labels.resource}}"}} API might be necessary for
a successful upgrade to the next cluster version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to audit logs to find out the actor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@sanchezl sanchezl changed the title add DeprecatedAPIUsed alert add DeprecatedAPIInUse alert Feb 3, 2021
@simonpasquier
Copy link
Contributor

lgtm

"k8s.io/client-go/discovery"
)

func TestDeprecatedAPIInUse(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good test. It will break every kube bump. I find this acceptable since it won't block the kube bump.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 @sanchezl ignore my ask for a test that breaks elsewhere. This is it I guess.

a successful upgrade to the next cluster version. Refer to the audit logs to identify the workload.
expr: |
group(apiserver_requested_deprecated_apis{removed_release="1.21"}) by (group,version,resource) and (sum by(group,version,resource) (rate(apiserver_request_total[10m]))) > 0
for: 1h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use 4h, will that cover the entire timeframe of an e2e test? Is this the spot or the rate the spot? Probably here I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up at #1044.

@deads2k
Copy link
Contributor

deads2k commented Feb 11, 2021

/lgtm
/hold

holding so we can decide about the exact timeframe.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2021
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2021
@sanchezl
Copy link
Contributor Author

/retest

1 similar comment
@sanchezl
Copy link
Contributor Author

/retest

@deads2k
Copy link
Contributor

deads2k commented Feb 15, 2021

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2021
@sanchezl
Copy link
Contributor Author

/test e2e-aws-serial

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 249c250 into openshift:master Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet