Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-2727: Adds telemeter alert TelemeterClientFailures #1803

Merged
merged 1 commit into from Oct 28, 2022

Conversation

JoaoBraveCoding
Copy link
Contributor

@JoaoBraveCoding JoaoBraveCoding commented Oct 20, 2022

Issue: https://issues.redhat.com/browse/MON-2727

Problem: in-cluster admins and folks monitoring submitted Insights should have a way to figure out that the cluster is trying and failing to submit Telemetry.

Solution: alert that will trigger when the rate of failed requests reaches a total of 20% of the total rate of requests in a 15 min window

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Issue: https://issues.redhat.com/browse/MON-2727

Problem: in-cluster admins and folks monitoring submitted Insights should have a way to figure out that the cluster is trying and failing to submit Telemetry.

Solution: alert that will trigger when the rate of failed requests
reaches a total of 20% of the total rate of requests in a 15 min window
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 20, 2022
@JoaoBraveCoding
Copy link
Contributor Author

/skip

@JoaoBraveCoding JoaoBraveCoding changed the title Adds telemeter alert TelemeterClientFailures MON-2727: Adds telemeter alert TelemeterClientFailures Oct 26, 2022
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@simonpasquier
Copy link
Contributor

/label px-approved
/label docs-approved

For QE approval
/assign @juzhao

@openshift-ci openshift-ci bot added px-approved Signifies that Product Support has signed off on this PR docs-approved Signifies that Docs has signed off on this PR labels Oct 26, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoaoBraveCoding, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JoaoBraveCoding,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juzhao
Copy link

juzhao commented Oct 28, 2022

checked with PR, used unauthorized token, since we pushed all metrics in one request, the maximum value for the TelemeterClientFailures expression is 1

# oc logs -n openshift-monitoring deployment.apps/telemeter-client -c telemeter-client
...
level=error caller=forwarder.go:276 ts=2022-10-28T04:03:07.684944851Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"
level=error caller=forwarder.go:276 ts=2022-10-28T04:04:07.752467672Z component=forwarder/worker msg="unable to forward results" err="unable to authorize to server: unable to exchange initial token for a long lived token: 404:\nnot found\n"

and

# token=`oc create token prometheus-k8s -n openshift-monitoring`
# c -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum by (namespace) (rate(federate_requests_failed_total{job="telemeter-client"}[15m])) / sum by (namespace) (rate(federate_requests_total{job="telemeter-client"}[15m]))' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "namespace": "openshift-monitoring"
        },
        "value": [
          1666929885.827,
          "1"
        ]
      }
    ]
  }
}

@juzhao
Copy link

juzhao commented Oct 28, 2022

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Oct 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2022

@JoaoBraveCoding: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/versions f7a2ab1 link false /test versions

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jan--f
Copy link
Contributor

jan--f commented Oct 28, 2022

/retest

@openshift-merge-robot openshift-merge-robot merged commit 372eaa3 into openshift:master Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants