Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather rest client latency and error rate to be able to detect performance regressions #25375

Merged
merged 1 commit into from
Jan 21, 2022

Conversation

dgrisonnet
Copy link
Member

Collect the latencies and error rates of the proxies in front of the apiserver to be able to detect regression from a particular network proxy. I can be either the load balancers, the services, or direct access via the pod.

The actual queries are based on openshift/cluster-kube-apiserver-operator#1272 and look like the following if they were not one-lined:

rest client 99th percentile of request latency:

sum by(type) (
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},"type","load_balancer","","")[1h:30s]
  )) by (le,type))
  or
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url!~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","service","","")[1h:30s]
  )) by (le,type))
  or
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","pod","","")[1h:30s]
  )) by (le,type))
)

rest client average request latency:

sum by(type) (
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
  ,"type","load_balancer","","")
  or
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
  ,"type","service","","")
  or
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
  ,"type","pod","","")
)

rest client error rate:

sum by(type) (
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
  ,"type","load_balancer","","")
  or
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
  ,"type","service","","")
  or
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
  ,"type","pod","","")
)

Note that all the graphs resulting from these queries will have 3 entries, one for each type of proxy.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@dgrisonnet
Copy link
Member Author

/cc @deads2k @tkashem @aojea

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 18, 2022

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/stackrox/rox-openshift-ci-mirror/master/e2e-test 9fe8b51 link unknown /test pj-rehearse
ci/rehearse/openshift/okd-machine-os/release-4.7/e2e-ovirt 9fe8b51 link unknown /test pj-rehearse
ci/prow/pj-rehearse 9fe8b51 link false /test pj-rehearse
ci/rehearse/tnozicka/openshift-acme/master/e2e-cluster-wide 9fe8b51 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator 9fe8b51 link unknown /test pj-rehearse
ci/rehearse/openshift/okd-machine-os/release-4.9/e2e-ovirt 9fe8b51 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-capi-operator/release-4.11/e2e-aws-capi-techpreview 9fe8b51 link unknown /test pj-rehearse
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-serial 9fe8b51 link unknown /test pj-rehearse

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor

deads2k commented Jan 21, 2022

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2022
@tkashem
Copy link
Contributor

tkashem commented Jan 21, 2022

LGTM

giving an opportunity for @aojea to review

@aojea
Copy link
Contributor

aojea commented Jan 21, 2022

@dgrisonnet I ended changing the regex to avoid getting the host+path and get the host only

              sum(rate(
                label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},
                  "host","$1","url","https?://([^/\\s]+).*")[5m:30s]
                )) by (le,host,service,namespace,node)

@dgrisonnet
Copy link
Member Author

It doesn't affect this PR since I am not relabeling the URL label to extract the host. Here, I am only exposing a higher-level type information which takes three values: load_balancer, service, and pod to avoid consuming too much resource

@aojea
Copy link
Contributor

aojea commented Jan 21, 2022

It doesn't affect this PR since I am not relabeling the URL label to extract the host. Here, I am only exposing a higher-level type information which takes three values: load_balancer, service, and pod to avoid consuming too much resource

you are the expert here
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, deads2k, dgrisonnet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 940113c into openshift:master Jan 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2022

@dgrisonnet: Updated the step-registry configmap in namespace ci at cluster app.ci using the following files:

  • key gather-extra-commands.sh using file ci-operator/step-registry/gather/extra/gather-extra-commands.sh

In response to this:

Collect the latencies and error rates of the proxies in front of the apiserver to be able to detect regression from a particular network proxy. I can be either the load balancers, the services, or direct access via the pod.

The actual queries are based on openshift/cluster-kube-apiserver-operator#1272 and look like the following if they were not one-lined:

rest client 99th percentile of request latency:

sum by(type) (
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},"type","load_balancer","","")[1h:30s]
 )) by (le,type))
 or
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url!~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","service","","")[1h:30s]
 )) by (le,type))
 or
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","pod","","")[1h:30s]
 )) by (le,type))
)

rest client average request latency:

sum by(type) (
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
 ,"type","load_balancer","","")
 or
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
 ,"type","service","","")
 or
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
 ,"type","pod","","")
)

rest client error rate:

sum by(type) (
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
 ,"type","load_balancer","","")
 or
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
 ,"type","service","","")
 or
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
 ,"type","pod","","")
)

Note that all the graphs resulting from these queries will have 3 entries, one for each type of proxy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
5 participants