Gather rest client latency and error rate to be able to detect performance regressions #25375

dgrisonnet · 2022-01-18T11:19:48Z

Collect the latencies and error rates of the proxies in front of the apiserver to be able to detect regression from a particular network proxy. I can be either the load balancers, the services, or direct access via the pod.

The actual queries are based on openshift/cluster-kube-apiserver-operator#1272 and look like the following if they were not one-lined:

rest client 99th percentile of request latency:

sum by(type) (
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},"type","load_balancer","","")[1h:30s]
  )) by (le,type))
  or
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url!~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","service","","")[1h:30s]
  )) by (le,type))
  or
  histogram_quantile(0.99, sum(rate(
    label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","pod","","")[1h:30s]
  )) by (le,type))
)

rest client average request latency:

sum by(type) (
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
  ,"type","load_balancer","","")
  or
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
  ,"type","service","","")
  or
  label_replace(
    sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
    /
    sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
  ,"type","pod","","")
)

rest client error rate:

sum by(type) (
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
  ,"type","load_balancer","","")
  or
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
  ,"type","service","","")
  or
  label_replace(
    sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
    /
    sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
  ,"type","pod","","")
)

Note that all the graphs resulting from these queries will have 3 entries, one for each type of proxy.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

dgrisonnet · 2022-01-18T11:20:23Z

/cc @deads2k @tkashem @aojea

openshift-ci · 2022-01-18T13:39:11Z

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/stackrox/rox-openshift-ci-mirror/master/e2e-test	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/rehearse/openshift/okd-machine-os/release-4.7/e2e-ovirt	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/prow/pj-rehearse	`9fe8b51`	link	false	`/test pj-rehearse`
ci/rehearse/tnozicka/openshift-acme/master/e2e-cluster-wide	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/rehearse/openshift/cluster-logging-operator/tech-preview/e2e-operator	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/rehearse/openshift/okd-machine-os/release-4.9/e2e-ovirt	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/rehearse/openshift/cluster-capi-operator/release-4.11/e2e-aws-capi-techpreview	`9fe8b51`	link	unknown	`/test pj-rehearse`
ci/rehearse/operator-framework/operator-marketplace/release-4.9/e2e-aws-serial	`9fe8b51`	link	unknown	`/test pj-rehearse`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2022-01-21T16:37:34Z

/approve

tkashem · 2022-01-21T17:29:44Z

LGTM

giving an opportunity for @aojea to review

aojea · 2022-01-21T17:44:54Z

@dgrisonnet I ended changing the regex to avoid getting the host+path and get the host only

              sum(rate(
                label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},
                  "host","$1","url","https?://([^/\\s]+).*")[5m:30s]
                )) by (le,host,service,namespace,node)

dgrisonnet · 2022-01-21T18:05:33Z

It doesn't affect this PR since I am not relabeling the URL label to extract the host. Here, I am only exposing a higher-level type information which takes three values: load_balancer, service, and pod to avoid consuming too much resource

aojea · 2022-01-21T18:09:23Z

It doesn't affect this PR since I am not relabeling the URL label to extract the host. Here, I am only exposing a higher-level type information which takes three values: load_balancer, service, and pod to avoid consuming too much resource

you are the expert here
/lgtm

openshift-ci · 2022-01-21T18:10:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, deads2k, dgrisonnet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/gather/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-01-21T18:21:33Z

@dgrisonnet: Updated the step-registry configmap in namespace ci at cluster app.ci using the following files:

key gather-extra-commands.sh using file ci-operator/step-registry/gather/extra/gather-extra-commands.sh

In response to this:

Collect the latencies and error rates of the proxies in front of the apiserver to be able to detect regression from a particular network proxy. I can be either the load balancers, the services, or direct access via the pod.

The actual queries are based on openshift/cluster-kube-apiserver-operator#1272 and look like the following if they were not one-lined:

rest client 99th percentile of request latency:

sum by(type) (
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},"type","load_balancer","","")[1h:30s]
 )) by (le,type))
 or
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url!~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","service","","")[1h:30s]
 )) by (le,type))
 or
 histogram_quantile(0.99, sum(rate(
   label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},"type","pod","","")[1h:30s]
 )) by (le,type))
)

rest client average request latency:

sum by(type) (
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://api-int.*"}[1h:30s]))
 ,"type","load_balancer","","")
 or
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
 ,"type","service","","")
 or
 label_replace(
   sum(rate(rest_client_request_duration_seconds_sum{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
   /
   sum(rate(rest_client_request_duration_seconds_count{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[1h:30s]))
 ,"type","pod","","")
)

rest client error rate:

sum by(type) (
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
 ,"type","load_balancer","","")
 or
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
 ,"type","service","","")
 or
 label_replace(
   sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
   /
   sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
 ,"type","pod","","")
)

Note that all the graphs resulting from these queries will have 3 entries, one for each type of proxy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gather/extra: expose client latency and error rate

9fe8b51

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>

openshift-ci bot requested review from aojea, deads2k, tkashem, jmguzik and smg247 January 18, 2022 11:20

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 21, 2022

openshift-ci bot assigned aojea Jan 21, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2022

openshift-merge-robot merged commit 940113c into openshift:master Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather rest client latency and error rate to be able to detect performance regressions #25375

Gather rest client latency and error rate to be able to detect performance regressions #25375

dgrisonnet commented Jan 18, 2022

dgrisonnet commented Jan 18, 2022

openshift-ci bot commented Jan 18, 2022

deads2k commented Jan 21, 2022

tkashem commented Jan 21, 2022

aojea commented Jan 21, 2022

dgrisonnet commented Jan 21, 2022

aojea commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022

Gather rest client latency and error rate to be able to detect performance regressions #25375

Gather rest client latency and error rate to be able to detect performance regressions #25375

Conversation

dgrisonnet commented Jan 18, 2022

dgrisonnet commented Jan 18, 2022

openshift-ci bot commented Jan 18, 2022

deads2k commented Jan 21, 2022

tkashem commented Jan 21, 2022

aojea commented Jan 21, 2022

dgrisonnet commented Jan 21, 2022

aojea commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022

openshift-ci bot commented Jan 21, 2022