feat(kuma-cp) golden signals #1739

lobkovilya · 2021-03-31T14:59:06Z

Summary

Current PR introduces support of Golden Signals: latency, traffic, errors, and saturation.

New dashboard 'Kuma Service'

Latency

max(histogram_quantile(0.99, rate(envoy_cluster_upstream_rq_time_bucket{kuma_io_services=~".*$service.*",mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m])))

Traffic

Incoming

sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name=~"localhost_.*"}[1m]))

Outgoing

sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name!~"localhost_.*", envoy_cluster_name!="kuma_envoy_admin"}[1m]))

Status Codes

Incoming

sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name=~"localhost_.*"}[1m])) by (envoy_response_code_class)

Outgoing

sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name!~"localhost_.*", envoy_cluster_name!="kuma_envoy_admin"}[1m])) by (envoy_response_code_class)

CPU

max(sum(rate(container_cpu_usage_seconds_total[1m])) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane) /
max(sum(kube_pod_container_resource_limits_cpu_cores) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)

Memory Utilization

max(sum(container_memory_working_set_bytes{image!=""}) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)

Memory Saturation

max(sum(container_memory_working_set_bytes) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane) / max(sum(kube_pod_container_resource_limits_memory_bytes) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)

New row on 'Kuma Mesh' dashboard

Latency

sum(histogram_quantile(0.99, rate(envoy_cluster_upstream_rq_time_bucket{mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m]))) by (kuma_io_service)

Traffic

sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m])) by (kuma_io_service)

Status codes

sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh", envoy_cluster_name=~"localhost_.*"}[1m])) by (kuma_io_service,envoy_response_code_class)

Full changelog

New charts
Change the way how we pack the dashboards into ConfigMap because of the annotation size limit

Issues resolved

Fix #XXX

Documentation

Link to the website documentation PR

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

jakubdyszkiewicz · 2021-03-31T16:22:14Z

Thanks for the screenshots, it's helpful to review it.

Kuma Mesh dashboard

Can we have separate chart for 5xx (and maybe combined with 4xx)? What I want to see as a Mesh operator if I go to mesh dashboard is whether the whole system is OK. If 2xx and 5xx are mixed from all the services it will be hard to distinguish if there is problem or not.
"Overview" -> "HTTP"?

Kuma Service

Fix the legend in memory and cpu utilization
nit: incoming 2xx -> Incoming 2xx
I'd say - drop the stacks in Kubernetes row. I want to see which instance one is taking the most CPU, same with memory and especially memory saturation going over % is not great
Second pane on the top, heading of dataplanes. Can it be named Dataplanes not dataplane?
General -> HTTP?
What if we look at L4 service? Will we also have L4 metrics here?

lobkovilya · 2021-04-01T09:49:59Z

@jakubdyszkiewicz thank you for the review.

Kuma Mesh dashboard

Can we have separate chart for 5xx (and maybe combined with 4xx)? What I want to see as a Mesh operator if I go to mesh dashboard is whether the whole system is OK. If 2xx and 5xx are mixed from all the services it will be hard to distinguish if there is problem or not.

I agree, but then I'd leave only 5xx (and 4xx) chart because successful status codes are covered by the Traffic chart.
"Overview" -> "HTTP"?

Yeah, probably "HTTP" makes more sense because "Latency" and "Status codes" work only for HTTP traffic. But if it's not "Overview" then I don't think it should be the first row of this dashboard. What do you think?

Kuma Service

What if we look at L4 service? Will we also have L4 metrics here?

We will see everything except "Latency" and "Status codes", is it okay?

jakubdyszkiewicz · 2021-04-01T15:55:59Z

I agree, but then I'd leave only 5xx (and 4xx) chart because successful status codes are covered by the Traffic chart.

ok

Yeah, probably "HTTP" makes more sense because "Latency" and "Status codes" work only for HTTP traffic. But if it's not "Overview" then I don't think it should be the first row of this dashboard. What do you think?

I think it's fine for this to be the first row

We will see everything except "Latency" and "Status codes", is it okay?

What is request/sec in this context? What will this show for redis?

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

lobkovilya · 2021-04-05T08:54:34Z

Kuma Mesh dashboard

Create separate chart Errors Status Codes to aggregate only 5xx and 4xx errors:

sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh", envoy_cluster_name=~"localhost_.*", envoy_response_code_class=~"4|5"}[1m])) by (kuma_io_service,envoy_response_code_class)

Rename Overview to HTTP

Kuma Service

Fixed legend:
Rename incoming to Incoming
Get rid of stacks
Rename dataplane to Dataplanes:
Rename General to HTTP
if HTTP is treated as TCP then every Request/Response is a new connection, I guess with Redis it works the same way

(cherry picked from commit c2a24fc)

(cherry picked from commit c2a24fc) Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com>

lobkovilya added 2 commits March 31, 2021 21:38

feat(kuma-cp) add Golden Signals

9ea6edc

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) make check

398d74a

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

lobkovilya requested a review from a team as a code owner March 31, 2021 14:59

lobkovilya added 2 commits April 5, 2021 14:04

feat(kuma-cp) review

397384e

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) fix 'Error Status Codes'

55d2d10

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

nickolaev approved these changes Apr 5, 2021

View reviewed changes

feat(kuma-cp) fix 'Dataplanes' table

01f436c

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

jakubdyszkiewicz approved these changes Apr 6, 2021

View reviewed changes

lobkovilya merged commit c2a24fc into master Apr 6, 2021

lobkovilya deleted the feat/golden-signals branch April 6, 2021 10:36

lobkovilya added the backport-to-stable label Apr 7, 2021

mergify bot pushed a commit that referenced this pull request Apr 7, 2021

feat(kuma-cp) golden signals (#1739)

10f42c4

(cherry picked from commit c2a24fc)

mergify bot mentioned this pull request Apr 7, 2021

feat(kuma-cp) golden signals (bp #1739) #1775

Merged

nickolaev pushed a commit that referenced this pull request Apr 8, 2021

feat(kuma-cp) golden signals (#1739) (#1775)

ccf6996

(cherry picked from commit c2a24fc) Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kuma-cp) golden signals #1739

feat(kuma-cp) golden signals #1739

lobkovilya commented Mar 31, 2021 •

edited

jakubdyszkiewicz commented Mar 31, 2021

lobkovilya commented Apr 1, 2021

jakubdyszkiewicz commented Apr 1, 2021

lobkovilya commented Apr 5, 2021

feat(kuma-cp) golden signals #1739

feat(kuma-cp) golden signals #1739

Conversation

lobkovilya commented Mar 31, 2021 • edited

Summary

New dashboard 'Kuma Service'

New row on 'Kuma Mesh' dashboard

Full changelog

Issues resolved

Documentation

jakubdyszkiewicz commented Mar 31, 2021

lobkovilya commented Apr 1, 2021

jakubdyszkiewicz commented Apr 1, 2021

lobkovilya commented Apr 5, 2021

lobkovilya commented Mar 31, 2021 •

edited