Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kuma-cp) golden signals #1739

Merged
merged 5 commits into from
Apr 6, 2021
Merged

feat(kuma-cp) golden signals #1739

merged 5 commits into from
Apr 6, 2021

Conversation

lobkovilya
Copy link
Contributor

@lobkovilya lobkovilya commented Mar 31, 2021

Summary

Current PR introduces support of Golden Signals: latency, traffic, errors, and saturation.

New dashboard 'Kuma Service'

image

  • Latency
max(histogram_quantile(0.99, rate(envoy_cluster_upstream_rq_time_bucket{kuma_io_services=~".*$service.*",mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m])))
  • Traffic

    • Incoming
    sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name=~"localhost_.*"}[1m]))
    
    • Outgoing
    sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name!~"localhost_.*", envoy_cluster_name!="kuma_envoy_admin"}[1m]))
    
  • Status Codes

    • Incoming
    sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name=~"localhost_.*"}[1m])) by (envoy_response_code_class)
    
    • Outgoing
    sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh",kuma_io_services=~".*$service.*", envoy_cluster_name!~"localhost_.*", envoy_cluster_name!="kuma_envoy_admin"}[1m])) by (envoy_response_code_class)
    
  • CPU

max(sum(rate(container_cpu_usage_seconds_total[1m])) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane) /
max(sum(kube_pod_container_resource_limits_cpu_cores) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)
  • Memory Utilization
max(sum(container_memory_working_set_bytes{image!=""}) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)
  • Memory Saturation
max(sum(container_memory_working_set_bytes) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane) / max(sum(kube_pod_container_resource_limits_memory_bytes) by (namespace, pod) * on (namespace, pod) group_right(kuma_io_service) envoy_server_live{kuma_io_services=~".*$service.*"}) by (dataplane)

New row on 'Kuma Mesh' dashboard

image

  • Latency
sum(histogram_quantile(0.99, rate(envoy_cluster_upstream_rq_time_bucket{mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m]))) by (kuma_io_service)
  • Traffic
sum(rate(envoy_cluster_upstream_rq_total{mesh="$mesh",envoy_cluster_name=~"localhost_.*"}[1m])) by (kuma_io_service)
  • Status codes
sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh", envoy_cluster_name=~"localhost_.*"}[1m])) by (kuma_io_service,envoy_response_code_class)

Full changelog

  • New charts
  • Change the way how we pack the dashboards into ConfigMap because of the annotation size limit

Issues resolved

Fix #XXX

Documentation

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@lobkovilya lobkovilya requested a review from a team as a code owner March 31, 2021 14:59
@jakubdyszkiewicz
Copy link
Contributor

Thanks for the screenshots, it's helpful to review it.

Kuma Mesh dashboard

  • Can we have separate chart for 5xx (and maybe combined with 4xx)? What I want to see as a Mesh operator if I go to mesh dashboard is whether the whole system is OK. If 2xx and 5xx are mixed from all the services it will be hard to distinguish if there is problem or not.
  • "Overview" -> "HTTP"?

Kuma Service

  • Fix the legend in memory and cpu utilization
  • nit: incoming 2xx -> Incoming 2xx
  • I'd say - drop the stacks in Kubernetes row. I want to see which instance one is taking the most CPU, same with memory and especially memory saturation going over % is not great
  • Second pane on the top, heading of dataplanes. Can it be named Dataplanes not dataplane?
  • General -> HTTP?
  • What if we look at L4 service? Will we also have L4 metrics here?

@lobkovilya
Copy link
Contributor Author

@jakubdyszkiewicz thank you for the review.

Kuma Mesh dashboard

  • Can we have separate chart for 5xx (and maybe combined with 4xx)? What I want to see as a Mesh operator if I go to mesh dashboard is whether the whole system is OK. If 2xx and 5xx are mixed from all the services it will be hard to distinguish if there is problem or not.

    I agree, but then I'd leave only 5xx (and 4xx) chart because successful status codes are covered by the Traffic chart.

  • "Overview" -> "HTTP"?

    Yeah, probably "HTTP" makes more sense because "Latency" and "Status codes" work only for HTTP traffic. But if it's not "Overview" then I don't think it should be the first row of this dashboard. What do you think?

    image

Kuma Service

  • What if we look at L4 service? Will we also have L4 metrics here?

    We will see everything except "Latency" and "Status codes", is it okay?

    image

@jakubdyszkiewicz
Copy link
Contributor

I agree, but then I'd leave only 5xx (and 4xx) chart because successful status codes are covered by the Traffic chart.

ok

Yeah, probably "HTTP" makes more sense because "Latency" and "Status codes" work only for HTTP traffic. But if it's not "Overview" then I don't think it should be the first row of this dashboard. What do you think?

I think it's fine for this to be the first row

We will see everything except "Latency" and "Status codes", is it okay?

What is request/sec in this context? What will this show for redis?

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@lobkovilya
Copy link
Contributor Author

Kuma Mesh dashboard

  • Create separate chart Errors Status Codes to aggregate only 5xx and 4xx errors:

    sum(rate(envoy_cluster_external_upstream_rq_xx{mesh="$mesh", envoy_cluster_name=~"localhost_.*", envoy_response_code_class=~"4|5"}[1m])) by (kuma_io_service,envoy_response_code_class)
    

    image

  • Rename Overview to HTTP

Kuma Service

  • Fixed legend:

    image

  • Rename incoming to Incoming

    image

  • Get rid of stacks

  • Rename dataplane to Dataplanes:

    image

  • Rename General to HTTP

  • if HTTP is treated as TCP then every Request/Response is a new connection, I guess with Redis it works the same way

@lobkovilya lobkovilya merged commit c2a24fc into master Apr 6, 2021
@lobkovilya lobkovilya deleted the feat/golden-signals branch April 6, 2021 10:36
mergify bot pushed a commit that referenced this pull request Apr 7, 2021
(cherry picked from commit c2a24fc)
nickolaev pushed a commit that referenced this pull request Apr 8, 2021
(cherry picked from commit c2a24fc)

Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants