Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kuma-cp): config delivery metrics #3932

Merged
merged 1 commit into from
Mar 2, 2022

Conversation

jakubdyszkiewicz
Copy link
Contributor

Summary

Add config delivery metrics.

So far we had a metric called xds_generation, which indicates how long did it took for us to generate config including:

  • building MeshContext (it may be cached)
  • building Proxy object
  • going through ProxyGenerators
  • applying ProxyTemplates modifications, hooks, etc.
  • Versioning, validation, and setting the snapshot in the SnapshotCache

From this moment we don't know how long it took for the CP to actually send it to Envoy and how long did it take for Envoy to apply this.

Ideally, we would have a metric to count the time between we apply a policy (via kubectl or API) to the time Envoy starts to respect the changed config. This is hard in a distributed environment with two different systems (Kuma CP / Envoy). Because the clock skews, it would need to happen on one machine. That would need to be an application that periodically

  • Applies some real policy on the CP
  • Polls the Envoy stats for changes
  • Report the metric
    Because it would need to be a separate deployment that operates on real policies, I think it would be very hard to embed such a thing in Kuma by default for users and customers.

This PR introduces an alternative approach with a new metric called xds_delivery. XDS Delivery counts the time from which config is set to SnapshotCache (scheduled to be delivered) to the moment we receive and process ACK/NACK from Envoy.

This metric can help us see if CP is struggling with

  • The network between proxies
  • Proto serialization of configuration

This gives us almost the whole flow that we want. The only missing part is the time between applying a policy and the time that DataplaneWatchdog picks it up and starts building MeshContext. We know that it should be at most KUMA_XDS_SERVER_DATAPLANE_CONFIGURATION_REFRESH_INTERVAL or xds_generation if watchdogs are struggling.

Full changelog

  • Add a new delivery metric to XDS and add them to Kuma CP dashboard.
  • Add a new delivery metric to KDS for consistency (it's not added to HDS) and add them to Kuma CP dashboard.
  • Remove DNS Server row from Kuma CP dashboard since it's not recommended to use DNS Server embedded in CP. Metrics are still exposed via Prometheus if someone really wants to use them.
  • Remove SDS row from Kuma CP dashboard. Secrets are served over ADS.
  • Fix HDS metrics in Kuma CP dashboard. They were broken because of copy-paste from xds
  • Fix XDS/KDS config confirmation metrics. They were using incorrect metrics names
  • Fix CP LIVE pane on the left. It's now a text, not a Gauge, because I could not find a way to make it always fill the Gauge with live instances.
  • Add unit tests to stats callbacks

Issues resolved

Fix #3827

Documentation

No docs really?

Testing

  • Unit tests
  • E2E tests
  • Manual testing on Universal
  • Manual testing on Kubernetes

Backwards compatibility

- [ ] Update UPGRADE.md with any steps users will need to take when upgrading.
- [ ] Add backport-to-stable label if the code follows our backporting policy

Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
@jakubdyszkiewicz jakubdyszkiewicz requested a review from a team as a code owner February 25, 2022 17:19
Copy link
Contributor

@lahabana lahabana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a cool feature!

@jakubdyszkiewicz jakubdyszkiewicz merged commit 5c5ef72 into master Mar 2, 2022
@jakubdyszkiewicz jakubdyszkiewicz deleted the feat/config-delivery-metrics branch March 2, 2022 11:07
SallyBlichWalkMe pushed a commit to SallyBlichWalkMe/kuma that referenced this pull request Apr 14, 2022
Signed-off-by: Jakub Dyszkiewicz <jakub.dyszkiewicz@gmail.com>
Signed-off-by: Sally Blich <sally.blich@walkme.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metric to see configuration update propagation time
2 participants