feat(kuma-cp): config delivery metrics #3932
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add config delivery metrics.
So far we had a metric called
xds_generation
, which indicates how long did it took for us to generate config including:From this moment we don't know how long it took for the CP to actually send it to Envoy and how long did it take for Envoy to apply this.
Ideally, we would have a metric to count the time between we apply a policy (via kubectl or API) to the time Envoy starts to respect the changed config. This is hard in a distributed environment with two different systems (Kuma CP / Envoy). Because the clock skews, it would need to happen on one machine. That would need to be an application that periodically
Because it would need to be a separate deployment that operates on real policies, I think it would be very hard to embed such a thing in Kuma by default for users and customers.
This PR introduces an alternative approach with a new metric called
xds_delivery
. XDS Delivery counts the time from which config is set to SnapshotCache (scheduled to be delivered) to the moment we receive and process ACK/NACK from Envoy.This metric can help us see if CP is struggling with
This gives us almost the whole flow that we want. The only missing part is the time between applying a policy and the time that
DataplaneWatchdog
picks it up and starts building MeshContext. We know that it should be at mostKUMA_XDS_SERVER_DATAPLANE_CONFIGURATION_REFRESH_INTERVAL
orxds_generation
if watchdogs are struggling.Full changelog
Issues resolved
Fix #3827
Documentation
No docs really?
Testing
Backwards compatibility
- [ ] UpdateUPGRADE.md
with any steps users will need to take when upgrading.- [ ] Addbackport-to-stable
label if the code follows our backporting policy