-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine what ambient telemetry will look like #42320
Comments
For ztunnel, there's a gap with Telemetry API because there's no CEL in Rust so customizations are not available (cc @zirain ). |
I think the limitations may be beyond Rust vs Envoy if we include the full scope of Telemetry customization. We probably don't want per-workload overrides on node proxy |
Yes, I'm not suggesting that, but I think it's fairly straightforward to maintain Waypoint full support for telemetry API as server-side for the producer and client-side for the consumer. For node proxies, I am not convinced we need to present L4 Istio metrics. The main audience would be the node operators since they are multi-tenant and perhaps not even visible to the application developers. The focus there should be on the infrastructure metrics, like the Envoy default metric set (e.g. load distribution, time spent, global p99, etc). |
Lin and Ethan will find the right person to tackle this workitem. |
There are the technical issues and then there is the question of what helps users best observe and troubleshoot their mesh. In my experience on Kiali these things are helpful:
Ambient has a more distinct separation of L4 and L7. It seems the push is that there should be complete L4 reporting such that telem consumers could get a full picture of their L4 traffic. This makes perfect sense for users only using the mesh for L4. But I'm not sure it makes perfect sense when there is L7 reporting as well, because I don't think we want to see duplicate TCP and HTTP edges in a topology view, for the same request. Although, it would be nice to show users a "merged" view, like they have today with the sidecar reporting, combining HTTP and TCP (think of the bookinfo graph with requests to mongodb). The TCP shows up only when it is the declared protocol. In other words, TCP metrics are "skipped" for HTTP traffic. Perhaps the L4 telemetry could add a label indicating whether it is strictly L4 or carries L7 info. That would make it easy to query for a complete L4 view, while also making it easy eliminate L4 metrics that would presumably be reported as L7 requests as well. I think to avoid disruption it would be good to continue to use the same Istio metrics that consumers are currently using. Although, as mentioned above regarding histo bucketing, I think some changes/optimizations are fine, like removing some deprecated fields, etc. Sorry for the large dump, but wanted to get down some thoughts... |
For me it's more than for the same cluster or same traffic, someone might want a TCP view and also want a L7 protocol view. They're different visualizations of the same data. It is a data collection/data query problem at the end of the day and not so much something an emitter should get involved with. That does mean that ztunnel + the istio data plane needs to properly populate enough metadata for the collector/view components to do what they need to do - just don't want the data plane to be responsible for filtering network metrics - that's why things like OTEL have standalone cluster-local collectors that can be configured to filter/drop metrics based on what people actually want to collect/store for a given cluster.
👍
👍 this is where I think having a configurable collector that is separate from Istio makes a lot of sense. Istio emits what it knows, there's a collector that ingests the emitted metrics, and you can drop or exclude what you like there before the metrics are forwarded to your store, based on your needs. I think most of the deduplication before storage (as well as any filtering a specific end user might want for a specific cluster) can be solved handily with a cluster-level collector, e.g. the otel one if it meets our needs.
Is this something that becomes less important with correlation IDs and baggage? That's sort of hard to enforce/track/make assumptions about on the Istio data plane side, globally.
The difficulty is that (IIRC) a waypoint is not always there for all L7 traffic - if it's L7 but we aren't doing anything special with it in the control plane at the L7 layer it may not go thru a waypoint at all and it is "just" TCP packets. ztunnel can infer if it's L7 to some degree (but will not do packet sniffing to find out)
Not familiar enough with the differences here but open to it. |
Custom bucketing is not (yet) easy, but somewhat available with the new native histogram metric type. This is available since Prometheus v2.40 as an experimental feature. I will check the design docs regarding compatibility questions.
There's already a
Here we are just talking about metrics, and that should be handled separately as that's one of the easiest, cheapest observability signal to get, and most users stop at this point (even if it's only temporarily), so they should be useful by themselves too.
Only a reasonable set of metrics should be enabled by default. Regular Istio did not get this totally right, as these can scare people off (or they can ignore these and suffer the consequences). This is another example that I feel like should be a default as only a very small subset of users would like to have this: https://docs.solo.io/gloo-mesh-enterprise/latest/observability/tools/prometheus/production-setup/#remove-cardinality-labels
OTel is just an implementation detail, and while it's getting more and more popular, most users are not using it yet but relying on Prometheus as that's still the de-facto standard. I think in terms of UX, we should focus on getting this right with Prometheus first, then OTel. It's quite easy as Prometheus Receiver is quite close to Prometheus. |
@howardjohn, As a follow-up to the working group meeting, if you and Louis have information to share please add it here in some way, thanks! |
Not stale |
/no stale |
@justinpettit can you identify some resource for this? Would love to have a high level design for telemetry before ambient announces some sort of beta. |
Thinking more on Ambient metrics I want to follow up on my initial comments. I do think the existing, core Istio metrics should be maintained for the foreseeable future. Doing this should be more seamless as people migrate to ambient, and also let OTel work evolve. At the same time, for ambient namespaces maybe we can treat telemetry a little differently, and mostly leave sidecar proxy telemetry alone. It seems to me that ambient users will move quickly to ambient, and mainly leave sidecars behind. And from a telemetry perspective I think we can probably ignore telemetry from sidecars within an ambient namespace (will there even be sidecars in an ambient namespace?). Eliminate Source/Dest Reporters I believe today ambient is always using If there is a sidecar proxy in an ambient namespace the telemetry could continue to report as it does today, independently of the ambient telemetry. Or, maybe it could be disabled completely, as some telem consumers, like Kiali, may elect to ignore it. Removal of source and dest reporting will result in significant savings in telemetry capture, storage and processing. The main concern is whether Ambient telem is able to capture the union of what the source and dest reporting did with sidecar proxies. For example, sidecar client reporting captured requests that never made it to a destination proxy, and destination reporting provided security policy. New bucketing for Histograms
Prometheus defaults cut the number almost in half:
OTel HTTP defaults are similar to Prom, but with 14 buckets:
Each unique time-series gets one more dimension for each bucket (+ 1 more for "inf"), so in a simple case, say requests A->B, there are currently 20 bucket time-series generated, with each updated for a value <= the bucket value. For example, using request_duration_milliseconds, for an 1100ms entry, the first 10 buckets are updated (+ 1 more). And note, that is for each unique set of attributes. In my opinion Envoy has too many buckets, bloating the storage and slowing down updates, queries and calculations. Prom and OTel are better but currently set for seconds, whereas Istio is currently reporting in milliseconds. It's not totally clear what to do here. Also, Prometheus has just introduced experimental support for "Native Histograms", which is different from what I described above, and may help out substantially. It is something that should be tracked for possible use with ambient GA. But for now, I'd suggest that when installing Istio with the ambient profile, we adhere to the OTel bucketing, adjusted for milliseconds. This will still reduce the overall number of time-series while adding some standardization. Id ignore support for custom bucketing and hope that native histogram support eventually solves the histogram problem. Removals
Minimally, I think the "canonical" attributes remove the need for Note that the nature of waypoints should already reduce cardinality significantly, as the There are also a lot of metrics that could likely be turned off by default. In Kiali the following metrics that are currently unused:
To summarize, the above doesn't recommend any sort of fundamental shift in telemetry for Ambient, but I think would let Ambient report in a more native way, address a legacy issue in histogram reporting, reduce the telemetry footprint, and require only modest changes for consumers. |
Most of these are not set at all by Istio, but by users Prometheus scraping configuration set to add all the pod labels to the metrics labels. Waypoints happen to solve this due to the fact it would use the waypoints labels, though. {destination,source}_{app,version} are an exception -- I generally agree that is not useful now that we have canonical service/revision. |
Slight pref for the |
Adding a new label "reporter_component" to OSS ambient adds to its complexity and is not consistent with OSS Istio, which does not have this new label. I prefer for ztunnel, reporter=”ztunnel”; for waypoint proxy, reporter=”waypoint”, because it is simple and clear. |
That's not necessarily a problem, OSS ambient looks nothing like OSS sidecar in practical terms anyway, and the reality is both will be supported in parallel in probable perpetuity. My concern is if we (eventually) end up with something like
and ztunnel A is proxying L4 traffic to all ambient pods on the node as well as terminating HBONE traffic bound to those N waypoints - it gets potentially a bit trickier to collate and represent the info, as the simplistic categories start to break down a bit. Thinking about it, it's probably it's moot either way tho, as we will eventually have to append more specific context than My only concern is that when we are forced to do that, we will make it hard for ourselves to support parallel telemetry views for sidecar and ambient traffic in the same cluster - I do think that is a requirement we should keep at the forefront. |
+1 |
Today, with sidecar injection, for a single request from A to B there are 2 time-series generated for each relevant metric (requests_total, tcp_sent_bytes, tcp_received bytes, etc), sidecar-A with |
From a cursory glance (caveat that I haven't fully read all the comments) I'd expect that I'd be supportive of adding something like a L7 Ambient
I'd expect the same-node waypoint <-> ztunnel hops to be negligible, but the biggest gap here is the processing time within the waypoint itself. For this reason, I think there should ideally be a "parent" tracing span capturing the entire client zt -> destination zt timing, although I think that would only be possible to emit from the destination zt, and only if the waypoint forwards timing/header information from the first leg? I'm not sure where/how/if to encode that the middle legs are associated with a waypoint, but it feels inaccurate to use L4 Ambient
Sidecars
|
I don't think we need The main value of the two |
Yes +1 on this, this is my main concern. Component != direction or origin, as long as we aren't conflating those things, we should be good. |
Chatted with @louiscryan a bit, here is where we arrived. Primary focus is retaining metrics value while keeping cost reasonable. I setup a small spreadsheet to put in theoretical numbers and analyze different approaches. Here are the primary approaches I considered: Here the boxes are the actual proxy reporting the metric, and the source. Note: the ones with empty spot are derived from others. For some reason I cannot upload to the team drive with some permission issue, so attached here Summarized (waypoint double == top option, waypoint single == bottom in the picture)
Broad notes:
There is discussion about consolidating {workload, canonical service, app} and {version, canonical revision}; since these are generally the same values, these do not meaningfully impact cardinality which is the primary driver for costs in metrics. Given this, its probably best to keep all the labels for compatibility. |
"Waypoint-double" reporting is qualitatively different since you cannot correlate the two metrics easily. Intuitively, for full fidelity with sidecars, you need a series per every network hop so it's always O(hops). You can sacrifice some fidelity via an intermediary LB service (so instead of M*N tuples you have M+N tuples), but it is not equivalent. |
Agreed, this is why Waypoint single is the likely recommendation. The cost difference between 'single' (bottom path) and 'double' (top path) is substantial so its instructive for folks to understand this w.r.t to the loss of fidelity vs sidecar as you describe. Ideally we would not need 30 'reporters' and could treat all waypoint Deployments as a single reporter or even all waypoints as simply one reporter since its unlikely that the waypoint instance detail as the reporter is actually useful - more directly, if the waypoint is having issues we are likely to use a more operation metric (CPU, queue lengths, etc) to diagnose issues with an instance. If we can find a way to collapse reporters this would make a lot of sense and bring the costs down significantly |
I prefer the "waypoint single" approach over the "waypoint double" approach because:
Take the following data points in the "waypoint double" approach as an example: source 1 -> waypoint, waypoint -> destination 2, source 2 -> waypoint, waypoint -> destination 1. Such data points do not provide the source -> destination information: e.g., from these data points, a user can not tell the actual source and destination is source 1 -> destination 1 or source 1 -> destination 2.
|
Would it be worthwhile to consider some labeling scheme and configuration such that it could be possible to enable "waypoint double" telemetry even if "waypoint single" is default and recommended for cost control in production? |
@mikemorris is there a material difference between the "double" telemetry and a "single" telemetry with cardinality reduction techniques applied (e.g. erasing destination labels, grouping of response codes, etc)? |
I think I don't have a deep enough understanding to answer that well tbh.
My concern is mostly around whether ^ is sufficient (and an expected/reasonable triage workflow) for Istio operators in this situation. |
I prefer waypoint-single as I don't think that in general users want to see the infrastructure in the telemetry. They want to see that traffic between service A and service B was successful, or not. As @lei-tang notes, I'm not sure how we would correlate the two legs in waypoint-double to show the request and its response time etc. I do see in both diagrams that src and dst workloads as noted as pods, is that correct? I don't think we'd want pods. |
This was a bit of a misnomer, it would be the standard app/canonicalservice/workload labels ,not a new 'pod' one. It seems like there is some consensus building around the "waypoint single" option, with the ztunnel metrics also shown there (which also happen to be what ztunnel implements today). |
Is it correct to think waypoint double (top diagram) provides time spent in a given waypoint while waypoint single (botoom diagram) doesn't provide that info? Personally i found this is very hard to compare with sidecars because waypoint single or double doesn't report time spent between source and destination. |
I don't think it provides much meaningful information. The waypoint itself is reporting it during its processing; its hard to accurately observe your own latency. For example, consider a request takes 10ms of processing
In reality it was 10ms, but we observe 1ms |
If the goal is to report the time spent in waypoint, that's possible but it's a different metric. Envoy does keep track of the overhead (e.g. time spent processing requests and responses minus the network to backend). To get accurate measurements, you have to measure at the source, using a client sidecar or out-of-cluster (user-facing browsers or mobile apps), which is I think far beyond the scope of the ambient telemetry. |
Thank you @howardjohn and @kyessenov - I agree waypoint double doesn't provide meaningful latency within waypoint itself. And latency within waypoint is out of scope for this WI. |
We drafted a "Plan for OSS ambient mesh beta - telemetry" (https://docs.google.com/document/d/15bW22HxuB9F-TIDy4h_YCBsUb8usbaurx8-DrXcL3_Q/edit). Please take a look. |
I've temporarily lost write access to docs, so will comment here:
Looks good though, thanks for pulling that together. |
John, thank you for your comments! I agree with your points and have updated the doc accordingly. PTAL. |
Closing this as we have determined what it will look like. Implementation is tracked in #50225 |
In sidecars, we have
reporter=source|destination
. In ambient, things are a bit less clear.We could have up to 4 proxies along the path, and 2 (ztunnel) may not be doing L4 processing at all.
However, the ztunnels can provide L4 telemetry and could even provide L4 traces to some extent (since we can propogate tracing headers via HBONE).
This issue tracks designing what telemetry should be emitted in ambient
size: L, type: design
related WI: #45676
The text was updated successfully, but these errors were encountered: