Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine what ambient telemetry will look like #42320

Closed
howardjohn opened this issue Dec 6, 2022 · 47 comments
Closed

Determine what ambient telemetry will look like #42320

howardjohn opened this issue Dec 6, 2022 · 47 comments
Assignees
Labels
Ambient Beta Must have for Beta of Ambient Mesh area/ambient Issues related to ambient mesh lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@howardjohn
Copy link
Member

howardjohn commented Dec 6, 2022

In sidecars, we have reporter=source|destination. In ambient, things are a bit less clear.
We could have up to 4 proxies along the path, and 2 (ztunnel) may not be doing L4 processing at all.
However, the ztunnels can provide L4 telemetry and could even provide L4 traces to some extent (since we can propogate tracing headers via HBONE).

This issue tracks designing what telemetry should be emitted in ambient

size: L, type: design

related WI: #45676

@howardjohn howardjohn added the area/ambient Issues related to ambient mesh label Dec 6, 2022
@kyessenov
Copy link
Contributor

For ztunnel, there's a gap with Telemetry API because there's no CEL in Rust so customizations are not available (cc @zirain ).

@howardjohn
Copy link
Member Author

I think the limitations may be beyond Rust vs Envoy if we include the full scope of Telemetry customization. We probably don't want per-workload overrides on node proxy

@kyessenov
Copy link
Contributor

kyessenov commented Dec 9, 2022

Yes, I'm not suggesting that, but I think it's fairly straightforward to maintain Waypoint full support for telemetry API as server-side for the producer and client-side for the consumer.

For node proxies, I am not convinced we need to present L4 Istio metrics. The main audience would be the node operators since they are multi-tenant and perhaps not even visible to the application developers. The focus there should be on the infrastructure metrics, like the Envoy default metric set (e.g. load distribution, time spent, global p99, etc).

@SkyfireFrancisZ
Copy link

Lin and Ethan will find the right person to tackle this workitem.

@jshaughn
Copy link

There are the technical issues and then there is the question of what helps users best observe and troubleshoot their mesh. In my experience on Kiali these things are helpful:

  • avoid too many metrics
    • collection and storage does impact users and many end up dropping metrics or turning off collection entirely
    • ambient should not over-report
      • I think source and/or destination reporting should be the goal, no more than two reporters reflecting the source and destination "view" of a request.
      • there was a discussion about sidecar proxies adding header information to indicate to waypoint proxies that source reporting was already done downstream. Something analogous would probably make sense for dest reporting. The idea here would be that a waypoint would stand-in as the source and/or dest reporter when there is no sidecar in the mix. I believe users will want to see telem as if they had sidecars, even if only a waypoint is doing the handling, it is the logical view of a request.
      • one question here is whether a waypoint should generate source and dest reporting when the time-series would be identical other than the reporter. The disadvantage is near-duplicate reporting means more collection and storage. And the vast majority of traffic, which is typically healthy, would be double-reported (like today). The advantage is that it is easier to query from the source or dest POV. But, although it would impact Kiali, I think it could be best if source-reporting was done only for requests that don't reach a dest workload (certain failures), otherwise dest reporting is all that is needed.
    • ambient should take this opportunity to change the bootstrap histogram bucketing.
      • Best would be to not use Envoy defaults (Prom defaults would likely be better) and allow custom bucketing, if possible.

Ambient has a more distinct separation of L4 and L7. It seems the push is that there should be complete L4 reporting such that telem consumers could get a full picture of their L4 traffic. This makes perfect sense for users only using the mesh for L4. But I'm not sure it makes perfect sense when there is L7 reporting as well, because I don't think we want to see duplicate TCP and HTTP edges in a topology view, for the same request. Although, it would be nice to show users a "merged" view, like they have today with the sidecar reporting, combining HTTP and TCP (think of the bookinfo graph with requests to mongodb). The TCP shows up only when it is the declared protocol. In other words, TCP metrics are "skipped" for HTTP traffic. Perhaps the L4 telemetry could add a label indicating whether it is strictly L4 or carries L7 info. That would make it easy to query for a complete L4 view, while also making it easy eliminate L4 metrics that would presumably be reported as L7 requests as well.

I think to avoid disruption it would be good to continue to use the same Istio metrics that consumers are currently using. Although, as mentioned above regarding histo bucketing, I think some changes/optimizations are fine, like removing some deprecated fields, etc.

Sorry for the large dump, but wanted to get down some thoughts...

@bleggett
Copy link
Contributor

bleggett commented Mar 23, 2023

Ambient has a more distinct separation of L4 and L7. It seems the push is that there should be complete L4 reporting such that telem consumers could get a full picture of their L4 traffic. This makes perfect sense for users only using the mesh for L4. But I'm not sure it makes perfect sense when there is L7 reporting as well, because I don't think we want to see duplicate TCP and HTTP edges in a topology view, for the same request.

For me it's more than for the same cluster or same traffic, someone might want a TCP view and also want a L7 protocol view. They're different visualizations of the same data. It is a data collection/data query problem at the end of the day and not so much something an emitter should get involved with.

That does mean that ztunnel + the istio data plane needs to properly populate enough metadata for the collector/view components to do what they need to do - just don't want the data plane to be responsible for filtering network metrics - that's why things like OTEL have standalone cluster-local collectors that can be configured to filter/drop metrics based on what people actually want to collect/store for a given cluster.

avoid too many metrics

👍

collection and storage does impact users and many end up dropping metrics or turning off collection entirely

👍 this is where I think having a configurable collector that is separate from Istio makes a lot of sense. Istio emits what it knows, there's a collector that ingests the emitted metrics, and you can drop or exclude what you like there before the metrics are forwarded to your store, based on your needs. I think most of the deduplication before storage (as well as any filtering a specific end user might want for a specific cluster) can be solved handily with a cluster-level collector, e.g. the otel one if it meets our needs.

ambient should not over-report
I think source and/or destination reporting should be the goal, no more than two reporters reflecting the source and destination "view" of a request.

Is this something that becomes less important with correlation IDs and baggage? That's sort of hard to enforce/track/make assumptions about on the Istio data plane side, globally.

there was a discussion about sidecar proxies adding header information to indicate to waypoint proxies that source reporting was already done downstream. Something analogous would probably make sense for dest reporting. The idea here would be that a waypoint would stand-in as the source and/or dest reporter when there is no sidecar in the mix. I believe users will want to see telem as if they had sidecars, even if only a waypoint is doing the handling, it is the logical view of a request.

The difficulty is that (IIRC) a waypoint is not always there for all L7 traffic - if it's L7 but we aren't doing anything special with it in the control plane at the L7 layer it may not go thru a waypoint at all and it is "just" TCP packets.

ztunnel can infer if it's L7 to some degree (but will not do packet sniffing to find out)

ambient should take this opportunity to change the bootstrap histogram bucketing.
Best would be to not use Envoy defaults (Prom defaults would likely be better) and allow custom bucketing, if possible.

Not familiar enough with the differences here but open to it.

@krisztianfekete
Copy link
Contributor

krisztianfekete commented May 12, 2023

ambient should take this opportunity to change the bootstrap histogram bucketing.
> Best would be to not use Envoy defaults (Prom defaults would likely be better) and allow custom bucketing, if possible.

Custom bucketing is not (yet) easy, but somewhat available with the new native histogram metric type. This is available since Prometheus v2.40 as an experimental feature. I will check the design docs regarding compatibility questions.
Until that, fixed buckets is fine, with reasonable defaults (based on extensive testing).

Perhaps the L4 telemetry could add a label indicating whether it is strictly L4 or carries L7 info. That would make it easy to query for a complete L4 view, while also making it easy eliminate L4 metrics that would presumably be reported as L7 requests as well.

There's already a request_protocol label, but a new one dedicated to distinguish between L4 and L7 can make sense, and would not be an issue of cardinality. This way, users would be able to run the same queries regardless of the layer and would be switch between the two "modes".

I think source and/or destination reporting should be the goal, no more than two reporters reflecting the source and destination "view" of a request.
> Is this something that becomes less important with correlation IDs and baggage? That's sort of hard to
enforce/track/make assumptions about on the Istio data plane side, globally.

Here we are just talking about metrics, and that should be handled separately as that's one of the easiest, cheapest observability signal to get, and most users stop at this point (even if it's only temporarily), so they should be useful by themselves too.

collection and storage does impact users and many end up dropping metrics or turning off collection entirely

Only a reasonable set of metrics should be enabled by default. Regular Istio did not get this totally right, as these can scare people off (or they can ignore these and suffer the consequences).

This is another example that I feel like should be a default as only a very small subset of users would like to have this: https://docs.solo.io/gloo-mesh-enterprise/latest/observability/tools/prometheus/production-setup/#remove-cardinality-labels

this is where I think having a configurable collector that is separate from Istio makes a lot of sense. Istio emits what it knows, there's a collector that ingests the emitted metrics, and you can drop or exclude what you like there before the metrics are forwarded to your store, based on your needs.

OTel is just an implementation detail, and while it's getting more and more popular, most users are not using it yet but relying on Prometheus as that's still the de-facto standard. I think in terms of UX, we should focus on getting this right with Prometheus first, then OTel. It's quite easy as Prometheus Receiver is quite close to Prometheus.

@jshaughn
Copy link

@howardjohn, As a follow-up to the working group meeting, if you and Louis have information to share please add it here in some way, thanks!

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Aug 11, 2023
@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Aug 26, 2023
@keithmattix keithmattix reopened this Aug 26, 2023
@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Aug 26, 2023
@keithmattix
Copy link
Contributor

Not stale

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Nov 25, 2023
@hzxuzhonghu
Copy link
Member

/no stale

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Nov 25, 2023
@linsun linsun added lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed Ambient Beta Must have for Beta of Ambient Mesh and removed lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. labels Dec 15, 2023
@linsun
Copy link
Member

linsun commented Dec 15, 2023

@justinpettit can you identify some resource for this? Would love to have a high level design for telemetry before ambient announces some sort of beta.

@jshaughn
Copy link

jshaughn commented Dec 19, 2023

Thinking more on Ambient metrics I want to follow up on my initial comments. I do think the existing, core Istio metrics should be maintained for the foreseeable future. Doing this should be more seamless as people migrate to ambient, and also let OTel work evolve. At the same time, for ambient namespaces maybe we can treat telemetry a little differently, and mostly leave sidecar proxy telemetry alone. It seems to me that ambient users will move quickly to ambient, and mainly leave sidecars behind. And from a telemetry perspective I think we can probably ignore telemetry from sidecars within an ambient namespace (will there even be sidecars in an ambient namespace?).

Eliminate Source/Dest Reporters
The source and destination reporting of sidecar proxies is rich, but wasteful. For successful requests the source and dest reporting is mostly duplicated. That results in double the processing, double the scraping, double the storage, and sometimes the need to de-dupe after querying. The Ambient arch is different, there are no source and dest proxies and reporting should not try and emulate it. A new "reporter" has been suggested in the past, and I now think that reporter = "ambient"makes sense, because it's accurate. It also makes it easy to query ambient telemetry via a single label-value pair. The reporter could also be set to "ztunnel" or "waypoint", but that may be better as something like reporter_app (Note, I think currently tcp metrics have app="ztunnel", I'd recommend that change to what's proposed above).

I believe today ambient is always using reporter="destination". This may seem like a good idea for some amount of back-compat, but I think it's misleading. For example, it's quite possible to have reporter="destination" yet no destination_workload, because in reality the request never made it to a dest workload.

If there is a sidecar proxy in an ambient namespace the telemetry could continue to report as it does today, independently of the ambient telemetry. Or, maybe it could be disabled completely, as some telem consumers, like Kiali, may elect to ignore it.

Removal of source and dest reporting will result in significant savings in telemetry capture, storage and processing. The main concern is whether Ambient telem is able to capture the union of what the source and dest reporting did with sidecar proxies. For example, sidecar client reporting captured requests that never made it to a destination proxy, and destination reporting provided security policy.

New bucketing for Histograms
A significant number of users disable Istio's histograms because they are too expensive. Currently, Istio bootstraps using Envoy's default histogram bucketing, with 19 buckets:

[ 0.5, 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 300000, 600000, 1800000, 3600000 ]

Prometheus defaults cut the number almost in half:

[ .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10 ]

OTel HTTP defaults are similar to Prom, but with 14 buckets:

[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ].

Each unique time-series gets one more dimension for each bucket (+ 1 more for "inf"), so in a simple case, say requests A->B, there are currently 20 bucket time-series generated, with each updated for a value <= the bucket value. For example, using request_duration_milliseconds, for an 1100ms entry, the first 10 buckets are updated (+ 1 more). And note, that is for each unique set of attributes.

In my opinion Envoy has too many buckets, bloating the storage and slowing down updates, queries and calculations. Prom and OTel are better but currently set for seconds, whereas Istio is currently reporting in milliseconds.

It's not totally clear what to do here. Also, Prometheus has just introduced experimental support for "Native Histograms", which is different from what I described above, and may help out substantially. It is something that should be tracked for possible use with ambient GA. But for now, I'd suggest that when installing Istio with the ambient profile, we adhere to the OTel bucketing, adjusted for milliseconds. This will still reduce the overall number of time-series while adding some standardization. Id ignore support for custom bucketing and hope that native histogram support eventually solves the histogram problem.

Removals
This is biased towards what we currently need in Kiali, but I think some attributes could be removed from the standard metrics. This may not help much with cardinality, but it would help with storage, and would simplify the metrics. For Kiali users looking to save metric storage space, we recommend the following labeldrops:

chart|destination_app|destination_version|heritage|.*operator.*|istio.*|release|security_istio_io_.*|service_istio_io_.*|sidecar_istio_io_inject|source_app|source_version

Minimally, I think the "canonical" attributes remove the need for *_app and *_version.

Note that the nature of waypoints should already reduce cardinality significantly, as the podattribute should have far fewer values, especially for instances where app pods saw a lot of churn.

There are also a lot of metrics that could likely be turned off by default. In Kiali the following metrics that are currently unused:

istio_agent_.*|istiod_.*|istio_build|citadel_.*|galley_.*|pilot_[^p].*|envoy_cluster_[^u].*|envoy_cluster_update.*|envoy_listener_[^dh].*|envoy_server_[^mu].*|envoy_wasm_.*

To summarize, the above doesn't recommend any sort of fundamental shift in telemetry for Ambient, but I think would let Ambient report in a more native way, address a legacy issue in histogram reporting, reduce the telemetry footprint, and require only modest changes for consumers.

@howardjohn
Copy link
Member Author

For Kiali users looking to save metric storage space, we recommend the following labeldrops:
chart|destination_app|destination_version|heritage|.operator.|istio.|release|security_istio_io_.|service_istio_io_.*|sidecar_istio_io_inject|source_app|source_version

Most of these are not set at all by Istio, but by users Prometheus scraping configuration set to add all the pod labels to the metrics labels. Waypoints happen to solve this due to the fact it would use the waypoints labels, though. {destination,source}_{app,version} are an exception -- I generally agree that is not useful now that we have canonical service/revision.

@bleggett
Copy link
Contributor

bleggett commented Feb 29, 2024

  • This should be sufficient if the different components report to distinct metrics. For example, ztunnel writes just to TCP metrics and waypoints write only to request metrics. Otherwise it may be best to have both use reporter=”ambient” and add a new label like reporter_component=”ztunnel | waypoint”.

This allows consumers to easily identify ambient telem in a query using a single label-value pair, and distinguish it from sidecar-proxy telem.

Slight pref for the reporter=”ambient” reporter_component=”ztunnel | waypoint” semantics, since I think with e.g. sandwiching this might get a little weird otherwise (and I think it simplifies query syntax for mixed clusters, as you point out).

@lei-tang
Copy link
Contributor

Adding a new label "reporter_component" to OSS ambient adds to its complexity and is not consistent with OSS Istio, which does not have this new label.

I prefer for ztunnel, reporter=”ztunnel”; for waypoint proxy, reporter=”waypoint”, because it is simple and clear.

@bleggett
Copy link
Contributor

bleggett commented Feb 29, 2024

Adding a new label "reporter_component" to OSS ambient adds to its complexity and is not consistent with OSS Istio, which does not have this new label.

That's not necessarily a problem, OSS ambient looks nothing like OSS sidecar in practical terms anyway, and the reality is both will be supported in parallel in probable perpetuity.

My concern is if we (eventually) end up with something like

  • 1 ztunnel (ztunnel A) per node
  • N waypoints per node

and ztunnel A is proxying L4 traffic to all ambient pods on the node as well as terminating HBONE traffic bound to those N waypoints - it gets potentially a bit trickier to collate and represent the info, as the simplistic categories start to break down a bit.

Thinking about it, it's probably it's moot either way tho, as we will eventually have to append more specific context than reporter=waypoint|ztunnel for ambient, no matter which option we picked.

My only concern is that when we are forced to do that, we will make it hard for ourselves to support parallel telemetry views for sidecar and ambient traffic in the same cluster - I do think that is a requirement we should keep at the forefront.

@krisztianfekete
Copy link
Contributor

krisztianfekete commented Mar 1, 2024

I prefer for ztunnel, reporter=”ztunnel”; for waypoint proxy, reporter=”waypoint”, because it is simple and clear.

+1

@jshaughn
Copy link

jshaughn commented Mar 1, 2024

Today, with sidecar injection, for a single request from A to B there are 2 time-series generated for each relevant metric (requests_total, tcp_sent_bytes, tcp_received bytes, etc), sidecar-A with reporter="source" and sidecar-B with reporter="destination. Is there consensus that with Ambient, for a single request from A to B, that we would expect 1 time-series generated for each relevant metric? Potentially with reporter="ztunnel" (for L4 metrics, like tcp_sent_bytes) or reporter="waypoint" (for L7 metrics, like request_total).

@mikemorris
Copy link
Member

mikemorris commented Mar 4, 2024

From a cursory glance (caveat that I haven't fully read all the comments) I'd expect that reporter="source" and reporter="destination" should generally retain similar usage and not be overloaded with a less descriptive new term for Ambient - reporter="ztunnel" fails to describe whether the ztunnel is reporting from the source or destination, and this actually becomes more important for ambient data flows due to the possibility of waypoint hops and therefore multiple source and destination points within a single flow.

I'd be supportive of adding something like a reporter_component label and using reporter_component="sidecar" for consistency with the traditional model to address @lei-tang's concern. With this, a typical flow (using a "sandwich" deployment) may look like:

L7 Ambient

client zt waypoint zt -> waypoint waypoint -> waypoint zt destination zt
reporter="source" reporter="destination" reporter="source" reporter="destination"
reporter_component="ztunnel" reporter_component="ztunnel" reporter_component="ztunnel" reporter_component="ztunnel"

I'd expect the same-node waypoint <-> ztunnel hops to be negligible, but the biggest gap here is the processing time within the waypoint itself. For this reason, I think there should ideally be a "parent" tracing span capturing the entire client zt -> destination zt timing, although I think that would only be possible to emit from the destination zt, and only if the waypoint forwards timing/header information from the first leg?

I'm not sure where/how/if to encode that the middle legs are associated with a waypoint, but it feels inaccurate to use reporter_component="waypoint" if using a ztunnel sandwich deployment where the ztunnel is the actual destination and then outbound source.

L4 Ambient

client zt destination zt
reporter="source" reporter="destination"
reporter_component="ztunnel" reporter_component="ztunnel"

Sidecars

client sidecar destination sidecar
reporter="source" reporter="destination"
reporter_component="sidecar" reporter_component="sidecar"

@kyessenov
Copy link
Contributor

kyessenov commented Mar 4, 2024

I don't think we need reporter_component - all monitoring systems allow tagging metrics with the component the metrics come from. This is usually referred to as a "monitored resource".

The main value of the two reporters is that it provides an easy way to compare, and the difference is meaningful since it reflects the network impact. There's not much value in comparing ambient telemetry to sidecar telemetry, so I think you can pick a completely different set of labels.

@bleggett
Copy link
Contributor

bleggett commented Mar 4, 2024

From a cursory glance (caveat that I haven't fully read all the comments) I'd expect that reporter="source" and reporter="destination" should generally retain similar usage and not be overloaded with a less descriptive new term for Ambient - reporter="ztunnel" fails to describe whether the ztunnel is reporting from the source or destination, and this actually becomes more important for ambient data flows due to the possibility of waypoint hops and therefore multiple source and destination points within a single flow.

Yes +1 on this, this is my main concern. Component != direction or origin, as long as we aren't conflating those things, we should be good.

@howardjohn
Copy link
Member Author

Chatted with @louiscryan a bit, here is where we arrived. Primary focus is retaining metrics value while keeping cost reasonable.

I setup a small spreadsheet to put in theoretical numbers and analyze different approaches. Here are the primary approaches I considered:

image

Here the boxes are the actual proxy reporting the metric, and the source.

Note: the ones with empty spot are derived from others.

For some reason I cannot upload to the team drive with some permission issue, so attached here
Ambient telemetry analysis.xlsx

Summarized (waypoint double == top option, waypoint single == bottom in the picture)

Metric types Reporting Pods Reporters Metric count Destinations Sources Series Cost
Sidecar HTTP 1000 2 64 7 1 896,000 $7,168
Sidecar TCP 1000 2 4 7 1 56,000 $448
Ztunnel 25 2 4 100 25 500,000 $4,000
Waypoint double (client part) 30 1 64 1 70 134,400 $1,075
Waypoint double (server part) 30 1 64 10 1 19,200 $154
Waypoint double 153,600 $1,229
Waypoint single 30 1 64 10 70 1,344,000 $10,752
Waypoint single + ztunnel 1,844,000 $14,752
Waypoint double + ztunnel 653,600 $5,229

Broad notes:

  • Histograms are killer. Native histogram support, or even dropping buckets, may yield huge wins -- but they scale linearly so its not foundational
  • Ztunnel cost is pretty small relative to waypoints
  • An 'off the shelf' waypoint implementation outside of Istio would likely report only the right side of the 'waypoint double'
  • Much of the cost comes from reporting pod_name in the label. This cannot be dropped out of the box in simple prometheus setup, but it is possible with prometheus https://karlstoney.com/federated-prometheus-to-reduce-metric-cardinality/ and generally any metric backend.

There is discussion about consolidating {workload, canonical service, app} and {version, canonical revision}; since these are generally the same values, these do not meaningfully impact cardinality which is the primary driver for costs in metrics. Given this, its probably best to keep all the labels for compatibility.

@kyessenov
Copy link
Contributor

"Waypoint-double" reporting is qualitatively different since you cannot correlate the two metrics easily. Intuitively, for full fidelity with sidecars, you need a series per every network hop so it's always O(hops). You can sacrifice some fidelity via an intermediary LB service (so instead of M*N tuples you have M+N tuples), but it is not equivalent.

@louiscryan
Copy link
Contributor

"Waypoint-double" reporting is qualitatively different since you cannot correlate the two metrics easily. Intuitively, for full fidelity with sidecars, you need a series per every network hop so it's always O(hops). You can sacrifice some fidelity via an intermediary LB service (so instead of M*N tuples you have M+N tuples), but it is not equivalent.

Agreed, this is why Waypoint single is the likely recommendation. The cost difference between 'single' (bottom path) and 'double' (top path) is substantial so its instructive for folks to understand this w.r.t to the loss of fidelity vs sidecar as you describe.

Ideally we would not need 30 'reporters' and could treat all waypoint Deployments as a single reporter or even all waypoints as simply one reporter since its unlikely that the waypoint instance detail as the reporter is actually useful - more directly, if the waypoint is having issues we are likely to use a more operation metric (CPU, queue lengths, etc) to diagnose issues with an instance. If we can find a way to collapse reporters this would make a lot of sense and bring the costs down significantly

@lei-tang
Copy link
Contributor

lei-tang commented Mar 5, 2024

I prefer the "waypoint single" approach over the "waypoint double" approach because:

  • The "waypoint single" approach provides the source -> destination information. The "waypoint double" approach does not provide the source -> destination information.

Take the following data points in the "waypoint double" approach as an example: source 1 -> waypoint, waypoint -> destination 2, source 2 -> waypoint, waypoint -> destination 1. Such data points do not provide the source -> destination information: e.g., from these data points, a user can not tell the actual source and destination is source 1 -> destination 1 or source 1 -> destination 2.

  • The "waypoint single" approach is simpler than the "waypoint double" approach.

@mikemorris
Copy link
Member

Would it be worthwhile to consider some labeling scheme and configuration such that it could be possible to enable "waypoint double" telemetry even if "waypoint single" is default and recommended for cost control in production?

@kyessenov
Copy link
Contributor

@mikemorris is there a material difference between the "double" telemetry and a "single" telemetry with cardinality reduction techniques applied (e.g. erasing destination labels, grouping of response codes, etc)?

@mikemorris
Copy link
Member

mikemorris commented Mar 5, 2024

@mikemorris is there a material difference between the "double" telemetry and a "single" telemetry with cardinality reduction techniques applied (e.g. erasing destination labels, grouping of response codes, etc)?

I think I don't have a deep enough understanding to answer that well tbh.

if the waypoint is having issues we are likely to use a more operation metric (CPU, queue lengths, etc) to diagnose issues

My concern is mostly around whether ^ is sufficient (and an expected/reasonable triage workflow) for Istio operators in this situation.

@jshaughn
Copy link

jshaughn commented Mar 6, 2024

I prefer waypoint-single as I don't think that in general users want to see the infrastructure in the telemetry. They want to see that traffic between service A and service B was successful, or not. As @lei-tang notes, I'm not sure how we would correlate the two legs in waypoint-double to show the request and its response time etc. I do see in both diagrams that src and dst workloads as noted as pods, is that correct? I don't think we'd want pods.

@howardjohn
Copy link
Member Author

src and dst workloads as noted as pods, is that correct? I don't think we'd want pods.

This was a bit of a misnomer, it would be the standard app/canonicalservice/workload labels ,not a new 'pod' one.


It seems like there is some consensus building around the "waypoint single" option, with the ztunnel metrics also shown there (which also happen to be what ztunnel implements today).

@linsun
Copy link
Member

linsun commented Mar 6, 2024

Is it correct to think waypoint double (top diagram) provides time spent in a given waypoint while waypoint single (botoom diagram) doesn't provide that info?

Personally i found this is very hard to compare with sidecars because waypoint single or double doesn't report time spent between source and destination.

@howardjohn
Copy link
Member Author

Is it correct to think waypoint double (top diagram) provides time spent in a given waypoint while waypoint single (botoom diagram) doesn't provide that info?

I don't think it provides much meaningful information. The waypoint itself is reporting it during its processing; its hard to accurately observe your own latency.

For example, consider a request takes 10ms of processing

  • 0ms: request packet hits waypoint
  • 5ms: waypoint done processing it, reports it got it
  • 6ms: waypoint about to send upstream request, reports its sending
  • 10ms: request finally leaves the waypoint fully

In reality it was 10ms, but we observe 1ms

@kyessenov
Copy link
Contributor

I don't think it provides much meaningful information. The waypoint itself is reporting it during its processing; its hard to accurately observe your own latency.

If the goal is to report the time spent in waypoint, that's possible but it's a different metric. Envoy does keep track of the overhead (e.g. time spent processing requests and responses minus the network to backend). To get accurate measurements, you have to measure at the source, using a client sidecar or out-of-cluster (user-facing browsers or mobile apps), which is I think far beyond the scope of the ambient telemetry.

@linsun
Copy link
Member

linsun commented Mar 7, 2024

Thank you @howardjohn and @kyessenov - I agree waypoint double doesn't provide meaningful latency within waypoint itself. And latency within waypoint is out of scope for this WI.

@lei-tang
Copy link
Contributor

We drafted a "Plan for OSS ambient mesh beta - telemetry" (https://docs.google.com/document/d/15bW22HxuB9F-TIDy4h_YCBsUb8usbaurx8-DrXcL3_Q/edit). Please take a look.

@howardjohn
Copy link
Member Author

I've temporarily lost write access to docs, so will comment here:

  • Ztunnel should write some logs, but not envoy logs format which were designed for HTTP. We can provide a far better log.
  • Waypoint should support tracing just like Istio Gateway does.
  • -1 to decision to not do Telemetry API. Its probably more work to not support it
  • Should remove "Additional stats for waypoint" or make it clear its just providing more context, not proposing we do that

Looks good though, thanks for pulling that together.

@lei-tang
Copy link
Contributor

John, thank you for your comments! I agree with your points and have updated the doc accordingly. PTAL.

@linsun
Copy link
Member

linsun commented Apr 4, 2024

To recap what we discussed recently in ambient meeting, I think we have a plan for ambient beta, which is to go with waypoint single for reporting (bottom of this diagram and not support Telemetry API. We need to identify an owner for #50225.

@howardjohn
Copy link
Member Author

Closing this as we have determined what it will look like. Implementation is tracked in #50225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ambient Beta Must have for Beta of Ambient Mesh area/ambient Issues related to ambient mesh lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
Status: Done
Development

No branches or pull requests