-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexplained telemetry involving passthrough and unknown #24379
Comments
That link is incomplete, and wrapped and thus cut-off. The full link to install the demo script used to replicate this problem is here: https://github.com/jmazzitelli/test/blob/master/deploy-travel-agency/deploy-travel-agency-demo.sh
That is incorrect - don't set CLIENT_EXE to minikube, set it to kubectl. So to install on minikube, run this:
|
I can confirm the same behaviour and add more data: Basically the travels-v1 workload invokes hotels,insurances,flights,cars services and those invoke discount ones. Debugging istio-proxies I can collect info where PassthroughCluster requests are reported:
There are no strange traffic, it looks some requests are missed from the telemetry report. |
Adding more proxy-config logs of this scenarios just in case it helps to diagnose: |
@bianpengyuan @nrjpoddar can you take a look at this issue? |
Will take a look. |
Likely a dup of #23901 |
BTW, I have also now seem this using bookinfo. On that one I did not get the edges from "unknown" but did get unwanted edges to Passthrough from some (not all) of the workloads. It seems there is some sort of internal telemetry "leak". |
@jshaughn Could you please dump I can only reproduce what you mentioned in #24379 (comment), that some idle connections to passthrough cluster.
I still need to dig why idle connection to passthrough cluster is happening, likely it is because of http inspector timeout. |
I can reproduce it in a cluster with cni enabled. |
@bianpengyuan I don't think the main issue is CNI related as we have seen this issue on non-CNI/minikube envs as well, but in this particular bookinfo case, where I am seeing only the passthrough destination edges, it is CNI. So perhaps this is not the same issue, but here is the bookinfo time-series dump: |
The non-cni environment is the place you run bookinfo? Can you only see ghost passthrough edges which does not have any data sent/received, or the unknown source as well? I think what we see with CNI and the ghost passthrough edges with non-CNI env are separated issue... Anyway I can reproduce both now, will dig more. |
The travel example has been reproduced with and without CNI and has both the unknown source edges as well as the passthrough destination edges. The bookinfo data from today is with CNI and showed only the unwanted passthrough destination edges. Sorry if I confused things but really glad you have been able to reproduce. |
the same issue with i don't think so , because i can not see the metrics correct at begin , it always go to the wrong metrics , it can not collect the completion tcp metrics as one. i can see the normal tcp metrics split to two part and did not know each other , as blow that one metrics from ratings-v2-mogodb , it did not known the dest istio_tcp_received_bytes_total{connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_principal="unknown", destination_service="unknown", destination_service_name="PassthroughCluster", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", instance="10.2.2.162:15090", job="envoy-stats", namespace="istio-samples", pod_name="ratings-v2-mongodb-7cdd894b4b-82wxg", reporter="source", request_protocol="tcp", response_flags="-", source_app="ratings", source_canonical_revision="v2-mongodb", source_canonical_service="ratings", source_principal="unknown", source_version="v2-mongodb", source_workload="ratings-v2-mongodb", source_workload_namespace="istio-samples"} one is mogodb-v1 it did not known the source istio_tcp_received_bytes_total{connection_security_policy="none", destination_app="mongodb", destination_canonical_revision="v1", destination_canonical_service="mongodb", destination_principal="unknown", destination_service="mongodb.istio-samples.svc.cluster.local", destination_service_name="mongodb", destination_service_namespace="istio-samples", destination_version="v1", destination_workload="mongodb-v1", destination_workload_namespace="istio-samples", instance="10.2.0.25:15090", job="envoy-stats", namespace="istio-samples", pod_name="mongodb-v1-5d68dcf7f4-wrwnv", reporter="destination", request_protocol="tcp", response_flags="-", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown"} |
@jshaughn @lucasponce Does |
@bianpengyuan the https://github.com/lucasponce/travel-comparison-demo/blob/master/travel_agency/travel_agency.go#L245 So, it's just a simple GET request to the others services using the http go library. So, all telemetry should be similar. |
Do the total number of requests (bad edges + good edges) match what you expect the client to send? |
After increasing outbound proto sniffing timeout, unknown edge disappeared. Could you try to set |
That is great sleuthing. @lambdai @PiotrSikora |
Not sure if it is feasible to disable outbound proto sniffing considering we are shifting to single outbound listener, but at least we could increase outbound proto sniffing timeout. @rshriram Do you have any concern on increasing the timeout? It is really confusing to have this kind of discrepancy between server side sidecar and client sidecar because of different proto sniffing timeout, even with just some sample apps. |
But it will impact all other traffic on that port wont it?why is sniffing taking such a long time? Is it an artifact of a slow client? |
Yes it is because of slow client. IIUC, increasing timeout will only impact plain text server first app, instead of all traffic on the that port, right? |
@bianpengyuan better to declare port and disable sniffing here. |
@jshaughn is there any update on this issue? |
@JimmyCYJ @bianpengyuan Sorry for the delay, it was my first time navigating the install changes with 1.7. The new approach to addons means Kiali and Prometheus are now installed separately. But I do now have 1.7-alpha.2 running and I can report that the issue is not fixed. It may have been the case that the 5s default was not honored in 1.6, but I'm not surprised the problem remains because, as mentioned above, the issue was not fixed when manually using This is still a serious issue as the telemetry is simply not being reported correctly, users will not be able to easily understand their traffic. It would still seem like something unexpected is going on because as discussed above, 5s seems like a large timeout, hopefully the units are not confused and what we think is 5s is not treated like 5ms or something like that. Otherwise it seems like the proto-sniffing is just behaving differently than expected. |
@jshaughn Can I get a config dump from your proxy? |
@jshaughn Actually nvm, I wanted to check whether the timeout is really applied, but I think it should have been, and I'd be really surprised if envoy is not handling the timeout correctly because it is so fundamental and lots of security feature depends on this as well, such as tls sniffing. I cannot think of a work around here with current configurability. tbh, 5s should be long enough for 99% of set up. We cannot just set timeout longer or even infinity since it is going to cause trouble for server first protocol like mysql and we don't have a way to opt-out a port from proto sniffing right now. Also proto sniffing needs to be turned on by default now because people are relying on it to distinguish type of traffic when a port is used by both http and tcp. Sorry about that.. but I'd suggest to either debug why 5s is not sufficient for this app with tcpdump or turn off sniffing for this demo app. |
If I recall, its not just this one app. We had heard from at least one other person (completely different app) that sees the same behavior. |
@jmazzitelli not sure if the other app has tried out new build with default 5s time out? It is understandable that the original timeout was too short (0.1s) and would cause this issue, but 5s is really long enough that normal app should just work fine with it. |
I believe @naphta was someone who saw this - would need to see if he tried with the latest 1.7 build. |
Yeah, I think @naphta mentioned that is because of prometheus scraping instead of proto sniffing. although have not got a confirmation on that. |
I don't think this is a 1.7 blocking issue, as the default timeout should satisfy most cases. I am going to close this. #24998 tracks long term fix to remove dependency on proto sniffing. |
It does appear that raising the proto sniffing doesn't resolve my problem, and the ports in question are the ones scraped by Prometheus. My problem might be related to using prometheus-direct scraping, although it seems odd to me that the sidecar wouldn't understand where it's coming from. |
Looks like no fix is coming in 1.7 and so if affected then part of your telemetry will be reported incorrectly, coming from [1] Disable proto-sniffing by setting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false. I'm not sure if @howardjohn has any other recommendation, I suggest pushing on #24998 to be fixed ASAP. @FL3SH , your graph in particular is pretty wild. I'm not sure I've seen 2 PassthroughCluster nodes, I'm not sure how that happens. |
I stand @jshaughn on this suggestion. |
I was able cleanup my graph quite a bit.
|
Can this issue have an impact on |
You graph looks normal to me. In your point of view, which part of your graph is wrong? |
I thought my traffic would go from |
Oh I see. yes, MTLS has to be enabled in order to make both client and server sidecar get peer's workload metadata, which is needed to connect kiali graph. |
In conclusion, this is how this graph should looks without tls? |
@FL3SH yes, for TCP connections the graph will be disconnected if MTLS is not enabled. For HTTP request, the graph would still be connected even without MTLS, since we use headers to exchange workload metadata between source and destination. |
Thank you for your explanation. |
This is closed but I can see it in a different version of the demo app: https://github.com/kiali/demos/tree/master/travels I'm testing Istio 1.7.1 with minikube. |
Yeah there is no change between 1.6.x and 1.7.1. This issue still relies on the work around to increase proto sniffing timeout and this comment still applies to your demo app. FWIW, the long term effort to make timeout infinite is in progress and hopefully it can make 1.8. |
We have a demo app called "travel agency" that when run against Istio 1.6 is generating the expected telemetry but also unexpected telemetry. Initial telemtry looks good and generates an expected Kiali graph. But quickly we see an unexpected TCP edge leading to PassthroughCluster, and then again from Unknown to a destination service. After a few minutes we eventually see these additional TCP edges leading to Passthrough and then from unknown. It seems sort of like an intermittent leak of internal traffic. Here is a short video (using Kiali replay) that shows the issue. At the very beginning you see the expected, all green, all http traffic, Quickly we see some of the unexpected (blue) TCP telem. And as I skip forward and advance the frames the remaining edges show up:
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[x] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
The TCP edges to PassthroughCluster, and from unknown, should not show up, which means that Istio should not generate the underlying Prometheus time-series.
Steps to reproduce the bug
The travel-agency app is found here: https://github.com/lucasponce/travel-comparison-demo
There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy-
travel-agency
This will install travel agency on minikube:
$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)
Version (include the output of
istioctl version --remote
andkubectl version
andhelm version
if you used Helm)This has been recreated on both 1.6.0 and 1.6.1 pre-release, using default (V2) telem.
How was Istio installed?
istioctl
Environment where bug was observed (cloud vendor, OS, etc)
This has been recreated on Minikube and OpenShift, both on bare metal and AWS.
cc @jmazzitelli @lucasponce
The text was updated successfully, but these errors were encountered: