Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained telemetry involving passthrough and unknown #24379

Closed
jshaughn opened this issue Jun 3, 2020 · 74 comments · Fixed by #24763
Closed

Unexplained telemetry involving passthrough and unknown #24379

jshaughn opened this issue Jun 3, 2020 · 74 comments · Fixed by #24763

Comments

@jshaughn
Copy link

jshaughn commented Jun 3, 2020

We have a demo app called "travel agency" that when run against Istio 1.6 is generating the expected telemetry but also unexpected telemetry. Initial telemtry looks good and generates an expected Kiali graph. But quickly we see an unexpected TCP edge leading to PassthroughCluster, and then again from Unknown to a destination service. After a few minutes we eventually see these additional TCP edges leading to Passthrough and then from unknown. It seems sort of like an intermittent leak of internal traffic. Here is a short video (using Kiali replay) that shows the issue. At the very beginning you see the expected, all green, all http traffic, Quickly we see some of the unexpected (blue) TCP telem. And as I skip forward and advance the frames the remaining edges show up:

travel-bad-telemetry

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[x] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
The TCP edges to PassthroughCluster, and from unknown, should not show up, which means that Istio should not generate the underlying Prometheus time-series.

Steps to reproduce the bug
The travel-agency app is found here: https://github.com/lucasponce/travel-comparison-demo

There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy-
travel-agency

This will install travel agency on minikube:

$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)
This has been recreated on both 1.6.0 and 1.6.1 pre-release, using default (V2) telem.

How was Istio installed?
istioctl

Environment where bug was observed (cloud vendor, OS, etc)
This has been recreated on Minikube and OpenShift, both on bare metal and AWS.

cc @jmazzitelli @lucasponce

@jmazzitelli
Copy link
Member

jmazzitelli commented Jun 3, 2020

There is a script to install the app here: https://github.com/jmazzitelli/test/tree/master/deploy-
travel-agency

That link is incomplete, and wrapped and thus cut-off. The full link to install the demo script used to replicate this problem is here:

https://github.com/jmazzitelli/test/blob/master/deploy-travel-agency/deploy-travel-agency-demo.sh

This will install travel agency on minikube:

$ CLIENT_EXE=minikube bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)

That is incorrect - don't set CLIENT_EXE to minikube, set it to kubectl. So to install on minikube, run this:

CLIENT_EXE=kubectl bash <(curl -L https://raw.githubusercontent.com/jmazzitelli/test/master/deploy-travel-agency/deploy-travel-agency-demo.sh)

@lucasponce
Copy link
Contributor

I can confirm the same behaviour and add more data:

image

Basically the travels-v1 workload invokes hotels,insurances,flights,cars services and those invoke discount ones.

Debugging istio-proxies I can collect info where PassthroughCluster requests are reported:

2020-06-03T10:44:13.406466597Z [2020-06-03T10:43:57.181Z] "- - -" 0 - "-" "-" 6351 15841 10273 - "-" "-" "-" "-" "10.111.31.203:8000" PassthroughCluster 172.17.0.24:41840 10.111.31.203:8000 172.17.0.24:41830 - -
2020-06-03T10:44:13.406477201Z [2020-06-03T10:44:07.421Z] "GET /travels/rome HTTP/1.1" 200 - "-" "-" 0 444 33 32 "-" "Go-http-client/1.1" "f9c92a5f-d036-9252-8ca7-ab987126f72d" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:42166 172.17.0.24:8000 172.17.0.26:52808 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.406488197Z [2020-06-03T10:44:07.689Z] "GET /travels/paris HTTP/1.1" 200 - "-" "-" 0 446 15 14 "-" "Go-http-client/1.1" "aab65bc6-47bc-96e5-939d-b77ba056f13c" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:42166 172.17.0.24:8000 172.17.0.27:43774 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.406499833Z [2020-06-03T10:44:07.703Z] "GET /travels/rome HTTP/1.1" 200 - "-" "-" 0 444 44 43 "-" "Go-http-client/1.1" "73452650-d18e-9482-b5d7-db26729ec3fc" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:42960 172.17.0.24:8000 172.17.0.28:59326 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.406511054Z [2020-06-03T10:44:07.800Z] "GET /travels/paris HTTP/1.1" 200 - "-" "-" 0 448 15 14 "-" "Go-http-client/1.1" "41965fab-3cec-9edd-b139-2749fbdaa661" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:38028 172.17.0.24:8000 172.17.0.29:55414 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.406522084Z [2020-06-03T10:43:44.869Z] "- - -" 0 - "-" "-" 19380 49781 23117 - "-" "-" "-" "-" "10.102.42.131:8000" PassthroughCluster 172.17.0.24:60356 10.102.42.131:8000 172.17.0.24:60352 - -
2020-06-03T10:44:13.40653199Z [2020-06-03T10:44:07.965Z] "GET /travels/london HTTP/1.1" 200 - "-" "-" 0 451 42 40 "-" "Go-http-client/1.1" "83f94115-49ad-9187-ace1-ad124991995e" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:38644 172.17.0.24:8000 172.17.0.30:40244 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.40654304Z [2020-06-03T10:44:07.948Z] "GET /travels/london HTTP/1.1" 200 - "-" "-" 0 452 60 57 "-" "Go-http-client/1.1" "40e46a83-436e-9876-be8d-051c3f4e81be" "travels.travel-agency:8000" "127.0.0.1:8000" inbound|8000|http|travels.travel-agency.svc.cluster.local 127.0.0.1:38028 172.17.0.24:8000 172.17.0.31:60750 outbound_.8000_._.travels.travel-agency.svc.cluster.local default
2020-06-03T10:44:13.406584607Z [2020-06-03T10:44:07.544Z] "- - -" 0 - "-" "-" 648 1618 476 - "-" "-" "-" "-" "10.111.31.203:8000" PassthroughCluster 172.17.0.24:42160 10.111.31.203:8000 172.17.0.24:42156 - -
2020-06-03T10:44:13.406466597Z [2020-06-03T10:43:57.181Z] "- - -" 0 - "-" "-" 6351 15841 10273 - "-" "-" "-" "-" "10.111.31.203:8000" PassthroughCluster 172.17.0.24:41840 10.111.31.203:8000 172.17.0.24:41830 - -

10.111.31.203:8000 -> flights service
172.17.0.24:41840 -> travels endpoint
172.17.0.24:41830 -> travels endpoint

2020-06-03T10:44:13.406522084Z [2020-06-03T10:43:44.869Z] "- - -" 0 - "-" "-" 19380 49781 23117 - "-" "-" "-" "-" "10.102.42.131:8000" PassthroughCluster 172.17.0.24:60356 10.102.42.131:8000 172.17.0.24:60352 - -

10.102.42.131:8000 -> cars service
172.17.0.24:60356 -> travels endpoint
172.17.0.24:60352 -> travels endpoint

2020-06-03T10:44:13.406584607Z [2020-06-03T10:44:07.544Z] "- - -" 0 - "-" "-" 648 1618 476 - "-" "-" "-" "-" "10.111.31.203:8000" PassthroughCluster 172.17.0.24:42160 10.111.31.203:8000 172.17.0.24:42156 - -

10.111.31.203:8000 -> flights service
172.17.0.24:41840 -> travels endpoint
172.17.0.24:41830 -> travels endpoint

There are no strange traffic, it looks some requests are missed from the telemetry report.

@lucasponce
Copy link
Contributor

Adding more proxy-config logs of this scenarios just in case it helps to diagnose:

log-proxy-config-listeners.txt
log-proxy-config-routes.txt

@douglas-reid
Copy link
Contributor

@bianpengyuan @nrjpoddar can you take a look at this issue?

@bianpengyuan bianpengyuan self-assigned this Jun 4, 2020
@bianpengyuan
Copy link
Contributor

Will take a look.

@bianpengyuan
Copy link
Contributor

Likely a dup of #23901

@jshaughn
Copy link
Author

jshaughn commented Jun 8, 2020

BTW, I have also now seem this using bookinfo. On that one I did not get the edges from "unknown" but did get unwanted edges to Passthrough from some (not all) of the workloads. It seems there is some sort of internal telemetry "leak".

@bianpengyuan
Copy link
Contributor

@jshaughn Could you please dump istio_tcp_received_bytes_total and istio_tcp_sent_bytes_total from prometheus? I cannot reproduce what you saw with travel app, where all workload has unknown source and passthrough cluster as destination and consistent tcp bytes sent or received. Though I am not using CNI, so I am wondering whether it is because of CNI again this time.

I can only reproduce what you mentioned in #24379 (comment), that some idle connections to passthrough cluster.

istio_tcp_received_bytes_total{app="insurances",connection_security_policy="unknown",destination_app="unknown",destination_canonical_revision="latest",destination_canonical_service="unknown",destination_principal="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",destination_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",instance="10.56.2.178:15020",istio_io_rev="default",job="kubernetes-pods",namespace="travel-agency",pod_name="insurances-v1-554dc5f9cb-ssk5p",pod_template_hash="554dc5f9cb",reporter="source",request_protocol="tcp",response_flags="UF,URX",security_istio_io_tlsMode="istio",service_istio_io_canonical_name="insurances",service_istio_io_canonical_revision="v1",source_app="insurances",source_canonical_revision="v1",source_canonical_service="insurances",source_principal="unknown",source_version="v1",source_workload="insurances-v1",source_workload_namespace="travel-agency",version="v1"}

I still need to dig why idle connection to passthrough cluster is happening, likely it is because of http inspector timeout.

@bianpengyuan
Copy link
Contributor

I can reproduce it in a cluster with cni enabled.

@jshaughn
Copy link
Author

jshaughn commented Jun 8, 2020

@bianpengyuan I don't think the main issue is CNI related as we have seen this issue on non-CNI/minikube envs as well, but in this particular bookinfo case, where I am seeing only the passthrough destination edges, it is CNI. So perhaps this is not the same issue, but here is the bookinfo time-series dump:

istio24379.txt

@bianpengyuan
Copy link
Contributor

The non-cni environment is the place you run bookinfo? Can you only see ghost passthrough edges which does not have any data sent/received, or the unknown source as well? I think what we see with CNI and the ghost passthrough edges with non-CNI env are separated issue... Anyway I can reproduce both now, will dig more.

@jshaughn
Copy link
Author

jshaughn commented Jun 8, 2020

The travel example has been reproduced with and without CNI and has both the unknown source edges as well as the passthrough destination edges.

The bookinfo data from today is with CNI and showed only the unwanted passthrough destination edges.

Sorry if I confused things but really glad you have been able to reproduce.

@Mengkzhaoyun
Copy link

#24501

the same issue with Istio v1.6.1 TCP telemetry unknown & passthroughcluster #24501 ?

i don't think so , because i can not see the metrics correct at begin , it always go to the wrong metrics , it can not collect the completion tcp metrics as one.

i can see the normal tcp metrics split to two part and did not know each other , as blow

that one metrics from ratings-v2-mogodb , it did not known the dest

istio_tcp_received_bytes_total{connection_security_policy="unknown", destination_app="unknown", destination_canonical_revision="latest", destination_canonical_service="unknown", destination_principal="unknown", destination_service="unknown", destination_service_name="PassthroughCluster", destination_service_namespace="unknown", destination_version="unknown", destination_workload="unknown", destination_workload_namespace="unknown", instance="10.2.2.162:15090", job="envoy-stats", namespace="istio-samples", pod_name="ratings-v2-mongodb-7cdd894b4b-82wxg", reporter="source", request_protocol="tcp", response_flags="-", source_app="ratings", source_canonical_revision="v2-mongodb", source_canonical_service="ratings", source_principal="unknown", source_version="v2-mongodb", source_workload="ratings-v2-mongodb", source_workload_namespace="istio-samples"}

one is mogodb-v1 it did not known the source

istio_tcp_received_bytes_total{connection_security_policy="none", destination_app="mongodb", destination_canonical_revision="v1", destination_canonical_service="mongodb", destination_principal="unknown", destination_service="mongodb.istio-samples.svc.cluster.local", destination_service_name="mongodb", destination_service_namespace="istio-samples", destination_version="v1", destination_workload="mongodb-v1", destination_workload_namespace="istio-samples", instance="10.2.0.25:15090", job="envoy-stats", namespace="istio-samples", pod_name="mongodb-v1-5d68dcf7f4-wrwnv", reporter="destination", request_protocol="tcp", response_flags="-", source_app="unknown", source_canonical_revision="latest", source_canonical_service="unknown", source_principal="unknown", source_version="unknown", source_workload="unknown", source_workload_namespace="unknown"}

@bianpengyuan
Copy link
Contributor

@jshaughn @lucasponce Does travels workload always send requests to a destination service in the same way? I am observing some weird behavior that for traffic from travels to other workloads (and other workloads to discount app too), that some of the requests do not have source metadata exchanged. So I am wondering, does travels app always call other app in a consistent way, or does it somehow use two different ways to call a destination service, which introduces this weird behavior.

@lucasponce
Copy link
Contributor

@bianpengyuan the travels workload it's pretty "hello world", it's always working on the same way:

https://github.com/lucasponce/travel-comparison-demo/blob/master/travel_agency/travel_agency.go#L245

So, it's just a simple GET request to the others services using the http go library.

So, all telemetry should be similar.

@mandarjog mandarjog added this to the 1.7 milestone Jun 9, 2020
@mandarjog
Copy link
Contributor

Do the total number of requests (bad edges + good edges) match what you expect the client to send?

@bianpengyuan
Copy link
Contributor

After increasing outbound proto sniffing timeout, unknown edge disappeared. Could you try to set --set meshConfig.protocolDetectionTimeout=1s in installation and see if it fixes your problem too? We might need to consider increasing timeout of outbound listener sniffing.

@mandarjog
Copy link
Contributor

That is great sleuthing.

@lambdai @PiotrSikora
Can this be disabled when we know the other side is http? Or is it now on by default and unchangeable ?

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Jun 10, 2020

Not sure if it is feasible to disable outbound proto sniffing considering we are shifting to single outbound listener, but at least we could increase outbound proto sniffing timeout. @rshriram Do you have any concern on increasing the timeout? It is really confusing to have this kind of discrepancy between server side sidecar and client sidecar because of different proto sniffing timeout, even with just some sample apps.

@rshriram
Copy link
Member

But it will impact all other traffic on that port wont it?why is sniffing taking such a long time? Is it an artifact of a slow client?

@bianpengyuan
Copy link
Contributor

Yes it is because of slow client. IIUC, increasing timeout will only impact plain text server first app, instead of all traffic on the that port, right?

@mandarjog
Copy link
Contributor

@bianpengyuan better to declare port and disable sniffing here.

@JimmyCYJ
Copy link
Member

@jshaughn is there any update on this issue?

@jshaughn
Copy link
Author

jshaughn commented Jul 30, 2020

@JimmyCYJ @bianpengyuan Sorry for the delay, it was my first time navigating the install changes with 1.7. The new approach to addons means Kiali and Prometheus are now installed separately. But I do now have 1.7-alpha.2 running and I can report that the issue is not fixed. It may have been the case that the 5s default was not honored in 1.6, but I'm not surprised the problem remains because, as mentioned above, the issue was not fixed when manually using --set meshConfig.protocolDetectionTimeout=5s.

This is still a serious issue as the telemetry is simply not being reported correctly, users will not be able to easily understand their traffic. It would still seem like something unexpected is going on because as discussed above, 5s seems like a large timeout, hopefully the units are not confused and what we think is 5s is not treated like 5ms or something like that. Otherwise it seems like the proto-sniffing is just behaving differently than expected.

@bianpengyuan
Copy link
Contributor

@jshaughn Can I get a config dump from your proxy?

@bianpengyuan
Copy link
Contributor

@jshaughn Actually nvm, I wanted to check whether the timeout is really applied, but I think it should have been, and I'd be really surprised if envoy is not handling the timeout correctly because it is so fundamental and lots of security feature depends on this as well, such as tls sniffing.

I cannot think of a work around here with current configurability. tbh, 5s should be long enough for 99% of set up. We cannot just set timeout longer or even infinity since it is going to cause trouble for server first protocol like mysql and we don't have a way to opt-out a port from proto sniffing right now. Also proto sniffing needs to be turned on by default now because people are relying on it to distinguish type of traffic when a port is used by both http and tcp.

Sorry about that.. but I'd suggest to either debug why 5s is not sufficient for this app with tcpdump or turn off sniffing for this demo app.

@jmazzitelli
Copy link
Member

Sorry about that.. but I'd suggest to either debug why 5s is not sufficient for this app with tcpdump or turn off sniffing for this demo app.

If I recall, its not just this one app. We had heard from at least one other person (completely different app) that sees the same behavior.

@bianpengyuan
Copy link
Contributor

@jmazzitelli not sure if the other app has tried out new build with default 5s time out? It is understandable that the original timeout was too short (0.1s) and would cause this issue, but 5s is really long enough that normal app should just work fine with it.

@jmazzitelli
Copy link
Member

@jmazzitelli not sure if the other app has tried out new build with default 5s time out? It is understandable that the original timeout was too short (0.1s) and would cause this issue, but 5s is really long enough that normal app should just work fine with it.

I believe @naphta was someone who saw this - would need to see if he tried with the latest 1.7 build.

@bianpengyuan
Copy link
Contributor

Yeah, I think @naphta mentioned that is because of prometheus scraping instead of proto sniffing. although have not got a confirmation on that.

@bianpengyuan
Copy link
Contributor

I don't think this is a 1.7 blocking issue, as the default timeout should satisfy most cases. I am going to close this. #24998 tracks long term fix to remove dependency on proto sniffing.

@FL3SH
Copy link

FL3SH commented Aug 6, 2020

I am wondering if this issue may cause a problem with mTLS?
Screenshot 2020-08-06 at 11 35 03

If I set spec.mtls.mode: STRICT for my namespace my app is no longer able to connected to MongoDB. The graph above is taken when mode is PERMISSIVE.
Did I misconfigured something or shouldn't it just work?
Same issue for 1.6.7 and 1.7.0-beta.1.

@naphta
Copy link

naphta commented Aug 6, 2020

Yeah, I think @naphta mentioned that is because of prometheus scraping instead of proto sniffing. although have not got a confirmation on that.

It does appear that raising the proto sniffing doesn't resolve my problem, and the ports in question are the ones scraped by Prometheus. My problem might be related to using prometheus-direct scraping, although it seems odd to me that the sidecar wouldn't understand where it's coming from.

@jshaughn
Copy link
Author

jshaughn commented Aug 6, 2020

Looks like no fix is coming in 1.7 and so if affected then part of your telemetry will be reported incorrectly, coming from unknown or going to PassthroughCluster. To improve the Kiali graph I can only recommend disabling proto-sniffing completely if your app doesn't need it [1], or hiding the unwanted traffic by entering node=unknown OR service^=Pass in the Kiali graph hide.

[1] Disable proto-sniffing by setting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false.

I'm not sure if @howardjohn has any other recommendation, I suggest pushing on #24998 to be fixed ASAP.

@FL3SH , your graph in particular is pretty wild. I'm not sure I've seen 2 PassthroughCluster nodes, I'm not sure how that happens.

@lucasponce
Copy link
Contributor

I'm not sure if @howardjohn has any other recommendation, I suggest pushing on #24998 to be fixed ASAP.

I stand @jshaughn on this suggestion.

@FL3SH
Copy link

FL3SH commented Aug 6, 2020

I was able cleanup my graph quite a bit.
Screenshot 2020-08-06 at 16 32 14

  • removed virutalservices for redis and mongo
  • fixed all prefixes for ports - I was missing tcp- or http- for protocol selection
  • added version labels for pods, services, and sts to remove kiali warnings
  • add missing app labels for pods, services, and sts for kiali to group them together

@FL3SH
Copy link

FL3SH commented Aug 7, 2020

After seting values.pilot.enableProtocolSniffingForInbound=false and values.pilot.enableProtocolSniffingForOutbound=false nothing really has changed.

Is it an expected result?

Screenshot 2020-08-07 at 10 18 33

@FL3SH
Copy link

FL3SH commented Aug 7, 2020

Can this issue have an impact on AuthorizationPolicy?

@bianpengyuan
Copy link
Contributor

You graph looks normal to me. In your point of view, which part of your graph is wrong?

@FL3SH
Copy link

FL3SH commented Aug 7, 2020

unknow -> mongo-mongod

I thought my traffic would go from myapp pod to mongo-mongod service direct without this strange mongo-mongod.defal.svc.cluster.local service.

@bianpengyuan
Copy link
Contributor

Oh I see. yes, MTLS has to be enabled in order to make both client and server sidecar get peer's workload metadata, which is needed to connect kiali graph.

@FL3SH
Copy link

FL3SH commented Aug 7, 2020

In conclusion, this is how this graph should looks without tls?

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Aug 7, 2020

@FL3SH yes, for TCP connections the graph will be disconnected if MTLS is not enabled. For HTTP request, the graph would still be connected even without MTLS, since we use headers to exchange workload metadata between source and destination.

@FL3SH
Copy link

FL3SH commented Aug 7, 2020

Thank you for your explanation.

@lucasponce
Copy link
Contributor

lucasponce commented Sep 14, 2020

This is closed but I can see it in a different version of the demo app:
image

https://github.com/kiali/demos/tree/master/travels

I'm testing Istio 1.7.1 with minikube.

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Sep 14, 2020

Yeah there is no change between 1.6.x and 1.7.1. This issue still relies on the work around to increase proto sniffing timeout and this comment still applies to your demo app. FWIW, the long term effort to make timeout infinite is in progress and hopefully it can make 1.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.