Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic config generation causing frequent inbound listener reload #18088

Closed
Stono opened this issue Oct 19, 2019 · 14 comments · Fixed by #18091
Closed

Non-deterministic config generation causing frequent inbound listener reload #18088

Stono opened this issue Oct 19, 2019 · 14 comments · Fixed by #18091
Assignees
Milestone

Comments

@Stono
Copy link
Contributor

Stono commented Oct 19, 2019

Bug description
Following on from #18086 (I'm raising separate things as I continue to debug #18043) I periodically (roughly every 5 minutes) see pilot-agent seemingly disconnect from pilot with:

[2019-10-19 18:59:49.171][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,

eg:

❯ k -n consumer-gateway logs --follow consumer-gateway-6ffb799bf5-kkf5q --tail=-1 -c istio-proxy | grep -i bazel-out
[2019-10-19 18:39:48.076][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:39:48.100][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:44:48.130][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:44:48.617][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:49:48.639][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:49:49.059][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:54:49.089][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:54:49.098][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:59:49.171][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:59:49.430][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:04:49.455][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:04:49.790][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:09:49.815][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:09:50.191][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:14:50.217][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:14:50.335][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:19:50.358][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:19:50.524][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:24:50.550][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:24:50.842][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(s

These correlate with big spikes XDS pushes:
Screenshot 2019-10-19 at 20 16 43

Which I believe are causing listeners to reload, which then in turn causes connections to be reset, which correlates to the metrics:

Screenshot 2019-10-19 at 20 08 11

And is also I believe the cause of my 503UC's on long running requests, because the drainDuration defaults to 45s

There is nothing in the discovery logs to indicate why these pushes are happening, and no config is changing.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
These XDS pushes, if they're normal, to not reset listeners.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version)
1.3.3

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.14

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

I increased to log level on pilot and see this for every connected pod every 5 minutes:

019-10-19T19:39:59.271754Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.12.81~vehicle-metric-service-5f94989698-2fx7d.vehicle-metric-service~vehicle-metric-service.svc.cluster.local-14 terminated rpc error: code = Canceled d
esc = context canceled
2019-10-19T19:39:59.271962Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.32.76~mtd-listings-8cfd9c7cb-j7smq.mtd-listings~mtd-listings.svc.cluster.local-84 terminated rpc error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272042Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.28.13~forecourt-service-7c89b5f45b-r9vsn.forecourt-service~forecourt-service.svc.cluster.local-13 terminated rpc error: code = Canceled desc = context c
anceled
2019-10-19T19:39:59.272111Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.5.124~vehicle-check-report-generator-84fd5d4968-hgq92.vehicle-check-report-generator~vehicle-check-report-generator.svc.cluster.local-79 terminated rpc
error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272216Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.11.197~sauron-web-bb5c85c49-fq4fp.sauron-web~sauron-web.svc.cluster.local-60 terminated rpc error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272374Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.18.253~vehicle-valuations-service-5c457dbc94-hrnj6.vehicle-valuations-service~vehicle-valuations-service.svc.cluster.local-159 terminated rpc error: cod
e = Canceled desc = context canceled
2019-10-19T19:39:59.272434Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.30.165~location-service-fcc598765-gsxtk.location-service~location-service.svc.cluster.local-62 terminated rpc error: code = Canceled desc = context canc
eled

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

On a side note, could anyone tell me the implications of setting a very large drainDuration?

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

Here you can see XDS pushes vs 503UC's

Screenshot 2019-10-19 at 21 03 33

@Stono Stono changed the title Proxy -> Pilot disconnection config reload Proxy -> Pilot disconnection config reload causing 503UC Oct 19, 2019
@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

Here's an example of a failed 503UC request as a result of the above, note the 503UC from ingress-nginx, and the 503DC from forecourt-service. This was actually a very short request (1.2seconds), showing that this problem is not just for long running requests.

Screenshot 2019-10-19 at 21 15 47

@nrjpoddar
Copy link
Member

@Stono there was similar issue #17383 which we fixed (or thought we did) by making the config serialization more stable. I'm not sure but may be there's still some lingering serialization issue which is causing Listeners to be reloaded without any user provided configuration changes.

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

@nrjpoddar i can confirm we're changing nothing whatsoever but every 5 minutes we see this. I will give PILOT_DISABLE_XDS_MARSHALING_TO_ANY=true a go.

@nrjpoddar
Copy link
Member

can you provide diff of Envoy config dump, specially listeners before and after this happens?

@nrjpoddar
Copy link
Member

Yeah, if it does then it's the same underlying issue. Remember PILOT_DISABLE_XDS_MARSHALING_TO_ANY will cause a spike in CPU.

@prune998
Copy link
Contributor

As pointed out in #18043, it's almost the same issue but seems to be limited to the virtualInbound listener, and does not happen at each Pilot keepaliveMaxServerConnectionAge trigger.
I'll keep investigating on monday but you're free to find the solution until then :)

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

Nope, PILOT_DISABLE_XDS_MARSHALING_TO_ANY did not resolve the problem. I'll get you the dumps from before and after as requested.

@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

Caught a diff!

before:
abtest-allocator-668956679-hvbsl-before.log

after:
abtest-allocator-668956679-hvbsl-after.log

It seems to be non-deterministic ordering when we have two services pointing at the same pod (different ports) on the virtualInbound as @prune998 said.

-                                "name": "inbound|80|http-web|app.abtest-allocator.svc.cluster.local",
+                                "name": "inbound|9080|http-web-admin|admin.abtest-allocator.svc.cluster.local",

The service is:

❯ k get svc
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
admin   ClusterIP   10.192.3.34     <none>        9080/TCP   228d
app     ClusterIP   10.192.14.203   <none>        80/TCP     228d

The ep's are:

❯ k get ep
NAME    ENDPOINTS                                               AGE
admin   10.194.1.107:9080,10.194.2.92:9080,10.194.31.126:9080   228d
app     10.194.1.107:8080,10.194.2.92:8080,10.194.31.126:8080   228d

@Stono Stono changed the title Proxy -> Pilot disconnection config reload causing 503UC Non-deterministic config generation causing frequent inbound listener reload Oct 19, 2019
@Stono
Copy link
Contributor Author

Stono commented Oct 19, 2019

I've made @howardjohn and @duderino aware of this, and raised a separate issue (#18089) to cover the fact listener reloads can result in 503UC (they're just exacerbated by this issue), and another issue (#18090) to question why pilot is pushing config every 5 minutes.

@howardjohn
Copy link
Member

howardjohn commented Oct 19, 2019 via email

@Stono
Copy link
Contributor Author

Stono commented Oct 30, 2019

This is fixed in 1.3.4

@Stono Stono closed this as completed Oct 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants