Non-deterministic config generation causing frequent inbound listener reload #18088

Stono · 2019-10-19T19:22:16Z

Bug description
Following on from #18086 (I'm raising separate things as I continue to debug #18043) I periodically (roughly every 5 minutes) see pilot-agent seemingly disconnect from pilot with:

[2019-10-19 18:59:49.171][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,

eg:

❯ k -n consumer-gateway logs --follow consumer-gateway-6ffb799bf5-kkf5q --tail=-1 -c istio-proxy | grep -i bazel-out
[2019-10-19 18:39:48.076][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:39:48.100][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:44:48.130][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:44:48.617][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:49:48.639][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:49:49.059][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:54:49.089][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:54:49.098][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 18:59:49.171][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 18:59:49.430][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:04:49.455][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:04:49.790][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:09:49.815][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:09:50.191][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:14:50.217][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:14:50.335][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:19:50.358][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:19:50.524][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);
[2019-10-19 19:24:50.550][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 13,
[2019-10-19 19:24:50.842][22][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:43] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(s

These correlate with big spikes XDS pushes:

Which I believe are causing listeners to reload, which then in turn causes connections to be reset, which correlates to the metrics:

And is also I believe the cause of my 503UC's on long running requests, because the drainDuration defaults to 45s

istio/install/kubernetes/helm/istio/templates/configmap.yaml

Line 211 in 362fb98

drainDuration: 45s

There is nothing in the discovery logs to indicate why these pushes are happening, and no config is changing.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
These XDS pushes, if they're normal, to not reset listeners.

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version)
1.3.3

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)
GKE 1.14

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

The text was updated successfully, but these errors were encountered:

Stono · 2019-10-19T19:41:25Z

I increased to log level on pilot and see this for every connected pod every 5 minutes:

019-10-19T19:39:59.271754Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.12.81~vehicle-metric-service-5f94989698-2fx7d.vehicle-metric-service~vehicle-metric-service.svc.cluster.local-14 terminated rpc error: code = Canceled d
esc = context canceled
2019-10-19T19:39:59.271962Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.32.76~mtd-listings-8cfd9c7cb-j7smq.mtd-listings~mtd-listings.svc.cluster.local-84 terminated rpc error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272042Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.28.13~forecourt-service-7c89b5f45b-r9vsn.forecourt-service~forecourt-service.svc.cluster.local-13 terminated rpc error: code = Canceled desc = context c
anceled
2019-10-19T19:39:59.272111Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.5.124~vehicle-check-report-generator-84fd5d4968-hgq92.vehicle-check-report-generator~vehicle-check-report-generator.svc.cluster.local-79 terminated rpc
error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272216Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.11.197~sauron-web-bb5c85c49-fq4fp.sauron-web~sauron-web.svc.cluster.local-60 terminated rpc error: code = Canceled desc = context canceled
2019-10-19T19:39:59.272374Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.18.253~vehicle-valuations-service-5c457dbc94-hrnj6.vehicle-valuations-service~vehicle-valuations-service.svc.cluster.local-159 terminated rpc error: cod
e = Canceled desc = context canceled
2019-10-19T19:39:59.272434Z     info    ads     ADS: "127.0.0.1:60686" sidecar~10.194.30.165~location-service-fcc598765-gsxtk.location-service~location-service.svc.cluster.local-62 terminated rpc error: code = Canceled desc = context canc
eled

Stono · 2019-10-19T19:59:38Z

On a side note, could anyone tell me the implications of setting a very large drainDuration?

Stono · 2019-10-19T20:03:58Z

Here you can see XDS pushes vs 503UC's

Stono · 2019-10-19T20:16:59Z

Here's an example of a failed 503UC request as a result of the above, note the 503UC from ingress-nginx, and the 503DC from forecourt-service. This was actually a very short request (1.2seconds), showing that this problem is not just for long running requests.

nrjpoddar · 2019-10-19T20:19:09Z

@Stono there was similar issue #17383 which we fixed (or thought we did) by making the config serialization more stable. I'm not sure but may be there's still some lingering serialization issue which is causing Listeners to be reloaded without any user provided configuration changes.

Stono · 2019-10-19T20:20:03Z

@nrjpoddar i can confirm we're changing nothing whatsoever but every 5 minutes we see this. I will give PILOT_DISABLE_XDS_MARSHALING_TO_ANY=true a go.

nrjpoddar · 2019-10-19T20:23:05Z

can you provide diff of Envoy config dump, specially listeners before and after this happens?

nrjpoddar · 2019-10-19T20:24:28Z

Yeah, if it does then it's the same underlying issue. Remember PILOT_DISABLE_XDS_MARSHALING_TO_ANY will cause a spike in CPU.

prune998 · 2019-10-19T20:30:22Z

As pointed out in #18043, it's almost the same issue but seems to be limited to the virtualInbound listener, and does not happen at each Pilot keepaliveMaxServerConnectionAge trigger.
I'll keep investigating on monday but you're free to find the solution until then :)

Stono · 2019-10-19T20:30:26Z

Nope, PILOT_DISABLE_XDS_MARSHALING_TO_ANY did not resolve the problem. I'll get you the dumps from before and after as requested.

Stono · 2019-10-19T21:03:17Z

Caught a diff!

before:
abtest-allocator-668956679-hvbsl-before.log

after:
abtest-allocator-668956679-hvbsl-after.log

It seems to be non-deterministic ordering when we have two services pointing at the same pod (different ports) on the virtualInbound as @prune998 said.

-                                "name": "inbound|80|http-web|app.abtest-allocator.svc.cluster.local",
+                                "name": "inbound|9080|http-web-admin|admin.abtest-allocator.svc.cluster.local",

The service is:

❯ k get svc
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
admin   ClusterIP   10.192.3.34     <none>        9080/TCP   228d
app     ClusterIP   10.192.14.203   <none>        80/TCP     228d

The ep's are:

❯ k get ep
NAME    ENDPOINTS                                               AGE
admin   10.194.1.107:9080,10.194.2.92:9080,10.194.31.126:9080   228d
app     10.194.1.107:8080,10.194.2.92:8080,10.194.31.126:8080   228d

Stono · 2019-10-19T21:17:51Z

I've made @howardjohn and @duderino aware of this, and raised a separate issue (#18089) to cover the fact listener reloads can result in 503UC (they're just exacerbated by this issue), and another issue (#18090) to question why pilot is pushing config every 5 minutes.

howardjohn · 2019-10-19T21:41:14Z

It would be useful to have a stat from envoy of number of configs that were unique, so we could see a spike in LDS drains, measure how efficient pilot is about not sending duplicate config, etc. I will send Envoy a feature request

…

On Sat, Oct 19, 2019, 2:18 PM Karl Stoney ***@***.***> wrote: I've made @howardjohn <https://github.com/howardjohn> and @duderino <https://github.com/duderino> aware of this, and raised a separate issue ( #18089 <#18089>) to cover the fact listener reloads can result in 503UC (they're just exacerbated by this issue). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18088?email_source=notifications&email_token=AAEYGXPKBKBAC56FYEST4U3QPN2TJA5CNFSM4JCRDIE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBX43NA#issuecomment-544198068>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXKSCOMIO6ZSWFXG4RLQPN2TJANCNFSM4JCRDIEQ> .

Fixes: istio#18088

Addresses: istio#18088

Addresses: #18088

Stono · 2019-10-30T14:34:53Z

This is fixed in 1.3.4

Fixes: #18088

Fixes: istio#18088

Fixes: #18088

Fixes: istio#18088

istio-policy-bot added the area/networking label Oct 19, 2019

This was referenced Oct 19, 2019

Mysterious 503's on long running requests #18043

Closed

HTTP2 connections being closed #18086

Closed

Stono changed the title ~~Proxy -> Pilot disconnection config reload~~ Proxy -> Pilot disconnection config reload causing 503UC Oct 19, 2019

Stono changed the title ~~Proxy -> Pilot disconnection config reload causing 503UC~~ Non-deterministic config generation causing frequent inbound listener reload Oct 19, 2019

Stono mentioned this issue Oct 19, 2019

Listener reload can cause 503UC errors #18089

Closed

nrjpoddar pushed a commit to nrjpoddar/istio that referenced this issue Oct 19, 2019

Sort filterchains for inbound virtual listeners

ebf5ba8

Fixes: istio#18088

nrjpoddar mentioned this issue Oct 19, 2019

Sort filterchains for inbound virtual listeners #18091

Merged

howardjohn mentioned this issue Oct 20, 2019

Feature: stat for XDS config reloads envoyproxy/envoy#8680

Closed

nrjpoddar pushed a commit to nrjpoddar/istio that referenced this issue Oct 21, 2019

Sort filterchains for inbound virtual listeners

06637f1

Addresses: istio#18088

nrjpoddar mentioned this issue Oct 21, 2019

[Release-1.3] Sort filterchains for inbound virtual listeners #18138

Merged

istio-testing pushed a commit that referenced this issue Oct 22, 2019

Sort filterchains for inbound virtual listeners (#18138)

b2ef225

Addresses: #18088

howardjohn added this to the 1.4 milestone Oct 29, 2019

howardjohn assigned nrjpoddar Oct 29, 2019

lambdai mentioned this issue Oct 29, 2019

Config fuzz test #18448

Closed

Stono closed this as completed Oct 30, 2019

istio-testing pushed a commit that referenced this issue Nov 1, 2019

Sort filterchains for inbound virtual listeners (#18091)

3924fca

Fixes: #18088

istio-testing pushed a commit to istio-testing/istio that referenced this issue Nov 1, 2019

Sort filterchains for inbound virtual listeners

32379b1

Fixes: istio#18088

istio-testing added a commit that referenced this issue Nov 1, 2019

Sort filterchains for inbound virtual listeners (#18566)

1638f9b

Fixes: #18088

sdake pushed a commit to sdake/istio that referenced this issue Dec 1, 2019

Sort filterchains for inbound virtual listeners (istio#18091)

ae784f8

Fixes: istio#18088

Stono mentioned this issue Jul 7, 2024

Frequent inbound cluster recreation for apparently no reason #51929

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic config generation causing frequent inbound listener reload #18088

Non-deterministic config generation causing frequent inbound listener reload #18088

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019 •

edited

Loading

nrjpoddar commented Oct 19, 2019

Stono commented Oct 19, 2019 •

edited

Loading

nrjpoddar commented Oct 19, 2019

nrjpoddar commented Oct 19, 2019

prune998 commented Oct 19, 2019

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

howardjohn commented Oct 19, 2019 via email

Stono commented Oct 30, 2019

Non-deterministic config generation causing frequent inbound listener reload #18088

Non-deterministic config generation causing frequent inbound listener reload #18088

Comments

Stono commented Oct 19, 2019 • edited Loading

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019

Stono commented Oct 19, 2019 • edited Loading

nrjpoddar commented Oct 19, 2019

Stono commented Oct 19, 2019 • edited Loading

nrjpoddar commented Oct 19, 2019

nrjpoddar commented Oct 19, 2019

prune998 commented Oct 19, 2019

Stono commented Oct 19, 2019 • edited Loading

Stono commented Oct 19, 2019 • edited Loading

Stono commented Oct 19, 2019 • edited Loading

howardjohn commented Oct 19, 2019 via email

Stono commented Oct 30, 2019

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading

Stono commented Oct 19, 2019 •

edited

Loading