503NR/BlackHole during istiod rollouts #28120

Stono · 2020-10-20T19:48:47Z

Bug description
We are in the process of restarting all workloads one by one to move them to a 1.6 sidecar, following a control plane update. All applications are identical in terms of their istio, deployment and pod config, just running different image.

We noticed that when one app was patched, the 1.5 pods shutting down started getting 503NR (No route to host) to all services it talks to. These metrics were recorded from reporter=source, so the source proxy (the 1.5 proxy on the terminating pods).

We have 3 istiod instances, here are the logs from each:

{"level":"info","time":"2020-10-20T18:38:58.573853Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:133"}
{"level":"info","time":"2020-10-20T18:38:58.703232Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:133"}
{"level":"info","time":"2020-10-20T18:39:07.261877Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 version:"}
{"level":"info","time":"2020-10-20T18:39:07.262842Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/248"}
{"level":"info","time":"2020-10-20T18:39:07.449983Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.36.23~private-advert-service-679b6f7c74-qcqcl.private-advert-service~private-advert-service.svc.cluster.local-459 version:"}
{"level":"info","time":"2020-10-20T18:39:07.451000Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/248"}
{"level":"info","time":"2020-10-20T18:39:07.592162Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service clusters:6 endpoints:19 empty:0"}
{"level":"warn","time":"2020-10-20T18:39:07.592193Z","scope":"ads","msg":"ADS:CDS: ACK ERROR sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 Internal:Error adding/updating cluster(s) outbound|53||kube-dns.kube-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-metric-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.security-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem"}
{"level":"info","time":"2020-10-20T18:39:07.594882Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:07.620491Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service routes:5"}
{"level":"warn","time":"2020-10-20T18:39:07.620526Z","scope":"ads","msg":"ADS:LDS: ACK ERROR sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 Internal:Error adding/updating listener(s) 10.198.40.238_8088: Invalid path: /etc/certs/root-cert.pem\n10.198.40.238_9080: Invalid path: /etc/certs/root-cert.pem\nvirtualInbound: Invalid path: /etc/certs/root-cert.pem\n"}
{"level":"info","time":"2020-10-20T18:39:08.601774Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service clusters:25 endpoints:59 empty:0"}
{"level":"info","time":"2020-10-20T18:39:08.811433Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:08.866546Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service routes:5"}
{"level":"info","time":"2020-10-20T18:39:53.961439Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:03.044181Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:04.616757Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:14.989001Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:27.046541Z","scope":"ads","msg":"ADS: \"10.198.40.238:46170\" sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 terminated rpc error: code = Canceled desc = context canceled"}
{"level":"info","time":"2020-10-20T18:40:27.711068Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:28.759121Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:31.504353Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:34.994675Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:56.833502Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:57.879189Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}

{"level":"info","time":"2020-10-20T18:38:58.574420Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:146"}
{"level":"info","time":"2020-10-20T18:38:58.703023Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:146"}
{"level":"info","time":"2020-10-20T18:39:04.983477Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.40.90~private-advert-service-679b6f7c74-cc57n.private-advert-service~private-advert-service.svc.cluster.local-460 version:"}
{"level":"info","time":"2020-10-20T18:39:04.984612Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/249"}
{"level":"info","time":"2020-10-20T18:39:06.151788Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service clusters:25 endpoints:59 empty:0"}
{"level":"info","time":"2020-10-20T18:39:06.354340Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:06.412639Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service routes:5"}
{"level":"info","time":"2020-10-20T18:39:24.148017Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 version:"}
{"level":"info","time":"2020-10-20T18:39:24.149259Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/249"}
{"level":"info","time":"2020-10-20T18:39:24.465521Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service clusters:6 endpoints:19 empty:0"}
{"level":"warn","time":"2020-10-20T18:39:24.465567Z","scope":"ads","msg":"ADS:CDS: ACK ERROR sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 Internal:Error adding/updating cluster(s) outbound|53||kube-dns.kube-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-metric-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.security-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem"}
{"level":"info","time":"2020-10-20T18:39:24.468403Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:24.493365Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service routes:5"}
{"level":"warn","time":"2020-10-20T18:39:24.493415Z","scope":"ads","msg":"ADS:LDS: ACK ERROR sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 Internal:Error adding/updating listener(s) 10.198.18.98_8088: Invalid path: /etc/certs/root-cert.pem\n10.198.18.98_9080: Invalid path: /etc/certs/root-cert.pem\nvirtualInbound: Invalid path: /etc/certs/root-cert.pem\n"}
{"level":"info","time":"2020-10-20T18:39:53.961199Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:03.044364Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:04.617457Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:14.989028Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:27.710887Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:28.759122Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:31.504097Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:34.996649Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:56.079555Z","scope":"ads","msg":"ADS: \"10.198.18.98:49238\" sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 terminated rpc error: code = Canceled desc = context canceled"}
{"level":"info","time":"2020-10-20T18:40:56.834310Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:57.878854Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}

{"level":"info","time":"2020-10-20T18:38:58.575231Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:38:58.703383Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:39:53.962827Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:03.045637Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:04.618209Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:07.635733Z","scope":"ads","msg":"ADS: \"10.198.40.238:49616\" sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-82 terminated with stream closed"}
{"level":"warn","time":"2020-10-20T18:40:07.636222Z","scope":"ads","msg":"EDS: Send failure sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-82: rpc error: code = Unavailable desc = transport is closing"}
{"level":"info","time":"2020-10-20T18:40:14.990589Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:181"}
{"level":"info","time":"2020-10-20T18:40:24.507327Z","scope":"ads","msg":"ADS: \"10.198.18.98:45782\" sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-151 terminated with stream closed"}
{"level":"warn","time":"2020-10-20T18:40:24.507756Z","scope":"ads","msg":"EDS: Send failure sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-151: rpc error: code = Unavailable desc = transport is closing"}
{"level":"info","time":"2020-10-20T18:40:27.711878Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:28.759746Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:31.503729Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:34.995838Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:56.834051Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:57.880219Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}

There are two services pointing to the same pods:

❯ k -n private-advert-service get svc
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
admin   ClusterIP   10.192.43.156   <none>        9080/TCP   616d
app     ClusterIP   10.192.44.2     <none>        80/TCP     2y53d

Unfortunately I don't have the proxy logs persisted.

I have been unable to reproduce this by going back and forth again, so it feels like a non-deterministic ordering or race condition of some sort.

[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
No 503NR's during upgrade of data plane

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)
1.6.12

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)

The text was updated successfully, but these errors were encountered:

Stono · 2020-10-20T20:05:37Z

It's worth noting that our process removes the old certificate after the workload has rolled out:

{"timestamp":"2020-10-20T18:38:56.074Z","level":"info","module":"istio-rollout-sidecars","message":"handling namespace private-advert-service"}
{"timestamp":"2020-10-20T18:38:56.139Z","level":"warn","module":"istio-rollout-sidecars","message":"workload requires updating","istioVersion":"eu.gcr.io/at-artefacts/istio/proxyv2:b0ac5af0d118960afc2be74e2fb5391a5de68de7","targetIstioVersion":"eu.gcr.io/at-artefacts/istio/proxyv2:a67042ea4018b0a0c417c0b28aaa9b62","namespace":"private-advert-service","name":"private-advert-service"}
{"timestamp":"2020-10-20T18:38:56.214Z","level":"info","module":"client","message":"rollout triggered","namespace":"private-advert-service","name":"private-advert-service","currentGeneration":339,"targetGeneration":340}
{"timestamp":"2020-10-20T18:38:56.226Z","level":"info","module":"client","message":"rollout complete","namespace":"private-advert-service","name":"private-advert-service","kind":"Deployment"}
{"timestamp":"2020-10-20T18:38:56.279Z","level":"info","module":"istio-rollout-sidecars","message":"deleting 1.5 secret","namespace":"private-advert-service","name":"istio.default"}

But perhaps there's a bit of a race condition because the old workload could still theoretically be terminating. My understanding of kubernetes was that deleting a mounted certificate after the pod has started doesn't remove it - it'd only stop a new pod scheduling if the secret wasn't available to mount.

Stono · 2020-10-20T20:06:25Z

ahhhh!

{"timestamp":"2020-10-20T18:38:56.214Z","level":"info","module":"client","message":"rollout triggered","namespace":"private-advert-service","name":"private-advert-service","currentGeneration":339,"targetGeneration":340}
{"timestamp":"2020-10-20T18:38:56.226Z","level":"info","module":"client","message":"rollout complete","namespace":"private-advert-service","name":"private-advert-service","kind":"Deployment"}

Look at the time stamps, it looks like a bug in the script caused this deployment to not actually wait, thus deleting certs for an in-use workload.

Stono · 2020-10-29T15:11:22Z

@howardjohn could you please reopen this because we just saw it again going from 1.6.12 to 1.6.13, this time it wasn't caused by me so there's definitely something odd going on during istiod deployments.

Struggling to consistently reproduce at the moment however.

Stono · 2020-10-30T08:32:36Z

This just happened again much more significantly when we rolled istiod:

Stono · 2020-10-30T08:37:56Z

@howardjohn I've narrowed this down to PILOT_ENABLE_LEGACY_INBOUND_LISTENERS. When adding or removing that flag, there's a moment where all the sidecars are broken.

The two events above were caused by the removal of PILOT_ENABLE_LEGACY_INBOUND_LISTENERS flowing through the environments now we're rolled out to 1.6 and it is no longer required.

Whilst i'm not hugely concerned (as we won't be bringing that flag back) - It'd be great if someone could confirm the problem we're seeing is only caused by that flag, and can't manifest as a result of other code paths? Potentially #26861?

Stono · 2020-10-30T12:48:01Z

Definitely other situations causing this too, this was going from 1.6.13 -> 1.7.4:

howardjohn · 2020-10-30T21:14:25Z

I (not too surprisingly) was not able to reproduce this. Do you see this on any environment that we can play around in? Either direct access or we can set up a call perhaps

howardjohn · 2020-10-30T22:23:26Z

bringing slack conversation here:

Suspect there are possible multiple issues. focusing on UF,URX first. Observed on 1.6 <-> 1.7 rollback/forward.

Diff on config:

1719c1682
<           "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",
---
>           "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
1738,1739c1701
<               },
<               "initial_fetch_timeout": "0s"
---
>               }
1757,1758c1719
<               },
<               "initial_fetch_timeout": "0s"
---
>               }

Logs:
client disconnected, failure reason: TLS error: Secret is not supplied by SDS

howardjohn · 2020-10-30T23:28:07Z

I build a custom image with v2 TLS context and its zero downtime. So seems extremely likely due to ^. Looking into what the "proper" fix is

See istio#28120 See envoyproxy/envoy#13864 This resolves a downtime event on in place upgrade from 1.6 to 1.7. This is a couple seconds of 503s. This is intentionally sent only to 1.7 as it is only relevant for this branch. Please note this feature flag is shipped by on by default. We have two choices: * Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get downtime unless they read the release notes and add the flag. * On by default. Anyone with 1.7 already deployed, but that still has 1.6 proxies will encur a downtime unless they read the release notes and remove the flag. I have chosen on by default, as the set of people with 1.6 proxies with 1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set of "off by default", and the mitigation is the same. Additionally, for those that are impacted, the impact will be exclusively the proxies on 1.6, which is presumably not 100% of proxies, whereas in the other case ALL proxies are 1.6 and thus impacted.

* Do not switch TLS version on 1.6 -> 1.7 upgrade See #28120 See envoyproxy/envoy#13864 This resolves a downtime event on in place upgrade from 1.6 to 1.7. This is a couple seconds of 503s. This is intentionally sent only to 1.7 as it is only relevant for this branch. Please note this feature flag is shipped by on by default. We have two choices: * Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get downtime unless they read the release notes and add the flag. * On by default. Anyone with 1.7 already deployed, but that still has 1.6 proxies will encur a downtime unless they read the release notes and remove the flag. I have chosen on by default, as the set of people with 1.6 proxies with 1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set of "off by default", and the mitigation is the same. Additionally, for those that are impacted, the impact will be exclusively the proxies on 1.6, which is presumably not 100% of proxies, whereas in the other case ALL proxies are 1.6 and thus impacted. * fix nil * Fix initial fetch

Stono · 2020-11-04T14:45:52Z

I can confirm @howardjohn fix for #28498 has fixed the majority of the rollout issues we experienced going from 1.6 to 1.7, however we are still seeing some (very small amount) of requests get "BlackHole" (we run REGISTRY_ONLY) as the control plane rolls out, as you can see here:

In this particular example, we have three instances of the source service:

ingress-nginx-external-controller-6b79d9f487-5fk6l (0 failed requests)
ingress-nginx-external-controller-6b79d9f487-mkmcs (2 failed requests)
ingress-nginx-external-controller-6b79d9f487-rn9xk (2 failed requests)

The service they are talking to that failed (zipkin) is being called from the pod itself (nginx) as nginx creates its own traces, and is configured with this DestiniationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: disable-mtls-to-zipkin
  namespace: istio-system
spec:
  host: zipkin.istio-system.svc.cluster.local
  trafficPolicy:
    tls:
      mode: DISABLE

I started looking at the istio-proxy logs to look for differences on ingress-nginx-external-controller-6b79d9f487-5fk6l (which had no errors). The logs for the time frame are here:
istio.tar.gz

The main thing I noticed was that on the pods where we saw issues, I also see:

{"level":"error","time":"2020-11-04T14:28:34.620008Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.671159Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 50 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.671394Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.771605Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 100 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.771717Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.971865Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 200 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.971983Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}

Unfortunately as the failures are so few, and i'm unable to inconsistently reproduce them, it's quite hard to get any more debug info.

howardjohn · 2020-11-04T16:05:53Z

For the logs about failed CSR, that should be unknown to envoy. From its perspective it just seeings that there is a 2s window where it request a secret and didn't get one. I could see that making envoy unhappy (it shouldn't, but seems a plausible bug), but if it was the cause I would expect 2s outage likely.

Stono · 2020-11-05T08:58:17Z

I've not been able to recreate the BlackHole/Zipkin one and have had 1.7 on a cluster for 24 hours with 400 different apps and about 300 deployments, and haven't seen any other errors.

I'm going to close this issue as sorted, for anyone coming to this the fix will be released in 1.7.5.

istio-policy-bot added the area/networking label Oct 20, 2020

Stono closed this as completed Oct 20, 2020

howardjohn reopened this Oct 29, 2020

Stono changed the title ~~503NR during rolling restart of application going from 1.5 to 1.6~~ 503NR/BlackHole during istiod rollouts Oct 30, 2020

howardjohn mentioned this issue Nov 2, 2020

Do not switch TLS version on 1.6 -> 1.7 upgrade #28498

Merged

Stono closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

503NR/BlackHole during istiod rollouts #28120

503NR/BlackHole during istiod rollouts #28120

Stono commented Oct 20, 2020 •

edited

Stono commented Oct 20, 2020

Stono commented Oct 20, 2020

Stono commented Oct 29, 2020

Stono commented Oct 30, 2020

Stono commented Oct 30, 2020 •

edited

Stono commented Oct 30, 2020

howardjohn commented Oct 30, 2020

howardjohn commented Oct 30, 2020

howardjohn commented Oct 30, 2020

Stono commented Nov 4, 2020 •

edited

howardjohn commented Nov 4, 2020

Stono commented Nov 5, 2020

503NR/BlackHole during istiod rollouts #28120

503NR/BlackHole during istiod rollouts #28120

Comments

Stono commented Oct 20, 2020 • edited

Stono commented Oct 20, 2020

Stono commented Oct 20, 2020

Stono commented Oct 29, 2020

Stono commented Oct 30, 2020

Stono commented Oct 30, 2020 • edited

Stono commented Oct 30, 2020

howardjohn commented Oct 30, 2020

howardjohn commented Oct 30, 2020

howardjohn commented Oct 30, 2020

Stono commented Nov 4, 2020 • edited

howardjohn commented Nov 4, 2020

Stono commented Nov 5, 2020

Stono commented Oct 20, 2020 •

edited

Stono commented Oct 30, 2020 •

edited

Stono commented Nov 4, 2020 •

edited