Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

503NR/BlackHole during istiod rollouts #28120

Closed
Stono opened this issue Oct 20, 2020 · 12 comments
Closed

503NR/BlackHole during istiod rollouts #28120

Stono opened this issue Oct 20, 2020 · 12 comments

Comments

@Stono
Copy link
Contributor

Stono commented Oct 20, 2020

Bug description
We are in the process of restarting all workloads one by one to move them to a 1.6 sidecar, following a control plane update. All applications are identical in terms of their istio, deployment and pod config, just running different image.

We noticed that when one app was patched, the 1.5 pods shutting down started getting 503NR (No route to host) to all services it talks to. These metrics were recorded from reporter=source, so the source proxy (the 1.5 proxy on the terminating pods).

Screenshot 2020-10-20 at 20 20 48

We have 3 istiod instances, here are the logs from each:

{"level":"info","time":"2020-10-20T18:38:58.573853Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:133"}
{"level":"info","time":"2020-10-20T18:38:58.703232Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:133"}
{"level":"info","time":"2020-10-20T18:39:07.261877Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 version:"}
{"level":"info","time":"2020-10-20T18:39:07.262842Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/248"}
{"level":"info","time":"2020-10-20T18:39:07.449983Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.36.23~private-advert-service-679b6f7c74-qcqcl.private-advert-service~private-advert-service.svc.cluster.local-459 version:"}
{"level":"info","time":"2020-10-20T18:39:07.451000Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/248"}
{"level":"info","time":"2020-10-20T18:39:07.592162Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service clusters:6 endpoints:19 empty:0"}
{"level":"warn","time":"2020-10-20T18:39:07.592193Z","scope":"ads","msg":"ADS:CDS: ACK ERROR sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 Internal:Error adding/updating cluster(s) outbound|53||kube-dns.kube-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-metric-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.security-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem"}
{"level":"info","time":"2020-10-20T18:39:07.594882Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:07.620491Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-6cdd8b76cc-rcvr5.private-advert-service routes:5"}
{"level":"warn","time":"2020-10-20T18:39:07.620526Z","scope":"ads","msg":"ADS:LDS: ACK ERROR sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 Internal:Error adding/updating listener(s) 10.198.40.238_8088: Invalid path: /etc/certs/root-cert.pem\n10.198.40.238_9080: Invalid path: /etc/certs/root-cert.pem\nvirtualInbound: Invalid path: /etc/certs/root-cert.pem\n"}
{"level":"info","time":"2020-10-20T18:39:08.601774Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service clusters:25 endpoints:59 empty:0"}
{"level":"info","time":"2020-10-20T18:39:08.811433Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:08.866546Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-679b6f7c74-qcqcl.private-advert-service routes:5"}
{"level":"info","time":"2020-10-20T18:39:53.961439Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:03.044181Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:04.616757Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:136"}
{"level":"info","time":"2020-10-20T18:40:14.989001Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:27.046541Z","scope":"ads","msg":"ADS: \"10.198.40.238:46170\" sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-458 terminated rpc error: code = Canceled desc = context canceled"}
{"level":"info","time":"2020-10-20T18:40:27.711068Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:28.759121Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:31.504353Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:34.994675Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:56.833502Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:40:57.879189Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:137"}
{"level":"info","time":"2020-10-20T18:38:58.574420Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:146"}
{"level":"info","time":"2020-10-20T18:38:58.703023Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:146"}
{"level":"info","time":"2020-10-20T18:39:04.983477Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.40.90~private-advert-service-679b6f7c74-cc57n.private-advert-service~private-advert-service.svc.cluster.local-460 version:"}
{"level":"info","time":"2020-10-20T18:39:04.984612Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/249"}
{"level":"info","time":"2020-10-20T18:39:06.151788Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service clusters:25 endpoints:59 empty:0"}
{"level":"info","time":"2020-10-20T18:39:06.354340Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:06.412639Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-679b6f7c74-cc57n.private-advert-service routes:5"}
{"level":"info","time":"2020-10-20T18:39:24.148017Z","scope":"ads","msg":"ADS:CDS: REQ sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 version:"}
{"level":"info","time":"2020-10-20T18:39:24.149259Z","scope":"ads","msg":"CDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service clusters:30 services:825 version:2020-10-20T18:11:26Z/249"}
{"level":"info","time":"2020-10-20T18:39:24.465521Z","scope":"ads","msg":"EDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service clusters:6 endpoints:19 empty:0"}
{"level":"warn","time":"2020-10-20T18:39:24.465567Z","scope":"ads","msg":"ADS:CDS: ACK ERROR sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 Internal:Error adding/updating cluster(s) outbound|53||kube-dns.kube-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-metric-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.private-advert-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.abtest-allocator.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stripe-payment-proxy.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.search-one-read.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.stock-management-api.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.security-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.customer-performance-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.unified-registration-system.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|80||app.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem, outbound|9080||admin.vehicle-data-service.svc.cluster.local: Invalid path: /etc/certs/root-cert.pem"}
{"level":"info","time":"2020-10-20T18:39:24.468403Z","scope":"ads","msg":"LDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service listeners:13"}
{"level":"info","time":"2020-10-20T18:39:24.493365Z","scope":"ads","msg":"RDS: PUSH for node:private-advert-service-6cdd8b76cc-6424j.private-advert-service routes:5"}
{"level":"warn","time":"2020-10-20T18:39:24.493415Z","scope":"ads","msg":"ADS:LDS: ACK ERROR sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 Internal:Error adding/updating listener(s) 10.198.18.98_8088: Invalid path: /etc/certs/root-cert.pem\n10.198.18.98_9080: Invalid path: /etc/certs/root-cert.pem\nvirtualInbound: Invalid path: /etc/certs/root-cert.pem\n"}
{"level":"info","time":"2020-10-20T18:39:53.961199Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:03.044364Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:04.617457Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:14.989028Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:27.710887Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:28.759122Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:31.504097Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:34.996649Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:56.079555Z","scope":"ads","msg":"ADS: \"10.198.18.98:49238\" sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-461 terminated rpc error: code = Canceled desc = context canceled"}
{"level":"info","time":"2020-10-20T18:40:56.834310Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:40:57.878854Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/249 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:147"}
{"level":"info","time":"2020-10-20T18:38:58.575231Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:38:58.703383Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} admin.private-online-ads-admin.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:39:53.962827Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:03.045637Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:04.618209Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:182"}
{"level":"info","time":"2020-10-20T18:40:07.635733Z","scope":"ads","msg":"ADS: \"10.198.40.238:49616\" sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-82 terminated with stream closed"}
{"level":"warn","time":"2020-10-20T18:40:07.636222Z","scope":"ads","msg":"EDS: Send failure sidecar~10.198.40.238~private-advert-service-6cdd8b76cc-rcvr5.private-advert-service~private-advert-service.svc.cluster.local-82: rpc error: code = Unavailable desc = transport is closing"}
{"level":"info","time":"2020-10-20T18:40:14.990589Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:181"}
{"level":"info","time":"2020-10-20T18:40:24.507327Z","scope":"ads","msg":"ADS: \"10.198.18.98:45782\" sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-151 terminated with stream closed"}
{"level":"warn","time":"2020-10-20T18:40:24.507756Z","scope":"ads","msg":"EDS: Send failure sidecar~10.198.18.98~private-advert-service-6cdd8b76cc-6424j.private-advert-service~private-advert-service.svc.cluster.local-151: rpc error: code = Unavailable desc = transport is closing"}
{"level":"info","time":"2020-10-20T18:40:27.711878Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:28.759746Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:31.503729Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{} app.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:34.995838Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:56.834051Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}
{"level":"info","time":"2020-10-20T18:40:57.880219Z","scope":"ads","msg":"XDS:EDSInc Pushing:2020-10-20T18:11:26Z/248 Services:map[admin.private-advert-service.svc.cluster.local:{}] ConnectedEndpoints:179"}

There are two services pointing to the same pods:

❯ k -n private-advert-service get svc
NAME    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
admin   ClusterIP   10.192.43.156   <none>        9080/TCP   616d
app     ClusterIP   10.192.44.2     <none>        80/TCP     2y53d

Unfortunately I don't have the proxy logs persisted.

I have been unable to reproduce this by going back and forth again, so it feels like a non-deterministic ordering or race condition of some sort.

[ ] Docs
[ ] Installation
[x] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
No 503NR's during upgrade of data plane

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)
1.6.12

How was Istio installed?
Helm

Environment where bug was observed (cloud vendor, OS, etc)

@Stono
Copy link
Contributor Author

Stono commented Oct 20, 2020

It's worth noting that our process removes the old certificate after the workload has rolled out:

{"timestamp":"2020-10-20T18:38:56.074Z","level":"info","module":"istio-rollout-sidecars","message":"handling namespace private-advert-service"}
{"timestamp":"2020-10-20T18:38:56.139Z","level":"warn","module":"istio-rollout-sidecars","message":"workload requires updating","istioVersion":"eu.gcr.io/at-artefacts/istio/proxyv2:b0ac5af0d118960afc2be74e2fb5391a5de68de7","targetIstioVersion":"eu.gcr.io/at-artefacts/istio/proxyv2:a67042ea4018b0a0c417c0b28aaa9b62","namespace":"private-advert-service","name":"private-advert-service"}
{"timestamp":"2020-10-20T18:38:56.214Z","level":"info","module":"client","message":"rollout triggered","namespace":"private-advert-service","name":"private-advert-service","currentGeneration":339,"targetGeneration":340}
{"timestamp":"2020-10-20T18:38:56.226Z","level":"info","module":"client","message":"rollout complete","namespace":"private-advert-service","name":"private-advert-service","kind":"Deployment"}
{"timestamp":"2020-10-20T18:38:56.279Z","level":"info","module":"istio-rollout-sidecars","message":"deleting 1.5 secret","namespace":"private-advert-service","name":"istio.default"}

But perhaps there's a bit of a race condition because the old workload could still theoretically be terminating. My understanding of kubernetes was that deleting a mounted certificate after the pod has started doesn't remove it - it'd only stop a new pod scheduling if the secret wasn't available to mount.

@Stono
Copy link
Contributor Author

Stono commented Oct 20, 2020

ahhhh!

{"timestamp":"2020-10-20T18:38:56.214Z","level":"info","module":"client","message":"rollout triggered","namespace":"private-advert-service","name":"private-advert-service","currentGeneration":339,"targetGeneration":340}
{"timestamp":"2020-10-20T18:38:56.226Z","level":"info","module":"client","message":"rollout complete","namespace":"private-advert-service","name":"private-advert-service","kind":"Deployment"}

Look at the time stamps, it looks like a bug in the script caused this deployment to not actually wait, thus deleting certs for an in-use workload.

@Stono Stono closed this as completed Oct 20, 2020
@Stono
Copy link
Contributor Author

Stono commented Oct 29, 2020

@howardjohn could you please reopen this because we just saw it again going from 1.6.12 to 1.6.13, this time it wasn't caused by me so there's definitely something odd going on during istiod deployments.

Struggling to consistently reproduce at the moment however.

Screenshot 2020-10-29 at 14 59 54

@howardjohn howardjohn reopened this Oct 29, 2020
@Stono
Copy link
Contributor Author

Stono commented Oct 30, 2020

This just happened again much more significantly when we rolled istiod:

Screenshot 2020-10-30 at 08 31 40

@Stono Stono changed the title 503NR during rolling restart of application going from 1.5 to 1.6 503NR/BlackHole during istiod rollouts Oct 30, 2020
@Stono
Copy link
Contributor Author

Stono commented Oct 30, 2020

@howardjohn I've narrowed this down to PILOT_ENABLE_LEGACY_INBOUND_LISTENERS. When adding or removing that flag, there's a moment where all the sidecars are broken.

The two events above were caused by the removal of PILOT_ENABLE_LEGACY_INBOUND_LISTENERS flowing through the environments now we're rolled out to 1.6 and it is no longer required.

Whilst i'm not hugely concerned (as we won't be bringing that flag back) - It'd be great if someone could confirm the problem we're seeing is only caused by that flag, and can't manifest as a result of other code paths? Potentially #26861?

@Stono
Copy link
Contributor Author

Stono commented Oct 30, 2020

Definitely other situations causing this too, this was going from 1.6.13 -> 1.7.4:

Screenshot 2020-10-30 at 12 46 52

@howardjohn
Copy link
Member

I (not too surprisingly) was not able to reproduce this. Do you see this on any environment that we can play around in? Either direct access or we can set up a call perhaps

@howardjohn
Copy link
Member

bringing slack conversation here:

Suspect there are possible multiple issues. focusing on UF,URX first. Observed on 1.6 <-> 1.7 rollback/forward.

Diff on config:

1719c1682
<           "@type": "type.googleapis.com/envoy.api.v2.auth.UpstreamTlsContext",
---
>           "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
1738,1739c1701
<               },
<               "initial_fetch_timeout": "0s"
---
>               }
1757,1758c1719
<               },
<               "initial_fetch_timeout": "0s"
---
>               }

Logs:
client disconnected, failure reason: TLS error: Secret is not supplied by SDS

@howardjohn
Copy link
Member

I build a custom image with v2 TLS context and its zero downtime. So seems extremely likely due to ^. Looking into what the "proper" fix is

howardjohn added a commit to howardjohn/istio that referenced this issue Nov 2, 2020
See istio#28120
See envoyproxy/envoy#13864

This resolves a downtime event on in place upgrade from 1.6 to 1.7. This
is a couple seconds of 503s.

This is intentionally sent only to 1.7 as it is only relevant for this
branch.

Please note this feature flag is shipped by on by default. We have two
choices:

* Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get
downtime unless they read the release notes and add the flag.
* On by default. Anyone with 1.7 already deployed, but that still has
1.6 proxies will encur a downtime unless they read the release notes and
remove the flag.

I have chosen on by default, as the set of people with 1.6 proxies with
1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set
of "off by default", and the mitigation is the same. Additionally, for
those that are impacted, the impact will be exclusively the proxies on
1.6, which is presumably not 100% of proxies, whereas in the other case
ALL proxies are 1.6 and thus impacted.
istio-testing pushed a commit that referenced this issue Nov 2, 2020
* Do not switch TLS version on 1.6 -> 1.7 upgrade

See #28120
See envoyproxy/envoy#13864

This resolves a downtime event on in place upgrade from 1.6 to 1.7. This
is a couple seconds of 503s.

This is intentionally sent only to 1.7 as it is only relevant for this
branch.

Please note this feature flag is shipped by on by default. We have two
choices:

* Off by default. Anyone upgrading from 1.6 to 1.7 will continue to get
downtime unless they read the release notes and add the flag.
* On by default. Anyone with 1.7 already deployed, but that still has
1.6 proxies will encur a downtime unless they read the release notes and
remove the flag.

I have chosen on by default, as the set of people with 1.6 proxies with
1.7.x Istiod upgrading to 1.7.5 seems far smaller than the impacted set
of "off by default", and the mitigation is the same. Additionally, for
those that are impacted, the impact will be exclusively the proxies on
1.6, which is presumably not 100% of proxies, whereas in the other case
ALL proxies are 1.6 and thus impacted.

* fix nil

* Fix initial fetch
@Stono
Copy link
Contributor Author

Stono commented Nov 4, 2020

I can confirm @howardjohn fix for #28498 has fixed the majority of the rollout issues we experienced going from 1.6 to 1.7, however we are still seeing some (very small amount) of requests get "BlackHole" (we run REGISTRY_ONLY) as the control plane rolls out, as you can see here:

Screenshot 2020-11-04 at 14 31 47

In this particular example, we have three instances of the source service:

  • ingress-nginx-external-controller-6b79d9f487-5fk6l (0 failed requests)
  • ingress-nginx-external-controller-6b79d9f487-mkmcs (2 failed requests)
  • ingress-nginx-external-controller-6b79d9f487-rn9xk (2 failed requests)

The service they are talking to that failed (zipkin) is being called from the pod itself (nginx) as nginx creates its own traces, and is configured with this DestiniationRule:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: disable-mtls-to-zipkin
  namespace: istio-system
spec:
  host: zipkin.istio-system.svc.cluster.local
  trafficPolicy:
    tls:
      mode: DISABLE

I started looking at the istio-proxy logs to look for differences on ingress-nginx-external-controller-6b79d9f487-5fk6l (which had no errors). The logs for the time frame are here:
istio.tar.gz

The main thing I noticed was that on the pods where we saw issues, I also see:

{"level":"error","time":"2020-11-04T14:28:34.620008Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.671159Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 50 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.671394Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.771605Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 100 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.771717Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}
{"level":"warn","time":"2020-11-04T14:28:34.971865Z","scope":"cache","msg":"resource:default request:2ae1c56d-527a-43f5-b503-da1b47e149b7 CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\", retry in 200 millisec"}
{"level":"error","time":"2020-11-04T14:28:34.971983Z","scope":"citadelclient","msg":"Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.192.37.141:15012: connect: connection refused\""}

Unfortunately as the failures are so few, and i'm unable to inconsistently reproduce them, it's quite hard to get any more debug info.

@howardjohn
Copy link
Member

For the logs about failed CSR, that should be unknown to envoy. From its perspective it just seeings that there is a 2s window where it request a secret and didn't get one. I could see that making envoy unhappy (it shouldn't, but seems a plausible bug), but if it was the cause I would expect 2s outage likely.

@Stono
Copy link
Contributor Author

Stono commented Nov 5, 2020

I've not been able to recreate the BlackHole/Zipkin one and have had 1.7 on a cluster for 24 hours with 400 different apps and about 300 deployments, and haven't seen any other errors.

I'm going to close this issue as sorted, for anyone coming to this the fix will be released in 1.7.5.

@Stono Stono closed this as completed Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants