Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio is picking up new virtualservice slowly #25685

Closed
xliuxu opened this issue Jul 21, 2020 · 82 comments · Fixed by #28261
Closed

Istio is picking up new virtualservice slowly #25685

xliuxu opened this issue Jul 21, 2020 · 82 comments · Fixed by #28261
Assignees

Comments

@xliuxu
Copy link

xliuxu commented Jul 21, 2020

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X] Networking
[X] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

client version: 1.5.4
cluster-local-gateway version:
cluster-local-gateway version:
cluster-local-gateway version:
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
pilot version: 1.5.4
pilot version: 1.5.4
pilot version: 1.5.4
data plane version: 1.5.4 (6 proxies)

How was Istio installed?

cat << EOF > ./istio-minimal-operator.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        autoInject: disabled
      useMCP: false
      # The third-party-jwt is not enabled on all k8s.
      # See: https://istio.io/docs/ops/best-practices/security/#configure-third-party-service-account-tokens
      jwtPolicy: first-party-jwt

  addonComponents:
    pilot:
      enabled: true
    prometheus:
      enabled: false

  components:
    ingressGateways:
      - name: istio-ingressgateway
        enabled: true
      - name: cluster-local-gateway
        enabled: true
        label:
          istio: cluster-local-gateway
          app: cluster-local-gateway
        k8s:
          service:
            type: ClusterIP
            ports:
            - port: 15020
              name: status-port
            - port: 80
              name: http2
            - port: 443
              name: https
EOF

./istioctl manifest generate -f istio-minimal-operator.yaml \
--set values.gateways.istio-egressgateway.enabled=false \
--set values.gateways.istio-ingressgateway.sds.enabled=true \
--set values.gateways.istio-ingressgateway.autoscaleMin=3 \
--set values.gateways.istio-ingressgateway.autoscaleMax=6 \
--set values.pilot.autoscaleMin=3 \
--set values.pilot.autoscaleMax=6 \
--set hub=icr.io/ext/istio  > istio.yaml

  kubectl apply -f istio.yaml    // more visibility than istioctl manifest apply

Environment where bug was observed (cloud vendor, OS, etc)
IKS

When we create ~1k virtualservices in a single cluster, the ingress gateway is picking up new virtualservice slowly.

image
The blue line in the chart indicates the overall time for probing to gateway pod return with success. (200 response code and expected header K-Network-Hash). The stepped increasing of time is caused by the exponential retry backoff to execute probing. But the overall trend seems to have a linear growth which takes ~50s for a new virtual service to be picked up with 800 virtual services present.

I also tried to dump and grep the configs in istio-ingress-gateway pod after the virtual service was created.
Initially the output was empty and It takes about 1min for the belowing result to showup.

curl localhost:15000/config_dump |grep testabc
      "name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"
        "name": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud:80",
         "testabc.default.dev-serving.codeengine.dev.appdomain.cloud",
         "testabc.default.dev-serving.codeengine.dev.appdomain.cloud:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default.svc.cluster.local:80",
         "testabc.default.svc.cluster.local",
         "testabc.default.svc.cluster.local:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default.svc:80",
         "testabc.default.svc",
         "testabc.default.svc:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default:80",
         "testabc.default",
         "testabc.default:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"

There is no mem/cpu pressure for istio components.

kubectl -n istio-system top pods
NAME                                     CPU(cores)   MEMORY(bytes)
cluster-local-gateway-644fd5f945-f4d6d   29m          953Mi
cluster-local-gateway-644fd5f945-mlhkc   34m          952Mi
cluster-local-gateway-644fd5f945-nt4qk   30m          958Mi
istio-ingressgateway-7759f4649d-b5whx    37m          1254Mi
istio-ingressgateway-7759f4649d-g2ppv    43m          1262Mi
istio-ingressgateway-7759f4649d-pv6qs    48m          1431Mi
istiod-6fb9877647-7h7wk                  8m           875Mi
istiod-6fb9877647-k6n9m                  9m           914Mi
istiod-6fb9877647-mncpn                  26m          925Mi

Below is a typical virtual service created by knative.

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    networking.knative.dev/ingress.class: istio.ingress.networking.knative.dev
  creationTimestamp: "2020-07-20T08:23:54Z"
  generation: 1
  labels:
    networking.internal.knative.dev/ingress: hello29
    serving.knative.dev/route: hello29
    serving.knative.dev/routeNamespace: default
  name: hello29-ingress
  namespace: default
  ownerReferences:
  - apiVersion: networking.internal.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Ingress
    name: hello29
    uid: 433c415d-901e-4154-bfd9-43178d0db192
  resourceVersion: "47989694"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/virtualservices/hello29-ingress
  uid: 20162863-b721-4d67-aa19-b84adc7dffe0
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  - knative-serving/knative-ingress-gateway
  hosts:
  - hello29.default
  - hello29.default.dev-serving.codeengine.dev.appdomain.cloud
  - hello29.default.svc
  - hello29.default.svc.cluster.local
  http:
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default
      gateways:
      - knative-serving/cluster-local-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default.dev-serving.codeengine.dev.appdomain.cloud
      gateways:
      - knative-serving/knative-ingress-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
---
@linsun
Copy link
Member

linsun commented Jul 21, 2020

How is pilot doing? Is workload properly distributed to each pilot instance?

Also, can you describe your scenario on fast you are applying these VSs? And can you disable telemetry v2 if you are not using it?

@ZhuangYuZY
Copy link

@linsun how to check the pilot if doing well ? In our case, from the diagram(blue line), the delay is increasing continually when we create knative service, each knative service will create one VS to config route in istio gateway.

And the output is of testabc is result of config route in istio gateway after we created 800 knative service, existing 800 knative service causes 1 min delay from VS creation to config route in istio gateway.

For telemetry, I do not think we enable it, you can check the install script we are using. Or how to check if telemetry v2 enabled or not ?

@ZhuangYuZY
Copy link

@linsun we tested this case on istio 1.6.5, the behavior is same. Thank you.

@sdake
Copy link
Member

sdake commented Jul 30, 2020

@ZhuangYuZY I have not personally ran a systemwide profile (yet) with your use case - however usually kubeapi is the culprit in this scenario. I have ran systemwide profiles of creating a service entry CR, and see very poor performance in past versions.

The diagnosed problem in the SE scenario is kubeapi rate-limits incoming CR creation, which will make it appear as if "Istio" is slow. Istio's discovery mechanism works by reading K8s's CR list. K8s CR creation is slow. Therefore, Istio is often blamed for K8s API scalability problems.

Kubernetes can only handle so many creations of any type of object per second, and the count is low.

I'll take a look early next week at this particular problem for you and verify this is just not how Kubernetes "works"...

Cheers,
-steve

@sdake
Copy link
Member

sdake commented Jul 30, 2020

cc / @mandarjog

@sdake sdake self-assigned this Jul 30, 2020
@sdake
Copy link
Member

sdake commented Aug 1, 2020

Basic tooling to test with and without Istio in the path: https://github.com/sdake/knative-perf

@sdake
Copy link
Member

sdake commented Aug 1, 2020

Running the manifest created by the vs tool in #25685 (comment), but changing the hub to a valid hub, the following results are observed:

real    0m13.657s
user    0m2.886s
sys     0m0.451s

System = VM on vsphere7 with 128gb ram + 8 cores, istio-1.6.5

@sdake
Copy link
Member

sdake commented Aug 1, 2020

Running with just an application of vs-crd.yaml and not istio.yaml - the timings look much better:

real    0m8.271s
user    0m3.036s
sys     0m0.426s

@sdake
Copy link
Member

sdake commented Aug 1, 2020

(edit - first run of this had an error, re-ran - slightly slower results)

With istio.yaml applied, the validating webhook was manually deleted:

sdake@sdake-dev:~/knative-perf$ kubectl delete ValidatingWebhookConfiguration -n istio-system istiod-istio-system

The results were close to baseline (vs-crd.yaml):

real    0m9.997s
user    0m3.043s
sys     0m0.468s

@sdake
Copy link
Member

sdake commented Aug 1, 2020

@lanceliuu - is the use case here that knative uses a pattern of creating a large amount of virtual services, all at about the same time? Attempting to replicate your benchmark - it feels synthetic, and I am curious if it represents a real-world use case.

Cheers,
-steve

@sdake
Copy link
Member

sdake commented Aug 1, 2020

ON IKS:

Running with just an application of vs-crd.yaml and not istio.yaml, nearly two extra minutes to create the VS in the remote cluster:

real    2m7.649s
user    0m3.756s
sys     0m0.656s

@sdake
Copy link
Member

sdake commented Aug 1, 2020

ON IKS:

Running with istio.yaml and applying vs.yaml from the test repo, the time to create 1000 VS in the remote cluster:

real    1m33.499s
user    0m3.016s
sys     0m1.062s

@ZhuangYuZY
Copy link

Yes, we create 800 kn service (trigger 800 vs creation), we saw salability problem of configuration setup in gateway, the time continue increase, from 1 sec to > 1 min.

@sdake
Copy link
Member

sdake commented Aug 2, 2020

@ZhuangYuZY I am attempting to reproduce your problem, although I have to write my own tools to do so as you haven't published yours.

Here is the question:

Are you creating 800 vs. then pausing for 5-10 minutes. Then measuring the creation of that last VS at 1 minute?

Or are you measuring the registration of the VS in the ingress gateway after each creation in the kubeapi?

Cheers,
-steve

@xliuxu
Copy link
Author

xliuxu commented Aug 3, 2020

@sdake
Actually the overall ready time for new vs after creating 800 vs remains stable at about 50s. We did pause for hours and then create new services. In the world of knative the Ready of vs means that a http request probe target to ingress with success response code and expected header K-Network-Hash.
e.g.

curl -v http://hello.default.dev-serving.codeengine.dev.appdomain.cloud  -H "User-Agent: Knative-Ingress-Probe" -H "K-Network-Probe: probe"
*   Trying 130.198.104.44...
* TCP_NODELAY set
* Connected to hello.default.dev-serving.codeengine.dev.appdomain.cloud (130.198.104.44) port 80 (#0)
> GET / HTTP/1.1
> Host: hello.default.dev-serving.codeengine.dev.appdomain.cloud
> Accept: */*
> User-Agent: Knative-Ingress-Probe
> K-Network-Probe: probe
>
< HTTP/1.1 200 OK
< k-network-hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
< date: Mon, 03 Aug 2020 02:54:53 GMT
< content-length: 0
< x-envoy-upstream-service-time: 1
< server: istio-envoy
<
* Connection #0 to host hello.default.dev-serving.codeengine.dev.appdomain.cloud left intact
* Closing connection 0

From the logs of knative we can confirm that the failed probes always caused by 404 response code. It is not related with DNS as the probe force to resolve the domain to the pod ip of istio ingress gateway.

IIRC with 800 ksvcs present, when we create new vs, we can see logs for istiod pilot popping up with logs quickly like following

2020-08-03T02:59:59.849636Z	info	Handle EDS: 0 endpoints for fddfefllfo-hwblr-1-private in namespace default
2020-08-03T02:59:59.877423Z	info	Handle EDS: 3 endpoints for fddfefllfo-hwblr-1 in namespace default
2020-08-03T02:59:59.877473Z	info	ads	Full push, new service fddfefllfo-hwblr-1.default.svc.cluster.local
2020-08-03T02:59:59.877485Z	info	ads	Full push, service accounts changed, fddfefllfo-hwblr-1.default.svc.cluster.local
2020-08-03T02:59:59.977789Z	info	ads	Push debounce stable[547] 3: 100.259888ms since last change, 190.482336ms since last push, full=true

So I think it is not related with kube API and informers could cache and refelect the changes quickly.
But currently I have no further clue do diagnose the process after pushing configs to ingress until it is picked up by envoy.

There is another problem which might also affect the performance. For services created by knative, it will also create coressponding kubernetes service and endpoints. And istio will also create rules like following even the virtualservice is not created. I wonder that the EDS might also stress the ingress gateway.

      "name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"

@sdake
Copy link
Member

sdake commented Aug 3, 2020

Thank you for the detail. The first problem I spotted is in this VS. gateways may not contain a slash. The validation for / as a gateway: field fails. Are you certain this slash is present in the VS created by knative?

Cheers,
-steve

Below is a typical virtual service created by knative.

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    networking.knative.dev/ingress.class: istio.ingress.networking.knative.dev
  creationTimestamp: "2020-07-20T08:23:54Z"
  generation: 1
  labels:
    networking.internal.knative.dev/ingress: hello29
    serving.knative.dev/route: hello29
    serving.knative.dev/routeNamespace: default
  name: hello29-ingress
  namespace: default
  ownerReferences:
  - apiVersion: networking.internal.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Ingress
    name: hello29
    uid: 433c415d-901e-4154-bfd9-43178d0db192
  resourceVersion: "47989694"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/virtualservices/hello29-ingress
  uid: 20162863-b721-4d67-aa19-b84adc7dffe0
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  - knative-serving/knative-ingress-gateway
  hosts:
  - hello29.default
  - hello29.default.dev-serving.codeengine.dev.appdomain.cloud
  - hello29.default.svc
  - hello29.default.svc.cluster.local
  http:
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default
      gateways:
      - knative-serving/cluster-local-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default.dev-serving.codeengine.dev.appdomain.cloud
      gateways:
      - knative-serving/knative-ingress-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
---

@sdake
Copy link
Member

sdake commented Aug 3, 2020

Feels environmental. My system is very fast - 8000 virtual services representing the 1000 services (and their endpoints) are created more quickly then I can run the debug commands.

Please check my performance testing repo and see if you can reproduce in your environment. You will need istioctl-1.7.0a2. You need this version because the route command was added to proxy-config is istioctl in 1.7 and doesn't do much useful work in 1.6.

istio.yaml was created using istioctl-1..6.5 and the IOP you provided above.

Here are my results on a single node virtual machine with 128gb of ram and 8 core of xeon CPU:

sdake@sdake-dev:~/knative-perf$  kubectl create ns istio-system
sdake@sdake-dev:~/knative-perf$  kubectl apply -f istio.yaml
sdake@sdake-dev:~/knative-perf$  kubectl apply -f vs.yaml
sdake@sdake-dev:~/knative-perf$ ./istioctl proxy-config route istio-ingressgateway-6f9df9b8-6fwdc -n istio-system | wc -l
8004
sdake@sdake-dev:~/knative-perf$ ./istioctl proxy-config endpoints istio-ingressgateway-6f9df9b8-6fwdc -n istio-system | wc -l
2087

The last two commands run the curl command above - sort of - to determine the endpoints and virtual service routes. Please note I noticed one of the ingress is not used. Also, please double-check my methodology represents what is happening in knative.

Cheers,
-steve

@sdake
Copy link
Member

sdake commented Aug 4, 2020

I tried this on IKS. IKS performance is not as good as my dedicated gear. I did notice the ALB was in a broken state:

Screen Shot 2020-08-03 at 7 48 28 PM

@sdake
Copy link
Member

sdake commented Aug 4, 2020

Events:
  Type     Reason     Age                  From                    Message
  ----     ------     ----                 ----                    -------
  Normal   Scheduled  30m                  default-scheduler       Successfully assigned istio-system/istio-ingressgateway-6f9df9b8-bwn4p to 10.221.80.115
  Normal   Pulling    30m                  kubelet, 10.221.80.115  Pulling image "docker.io/istio/proxyv2:1.6.5"
  Normal   Pulled     30m                  kubelet, 10.221.80.115  Successfully pulled image "docker.io/istio/proxyv2:1.6.5"
  Normal   Created    30m                  kubelet, 10.221.80.115  Created container istio-proxy
  Normal   Started    30m                  kubelet, 10.221.80.115  Started container istio-proxy
  Warning  Unhealthy  20m (x172 over 27m)  kubelet, 10.221.80.115  Readiness probe failed: Get http://172.30.33.68:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

@xliuxu
Copy link
Author

xliuxu commented Aug 4, 2020

I have tested istioctl-1.7.0a2 release and got a different result than we met before.
After 800 knative services (which will create 800 vs accordingly) are created, when I try to create new knative services, the ready time keeps about 10~20 seconds. Which is much lower than we tested on 1.6.5 and piror releases.
image

From the graph we can tell that despite of the peaks, the ready time can back to a resonable value than before.
Also I noticed that from the probing logs the failed probing almost fails with a 503 error response. I tried to curl manually and got a response like below.

curl http://coligotest-676.coligotest-6.serving-perf-temporary-cd5bb76c946a0f930dd9607ba0ddd3a1-0001.au-syd.containers.appdomain.cloud -v
* Uses proxy env variable http_proxy == 'http://127.0.0.1:8001'
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8001 (#0)
> GET http://coligotest-676.coligotest-6.serving-perf-temporary-cd5bb76c946a0f930dd9607ba0ddd3a1-0001.au-syd.containers.appdomain.cloud/ HTTP/1.1
> Host: coligotest-676.coligotest-6.serving-perf-temporary-cd5bb76c946a0f930dd9607ba0ddd3a1-0001.au-syd.containers.appdomain.cloud
> User-Agent: curl/7.64.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 503 Service Unavailable
< Content-Length: 19
< Connection: keep-alive
< Content-Type: text/plain
< Date: Tue, 04 Aug 2020 09:01:10 GMT
< Keep-Alive: timeout=4
< Proxy-Connection: keep-alive
< Server: istio-envoy
<
* Connection #0 to host 127.0.0.1 left intact
no healthy upstream* Closing connection 0

If I retry the curl command and some of them could return with 200 success. Seems that some of the gateway is ready but others are not. Knative need all pods of gateways to be ready to serve the domain and then it will mark service as ready. Currently we have 3 replicas of public gateways of istio.

@zhanggbj
Copy link

zhanggbj commented Aug 4, 2020

Per offline discussion with @sdake, we're not using IKS ALB mentioned above. So no traffic will go through IKS ingress but will go to Istio ingress gateway.

And during Knative service creation, the ingress_lb_ready duration basically means Knative have applied a Istio Virtual Service but will send a http request as below to make sure the Istio config works.

curl -k -v -H "Host: knative-service.domain.com" -H "User-Agent: Knative-Ingress-Probe"  "K-Network-Probe: probe" http://172.30.49.209:80

# 172.30.49.209 is the istio ingressgeteway IP

@sdake
Copy link
Member

sdake commented Aug 4, 2020

As I understand, NLB uses keepalived on IKS. As a result, there are still processes running to keep the packets moving in a keepalived scenario within the cluster.

My theory is that when you run your 1k processes, it overloads the cluster. You start to see this overload at about 330 processes, and then it really increases up later at 340 processes.

Kubernetes is designed to run 30 pods per node. At maximum - with configuration, Kubernetes on a very fast machine such as bare metal can run 100 pods per node maximum. If you have 10-15 nodes, you will see the above results, because you have overloaded the K8s platform and the scheduler will begin to struggle. It would be interesting to take a look at dmesg output during this ramp up to see if processes are killed as a result of OOM, or other kernel failures are occuring.

Cheers,
-steve

@xliuxu
Copy link
Author

xliuxu commented Aug 5, 2020

We are testing using a cluster with 12 worker nodes of the following configurations

Capacity:
  cpu:                16
  ephemeral-storage:  102821812Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32913324Ki
  pods:               160
Allocatable:
  cpu:                15880m
  ephemeral-storage:  100025058636
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             29117356Ki
  pods:               160

So I think it should be to able to test 800 knative services (also 800 pods if we set minscale =1). The test application is just a basic hello world go example. I noticed that even we set minscale = 0, which means that the deployment will scale to zero pods when it is ready and no traffic goes through, we can get a similar result.
It is interesting that envoy givess a 503 errors with response no healthy upstream. When we are testing 1.6.5 or piror releases the error should always be 404.
I also try to dump the dmesg for the node hosting ingress gateway pods and found the following error on one node. I have no clue whether it is critical or not. But I cannot find any OOM errors.

[83473.038700] wrk:worker_1[194281]: segfault at 20 ip 000056211e436207 sp 00007effcd7b70b0 error 4 in envoy[56211cdd9000+232a000]
[83475.185571] wrk:worker_7[194453]: segfault at 21 ip 000055fdbb3e2207 sp 00007ff7e32ed0b0 error 4 in envoy[55fdb9d85000+232a000]
[83492.175520] wrk:worker_3[194705]: segfault at 20 ip 000055733073f207 sp 00007f8bfd3e70b0 error 4 in envoy[55732f0e2000+232a000]
[83518.877194] wrk:worker_7[195086]: segfault at 20 ip 000055a729bcd207 sp 00007fce2e67e0b0 error 4 in envoy[55a728570000+232a000]
[83577.918847] wrk:worker_0[195759]: segfault at 30 ip 00005582aec16207 sp 00007f12bd7c20b0 error 4 in envoy[5582ad5b9000+232a000]
[83663.561086] wrk:worker_4[196638]: segfault at 21 ip 0000561e2efe1207 sp 00007f897422d0b0 error 4 in envoy[561e2d984000+232a000]
[83835.123111] wrk:worker_3[198375]: segfault at 30 ip 000055935dcbb207 sp 00007fd87e7b50b0 error 4 in envoy[55935c65e000+232a000]
[84138.453503] wrk:worker_0[201249]: segfault at 21 ip 00005576d5ada207 sp 00007f45668820b0 error 4 in envoy[5576d447d000+232a000]
[84442.424616] wrk:worker_5[204183]: segfault at 20 ip 000056050f968207 sp 00007ff9ab7b10b0 error 4 in envoy[56050e30b000+232a000]

@zhanggbj
Copy link

zhanggbj commented Aug 5, 2020

@sdake FYI, here is the benchmark tool we're using which will help to generate the Knative Service with different intervals as @lanceliuu mentioned above and can get the ingress_lb_ready duration time and dashboard.
https://github.com/zhanggbj/kperf

@xliuxu
Copy link
Author

xliuxu commented Aug 7, 2020

@sdake
From the recent tests we find that the increasing of ready time for knative services is not caused by the delay for pilot pushing configs to envoy. After enable debug logs of enovy we find that the probes for knative fails with the following error:

2020-08-07T03:59:33.442763Z	debug	envoy http	[C66564] new stream
2020-08-07T03:59:33.442841Z	debug	envoy http	[C66564][S14276412531493771392] request headers complete (end_stream=true):
':authority', 'coligotdestg-50.coligotest-19.serving-perf-temporary-cd5bb76c946a0f930dd9607ba0ddd3a1-0001.au-syd.containers.appdomain.cloud:80'
':path', '/healthz'
':method', 'GET'
'user-agent', 'Knative-Ingress-Probe'
'k-network-probe', 'probe'
'accept-encoding', 'gzip'

2020-08-07T03:59:33.442855Z	debug	envoy http	[C66564][S14276412531493771392] request end stream
2020-08-07T03:59:33.443011Z	debug	envoy router	[C66564][S14276412531493771392] cluster 'outbound|80||coligotdestg-50-w4hcq.coligotest-19.svc.cluster.local' match for URL '/healthz'
2020-08-07T03:59:33.443043Z	debug	envoy upstream	no healthy host for HTTP connection pool
2020-08-07T03:59:33.443100Z	debug	envoy http	[C66564][S14276412531493771392] Sending local reply with details no_healthy_upstream
2020-08-07T03:59:33.443139Z	debug	envoy http	[C66564][S14276412531493771392] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '19'
'content-type', 'text/plain'
'date', 'Fri, 07 Aug 2020 03:59:33 GMT'
'server', 'istio-envoy'

Seems that the host in virtualservice is recongized and the upstream cluster config is also configured correctly. But envoy treat it as unhealthy. I can confirm the dns for upstream host is able to resolve in ingress-gateway-pods when the error occurred. And no healthy checkers is specified in the upstream cluster config. The document of envoy said that

If no configuration is specified no health checking will be done and all cluster members will be considered healthy at all times.

I wonder why envoy reports it as unhealthy and takes several minutes to recover.

@linsun
Copy link
Member

linsun commented Aug 7, 2020

Can you check out https://karlstoney.com/2019/05/31/istio-503s-ucs-and-tcp-fun-times/index.html and to see if there is more you can find from these 503s?

also, you can turn on access logs https://istio.io/docs/tasks/telemetry/logs/access-log/ and there is a "Response Flag" field that will give more details about the 503 (like DC, UC, etc).

@sdake
Copy link
Member

sdake commented Oct 16, 2020

HI gang. With PR: #27687, Istio performs much better. Running the master branch with the 800 services test case on K8s 1.19:
Screen Shot 2020-10-15 at 7 30 04 PM

I found this PR has improved the default case, and explains the massive jump at 237+ without flow control. I think flow control is still necessary, although without modifying this value, flow control fails in various ways.

Here is how I set PILOT_XDS_SEND_TIMEOUT:

sdake@scale-cp:~$ cat kn.sh
#!/bin/bash
# Install Istio and knative
./istioctl install -f istio-minimal-operator.yaml --set values.pilot.env.PILOT_XDS_SEND_TIMEOUT=100s --set values.pilot.env.PILOT_ENABLE_FLOW_CONTROL=true --set values.gateways.istio-egressgateway.enabled=false --set hub=docker.io/sdake --set tag=kn0157
kubectl apply --filename https://github.com/knative/serving/releases/download/v0.17.0/serving-crds.yaml
kubectl apply --filename https://github.com/knative/serving/releases/download/v0.17.0/serving-core.yaml
kubectl apply --filename https://github.com/knative/net-istio/releases/download/v0.17.0/release.yaml

I displayed some of the various send metric times .. Note even though the FLOW_CONTROL flag is enabled in the command, it is unimplemented in master so no flow control is running at present.

Envoy SIGTERMed (restarted) in the big spike here:

sdake@scale-cp:~$ kubectl logs --previous -n istio-system cluster-local-gateway-7784d9555d-n2xsm
2020-10-16T01:59:10.399686Z     info    FLAG: --concurrency="0"
2020-10-16T01:59:10.399726Z     info    FLAG: --domain="istio-system.svc.cluster.local"
2020-10-16T01:59:10.399733Z     info    FLAG: --help="false"
2020-10-16T01:59:10.399736Z     info    FLAG: --log_as_json="false"
2020-10-16T01:59:10.399770Z     info    FLAG: --log_caller=""
2020-10-16T01:59:10.399774Z     info    FLAG: --log_output_level="default:info"
2020-10-16T01:59:10.399777Z     info    FLAG: --log_rotate=""
2020-10-16T01:59:10.399781Z     info    FLAG: --log_rotate_max_age="30"
2020-10-16T01:59:10.399791Z     info    FLAG: --log_rotate_max_backups="1000"
2020-10-16T01:59:10.399794Z     info    FLAG: --log_rotate_max_size="104857600"
2020-10-16T01:59:10.399798Z     info    FLAG: --log_stacktrace_level="default:none"
2020-10-16T01:59:10.399811Z     info    FLAG: --log_target="[stdout]"
2020-10-16T01:59:10.399816Z     info    FLAG: --meshConfig="./etc/istio/config/mesh"
2020-10-16T01:59:10.399819Z     info    FLAG: --outlierLogPath=""
2020-10-16T01:59:10.399823Z     info    FLAG: --proxyComponentLogLevel="misc:error"
2020-10-16T01:59:10.399826Z     info    FLAG: --proxyLogLevel="warning"
2020-10-16T01:59:10.399830Z     info    FLAG: --serviceCluster="cluster-local-gateway"
2020-10-16T01:59:10.399834Z     info    FLAG: --stsPort="0"
2020-10-16T01:59:10.399837Z     info    FLAG: --templateFile=""
2020-10-16T01:59:10.399841Z     info    FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2020-10-16T01:59:10.399845Z     info    FLAG: --trust-domain="cluster.local"
2020-10-16T01:59:10.399856Z     info    Version 1.8-dev-63463ec22531aa3e23234394407c113ac279ba4a-dirty-Modified
2020-10-16T01:59:10.400027Z     info    Obtained private IP [172.16.25.193 fe80::80b5:a5ff:fe5b:f0b3]
2020-10-16T01:59:10.400128Z     info    Apply mesh config from file defaultConfig:
  discoveryAddress: istiod.istio-system.svc:15012
  proxyMetadata:
    DNS_AGENT: ""
  tracing:
    zipkin:
      address: zipkin.istio-system:9411
disableMixerHttpReports: true
enablePrometheusMerge: true
rootNamespace: istio-system
trustDomain: cluster.local
2020-10-16T01:59:10.401458Z     info    Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 0
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
envoyAccessLogService: {}
envoyMetricsService: {}
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
  DNS_AGENT: ""
serviceCluster: cluster-local-gateway
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
  zipkin:
    address: zipkin.istio-system:9411

2020-10-16T01:59:10.401512Z     info    Proxy role: &model.Proxy{RWMutex:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, Type:"router", IPAddresses:[]string{"172.16.25.193", "fe80::80b5:a5ff:fe5b:f0b3"}, ID:"cluster-local-gateway-7784d9555d-n2xsm.istio-system", Locality:(*envoy_config_core_v3.Locality)(nil), DNSDomain:"istio-system.svc.cluster.local", ConfigNamespace:"", Metadata:(*model.NodeMetadata)(nil), SidecarScope:(*model.SidecarScope)(nil), PrevSidecarScope:(*model.SidecarScope)(nil), MergedGateway:(*model.MergedGateway)(nil), ServiceInstances:[]*model.ServiceInstance(nil), IstioVersion:(*model.IstioVersion)(nil), VerifiedIdentity:(*spiffe.Identity)(nil), ipv6Support:false, ipv4Support:false, GlobalUnicastIP:"", XdsResourceGenerator:model.XdsResourceGenerator(nil), WatchedResources:map[string]*model.WatchedResource(nil)}
2020-10-16T01:59:10.401518Z     info    JWT policy is third-party-jwt
2020-10-16T01:59:10.401595Z     info    PilotSAN []string{"istiod.istio-system.svc"}
2020-10-16T01:59:10.401686Z     info    sa.serverOptions.CAEndpoint == istiod.istio-system.svc:15012 Citadel
2020-10-16T01:59:10.401805Z     info    Using CA istiod.istio-system.svc:15012 cert with certs: var/run/secrets/istio/root-cert.pem
2020-10-16T01:59:10.401933Z     info    citadelclient   Citadel client using custom root: istiod.istio-system.svc:15012 -----BEGIN CERTIFICATE-----
MIIC/DCCAeSgAwIBAgIQHdfSaLljKjP7xHiVYTUmyTANBgkqhkiG9w0BAQsFADAY
MRYwFAYDVQQKEw1jbHVzdGVyLmxvY2FsMB4XDTIwMTAxNjAxNTg1N1oXDTMwMTAx
NDAxNTg1N1owGDEWMBQGA1UEChMNY2x1c3Rlci5sb2NhbDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAKF+rryWpgQabVS8vc6roKsegGwDt2fInsMmy4u/
tgkZw2IsQGfgi4R/7hy+8rSRu8n32j2gTYU9cSYFcU3mrtqx+cZylcgxCaa63Kxh
k77moW8qVwXa/R7CO7VFegOLguX4m8e5B7b0mHw0pPqDNI158ChcoEjpOvZxqAxT
hHtaDFq9B+DPY38u0zr3jEjFTEMw8HASd9vxdEKrRDJjj2aEiMK9vaQ4t7xw6pk0
+TiWqzr22TIR90L383OCvTSAxquW8EkCmBV5g3E/Onxgx4nyj1WWnFDUEzm35f25
oe65uYjBobTf2qThMoxz6Z1e1UnSYsF9DvETCyD+RAcg59cCAwEAAaNCMEAwDgYD
VR0PAQH/BAQDAgIEMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFMKwGsGtjksA
uZey1x+qkE1uF6C1MA0GCSqGSIb3DQEBCwUAA4IBAQA5pcZBDK5R6cWbN4YgVi0I
/mcbtdxibsvDe4jXE2LR5/2r39KSb8tf11xLRuA0fmbHS88L+OJt0h8kR8qIqHtC
08O0ZpjFb/UGIjgnRtTak7sM0Ar85webroP+GtUOYysythgje1jqX8xp4zIRrRvX
KdbwXeaVTgtB2lhHY2s3D/X+369scJCR0daMRbGoYhKcThN05VhoOf9BuL4lUiTl
qMl2ip27uTzn5FuT20F2/syJNu6MVHuhWfr/ucahCrK5R4anUzYLCNcgziY4lgAR
uzgzwOQij/UkEv8SjMPVUFE/6LKftb2QWuAVswSGEJtIqocEoIjBVolovgJSMeW8
-----END CERTIFICATE-----

2020-10-16T01:59:10.437110Z     info    Starting gateway SDS
2020-10-16T01:59:10.540289Z     info    sds     SDS gRPC server for workload UDS starts, listening on "./etc/istio/proxy/SDS" 

2020-10-16T01:59:10.540428Z     info    sds     SDS gRPC server for gateway controller starts, listening on "./var/run/ingress_gateway/sds" 

2020-10-16T01:59:10.540455Z     info    xdsproxy        Initializing with upstream address istiod.istio-system.svc:15012 and cluster Kubernetes
2020-10-16T01:59:10.540327Z     info    sds     Start SDS grpc server
2020-10-16T01:59:10.540494Z     info    sds     Start SDS grpc server for ingress gateway proxy
2020-10-16T01:59:10.540737Z     info    xdsproxy        adding watcher for certificate var/run/secrets/istio/root-cert.pem
2020-10-16T01:59:10.540896Z     info    Starting proxy agent
2020-10-16T01:59:10.540959Z     info    Opening status port 15020

2020-10-16T01:59:10.540982Z     info    Received new config, creating new Envoy epoch 0
2020-10-16T01:59:10.541006Z     info    Epoch 0 starting
2020-10-16T01:59:10.612165Z     info    Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster cluster-local-gateway --service-node router~172.16.25.193~cluster-local-gateway-7784d9555d-n2xsm.istio-system~istio-system.svc.cluster.local --local-address-ip-version v4 --bootstrap-version 3 --log-format-prefix-with-location 0 --log-format %Y-%m-%dT%T.%fZ      %l      envoy %n        %v -l warning --component-log-level misc:error]
2020-10-16T01:59:10.704247Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T01:59:10.762180Z     warning envoy main      there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2020-10-16T01:59:11.930606Z     info    Envoy proxy is ready
2020-10-16T02:01:32.662217Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:01:32.752377Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:01:40.453017Z     info    sds     resource:ROOTCA new connection
2020-10-16T02:01:40.453208Z     info    sds     Skipping waiting for gateway secret
2020-10-16T02:01:40.453539Z     info    sds     resource:default new connection
2020-10-16T02:01:40.453710Z     info    sds     Skipping waiting for gateway secret
2020-10-16T02:01:40.761638Z     info    cache   Root cert has changed, start rotating root cert for SDS clients
2020-10-16T02:01:40.761678Z     info    cache   GenerateSecret default
2020-10-16T02:01:40.762136Z     info    sds     resource:default pushed key/cert pair to proxy
2020-10-16T02:01:41.053775Z     info    cache   Loaded root cert from certificate ROOTCA
2020-10-16T02:01:41.054160Z     info    sds     resource:ROOTCA pushed root cert to proxy
2020-10-16T02:05:50.337840Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:05:50.338244Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:05:50.537126Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:05:50.551713Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:05:50.552076Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:05:51.339141Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:05:51.354765Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:05:51.354302Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:05:51.755568Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:05:51.771284Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:05:51.773622Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:05:54.637499Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:06:13.117863Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:06:13.118289Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:06:13.142508Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:06:23.907216Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:06:23.907650Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:06:24.309467Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:06:33.504522Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:06:33.505244Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:06:33.805966Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:06:33.821603Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:06:33.822146Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:06:33.960974Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:06:48.876198Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:06:48.876862Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:06:49.261012Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:01.977933Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:07:01.978379Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:02.264945Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:02.284377Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:07:02.284887Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:07:02.557491Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:12.701594Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:07:12.702022Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:12.781184Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:12.793431Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:07:12.793853Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:07:13.633206Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:21.360882Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:07:21.361257Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:21.463948Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:21.481547Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:07:21.481944Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:07:22.130888Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:31.700642Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:07:31.700961Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:32.195510Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:32.206918Z     warn    xdsproxy        upstream terminated with unexpected error rpc error: code = Unknown desc = missing node ID
2020-10-16T02:07:32.207349Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2020-10-16T02:07:32.752441Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:42.104987Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:07:42.105887Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:42.450481Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:07:53.096206Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:07:53.096929Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:07:53.142537Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:02.628500Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:08:02.628996Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:02.796967Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:02.808092Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:08:02.808398Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:03.585190Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:10.468927Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:08:10.469354Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:10.737844Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:23.498743Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:08:23.499086Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:23.634624Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:23.645854Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:08:23.646124Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:24.457364Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:32.011835Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:08:32.012211Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:32.328237Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:32.347338Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.route.v3.RouteConfiguration: EOF
2020-10-16T02:08:32.347873Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:32.532117Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:43.348588Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:08:43.349840Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:43.767588Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:43.783923Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:08:43.793767Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:44.646502Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:54.518489Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:08:54.518956Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:08:54.861155Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:54.876616Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.route.v3.RouteConfiguration: EOF
2020-10-16T02:08:54.876913Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:55.426009Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:08:55.438543Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.route.v3.RouteConfiguration: EOF
2020-10-16T02:08:55.438969Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:08:55.749592Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:07.077497Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:09:07.077845Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:07.452948Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:16.737625Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:09:16.738134Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:17.226769Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:28.013763Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:09:28.014242Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:28.518022Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:28.539090Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:09:28.538501Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:09:28.665306Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:38.342405Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:09:38.343334Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:38.452084Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:47.903154Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:09:47.903629Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:48.270553Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:09:55.191934Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:09:55.192325Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:09:55.417107Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:06.580845Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:10:06.588265Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:10:06.721196Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:06.735185Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:06.743370Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:06.753241Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:06.765139Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:06.765976Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:07.825729Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:07.838217Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:07.838659Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:08.093584Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:20.725233Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:10:20.726134Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:10:21.210759Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:32.535086Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:10:32.535685Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:10:32.702306Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:42.743780Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:10:42.747799Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:10:42.871380Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:44.883015Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:44.883439Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:44.918183Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:55.504787Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:10:55.505319Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:10:55.980198Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:56.003653Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:56.004313Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:56.705362Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:10:56.718653Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:10:56.718952Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:10:58.509699Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:08.302624Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:11:08.309887Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:11:08.685218Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:22.053007Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:11:22.053573Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:11:22.065630Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:22.085065Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:22.085492Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:22.888449Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:22.903391Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:22.903723Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:23.922501Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:23.944882Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:23.945110Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:25.938535Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:34.856782Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment: context deadline exceeded
2020-10-16T02:11:34.857508Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:11:35.268599Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:35.284864Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:35.285411Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:36.167629Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:36.184562Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:36.185092Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:37.444697Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:37.460650Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:37.461317Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:38.378750Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:49.352330Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:11:49.352691Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:11:49.693715Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:11:49.716645Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:11:49.717241Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:11:50.134543Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:12:01.908816Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:12:01.909203Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:12:01.947033Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:12:11.975901Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: context deadline exceeded
2020-10-16T02:12:11.976231Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2020-10-16T02:12:12.348237Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:12:12.384905Z     error   xdsproxy        upstream send error for type url type.googleapis.com/envoy.config.listener.v3.Listener: EOF
2020-10-16T02:12:12.385237Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, EOF
2020-10-16T02:12:13.231978Z     info    xdsproxy        connecting to istiod.istio-system.svc:15012
2020-10-16T02:12:13.256439Z     warning envoy main      caught SIGTERM

The SIGTERM was caused by the OOMkiller:

    State:          Running
      Started:      Fri, 16 Oct 2020 02:12:13 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 16 Oct 2020 01:59:10 +0000
      Finished:     Fri, 16 Oct 2020 02:12:13 +0000

Hilarious exit code - actually my license plate...

During the test I displayed the push time in milliseconds:

sdake@scale-cp:~$ kubectl logs -n istio-system istiod-66b44bd5cb-m8zxz | grep duration | cut -d: -f 4 | sort -bg | tail -20
 5307
 5310
 5344
 5365
 5370
 5386
 5390
 5456
 5481
 5492
 5544
 5569
 5576
 5627
 5769
 5855
 5857
 5870
 6065
 6227

@sdake sdake closed this as completed Oct 16, 2020
Prioritization automation moved this from P0 to Done Oct 16, 2020
@sdake
Copy link
Member

sdake commented Oct 16, 2020

clicked wrong button ^^

@sdake sdake reopened this Oct 16, 2020
@ramaraochavali
Copy link
Contributor

I am not sure how send timeout is helping here - are we saying some times the config is lost/dropped because of low timeouts?

@sdake
Copy link
Member

sdake commented Oct 16, 2020

when the connection is dropped because of the 5 second send timeout, the system falls over. Under heavy services churn, the proxy enters a super-overloaded state where it can take 200-800 seconds to recover - see: #25685 (comment).

Yes, that is what I am saying.

sdake pushed a commit to sdake/istio that referenced this issue Oct 22, 2020
Fixes: istio#25685

Istio suffers from a problem at large scale (800+ services sequentially
created) with significant churn that Envoy becomes overloaded and
produces a zigzaw pattern in acking results. This slows Istio down by
using a semaphore to signal when a receive has occured and wait for the
semaphore prior to new pushes.

Co-Authored-By: John Howard <howardjohn@google.com>
@sdake
Copy link
Member

sdake commented Oct 22, 2020

These results are better - although ADS continues to disconnect and Envoy OOMs: #28192. I do feel like this is the first attempt of a PR that manages the numerous constraints of the protocol implementation
Screen Shot 2020-10-22 at 8 38 11 AM
.

@howardjohn
Copy link
Member

Got a profile of envoy during high XDS pushes
envoy.prof.gz

20% of time in MessageUtil::validate
20% of time in MessageUtil::hash
Another 15% in RdsRouteConfigProviderImpl::validateConfig. It looks like this is doing 2x the work as part of RdsRouteConfigSubscription::onConfigUpdate

howardjohn added a commit to howardjohn/istio that referenced this issue Oct 27, 2020
Fixes: istio#25685

At large scale, Envoy suffers from overload of XDS pushes, and there is
no backpressure in the system. Other control planes, such as any based
on go-control-plane, outperform Istio in config update propogations
under load as a result.

This changes adds a backpressure mechanism to ensure we do not push more
configs than Envoy can handle. By slowing down the pushes, the
propogation time of new configurations actually increases. We do this by
keeping note, but not sending, any push requests where that TypeUrl has
an un-ACKed request in flight. When we get an ACK, if there is a pending
push request we will immediately trigger it. This effectively means that
in a high churn environment, each proxy will always have exactly 1
outstanding push per type, and when the ACK is recieved we will
immediately send a new update.

This PR is co-authored by Steve, who did a huge amount of work in
developing this into the state it is today, as wel as finding and
testing the problem. See istio#27563 for
much of this work.

Co-Authored-By: Steven Dake sdake@ibm.com
istio-testing pushed a commit that referenced this issue Oct 27, 2020
* Wait until ACK before sending additional pushes

Fixes: #25685

At large scale, Envoy suffers from overload of XDS pushes, and there is
no backpressure in the system. Other control planes, such as any based
on go-control-plane, outperform Istio in config update propogations
under load as a result.

This changes adds a backpressure mechanism to ensure we do not push more
configs than Envoy can handle. By slowing down the pushes, the
propogation time of new configurations actually increases. We do this by
keeping note, but not sending, any push requests where that TypeUrl has
an un-ACKed request in flight. When we get an ACK, if there is a pending
push request we will immediately trigger it. This effectively means that
in a high churn environment, each proxy will always have exactly 1
outstanding push per type, and when the ACK is recieved we will
immediately send a new update.

This PR is co-authored by Steve, who did a huge amount of work in
developing this into the state it is today, as wel as finding and
testing the problem. See #27563 for
much of this work.

Co-Authored-By: Steven Dake sdake@ibm.com

* Refactor and cleanup tests

* Add test
@ZhuangYuZY
Copy link

@howardjohn It is great to have PR to fix the issue. What is the target release for this PR ? Thank you.

gargnupur added a commit to gargnupur/istio that referenced this issue Nov 13, 2020
Signed-off-by: gargnupur <gargnupur@google.com>

fix

Signed-off-by: gargnupur <gargnupur@google.com>

add yaml

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

Delete reference to old ISTIO_META_PROXY_XDS_VIA_AGENT (istio#28203)

* update expose istiod

* add https sample

* fix tab

* update host + domain

* fix lint

* fix lint

* tweak host

* fix lint

* use tls port

* name port correctly

* change default to tls

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 7feb468.

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 98209a0.

* use istiod-remote since pilot is still enabled on remote cluster

* loose up on host name

* adding notes

* clean up this in preview profile

Co-authored-by: Iris <irisding@apache.org>

Avoid telemetry cluster metadata override (istio#28171)

* fix cluster metadata override

* test

* fix

* fix

* fix again

* clean

add telemetry test for customize metrics (istio#27844)

* add test for customize metrics

* address comments

* add remove tag check

* fix test

Delete istiod pods on cleanup (istio#28205)

Otherwise they stay around and can cause other tests to fail.

In a concrete example, deployment "istiod-canary" stays live
and interferes in pilot's TestMultiRevision test, which also
deploys a "istiod-canary", but, since a deployment with that
name already exists, operator doesn't redeploy it, because it's
already there.

Fix HTTPs on HTTP port passthrough (istio#28166)

* Fix HTTPs on HTTP port passthrough

* Add note

remove 1.7 telemetry filters from charts (istio#28195)

use correct env var name (istio#28217)

Align Ingress resource status updates with Ingresses targeted in controller (istio#28225)

make istiod-remote depend on base when installation (istio#28219)

Add remoteIpBlocks functionality to AuthorizationPolicy (istio#27906)

* create remoteIpBlocks and update ipBlocks for AuthorizationPolicy

 By adding remoteIpBlocks and notRemoteIpBlocks in Source,
 an AuthorizationPolicy can trigger actions based on the original
 client IP address gleaned from the X-Forwarded-For header or the
 proxy protocol. The ipBlocks and notIpBlocks fields have also been
 updated to use direct_remote_ip in Envoy instead of source_ip

* use correct attribute for RemoteIpBlocks

* fix unit tests and add integration tests for remote.ip attribute

* fix notRemoteIp integration test

* initialize headers if it is nil

* Combine remoteIp tests into IngressGateway test and add release note

* add titles to links

* remove unneeded tests

* fix quotes in releasenote, run make gen

* make upgradeNotes a list

Remove deprecated istio-coredns plugin (istio#28179)

make stackdriver test platform agnostic (istio#28237)

* make stackdriver test platform agnostic

* fix

* clean up

Add Wasm Extension Dashboard (istio#28209)

* Add WASM Extension Dashboard

* update dashboard

* update dashboard and add cpu/mem

* address review comment

* add excluded

* remove extension dashboard from test allowlist.txt

* update readme

Clean up metadata exchange keys (istio#28249)

* clean up

* cleanup exchange key

Remove unnecessary warning log from ingress status watcher (istio#28254)

vm health checking (istio#28142)

* impl with pilot

* Remove redundant import

* Remove redundant return

* address some concerns

* address more concerns

* Add tests

* fix ci?

* fix ci?

Automator: update proxy@master in istio/istio@master (istio#27786)

pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list (istio#28260)

* pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

* docs: add release notes

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

Adjust Wasm VMs charts order and Add release note (istio#28251)

* Adjust Wasm VMs charts order

* add release note

* replace wasm extension dashboard with real ID

Issue istio#27606: Minor Bug fixes, mostly renaming (istio#28156)

Cleanup ADS tests (istio#28275)

* Cleanup ADS tests

* fix lint

* fix lint

Temporarily skip ratelimit tests (istio#28286)

To help master/1.8 get to a merge-able state

Add warning for legacy FQDN gateway reference (istio#27948)

* Add warning for legacy FQDN gateway reference

* fix lint

* Add more warnings

Fixes for trust domain configuration (istio#28127)

* Fixes for trust domain configuration

We want to ensure we take values.global.trustDomain as default, fallback
to meshConfig.trustDomain, and ensure this is passed through to all
parts of the code.

This fixes the breakage in istio/istio.io#8301

* fix lint

Status improvements (istio#28136)

* Consolidate ledger and status implementation

* Add ownerReference for garbage collection

* write observedGeneration for status

* cleanup rebase errors

* remove garbage from pr

* fix test failures

* Fix receiver linting

* fix broken unit tests

* fix init for route test

* Fix test failures

* add missing ledger to test

* Add release notes

* Reorganize status controller start

* fix race

* separate init and start funcs

* add newline

* remove test sprawl

* reset retention

Add size to ADS push log (istio#28262)

Add README.md for vendor optimized profiles (istio#28155)

* Add README.profiles for vendor optimized profiles

* Another attempt at the table

Fix operator revision handling (istio#28044)

* Fix operator revision handling

* Add revision to installation CR

* Add revision to each resource label

* Update label handling

* Add deployment spec template labels, clean up logging

* Fix test

* Update integration test

* Make gen

* Fix test

* Testing

* Fix tests

Futureproof telemetry envoyfilters a bit (istio#28176)

remove the install comment (istio#28243)

* remove the install comment

* Revert "remove the install comment"

This reverts commit 60bc649.

* Update gen-eastwest-gateway.sh

pilot: skip privileged ports when building listeners for non-root gateways (istio#28268)

* pilot: skip privileged ports when building listeners for non-root gateways

* Add release note

* Use ISTIO_META_UNPRIVILEGED_POD env var instead of a Pod label

Automator: update proxy@master in istio/istio@master (istio#28281)

istioctl bug-report: do not override system namespaces if --exclude flag is provided (istio#27989)

Add ingress status integration test (istio#28263)

clean up: extension configs (istio#28277)

* clean up extension configs

Signed-off-by: Kuat Yessenov <kuat@google.com>

* make gen

Signed-off-by: Kuat Yessenov <kuat@google.com>

Show empty routes in pc routes (istio#28170)

```
NOTE: This output only contains routes loaded via RDS.
NAME                                                                                                     DOMAINS                      MATCH                  VIRTUAL SERVICE
https.443.https-443-ingress-service1-default-0.service1-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
https.443.https-443-ingress-service2-default-0.service2-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
http.80                                                                                                  service1.demo.........io     /*                     service1-demo-......-io-service1-istio-autogenerated-k8s-ingress.default
http.80                                                                                                  service2.demo.........io     /*                     service2-demo-.....i-io-service2-istio-autogenerated-k8s-ingress.default
                                                                                                         *                            /stats/prometheus*
                                                                                                         *                            /healthz/ready*
```

The first 2 lines would not show up without this PR

Add warnings for unknown fields in EnvoyFilter (istio#28227)

Fixes istio#26390

Update rather than patch webhook configuration (istio#28228)

* Update rather than patch webhook configuration

This is a far more flexible pattern, allowing us to have multiple
webhooks and patch them successful. This pattern follows what the
cert-manager does in their webhook patcher (see
pkg/controller/cainjector), which I consider to be top quality code.

* update rbac

Improve error when users use removed addon (istio#28241)

* Improve error when users use removed addon

After and before:
```
$ grun ./istioctl/cmd/istioctl manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: component "foo" does not exist
$ ik manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: stat manifests/charts/addons/foo: no such file or directory
```

* Fix test

When installing istio-cni remove existing istio-cni plugin before inserting a new one (istio#28258)

* Remove istio-cni plugin before inserting a new one

* docs: add release notes

Automator: update common-files@master in istio/istio@master (istio#28278)

Make ingress gateway selector in status watcher match the one used to generate gateway (istio#28279)

* Check for empty ingress service value when converting ingress to gateway

* Pull ingress gateway selector logic into own func

* Use same ingress gateway selector logic for status watcher as when generating gateways

* Fix status watcher test

Remove time.Sleep hacks for fast tests/non-flaky  (istio#27741)

* Remove time.Sleep hacks for fast tests

* fix flake

Add grafana templating query for DS_PROMETHEUS and add missing datasource (istio#28320)

* Add grafana templating query for DS_PROMETHEUS and add missing datasource

* make extension dashboard viewonly

skip ingress test in multicluster (istio#28321)

E2E test for trust domain alias client side secure naming. (istio#28206)

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* fix host to address

* update script

* refactor based on comments

* updated comments

* add build constraints

* lint fix

* fixes based on comments

Samples: use more common images and delete useless samples (istio#28215)

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

Wait until ACK before sending additional pushes (istio#28261)

* Wait until ACK before sending additional pushes

Fixes: istio#25685

At large scale, Envoy suffers from overload of XDS pushes, and there is
no backpressure in the system. Other control planes, such as any based
on go-control-plane, outperform Istio in config update propogations
under load as a result.

This changes adds a backpressure mechanism to ensure we do not push more
configs than Envoy can handle. By slowing down the pushes, the
propogation time of new configurations actually increases. We do this by
keeping note, but not sending, any push requests where that TypeUrl has
an un-ACKed request in flight. When we get an ACK, if there is a pending
push request we will immediately trigger it. This effectively means that
in a high churn environment, each proxy will always have exactly 1
outstanding push per type, and when the ACK is recieved we will
immediately send a new update.

This PR is co-authored by Steve, who did a huge amount of work in
developing this into the state it is today, as wel as finding and
testing the problem. See istio#27563 for
much of this work.

Co-Authored-By: Steven Dake sdake@ibm.com

* Refactor and cleanup tests

* Add test

istioctl: fix failure when passing flags to `go test` (istio#28332)

add xds proxy metrics (istio#28267)

* add xds proxy metrics

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* lint

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix description

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

remove-from-mesh: skip system namespace to remove sidecar (istio#28187)

* remove-from-mesh: skip system namespace to remove sidecar

* check for -i

Remove accidentally merged debug logs (istio#28331)

update warning message for upgrading istio version (istio#28303)

* update warning message for upgrading istio version

* add use before istioctl analyze

Update README.md (istio#28272)

* Update README.md

* Update README.md

Xds proxy improve (istio#28307)

* Prevent goroutine leak

* Accelerate by splitting upstream request and response handling

* fix lint

fix manifestpath for verify install. (istio#28345)

Fix ADSC race (istio#28342)

* Fix ADSC race

* fix

* fix ut

* Update pkg/istio-agent/local_xds_generator.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pkg/adsc/adsc_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pilot/pkg/xds/lds_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Apply shamsher's suggestions from code review

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

List Istio injectors (istio#27849)

* Refactor and rename command

* Code nits

* Fix typo

* Print message if no namespaces have injection

* Code nits

* Case where an injected namespace does not yet have pods

* Code cleanup

kube-inject: hide namespace flag in favour of istioNamespace (istio#28067)

Automator: update proxy@master in istio/istio@master (istio#28355)

Fix test race in FilterGatewayClusterConf (istio#28330)

Example failure
https://prow.istio.io/view/gs/istio-prow/logs/unit-tests_istio_postsubmit/4087

In generally, we need a better way to mutate feature flags in tests.
Maybe a conditionally compiled mutex. Will open an issue to track

Networking: Add scaffold of tunnel EDS builder  (istio#28244)

* add endpoint tunnel supportablity field

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* edsbtswip

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix import

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* EndpointsByNetworkFilter refactor

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix pkg/pilot/ tests

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* ep builder decide build out tunnel type

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add basic tunnel eds test

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix proxy metadata access

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix endpoint COW, h2support bitfield bug

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen without fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add errgo

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* address comment

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

cni:fix order dependent test failures (istio#28349)

The `interceptRuleMgrType` decalred in main.go as "iptables", it would
be changed to "mock" in func resetGlobalTestVariables().

When run single test, it would be "iptables" and make test failed.

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

fix uninstall test (istio#28335)

* fix uninstall test

* revert prow change

* address comment

* add logic to deduplicate

Adding Route Specific RateLimiting Test

Signed-off-by: gargnupur <gargnupur@google.com>

remove debug info

Signed-off-by: gargnupur <gargnupur@google.com>

fix failure

Signed-off-by: gargnupur <gargnupur@google.com>
istio-testing pushed a commit that referenced this issue Nov 16, 2020
Signed-off-by: gargnupur <gargnupur@google.com>

fix

Signed-off-by: gargnupur <gargnupur@google.com>

add yaml

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

Delete reference to old ISTIO_META_PROXY_XDS_VIA_AGENT (#28203)

* update expose istiod

* add https sample

* fix tab

* update host + domain

* fix lint

* fix lint

* tweak host

* fix lint

* use tls port

* name port correctly

* change default to tls

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 7feb468.

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 98209a0.

* use istiod-remote since pilot is still enabled on remote cluster

* loose up on host name

* adding notes

* clean up this in preview profile

Co-authored-by: Iris <irisding@apache.org>

Avoid telemetry cluster metadata override (#28171)

* fix cluster metadata override

* test

* fix

* fix

* fix again

* clean

add telemetry test for customize metrics (#27844)

* add test for customize metrics

* address comments

* add remove tag check

* fix test

Delete istiod pods on cleanup (#28205)

Otherwise they stay around and can cause other tests to fail.

In a concrete example, deployment "istiod-canary" stays live
and interferes in pilot's TestMultiRevision test, which also
deploys a "istiod-canary", but, since a deployment with that
name already exists, operator doesn't redeploy it, because it's
already there.

Fix HTTPs on HTTP port passthrough (#28166)

* Fix HTTPs on HTTP port passthrough

* Add note

remove 1.7 telemetry filters from charts (#28195)

use correct env var name (#28217)

Align Ingress resource status updates with Ingresses targeted in controller (#28225)

make istiod-remote depend on base when installation (#28219)

Add remoteIpBlocks functionality to AuthorizationPolicy (#27906)

* create remoteIpBlocks and update ipBlocks for AuthorizationPolicy

 By adding remoteIpBlocks and notRemoteIpBlocks in Source,
 an AuthorizationPolicy can trigger actions based on the original
 client IP address gleaned from the X-Forwarded-For header or the
 proxy protocol. The ipBlocks and notIpBlocks fields have also been
 updated to use direct_remote_ip in Envoy instead of source_ip

* use correct attribute for RemoteIpBlocks

* fix unit tests and add integration tests for remote.ip attribute

* fix notRemoteIp integration test

* initialize headers if it is nil

* Combine remoteIp tests into IngressGateway test and add release note

* add titles to links

* remove unneeded tests

* fix quotes in releasenote, run make gen

* make upgradeNotes a list

Remove deprecated istio-coredns plugin (#28179)

make stackdriver test platform agnostic (#28237)

* make stackdriver test platform agnostic

* fix

* clean up

Add Wasm Extension Dashboard (#28209)

* Add WASM Extension Dashboard

* update dashboard

* update dashboard and add cpu/mem

* address review comment

* add excluded

* remove extension dashboard from test allowlist.txt

* update readme

Clean up metadata exchange keys (#28249)

* clean up

* cleanup exchange key

Remove unnecessary warning log from ingress status watcher (#28254)

vm health checking (#28142)

* impl with pilot

* Remove redundant import

* Remove redundant return

* address some concerns

* address more concerns

* Add tests

* fix ci?

* fix ci?

Automator: update proxy@master in istio/istio@master (#27786)

pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list (#28260)

* pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

* docs: add release notes

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

Adjust Wasm VMs charts order and Add release note (#28251)

* Adjust Wasm VMs charts order

* add release note

* replace wasm extension dashboard with real ID

Issue #27606: Minor Bug fixes, mostly renaming (#28156)

Cleanup ADS tests (#28275)

* Cleanup ADS tests

* fix lint

* fix lint

Temporarily skip ratelimit tests (#28286)

To help master/1.8 get to a merge-able state

Add warning for legacy FQDN gateway reference (#27948)

* Add warning for legacy FQDN gateway reference

* fix lint

* Add more warnings

Fixes for trust domain configuration (#28127)

* Fixes for trust domain configuration

We want to ensure we take values.global.trustDomain as default, fallback
to meshConfig.trustDomain, and ensure this is passed through to all
parts of the code.

This fixes the breakage in istio/istio.io#8301

* fix lint

Status improvements (#28136)

* Consolidate ledger and status implementation

* Add ownerReference for garbage collection

* write observedGeneration for status

* cleanup rebase errors

* remove garbage from pr

* fix test failures

* Fix receiver linting

* fix broken unit tests

* fix init for route test

* Fix test failures

* add missing ledger to test

* Add release notes

* Reorganize status controller start

* fix race

* separate init and start funcs

* add newline

* remove test sprawl

* reset retention

Add size to ADS push log (#28262)

Add README.md for vendor optimized profiles (#28155)

* Add README.profiles for vendor optimized profiles

* Another attempt at the table

Fix operator revision handling (#28044)

* Fix operator revision handling

* Add revision to installation CR

* Add revision to each resource label

* Update label handling

* Add deployment spec template labels, clean up logging

* Fix test

* Update integration test

* Make gen

* Fix test

* Testing

* Fix tests

Futureproof telemetry envoyfilters a bit (#28176)

remove the install comment (#28243)

* remove the install comment

* Revert "remove the install comment"

This reverts commit 60bc649.

* Update gen-eastwest-gateway.sh

pilot: skip privileged ports when building listeners for non-root gateways (#28268)

* pilot: skip privileged ports when building listeners for non-root gateways

* Add release note

* Use ISTIO_META_UNPRIVILEGED_POD env var instead of a Pod label

Automator: update proxy@master in istio/istio@master (#28281)

istioctl bug-report: do not override system namespaces if --exclude flag is provided (#27989)

Add ingress status integration test (#28263)

clean up: extension configs (#28277)

* clean up extension configs

Signed-off-by: Kuat Yessenov <kuat@google.com>

* make gen

Signed-off-by: Kuat Yessenov <kuat@google.com>

Show empty routes in pc routes (#28170)

```
NOTE: This output only contains routes loaded via RDS.
NAME                                                                                                     DOMAINS                      MATCH                  VIRTUAL SERVICE
https.443.https-443-ingress-service1-default-0.service1-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
https.443.https-443-ingress-service2-default-0.service2-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
http.80                                                                                                  service1.demo.........io     /*                     service1-demo-......-io-service1-istio-autogenerated-k8s-ingress.default
http.80                                                                                                  service2.demo.........io     /*                     service2-demo-.....i-io-service2-istio-autogenerated-k8s-ingress.default
                                                                                                         *                            /stats/prometheus*
                                                                                                         *                            /healthz/ready*
```

The first 2 lines would not show up without this PR

Add warnings for unknown fields in EnvoyFilter (#28227)

Fixes #26390

Update rather than patch webhook configuration (#28228)

* Update rather than patch webhook configuration

This is a far more flexible pattern, allowing us to have multiple
webhooks and patch them successful. This pattern follows what the
cert-manager does in their webhook patcher (see
pkg/controller/cainjector), which I consider to be top quality code.

* update rbac

Improve error when users use removed addon (#28241)

* Improve error when users use removed addon

After and before:
```
$ grun ./istioctl/cmd/istioctl manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: component "foo" does not exist
$ ik manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: stat manifests/charts/addons/foo: no such file or directory
```

* Fix test

When installing istio-cni remove existing istio-cni plugin before inserting a new one (#28258)

* Remove istio-cni plugin before inserting a new one

* docs: add release notes

Automator: update common-files@master in istio/istio@master (#28278)

Make ingress gateway selector in status watcher match the one used to generate gateway (#28279)

* Check for empty ingress service value when converting ingress to gateway

* Pull ingress gateway selector logic into own func

* Use same ingress gateway selector logic for status watcher as when generating gateways

* Fix status watcher test

Remove time.Sleep hacks for fast tests/non-flaky  (#27741)

* Remove time.Sleep hacks for fast tests

* fix flake

Add grafana templating query for DS_PROMETHEUS and add missing datasource (#28320)

* Add grafana templating query for DS_PROMETHEUS and add missing datasource

* make extension dashboard viewonly

skip ingress test in multicluster (#28321)

E2E test for trust domain alias client side secure naming. (#28206)

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* fix host to address

* update script

* refactor based on comments

* updated comments

* add build constraints

* lint fix

* fixes based on comments

Samples: use more common images and delete useless samples (#28215)

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

Wait until ACK before sending additional pushes (#28261)

* Wait until ACK before sending additional pushes

Fixes: #25685

At large scale, Envoy suffers from overload of XDS pushes, and there is
no backpressure in the system. Other control planes, such as any based
on go-control-plane, outperform Istio in config update propogations
under load as a result.

This changes adds a backpressure mechanism to ensure we do not push more
configs than Envoy can handle. By slowing down the pushes, the
propogation time of new configurations actually increases. We do this by
keeping note, but not sending, any push requests where that TypeUrl has
an un-ACKed request in flight. When we get an ACK, if there is a pending
push request we will immediately trigger it. This effectively means that
in a high churn environment, each proxy will always have exactly 1
outstanding push per type, and when the ACK is recieved we will
immediately send a new update.

This PR is co-authored by Steve, who did a huge amount of work in
developing this into the state it is today, as wel as finding and
testing the problem. See #27563 for
much of this work.

Co-Authored-By: Steven Dake sdake@ibm.com

* Refactor and cleanup tests

* Add test

istioctl: fix failure when passing flags to `go test` (#28332)

add xds proxy metrics (#28267)

* add xds proxy metrics

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* lint

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix description

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

remove-from-mesh: skip system namespace to remove sidecar (#28187)

* remove-from-mesh: skip system namespace to remove sidecar

* check for -i

Remove accidentally merged debug logs (#28331)

update warning message for upgrading istio version (#28303)

* update warning message for upgrading istio version

* add use before istioctl analyze

Update README.md (#28272)

* Update README.md

* Update README.md

Xds proxy improve (#28307)

* Prevent goroutine leak

* Accelerate by splitting upstream request and response handling

* fix lint

fix manifestpath for verify install. (#28345)

Fix ADSC race (#28342)

* Fix ADSC race

* fix

* fix ut

* Update pkg/istio-agent/local_xds_generator.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pkg/adsc/adsc_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pilot/pkg/xds/lds_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Apply shamsher's suggestions from code review

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

List Istio injectors (#27849)

* Refactor and rename command

* Code nits

* Fix typo

* Print message if no namespaces have injection

* Code nits

* Case where an injected namespace does not yet have pods

* Code cleanup

kube-inject: hide namespace flag in favour of istioNamespace (#28067)

Automator: update proxy@master in istio/istio@master (#28355)

Fix test race in FilterGatewayClusterConf (#28330)

Example failure
https://prow.istio.io/view/gs/istio-prow/logs/unit-tests_istio_postsubmit/4087

In generally, we need a better way to mutate feature flags in tests.
Maybe a conditionally compiled mutex. Will open an issue to track

Networking: Add scaffold of tunnel EDS builder  (#28244)

* add endpoint tunnel supportablity field

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* edsbtswip

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix import

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* EndpointsByNetworkFilter refactor

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix pkg/pilot/ tests

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* ep builder decide build out tunnel type

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add basic tunnel eds test

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix proxy metadata access

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix endpoint COW, h2support bitfield bug

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen without fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add errgo

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* address comment

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

cni:fix order dependent test failures (#28349)

The `interceptRuleMgrType` decalred in main.go as "iptables", it would
be changed to "mock" in func resetGlobalTestVariables().

When run single test, it would be "iptables" and make test failed.

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

fix uninstall test (#28335)

* fix uninstall test

* revert prow change

* address comment

* add logic to deduplicate

Adding Route Specific RateLimiting Test

Signed-off-by: gargnupur <gargnupur@google.com>

remove debug info

Signed-off-by: gargnupur <gargnupur@google.com>

fix failure

Signed-off-by: gargnupur <gargnupur@google.com>
daixiang0 pushed a commit to daixiang0/istio that referenced this issue Nov 19, 2020
Signed-off-by: gargnupur <gargnupur@google.com>

fix

Signed-off-by: gargnupur <gargnupur@google.com>

add yaml

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix comments

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

fix build

Signed-off-by: gargnupur <gargnupur@google.com>

Delete reference to old ISTIO_META_PROXY_XDS_VIA_AGENT (istio#28203)

* update expose istiod

* add https sample

* fix tab

* update host + domain

* fix lint

* fix lint

* tweak host

* fix lint

* use tls port

* name port correctly

* change default to tls

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Update samples/multicluster/expose-istiod.yaml

Co-authored-by: Iris <irisding@apache.org>

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 7feb468.

* Revert "Update samples/multicluster/expose-istiod.yaml"

This reverts commit 98209a0.

* use istiod-remote since pilot is still enabled on remote cluster

* loose up on host name

* adding notes

* clean up this in preview profile

Co-authored-by: Iris <irisding@apache.org>

Avoid telemetry cluster metadata override (istio#28171)

* fix cluster metadata override

* test

* fix

* fix

* fix again

* clean

add telemetry test for customize metrics (istio#27844)

* add test for customize metrics

* address comments

* add remove tag check

* fix test

Delete istiod pods on cleanup (istio#28205)

Otherwise they stay around and can cause other tests to fail.

In a concrete example, deployment "istiod-canary" stays live
and interferes in pilot's TestMultiRevision test, which also
deploys a "istiod-canary", but, since a deployment with that
name already exists, operator doesn't redeploy it, because it's
already there.

Fix HTTPs on HTTP port passthrough (istio#28166)

* Fix HTTPs on HTTP port passthrough

* Add note

remove 1.7 telemetry filters from charts (istio#28195)

use correct env var name (istio#28217)

Align Ingress resource status updates with Ingresses targeted in controller (istio#28225)

make istiod-remote depend on base when installation (istio#28219)

Add remoteIpBlocks functionality to AuthorizationPolicy (istio#27906)

* create remoteIpBlocks and update ipBlocks for AuthorizationPolicy

 By adding remoteIpBlocks and notRemoteIpBlocks in Source,
 an AuthorizationPolicy can trigger actions based on the original
 client IP address gleaned from the X-Forwarded-For header or the
 proxy protocol. The ipBlocks and notIpBlocks fields have also been
 updated to use direct_remote_ip in Envoy instead of source_ip

* use correct attribute for RemoteIpBlocks

* fix unit tests and add integration tests for remote.ip attribute

* fix notRemoteIp integration test

* initialize headers if it is nil

* Combine remoteIp tests into IngressGateway test and add release note

* add titles to links

* remove unneeded tests

* fix quotes in releasenote, run make gen

* make upgradeNotes a list

Remove deprecated istio-coredns plugin (istio#28179)

make stackdriver test platform agnostic (istio#28237)

* make stackdriver test platform agnostic

* fix

* clean up

Add Wasm Extension Dashboard (istio#28209)

* Add WASM Extension Dashboard

* update dashboard

* update dashboard and add cpu/mem

* address review comment

* add excluded

* remove extension dashboard from test allowlist.txt

* update readme

Clean up metadata exchange keys (istio#28249)

* clean up

* cleanup exchange key

Remove unnecessary warning log from ingress status watcher (istio#28254)

vm health checking (istio#28142)

* impl with pilot

* Remove redundant import

* Remove redundant return

* address some concerns

* address more concerns

* Add tests

* fix ci?

* fix ci?

Automator: update proxy@master in istio/istio@master (istio#27786)

pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list (istio#28260)

* pilot: GlobalUnicastIP of a model.Proxy should be set to the 1st applicable IP address in the list

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

* docs: add release notes

Signed-off-by: Yaroslav Skopets <yaroslav@tetrate.io>

Adjust Wasm VMs charts order and Add release note (istio#28251)

* Adjust Wasm VMs charts order

* add release note

* replace wasm extension dashboard with real ID

Issue istio#27606: Minor Bug fixes, mostly renaming (istio#28156)

Cleanup ADS tests (istio#28275)

* Cleanup ADS tests

* fix lint

* fix lint

Temporarily skip ratelimit tests (istio#28286)

To help master/1.8 get to a merge-able state

Add warning for legacy FQDN gateway reference (istio#27948)

* Add warning for legacy FQDN gateway reference

* fix lint

* Add more warnings

Fixes for trust domain configuration (istio#28127)

* Fixes for trust domain configuration

We want to ensure we take values.global.trustDomain as default, fallback
to meshConfig.trustDomain, and ensure this is passed through to all
parts of the code.

This fixes the breakage in istio/istio.io#8301

* fix lint

Status improvements (istio#28136)

* Consolidate ledger and status implementation

* Add ownerReference for garbage collection

* write observedGeneration for status

* cleanup rebase errors

* remove garbage from pr

* fix test failures

* Fix receiver linting

* fix broken unit tests

* fix init for route test

* Fix test failures

* add missing ledger to test

* Add release notes

* Reorganize status controller start

* fix race

* separate init and start funcs

* add newline

* remove test sprawl

* reset retention

Add size to ADS push log (istio#28262)

Add README.md for vendor optimized profiles (istio#28155)

* Add README.profiles for vendor optimized profiles

* Another attempt at the table

Fix operator revision handling (istio#28044)

* Fix operator revision handling

* Add revision to installation CR

* Add revision to each resource label

* Update label handling

* Add deployment spec template labels, clean up logging

* Fix test

* Update integration test

* Make gen

* Fix test

* Testing

* Fix tests

Futureproof telemetry envoyfilters a bit (istio#28176)

remove the install comment (istio#28243)

* remove the install comment

* Revert "remove the install comment"

This reverts commit 60bc649.

* Update gen-eastwest-gateway.sh

pilot: skip privileged ports when building listeners for non-root gateways (istio#28268)

* pilot: skip privileged ports when building listeners for non-root gateways

* Add release note

* Use ISTIO_META_UNPRIVILEGED_POD env var instead of a Pod label

Automator: update proxy@master in istio/istio@master (istio#28281)

istioctl bug-report: do not override system namespaces if --exclude flag is provided (istio#27989)

Add ingress status integration test (istio#28263)

clean up: extension configs (istio#28277)

* clean up extension configs

Signed-off-by: Kuat Yessenov <kuat@google.com>

* make gen

Signed-off-by: Kuat Yessenov <kuat@google.com>

Show empty routes in pc routes (istio#28170)

```
NOTE: This output only contains routes loaded via RDS.
NAME                                                                                                     DOMAINS                      MATCH                  VIRTUAL SERVICE
https.443.https-443-ingress-service1-default-0.service1-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
https.443.https-443-ingress-service2-default-0.service2-istio-autogenerated-k8s-ingress.istio-system     *                            /*                     404
http.80                                                                                                  service1.demo.........io     /*                     service1-demo-......-io-service1-istio-autogenerated-k8s-ingress.default
http.80                                                                                                  service2.demo.........io     /*                     service2-demo-.....i-io-service2-istio-autogenerated-k8s-ingress.default
                                                                                                         *                            /stats/prometheus*
                                                                                                         *                            /healthz/ready*
```

The first 2 lines would not show up without this PR

Add warnings for unknown fields in EnvoyFilter (istio#28227)

Fixes istio#26390

Update rather than patch webhook configuration (istio#28228)

* Update rather than patch webhook configuration

This is a far more flexible pattern, allowing us to have multiple
webhooks and patch them successful. This pattern follows what the
cert-manager does in their webhook patcher (see
pkg/controller/cainjector), which I consider to be top quality code.

* update rbac

Improve error when users use removed addon (istio#28241)

* Improve error when users use removed addon

After and before:
```
$ grun ./istioctl/cmd/istioctl manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: component "foo" does not exist
$ ik manifest generate --set addonComponents.foo.enabled=true -d manifests
Error: stat manifests/charts/addons/foo: no such file or directory
```

* Fix test

When installing istio-cni remove existing istio-cni plugin before inserting a new one (istio#28258)

* Remove istio-cni plugin before inserting a new one

* docs: add release notes

Automator: update common-files@master in istio/istio@master (istio#28278)

Make ingress gateway selector in status watcher match the one used to generate gateway (istio#28279)

* Check for empty ingress service value when converting ingress to gateway

* Pull ingress gateway selector logic into own func

* Use same ingress gateway selector logic for status watcher as when generating gateways

* Fix status watcher test

Remove time.Sleep hacks for fast tests/non-flaky  (istio#27741)

* Remove time.Sleep hacks for fast tests

* fix flake

Add grafana templating query for DS_PROMETHEUS and add missing datasource (istio#28320)

* Add grafana templating query for DS_PROMETHEUS and add missing datasource

* make extension dashboard viewonly

skip ingress test in multicluster (istio#28321)

E2E test for trust domain alias client side secure naming. (istio#28206)

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* trust domain alias secure naming e2e test

* add dynamic certs and test options

* move under ca_custom_root test folder

* fix host to address

* update script

* refactor based on comments

* updated comments

* add build constraints

* lint fix

* fixes based on comments

Samples: use more common images and delete useless samples (istio#28215)

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

Wait until ACK before sending additional pushes (istio#28261)

* Wait until ACK before sending additional pushes

Fixes: istio#25685

At large scale, Envoy suffers from overload of XDS pushes, and there is
no backpressure in the system. Other control planes, such as any based
on go-control-plane, outperform Istio in config update propogations
under load as a result.

This changes adds a backpressure mechanism to ensure we do not push more
configs than Envoy can handle. By slowing down the pushes, the
propogation time of new configurations actually increases. We do this by
keeping note, but not sending, any push requests where that TypeUrl has
an un-ACKed request in flight. When we get an ACK, if there is a pending
push request we will immediately trigger it. This effectively means that
in a high churn environment, each proxy will always have exactly 1
outstanding push per type, and when the ACK is recieved we will
immediately send a new update.

This PR is co-authored by Steve, who did a huge amount of work in
developing this into the state it is today, as wel as finding and
testing the problem. See istio#27563 for
much of this work.

Co-Authored-By: Steven Dake sdake@ibm.com

* Refactor and cleanup tests

* Add test

istioctl: fix failure when passing flags to `go test` (istio#28332)

add xds proxy metrics (istio#28267)

* add xds proxy metrics

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* lint

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* fix description

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

remove-from-mesh: skip system namespace to remove sidecar (istio#28187)

* remove-from-mesh: skip system namespace to remove sidecar

* check for -i

Remove accidentally merged debug logs (istio#28331)

update warning message for upgrading istio version (istio#28303)

* update warning message for upgrading istio version

* add use before istioctl analyze

Update README.md (istio#28272)

* Update README.md

* Update README.md

Xds proxy improve (istio#28307)

* Prevent goroutine leak

* Accelerate by splitting upstream request and response handling

* fix lint

fix manifestpath for verify install. (istio#28345)

Fix ADSC race (istio#28342)

* Fix ADSC race

* fix

* fix ut

* Update pkg/istio-agent/local_xds_generator.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pkg/adsc/adsc_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Update pilot/pkg/xds/lds_test.go

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

* Apply shamsher's suggestions from code review

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

Co-authored-by: Shamsher Ansari <shaansar@redhat.com>

List Istio injectors (istio#27849)

* Refactor and rename command

* Code nits

* Fix typo

* Print message if no namespaces have injection

* Code nits

* Case where an injected namespace does not yet have pods

* Code cleanup

kube-inject: hide namespace flag in favour of istioNamespace (istio#28067)

Automator: update proxy@master in istio/istio@master (istio#28355)

Fix test race in FilterGatewayClusterConf (istio#28330)

Example failure
https://prow.istio.io/view/gs/istio-prow/logs/unit-tests_istio_postsubmit/4087

In generally, we need a better way to mutate feature flags in tests.
Maybe a conditionally compiled mutex. Will open an issue to track

Networking: Add scaffold of tunnel EDS builder  (istio#28244)

* add endpoint tunnel supportablity field

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* edsbtswip

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix import

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* EndpointsByNetworkFilter refactor

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix pkg/pilot/ tests

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* ep builder decide build out tunnel type

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add basic tunnel eds test

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix proxy metadata access

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fix endpoint COW, h2support bitfield bug

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen without fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* add errgo

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* address comment

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* fmt

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* make gen

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

cni:fix order dependent test failures (istio#28349)

The `interceptRuleMgrType` decalred in main.go as "iptables", it would
be changed to "mock" in func resetGlobalTestVariables().

When run single test, it would be "iptables" and make test failed.

Signed-off-by: Xiang Dai <long0dai@foxmail.com>

fix uninstall test (istio#28335)

* fix uninstall test

* revert prow change

* address comment

* add logic to deduplicate

Adding Route Specific RateLimiting Test

Signed-off-by: gargnupur <gargnupur@google.com>

remove debug info

Signed-off-by: gargnupur <gargnupur@google.com>

fix failure

Signed-off-by: gargnupur <gargnupur@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
9 participants