Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

Closed
harpratap opened this issue Sep 6, 2019 · 19 comments · Fixed by #17122
Closed

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

harpratap opened this issue Sep 6, 2019 · 19 comments · Fixed by #17122

Comments

@harpratap
Copy link

harpratap commented Sep 6, 2019

Bug description
When I try to upgrade-downgrade between versions 1.2.2 and 1.2.5 my applications which are using sidecar goes into unready state and I see a downtime in my services.
My requests follow this path:
Load generator (outside cluster) -> Load Balancer (outside cluster) -> Istio Ingressgateway (inside cluster) -> Application (just simple nginx docker image)
I have about 20 instances of istio-ingressgateway and 60 instances of nginx and I generate a load of about 15k rps which normally this setup handles without a sweat.
What I observe when I do netstat -ltpn inside sidecar proxy is that a new envoy process comes up and old one goes away, this probably causes the application to become unhealthy because this new envoy process isn't listening on port 15090. After a while it does start listening to 15090 and 15001 and the errors go away once all instances are back.
image

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[X] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[X] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
To not see any affect on my traffic when doing control plane upgrade of Istio

Steps to reproduce the bug
We consider istio-ingressgateway to also be a part of data plane and don't want to make any changes to it. We upgrade everything else other than this. CNI is running on version 1.2.5
I try to change versions using these commands -
helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

mkdir tmp
mv install/kubernetes/helm/istio/charts/gateways/templates/* tmp/
helm template install/kubernetes/helm/istio/ --namespace istio-system --name istio --values custom.yaml | kubectl -n istio-system apply -f -
mv tmp/* install/kubernetes/helm/istio/charts/gateways/templates/
rm -r tmp/

This will temporarily remove all gateway related changes and upgrade everything else.

Version (include the output of istioctl version --remote and kubectl version)
Istio - 1.2.2 to 1.2.5
Kubernetes - 1.15.0

How was Istio installed?
Using helm template and this custom.yaml for values -

gateways:
  istio-ingressgateway:
    type: NodePort
    autoscaleMin: 20
    autoscaleMax: 20
    ports:
    - port: 80
      targetPort: 80
      name: http2
      nodePort: 60000
    - port: 443
      name: https
      nodePort: 60001
    - port: 31400
      name: tcp
      nodePort: 61400
    resources:
      requests:
        cpu: 2
        memory: 512Mi
      limits:
        cpu: 2
        memory: 512Mi
kiali:
  enabled: true
  dashboard:
    grafanaURL: http://grafana:3000
    jaegerURL: http://tracing:80
  resources:
    requests:
      cpu: 4
      memory: 4096Mi
    limits:
      cpu: 4
      memory: 4096Mi
  createDemoSecret: true
  prometheusAddr: prometheus.internal.com

mixer:
  policy:
    enabled: false
  telemetry:
    autoscaleMin: 30
    autoscaleMax: 100

grafana:
  enabled: true

pilot:
  traceSampling: 100.0
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi

tracing:
  enabled: false

istio_cni:
  enabled: true

global:
  policyCheckFailOpen: true
  proxy:
    logLevel: "error"
    resources:
      requests:
        cpu: 50m
        memory: 180Mi
      limits:
        cpu: 2
        memory: 512Mi
  defaultResources:
    requests:
      cpu: 1
      memory: 2048Mi
    limits:
      cpu: 2
      memory: 2048Mi

Environment where bug was observed (cloud vendor, OS, etc)
On prem k8s cluster running on bare metal

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

@harpratap harpratap changed the title Upgrading control plane from 1.2.2 to 1.25 causing down time Upgrading control plane from 1.2.2 to 1.2.5 causing down time Sep 6, 2019
@howardjohn
Copy link
Member

do you have the Envoy logs during this time? I would not expect Envoy to be restarting as a result of a control plane upgrade

@harpratap
Copy link
Author

@howardjohn logs of one of the sidecar envoys that became unavailable - https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d

@howardjohn
Copy link
Member

howardjohn commented Sep 6, 2019 via email

@harpratap
Copy link
Author

harpratap commented Sep 6, 2019

@howardjohn You were right, it is going OOM and getting killed.
We have set 180Mi as memory request right now, will try increasing this to see how much it maxes out at

�[32m[17629535.989028] �[0m�[1mCall Trace:�[0m
�[32m[17629535.992554] �[0m�[1m dump_stack+0x5c/0x85�[0m
�[32m[17629535.997049] �[0m�[1m dump_header+0x94/0x229�[0m
�[32m[17629536.001755] �[0m�[1m oom_kill_process+0x213/0x410�[0m
�[32m[17629536.007031] �[0m�[1m out_of_memory+0x2ab/0x4c0�[0m
�[32m[17629536.011991] �[0m�[1m mem_cgroup_out_of_memory+0x49/0x80�[0m
�[32m[17629536.017839] �[0m�[1m mem_cgroup_oom_synchronize+0x2ed/0x330�[0m
�[32m[17629536.024091] �[0m�[1m ? mem_cgroup_css_online+0x30/0x30�[0m
�[32m[17629536.029819] �[0m�[1m pagefault_out_of_memory+0x32/0x77�[0m
�[32m[17629536.035559] �[0m�[1m __do_page_fault+0x4b3/0x4c0�[0m
�[32m[17629536.040720] �[0m�[1m ? page_fault+0x2f/0x50�[0m
�[32m[17629536.045392] �[0m�[1m page_fault+0x45/0x50�[0m
�[32m[17629536.049882] �[0m�[33mRIP�[0m�[1m: 1b23a000:0x15db160�[0m
�[32m[17629536.054575] �[0m�[33mRSP�[0m�[1m: 0003:00007fa6b7849a00 EFLAGS: 1b23c000�[0m
�[32m[17629536.054698] �[0mTask in /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6 killed as a result of limit of /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6
�[32m[17629536.093061] �[0m�[33mmemory�[0m: usage 524288kB, limit 524288kB, failcnt 0
�[32m[17629536.100287] �[0m�[33mmemory+swap�[0m: usage 524288kB, limit 524288kB, failcnt 67
�[32m[17629536.108211] �[0m�[33mkmem�[0m: usage 25320kB, limit 9007199254740988kB, failcnt 0
�[32m[17629536.116130] �[0m�[33mMemory cgroup stats for /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6�[0m: cache:20KB rss:498948KB rss_huge:102400KB shmem:20KB mapped_file:4KB dirty:0KB writeback:0KB swap:0KB inactive_anon:4KB active_anon:498852KB inactive_file:0KB active_file:0KB unevictable:0KB
�[32m[17629536.153365] �[0m[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
�[32m[17629536.164341] �[0m[16464]  1337 16464     9811     5663      24       5        0           995 pilot-agent
�[32m[17629536.175818] �[0m[16626]  1337 16626    53752    29113      75       3        0           995 envoy
�[32m[17629536.186684] �[0m[21082]  1337 21082   124407   101624     217       3        0           995 envoy
�[32m[17629536.197529] �[0m�[33mMemory cgroup out of memory�[0m�[31m: Kill process 21082 (envoy) score 1771 or sacrifice child�[0m
�[32m[17629536.208705] �[0m�[31mKilled process 21082 (envoy) total-vm:497628kB, anon-rss:389816kB, file-rss:16676kB, shmem-rss:4kB�[0m
�[32m[17629536.241470] �[0m�[33moom_reaper�[0m: reaped process 21082 (envoy), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

image

We can mitigate this by increasing memory requests, but shouldn't the sidecar not have such a high jump in memory usage when we upgrade? When running normal load it stays at about 115Mi and jumps to 180Mi when upgrades happen.

@howardjohn
Copy link
Member

Thanks @harpratap . Yeah I would not expect a jump in memory during upgrade. Maybe a little bit because Pilot may be sending different config when it is updated, but should be minimal I would think.

In the graph it looks like it spikes, then goes back to old memory usage. Did it go back to old usage because you rolled back, or is 13:24 on 1.2.2 and 13:36 on 1.2.5?

@harpratap
Copy link
Author

@howardjohn The latter, Few minutes after the upgrade is completed the sidecar container gets killed and comes up again at 13:32. The sidecar itself always stays at 1.2.2 though, no changes made to that.

@howardjohn
Copy link
Member

One possibility I think is that when you upgrade the pilot, it sends envoy s new config. Envoy then has a copy of the old and new config for some time period, but it is a long enough time to cause it to get oom killed?

I have definitely seen this before during cert rotation, but I think that scenario is different and there are 2 separate envoy processes running in that case.

@lambdai does the above some like a possible cause? anything else we can debug?

@lambdai
Copy link
Contributor

lambdai commented Sep 9, 2019

Going through the log in
https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d#file-istio-proxy-faling-log-L352-L355

Sidecar proxy is restarted due to file certs update. This envoy restart is hotrestart and it would double the memory usage.

I am not sure if it is expected to use file based certs

@JimmyCYJ How can we determine if this pilot should use sds or file based?

@harpratap
Copy link
Author

harpratap commented Sep 13, 2019

After further investigation it doesn't seem like memory was the issue. We were still seeing 503s even after giving it enough memory and the usage never seem to peak past 200MB anyway. Turns out the job "istio-cleanup-secrets" is the one causing this. When running this job we see a new envoy process coming up and the application is no longer considered healthy while the switch is happening. Over the period we see all of the pods becoming unhealthy together. We do also see the same thing happening in our istio-ingressgateway too. My questions are -

  1. Do we really need this job to be run for every upgrade? Even if we don't use mTLS?
  2. Is it possible to improve this to delete the secrets one by one? It would still cause issues when you have a single pod running or if you aren't over-provisioning your deployments but I think the risk in these cases is acceptable

@duderino
Copy link
Contributor

@ymesika created that job, but it has existed back since the 1.0 days so it's a very old change. Would be good to understand why it's suddenly causing an issue

@howardjohn
Copy link
Member

howardjohn commented Sep 13, 2019 via email

@howardjohn
Copy link
Member

Wow... this seems like a huge oversight. The job is set to be only run on deletion:

But since you use helm template (which is absolutely supported - most of us do not like Tiller) this actually runs on install!

Then, coupled with the fact that we change the job named every time, it runs on upgrades as well. Thank you for finding this.

@duderino
Copy link
Contributor

@howardjohn which branches are affected?

@duderino duderino added this to the 1.3 milestone Sep 13, 2019
@howardjohn
Copy link
Member

howardjohn commented Sep 13, 2019 via email

@duderino
Copy link
Contributor

duderino commented Sep 13, 2019

So if it affects all branches (meaning we've had this for a while and the world hasn't ended), I think that means we should fix it, but not hold back the 1.1.15 and 1.2.6 patches which are almost done and instead get this out in the next iteration. Thoughts?

@elfinhe
Copy link
Member

elfinhe commented Sep 13, 2019

fixing it in next iteration, does this mean upgrade/downgrade from unfixed versions will always cause downtime? maybe we need to provide a workaround, e.g, commands+instructions/script on running instances.

@nrjpoddar
Copy link
Member

@howardjohn this seems to be similar to https://github.com/istio/istio/pull/17033/files

howardjohn added a commit to howardjohn/istio that referenced this issue Sep 14, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes istio#16873
@hzxuzhonghu
Copy link
Member

cc

istio-testing pushed a commit that referenced this issue Sep 17, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes #16873
istio-testing pushed a commit to istio-testing/istio that referenced this issue Sep 17, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes istio#16873
istio-testing added a commit that referenced this issue Sep 17, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes #16873
@howardjohn
Copy link
Member

Confirmed this fixed the ACK ERRORS about certs not found on upgrades as well

istio-testing pushed a commit to istio-testing/istio that referenced this issue Sep 18, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes istio#16873
fpesce pushed a commit that referenced this issue Oct 15, 2019
This was intended to delete Istio secrets after you did `helm remove`.
Instead, it deletes secrets during every upgrade, causing outages.

Fixes #16873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants