Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

harpratap · 2019-09-06T03:12:30Z

Bug description
When I try to upgrade-downgrade between versions 1.2.2 and 1.2.5 my applications which are using sidecar goes into unready state and I see a downtime in my services.
My requests follow this path:
Load generator (outside cluster) -> Load Balancer (outside cluster) -> Istio Ingressgateway (inside cluster) -> Application (just simple nginx docker image)
I have about 20 instances of istio-ingressgateway and 60 instances of nginx and I generate a load of about 15k rps which normally this setup handles without a sweat.
What I observe when I do netstat -ltpn inside sidecar proxy is that a new envoy process comes up and old one goes away, this probably causes the application to become unhealthy because this new envoy process isn't listening on port 15090. After a while it does start listening to 15090 and 15001 and the errors go away once all instances are back.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure
[ ] Docs
[X] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[X] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
To not see any affect on my traffic when doing control plane upgrade of Istio

Steps to reproduce the bug
We consider istio-ingressgateway to also be a part of data plane and don't want to make any changes to it. We upgrade everything else other than this. CNI is running on version 1.2.5
I try to change versions using these commands -
helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

mkdir tmp
mv install/kubernetes/helm/istio/charts/gateways/templates/* tmp/
helm template install/kubernetes/helm/istio/ --namespace istio-system --name istio --values custom.yaml | kubectl -n istio-system apply -f -
mv tmp/* install/kubernetes/helm/istio/charts/gateways/templates/
rm -r tmp/

This will temporarily remove all gateway related changes and upgrade everything else.

Version (include the output of istioctl version --remote and kubectl version)
Istio - 1.2.2 to 1.2.5
Kubernetes - 1.15.0

How was Istio installed?
Using helm template and this custom.yaml for values -

gateways:
  istio-ingressgateway:
    type: NodePort
    autoscaleMin: 20
    autoscaleMax: 20
    ports:
    - port: 80
      targetPort: 80
      name: http2
      nodePort: 60000
    - port: 443
      name: https
      nodePort: 60001
    - port: 31400
      name: tcp
      nodePort: 61400
    resources:
      requests:
        cpu: 2
        memory: 512Mi
      limits:
        cpu: 2
        memory: 512Mi
kiali:
  enabled: true
  dashboard:
    grafanaURL: http://grafana:3000
    jaegerURL: http://tracing:80
  resources:
    requests:
      cpu: 4
      memory: 4096Mi
    limits:
      cpu: 4
      memory: 4096Mi
  createDemoSecret: true
  prometheusAddr: prometheus.internal.com

mixer:
  policy:
    enabled: false
  telemetry:
    autoscaleMin: 30
    autoscaleMax: 100

grafana:
  enabled: true

pilot:
  traceSampling: 100.0
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi

tracing:
  enabled: false

istio_cni:
  enabled: true

global:
  policyCheckFailOpen: true
  proxy:
    logLevel: "error"
    resources:
      requests:
        cpu: 50m
        memory: 180Mi
      limits:
        cpu: 2
        memory: 512Mi
  defaultResources:
    requests:
      cpu: 1
      memory: 2048Mi
    limits:
      cpu: 2
      memory: 2048Mi

Environment where bug was observed (cloud vendor, OS, etc)
On prem k8s cluster running on bare metal

Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.

The text was updated successfully, but these errors were encountered:

howardjohn · 2019-09-06T03:23:14Z

do you have the Envoy logs during this time? I would not expect Envoy to be restarting as a result of a control plane upgrade

harpratap · 2019-09-06T04:34:47Z

@howardjohn logs of one of the sidecar envoys that became unavailable - https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d

howardjohn · 2019-09-06T04:42:46Z

`signal: killed` can happen when Envoy is oom killed. I am not sure if it can happen in other cases though... Do you still have ability to debug this? it could be useful to look at envoy memory usage and dmesg on the pod may show if it was oom killed? There may be other causes too

…

On Thu, Sep 5, 2019, 9:35 PM Harpratap Singh Layal ***@***.***> wrote: @howardjohn <https://github.com/howardjohn> logs of one of the sidecar envoys that became unavailable - https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16873?email_source=notifications&email_token=AAEYGXOM4CYFH3NXDNBQNU3QIHMZLA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BWETA#issuecomment-528704076>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEYGXMBONM2R3C2ZHJ5LWLQIHMZLANCNFSM4IUEPLJQ> .

harpratap · 2019-09-06T05:29:05Z

@howardjohn You were right, it is going OOM and getting killed.
We have set 180Mi as memory request right now, will try increasing this to see how much it maxes out at

�[32m[17629535.989028] �[0m�[1mCall Trace:�[0m
�[32m[17629535.992554] �[0m�[1m dump_stack+0x5c/0x85�[0m
�[32m[17629535.997049] �[0m�[1m dump_header+0x94/0x229�[0m
�[32m[17629536.001755] �[0m�[1m oom_kill_process+0x213/0x410�[0m
�[32m[17629536.007031] �[0m�[1m out_of_memory+0x2ab/0x4c0�[0m
�[32m[17629536.011991] �[0m�[1m mem_cgroup_out_of_memory+0x49/0x80�[0m
�[32m[17629536.017839] �[0m�[1m mem_cgroup_oom_synchronize+0x2ed/0x330�[0m
�[32m[17629536.024091] �[0m�[1m ? mem_cgroup_css_online+0x30/0x30�[0m
�[32m[17629536.029819] �[0m�[1m pagefault_out_of_memory+0x32/0x77�[0m
�[32m[17629536.035559] �[0m�[1m __do_page_fault+0x4b3/0x4c0�[0m
�[32m[17629536.040720] �[0m�[1m ? page_fault+0x2f/0x50�[0m
�[32m[17629536.045392] �[0m�[1m page_fault+0x45/0x50�[0m
�[32m[17629536.049882] �[0m�[33mRIP�[0m�[1m: 1b23a000:0x15db160�[0m
�[32m[17629536.054575] �[0m�[33mRSP�[0m�[1m: 0003:00007fa6b7849a00 EFLAGS: 1b23c000�[0m
�[32m[17629536.054698] �[0mTask in /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6 killed as a result of limit of /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6
�[32m[17629536.093061] �[0m�[33mmemory�[0m: usage 524288kB, limit 524288kB, failcnt 0
�[32m[17629536.100287] �[0m�[33mmemory+swap�[0m: usage 524288kB, limit 524288kB, failcnt 67
�[32m[17629536.108211] �[0m�[33mkmem�[0m: usage 25320kB, limit 9007199254740988kB, failcnt 0
�[32m[17629536.116130] �[0m�[33mMemory cgroup stats for /kubepods/burstable/podcd6d0d3b-67fc-4cf2-b9a9-d31608f628c5/6f1d07e040ed704c35e41c62e5dffebac034a8609e520ab81d7d23117cf924e6�[0m: cache:20KB rss:498948KB rss_huge:102400KB shmem:20KB mapped_file:4KB dirty:0KB writeback:0KB swap:0KB inactive_anon:4KB active_anon:498852KB inactive_file:0KB active_file:0KB unevictable:0KB
�[32m[17629536.153365] �[0m[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
�[32m[17629536.164341] �[0m[16464]  1337 16464     9811     5663      24       5        0           995 pilot-agent
�[32m[17629536.175818] �[0m[16626]  1337 16626    53752    29113      75       3        0           995 envoy
�[32m[17629536.186684] �[0m[21082]  1337 21082   124407   101624     217       3        0           995 envoy
�[32m[17629536.197529] �[0m�[33mMemory cgroup out of memory�[0m�[31m: Kill process 21082 (envoy) score 1771 or sacrifice child�[0m
�[32m[17629536.208705] �[0m�[31mKilled process 21082 (envoy) total-vm:497628kB, anon-rss:389816kB, file-rss:16676kB, shmem-rss:4kB�[0m
�[32m[17629536.241470] �[0m�[33moom_reaper�[0m: reaped process 21082 (envoy), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

We can mitigate this by increasing memory requests, but shouldn't the sidecar not have such a high jump in memory usage when we upgrade? When running normal load it stays at about 115Mi and jumps to 180Mi when upgrades happen.

howardjohn · 2019-09-06T15:20:50Z

Thanks @harpratap . Yeah I would not expect a jump in memory during upgrade. Maybe a little bit because Pilot may be sending different config when it is updated, but should be minimal I would think.

In the graph it looks like it spikes, then goes back to old memory usage. Did it go back to old usage because you rolled back, or is 13:24 on 1.2.2 and 13:36 on 1.2.5?

harpratap · 2019-09-09T03:12:47Z

@howardjohn The latter, Few minutes after the upgrade is completed the sidecar container gets killed and comes up again at 13:32. The sidecar itself always stays at 1.2.2 though, no changes made to that.

howardjohn · 2019-09-09T03:18:48Z

One possibility I think is that when you upgrade the pilot, it sends envoy s new config. Envoy then has a copy of the old and new config for some time period, but it is a long enough time to cause it to get oom killed?

I have definitely seen this before during cert rotation, but I think that scenario is different and there are 2 separate envoy processes running in that case.

@lambdai does the above some like a possible cause? anything else we can debug?

lambdai · 2019-09-09T18:47:19Z

Going through the log in
https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d#file-istio-proxy-faling-log-L352-L355

Sidecar proxy is restarted due to file certs update. This envoy restart is hotrestart and it would double the memory usage.

I am not sure if it is expected to use file based certs

@JimmyCYJ How can we determine if this pilot should use sds or file based?

harpratap · 2019-09-13T03:11:02Z

After further investigation it doesn't seem like memory was the issue. We were still seeing 503s even after giving it enough memory and the usage never seem to peak past 200MB anyway. Turns out the job "istio-cleanup-secrets" is the one causing this. When running this job we see a new envoy process coming up and the application is no longer considered healthy while the switch is happening. Over the period we see all of the pods becoming unhealthy together. We do also see the same thing happening in our istio-ingressgateway too. My questions are -

Do we really need this job to be run for every upgrade? Even if we don't use mTLS?
Is it possible to improve this to delete the secrets one by one? It would still cause issues when you have a single pod running or if you aren't over-provisioning your deployments but I think the risk in these cases is acceptable

duderino · 2019-09-13T03:37:54Z

@ymesika created that job, but it has existed back since the 1.0 days so it's a very old change. Would be good to understand why it's suddenly causing an issue

howardjohn · 2019-09-13T03:50:34Z

I think there was a change to append the version to the job name. Maybe prior to this the job didn't run on update because they name was not unique but now it does. Just a guess though - will look into it tomorrow

…

On Thu, Sep 12, 2019, 8:38 PM Joshua Blatt ***@***.***> wrote: @ymesika <https://github.com/ymesika> created that job, but it has existed back since the 1.0 days so it's a very old change. Would be good to understand why it's suddenly causing an issue — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16873?email_source=notifications&email_token=AAEYGXPBPQJ6FWDOFY5YIDLQJMDMTA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T3UPA#issuecomment-531085884>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEYGXKEUCFPVLVXWEINC33QJMDMTANCNFSM4IUEPLJQ> .

howardjohn · 2019-09-13T15:10:11Z

Wow... this seems like a huge oversight. The job is set to be only run on deletion:

istio/install/kubernetes/helm/istio/charts/security/templates/cleanup-secrets.yaml

Line 83 in f91f99e

"helm.sh/hook": post-delete

But since you use helm template (which is absolutely supported - most of us do not like Tiller) this actually runs on install!

Then, coupled with the fact that we change the job named every time, it runs on upgrades as well. Thank you for finding this.

duderino · 2019-09-13T15:28:06Z

@howardjohn which branches are affected?

howardjohn · 2019-09-13T15:31:38Z

1.1 - master. The thing about appending the version to the job name has apparently always been there, I was thinking of a different job I guess.

…

On Fri, Sep 13, 2019 at 8:28 AM Joshua Blatt ***@***.***> wrote: @howardjohn <https://github.com/howardjohn> which branches are affected? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16873?email_source=notifications&email_token=AAEYGXMD2SP2K5QUUXV33NTQJOWTTA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VLZSY#issuecomment-531283147>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEYGXOF7IIHRAFQP7DGN5LQJOWTTANCNFSM4IUEPLJQ> .

duderino · 2019-09-13T15:33:03Z

So if it affects all branches (meaning we've had this for a while and the world hasn't ended), I think that means we should fix it, but not hold back the 1.1.15 and 1.2.6 patches which are almost done and instead get this out in the next iteration. Thoughts?

elfinhe · 2019-09-13T16:56:33Z

fixing it in next iteration, does this mean upgrade/downgrade from unfixed versions will always cause downtime? maybe we need to provide a workaround, e.g, commands+instructions/script on running instances.

nrjpoddar · 2019-09-13T17:01:19Z

@howardjohn this seems to be similar to https://github.com/istio/istio/pull/17033/files

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

hzxuzhonghu · 2019-09-17T03:29:22Z

cc

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

howardjohn · 2019-09-18T18:55:51Z

Confirmed this fixed the ACK ERRORS about certs not found on upgrades as well

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

istio-policy-bot added area/environments area/networking area/test and release labels Sep 6, 2019

harpratap changed the title ~~Upgrading control plane from 1.2.2 to 1.25 causing down time~~ Upgrading control plane from 1.2.2 to 1.2.5 causing down time Sep 6, 2019

howardjohn removed the area/test and release label Sep 11, 2019

duderino assigned howardjohn Sep 13, 2019

duderino added this to the 1.3 milestone Sep 13, 2019

howardjohn added a commit to howardjohn/istio that referenced this issue Sep 14, 2019

Remove cleanup-secrets job

e3d7405

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

howardjohn mentioned this issue Sep 16, 2019

Remove cleanup-secrets job #17122

Merged

istio-testing closed this as completed in #17122 Sep 17, 2019

istio-testing pushed a commit that referenced this issue Sep 17, 2019

Remove cleanup-secrets job (#17122)

2ebb216

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

istio-testing pushed a commit to istio-testing/istio that referenced this issue Sep 17, 2019

Remove cleanup-secrets job

e6b200f

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

istio-testing added a commit that referenced this issue Sep 17, 2019

Remove cleanup-secrets job (#17150)

c8dd421

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

istio-testing pushed a commit to istio-testing/istio that referenced this issue Sep 18, 2019

Remove cleanup-secrets job

ad36b35

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873

fpesce pushed a commit that referenced this issue Oct 15, 2019

Remove cleanup-secrets job (#17189)

07f7933

This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

harpratap commented Sep 6, 2019 •

edited

Loading

howardjohn commented Sep 6, 2019

harpratap commented Sep 6, 2019

howardjohn commented Sep 6, 2019 via email

harpratap commented Sep 6, 2019 •

edited

Loading

howardjohn commented Sep 6, 2019

harpratap commented Sep 9, 2019

howardjohn commented Sep 9, 2019

lambdai commented Sep 9, 2019

harpratap commented Sep 13, 2019 •

edited

Loading

duderino commented Sep 13, 2019

howardjohn commented Sep 13, 2019 via email

howardjohn commented Sep 13, 2019

duderino commented Sep 13, 2019

howardjohn commented Sep 13, 2019 via email

duderino commented Sep 13, 2019 •

edited

Loading

elfinhe commented Sep 13, 2019 •

edited

Loading

nrjpoddar commented Sep 13, 2019

hzxuzhonghu commented Sep 17, 2019

howardjohn commented Sep 18, 2019

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873

Comments

harpratap commented Sep 6, 2019 • edited Loading

howardjohn commented Sep 6, 2019

harpratap commented Sep 6, 2019

howardjohn commented Sep 6, 2019 via email

harpratap commented Sep 6, 2019 • edited Loading

howardjohn commented Sep 6, 2019

harpratap commented Sep 9, 2019

howardjohn commented Sep 9, 2019

lambdai commented Sep 9, 2019

harpratap commented Sep 13, 2019 • edited Loading

duderino commented Sep 13, 2019

howardjohn commented Sep 13, 2019 via email

howardjohn commented Sep 13, 2019

duderino commented Sep 13, 2019

howardjohn commented Sep 13, 2019 via email

duderino commented Sep 13, 2019 • edited Loading

elfinhe commented Sep 13, 2019 • edited Loading

nrjpoddar commented Sep 13, 2019

hzxuzhonghu commented Sep 17, 2019

howardjohn commented Sep 18, 2019

harpratap commented Sep 6, 2019 •

edited

Loading

harpratap commented Sep 6, 2019 •

edited

Loading

harpratap commented Sep 13, 2019 •

edited

Loading

duderino commented Sep 13, 2019 •

edited

Loading

elfinhe commented Sep 13, 2019 •

edited

Loading