-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading control plane from 1.2.2 to 1.2.5 causing down time #16873
Comments
do you have the Envoy logs during this time? I would not expect Envoy to be restarting as a result of a control plane upgrade |
@howardjohn logs of one of the sidecar envoys that became unavailable - https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d |
`signal: killed` can happen when Envoy is oom killed. I am not sure if it
can happen in other cases though...
Do you still have ability to debug this? it could be useful to look at
envoy memory usage and dmesg on the pod may show if it was oom killed?
There may be other causes too
…On Thu, Sep 5, 2019, 9:35 PM Harpratap Singh Layal ***@***.***> wrote:
@howardjohn <https://github.com/howardjohn> logs of one of the sidecar
envoys that became unavailable -
https://gist.github.com/harpratap/d5a7c762e8b1e5e1808698f04b47739d
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16873?email_source=notifications&email_token=AAEYGXOM4CYFH3NXDNBQNU3QIHMZLA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BWETA#issuecomment-528704076>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEYGXMBONM2R3C2ZHJ5LWLQIHMZLANCNFSM4IUEPLJQ>
.
|
@howardjohn You were right, it is going OOM and getting killed.
We can mitigate this by increasing memory requests, but shouldn't the sidecar not have such a high jump in memory usage when we upgrade? When running normal load it stays at about 115Mi and jumps to 180Mi when upgrades happen. |
Thanks @harpratap . Yeah I would not expect a jump in memory during upgrade. Maybe a little bit because Pilot may be sending different config when it is updated, but should be minimal I would think. In the graph it looks like it spikes, then goes back to old memory usage. Did it go back to old usage because you rolled back, or is 13:24 on 1.2.2 and 13:36 on 1.2.5? |
@howardjohn The latter, Few minutes after the upgrade is completed the sidecar container gets killed and comes up again at 13:32. The sidecar itself always stays at 1.2.2 though, no changes made to that. |
One possibility I think is that when you upgrade the pilot, it sends envoy s new config. Envoy then has a copy of the old and new config for some time period, but it is a long enough time to cause it to get oom killed? I have definitely seen this before during cert rotation, but I think that scenario is different and there are 2 separate envoy processes running in that case. @lambdai does the above some like a possible cause? anything else we can debug? |
Going through the log in Sidecar proxy is restarted due to file certs update. This envoy restart is hotrestart and it would double the memory usage. I am not sure if it is expected to use file based certs @JimmyCYJ How can we determine if this pilot should use sds or file based? |
After further investigation it doesn't seem like memory was the issue. We were still seeing 503s even after giving it enough memory and the usage never seem to peak past 200MB anyway. Turns out the job "istio-cleanup-secrets" is the one causing this. When running this job we see a new envoy process coming up and the application is no longer considered healthy while the switch is happening. Over the period we see all of the pods becoming unhealthy together. We do also see the same thing happening in our istio-ingressgateway too. My questions are -
|
@ymesika created that job, but it has existed back since the 1.0 days so it's a very old change. Would be good to understand why it's suddenly causing an issue |
I think there was a change to append the version to the job name. Maybe
prior to this the job didn't run on update because they name was not unique
but now it does. Just a guess though - will look into it tomorrow
…On Thu, Sep 12, 2019, 8:38 PM Joshua Blatt ***@***.***> wrote:
@ymesika <https://github.com/ymesika> created that job, but it has
existed back since the 1.0 days so it's a very old change. Would be good to
understand why it's suddenly causing an issue
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16873?email_source=notifications&email_token=AAEYGXPBPQJ6FWDOFY5YIDLQJMDMTA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T3UPA#issuecomment-531085884>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEYGXKEUCFPVLVXWEINC33QJMDMTANCNFSM4IUEPLJQ>
.
|
Wow... this seems like a huge oversight. The job is set to be only run on deletion: istio/install/kubernetes/helm/istio/charts/security/templates/cleanup-secrets.yaml Line 83 in f91f99e
But since you use Then, coupled with the fact that we change the job named every time, it runs on upgrades as well. Thank you for finding this. |
@howardjohn which branches are affected? |
1.1 - master. The thing about appending the version to the job name has
apparently always been there, I was thinking of a different job I guess.
…On Fri, Sep 13, 2019 at 8:28 AM Joshua Blatt ***@***.***> wrote:
@howardjohn <https://github.com/howardjohn> which branches are affected?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16873?email_source=notifications&email_token=AAEYGXMD2SP2K5QUUXV33NTQJOWTTA5CNFSM4IUEPLJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VLZSY#issuecomment-531283147>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEYGXOF7IIHRAFQP7DGN5LQJOWTTANCNFSM4IUEPLJQ>
.
|
So if it affects all branches (meaning we've had this for a while and the world hasn't ended), I think that means we should fix it, but not hold back the 1.1.15 and 1.2.6 patches which are almost done and instead get this out in the next iteration. Thoughts? |
fixing it in next iteration, does this mean upgrade/downgrade from unfixed versions will always cause downtime? maybe we need to provide a workaround, e.g, commands+instructions/script on running instances. |
@howardjohn this seems to be similar to https://github.com/istio/istio/pull/17033/files |
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873
cc |
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873
Confirmed this fixed the ACK ERRORS about certs not found on upgrades as well |
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes istio#16873
This was intended to delete Istio secrets after you did `helm remove`. Instead, it deletes secrets during every upgrade, causing outages. Fixes #16873
Bug description
![image](https://user-images.githubusercontent.com/5058823/64398445-d432fe80-d09f-11e9-9cf9-440e4927372e.png)
When I try to upgrade-downgrade between versions 1.2.2 and 1.2.5 my applications which are using sidecar goes into unready state and I see a downtime in my services.
My requests follow this path:
Load generator (outside cluster) -> Load Balancer (outside cluster) -> Istio Ingressgateway (inside cluster) -> Application (just simple nginx docker image)
I have about 20 instances of istio-ingressgateway and 60 instances of nginx and I generate a load of about 15k rps which normally this setup handles without a sweat.
What I observe when I do
netstat -ltpn
inside sidecar proxy is that a new envoy process comes up and old one goes away, this probably causes the application to become unhealthy because this new envoy process isn't listening on port 15090. After a while it does start listening to 15090 and 15001 and the errors go away once all instances are back.Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure
[ ] Docs
[X] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[X] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
To not see any affect on my traffic when doing control plane upgrade of Istio
Steps to reproduce the bug
We consider istio-ingressgateway to also be a part of data plane and don't want to make any changes to it. We upgrade everything else other than this. CNI is running on version 1.2.5
I try to change versions using these commands -
helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -
mkdir tmp
mv install/kubernetes/helm/istio/charts/gateways/templates/* tmp/
helm template install/kubernetes/helm/istio/ --namespace istio-system --name istio --values custom.yaml | kubectl -n istio-system apply -f -
mv tmp/* install/kubernetes/helm/istio/charts/gateways/templates/
rm -r tmp/
This will temporarily remove all gateway related changes and upgrade everything else.
Version (include the output of
istioctl version --remote
andkubectl version
)Istio - 1.2.2 to 1.2.5
Kubernetes - 1.15.0
How was Istio installed?
Using helm template and this custom.yaml for values -
Environment where bug was observed (cloud vendor, OS, etc)
On prem k8s cluster running on bare metal
Additionally, please consider attaching a cluster state archive by attaching
the dump file to this issue.
The text was updated successfully, but these errors were encountered: