Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm release stuck with status "pending-upgrade" #7476

Closed
slashr opened this issue Jan 28, 2020 · 15 comments
Closed

Helm release stuck with status "pending-upgrade" #7476

slashr opened this issue Jan 28, 2020 · 15 comments

Comments

@slashr
Copy link

slashr commented Jan 28, 2020

Output of helm version:
version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}

Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-14T04:24:29Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.8", GitCommit:"211047e9a1922595eaa3a1127ed365e9299a6c23", GitTreeState:"clean", BuildDate:"2019-10-15T12:02:12Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud Provider/Platform (AKS, GKE, Minikube etc.):
AWS

I had installed Prometheus Operator around a month ago.

  • helm list --all-namespaces -a

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION

spw-test monitoring 27 2019-12-20 07:24:19.770034 +0100 CET pending-upgrade prometheus-operator-8.1.8 0.34.0

The status is stuck on `pending-upgrade and I can't seem to do anything about this.


  • helm upgrade test . --namespace monitoring

Error: UPGRADE FAILED: "spw-test" has no deployed releases


  • helm list

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION


  • helm delete spw-test

Error: uninstall: Release not loaded: spw-test: release: not found


  • helm install spw-test . -n monitoring

Error: cannot re-use a name that is still in use

I'd like to either get rid of this release and create a new one. Or somehow fix this so that I can do helm upgrade on the release

@bacongobbler
Copy link
Member

bacongobbler commented Jan 28, 2020

Have you tried helm delete -n monitoring spw-test? It looks like you forgot to add the namespace parameter to the helm delete example. By default, Helm looks under the namespace set by the current kube config's context, or "default" if none is set.

@bacongobbler
Copy link
Member

bacongobbler commented Jan 28, 2020

As for the release being stuck in "pending upgrade", we've seen that occur when the connection times out mid-upgrade or mid-rollback. I believe others have worked around this by manually marking the release as DEPLOYED, usually by editing the secret Helm creates to track the release ledger. This allows them to upgrade, assuming the initial upgrade completed prior to the connection timeout. It isn't recommended unless you know your release is in a good state.

@slashr
Copy link
Author

slashr commented Feb 5, 2020

@bacongobbler

Thank you. Still finding it hard to believe that I did not enter the namespace.

@gitumarkk
Copy link

This also happened to me because my help upgrade process was interrupted. After some trial and error, a helm rollback restored my helm state i.e. helm rollback <name> <revision> hope it helps anyone in the same situation.

@tamasbege
Copy link

tamasbege commented Dec 10, 2020

I stumbled upon an older thread with the same issue... Apparently it is more common...
#4558

The workaround as in manual rollback works, however it is no solution for an automated CI environment like Jenkins or something like that...

@kitos9112
Copy link

Yet another one who stumbles upon this issue! During a helm upgrade from a CD system there could be a multitude of events that might end up interrupting the upgrade process. A manual rollback should be a no-go in a zero-touch environment

@fouadsemaan
Copy link

This also happened to me because my help upgrade process was interrupted. After some trial and error, a helm rollback restored my helm state i.e. helm rollback <name> <revision> hope it helps anyone in the same situation.

Manual intervention across hundreds of automated deployments is risky.

#7476 (comment)

@fouadsemaan
Copy link

Yet another one who stumbles upon this issue! During a helm upgrade from a CD system there could be a multitude of events that might end up interrupting the upgrade process. A manual rollback should be a no-go in a zero-touch environment

Hear, hear!

@vasilevp
Copy link

Yet another one who stumbles upon this issue! During a helm upgrade from a CD system there could be a multitude of events that might end up interrupting the upgrade process. A manual rollback should be a no-go in a zero-touch environment

One way to avoid manual rollbacks in a CI/CD setting is to run upgrade with the --atomic flag, if you can afford the wait. That way you can automatically roll back after a timeout which, at least in our case, works just fine.

@kitos9112
Copy link

Yet another one who stumbles upon this issue! During a helm upgrade from a CD system there could be a multitude of events that might end up interrupting the upgrade process. A manual rollback should be a no-go in a zero-touch environment

One way to avoid manual rollbacks in a CI/CD setting is to run upgrade with the --atomic flag, if you can afford the wait. That way you can automatically roll back after a timeout which, at least in our case, works just fine.

That's fair enough if you are lucky your CD job does not fail for whatever other reason (e.g., underlying compute issue, network failure, etc...) - We've not gotten around this issue just yet, and I still see this issue occurring in 1% of our deployments.

@johnib
Copy link

johnib commented Sep 7, 2023

How come the timeout setting doesn't help in this case? My helm upgrade was stuck in this state for almost 2 weeks.
Eventually we manually rolled it back using helm rollback -- however that is not a good solution for automated systems.

What I'm trying to figure out is what does the timeout setting do, if not terminating long-running pending-upgrade operations? isn't it the sole purpose of this setting?

@zchenyu
Copy link

zchenyu commented Dec 20, 2023

Bump. Can this issue be re-opened? It seems to be still occurring.

@0054
Copy link

0054 commented Jan 15, 2024

This problem is relevant for us too
sometimes when helm upgrade ... failed with an error, the release status does not change to Failed and stays in Pending-upgrade

@sourcehawk
Copy link

This happens far too often for helm to be a reliable automation tool. Sometimes not even a rollback will work and the only other option is to completely delete the helm chart or the namespace which is just plain out unacceptable for zero downtime production workloads.

@meredrica
Copy link

meredrica commented Jun 18, 2024

We just observed this again with fully automated builds.
We run this command: helm upgrade --atomic --cleanup-on-fail --install --set some.value=foo --set other.value=bar releasename path/to/chart followed by a helm test releasename
I'm aware that there are some workarounds like editing the secret etc, we opted fir deleting the charts since it was our dev instance, but this situation is not ideal.
I think this bug should be reopened and investigated again.

My best guess is that one deployment interrupted another during helm upgrade and this causes the problems. The deployment job right before the first problematic one was cancelled mid-flight, I can't see any relevant logs of helm being run, but the job might have been cancelled right during the helm upgrade, not providing logs (I don't know the gitlab specifics for log collection)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests