helm upgrade > timeout on pre-upgrade hook > revision stuck in `PENDING_UPGRADE` and multiple `DEPLOYED` revisions arise soon #4558

consideRatio · 2018-08-29T16:15:19Z

Reproduction and symptom

helm upgrade with a helm pre-upgrade hook that times out.
Error: UPGRADE FAILED: timed out waiting for the condition.

helm history my-release-name

# the last line...
22      	Wed Aug 29 17:59:48 2018	PENDING_UPGRADE	jupyterhub-0.7-04ccf1a 	Preparing upgrade

Expected outcome

The revision should end up in FAILED rather than PENDING_UPGRADE right?

The text was updated successfully, but these errors were encountered:

consideRatio · 2018-08-29T16:19:23Z

This may directly also have led to multiple revisions considered to be DEPLOYED but I fail to reproduce that exact steps, but it has happened to me very recently while working with hooks, some of which timed out, some of which failed. Will try to make the multiple DEPLOYED revisions reproducible as well, but this is perhaps another issue, it is at least another symptom.

consideRatio · 2018-08-29T17:10:04Z

By running the same helm upgrade over, I ended up with multiple revisions being DEPLOYED.

My helm hooks are mostly a DaemonSet pulling images with init-containers with a pause image used by the main container. I also have a Job that awaits the DaemonSet to have the desired ready pods. All hooks have the following annotations.

  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
    "helm.sh/hook-weight": "0"

I think that tiller is fooled by the delete-policy of before-hook-creation. Something like this...

An upgrade is done, getting stuck in the pre-upgrade hook for some reason. This upgrade will remain in PENDING_UPGRADE no matter what if the helm client times out or not.
A new upgrade is done, the resources are removed and tiller believes the new hooks actually finished when they were deleted and progresses towards the actual upgrade.

An indication of this to be somewhat correct, is that the upgrade were considered to be succeeded in the end when it really should or could not be! I had asked it to pull images never to be found in the pre-upgrade hooks, but the actual upgrade happened even though those images did not exist. Somehow, tiller was fooled to believe the hooks completed successfully!

consideRatio · 2018-08-29T22:22:28Z

How to get an indeployable Deployment deployed

I tried to create a minimalistic reproduction and ended up with something slightly different but I bet that this is related.

The following Charts deployment should never be deployed, right? Because it has a hook that should keep running in eternity. But it will be deployed if you run two upgrades in succession and have a hook resource already available with the same name and about to terminate.

Chart.yaml:

apiVersion: v1
appVersion: "1.0"
description: A Helm chart for Kubernetes
name: issue-4558
version: 0.1.0

templates/deployment.yaml:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: never-to-be-installed-deployment
spec:
  selector:
    matchLabels:
      dummy: dummy
  template:
    metadata:
      labels:
        dummy: dummy
    spec:
      containers:
        - name: never-to-be-installed-deployment
          image: "gcr.io/google_containers/pause:3.1"

templates/job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: never-finishing-job
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: never-finishing-job
          image: "gcr.io/google_containers/pause:3.1"

Reproduction commands:

helm upgrade issue . --install --namespace issue
# abort
helm upgrade issue . --install --namespace issue

andytom · 2018-11-27T15:37:17Z

I have notice the same issue but we are not using the before-hook-creation. Our releases still get stuck but it doesn't seem to create duplicate deployed releases.

rodriguez-facundo · 2019-08-13T02:11:03Z

Same:

➜ helm history jupyterhub
REVISION	UPDATED                 	STATUS         	CHART                 	DESCRIPTION
1       	Mon Aug 12 21:29:28 2019	DEPLOYED       	jupyterhub-0.8-ff69a77	Install complete
2       	Mon Aug 12 22:04:16 2019	PENDING_UPGRADE	jupyterhub-0.8-ff69a77	Preparing upgrade

Adhira-Deogade · 2020-05-13T16:35:00Z

Is there a workaround for this? Is upgrading to helm3 a solution?

naseemkullah · 2020-06-23T18:55:45Z

Is there a workaround for this? Is upgrading to helm3 a solution?

I've just run into this issue and worked around it by performing a helm rollback to a previous release as follows:

problem:

26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade

fix:

$ helm rollback elasticsearch-release 26
Rollback was a success! Happy Helming!

$ helm history elasticsearch-release
26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade
28      	Tue Jun 23 14:51:11 2020	deployed       	elasticsearch-7.5.1	7.5.1      	Rollback to 26

mitchellmaler · 2020-11-19T17:56:21Z

We are running into this same issue with helm 3. The pipeline gets canceled and the helm operation is stuck in pending-upgrade. The current workaround for running a rollback does work but it isn't that great for an automated pipeline unless we add a check before to make sure to "rollback" before deploy.

Is there anyway to just bypass the "pending-upgrade" status on a new deploy without running a rollback?

tamasbege · 2020-12-10T11:52:19Z

We are running on Helm 3.4.1 and are running into the same issue as here from time to time. Worth mentioning that the previous version 3.3.x had no such trouble with the deployments...
Can someone from the Helm team take a look at this and give an update or something?

tamasbege · 2020-12-10T11:56:38Z

See also #8987 and #7476

bcouetil · 2020-12-11T14:58:17Z

Same problem, coming here searching for a reason/fix 👍

Gumi22 · 2020-12-31T17:01:23Z

We also have the same problem.
Our automated pipeline terminates the upgrade if it takes more than X minutes.
Using a short --timeout value is only a partial workaround, because the helm chart deploys multiple k8s resources and any one could potentially fail (even though its mainly a deployment).
#9180 would be a perfect fix in my opinion.

brainbug89 · 2021-03-18T13:23:47Z

Same problem here. Our pipeline gets canceled when there is a new version running and afterwards we can't deploy anymore because of Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.

klose4711 · 2021-03-22T08:34:57Z

We have the same problem in our GitLab pipelines. The workaround (running rollback) is not a good solution for prod CI/CD pipelines.

Constantin07 · 2021-04-01T09:01:40Z

Also ran into the same issue:

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
make: *** [Makefile:131: deploy] Error 1

$ helm version
version.BuildInfo{Version:"v3.5.3", GitCommit:"041ce5a2c17a58be0fcd5f5e16fb3e7e95fea622", GitTreeState:"dirty", GoVersion:"go1.16"}

vharitonsky · 2021-04-08T13:45:55Z

Issue is reproducible on helm2 and helm3.

jpiper · 2021-04-17T08:58:54Z

This happened to me when I SIGTERMd an upgrade. I solved it by deleting the helm secret associated with this release, e.g.

$ k get secrets
NAME                                 TYPE                                  DATA   AGE
sh.helm.release.v1.app.v1            helm.sh/release.v1                    1      366d
sh.helm.release.v1.app.v2            helm.sh/release.v1                    1      331d
sh.helm.release.v1.app.v3            helm.sh/release.v1                    1      247d
sh.helm.release.v1.app.v4            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v5            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v6            helm.sh/release.v1                    1      15m
sh.helm.release.v1.app.v7            helm.sh/release.v1                    1      66s

$ k delete secret sh.helm.release.v1.app.v7

klose4711 · 2021-06-14T13:48:51Z

Any updates on this issue?

klose4711 · 2022-04-13T09:38:51Z

Can someone else confirm it's "fixed" in v3.8.0? (I can't test it myself atm)

goenning · 2022-05-17T15:44:43Z

I didn't fix it for me, I've cancelled a deployment using v3.8.2 and it still got stuck on pending-upgrade.

mr-yaky · 2022-05-30T11:06:25Z

Yes, this issue is still existing with new version. We got the same with v3.8.2:
version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

Moser-ss · 2022-08-01T15:12:01Z

@goenning

I didn't fix it for me, I've cancelled a deployment using v3.8.2 and it still got stuck on pending-upgrade.

Could you tell me how did you cancel your deployment? Like ctrl+C ?

Shahard2 · 2022-08-18T16:15:33Z

Any solution for this?

matti · 2022-08-18T16:39:23Z

@Shahard2 #4558 (comment) ping @bacongobbler

der-ali · 2022-09-27T10:45:27Z

+1

joejulian · 2022-09-28T23:58:42Z

Happy to have a contribution to address this. Probably should start with a HIP also take a look at the contributing doc.

It is observed that when a helm release is in pending state, another helm release can't be started by FluxCD. FluxCD will not try to do steps to apply the newer helm release, but will just error. This prevents us from applying a new helm release over a release with pods stuck in Pending state (just an example). When the specific message for helm operation in progress is detected, attempt to recover by moving the older releases to failed state. Move inspired by [1]. To do so, patch the helm secret for the specific release. As an optimization, trigger the FluxCD HelmRelease reconciliation right after. One future optimization we can do is run an audit to delete the helm releases for which metadata status is a pending operation, but release data is failed (resource that we patched in this commit). Refactor HelmRelease resource reconciliation trigger, smaller size. There are upstream references related to this bug, see [2] and [3]. Tests on Debian AIO-SX: PASS: unlocked enabled available PASS: platform-integ-apps applied after reproducing error: PASS: inspect sysinv logs, see recovery is attemped PASS: inspect fluxcd logs, see that HelmRelease reconciliation is triggered part of recovery [1]: https://github.com/porter-dev/porter/pull/1685/files [2]: helm/helm#8987 [3]: helm/helm#4558 Closes-Bug: 1997368 Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com> Change-Id: I36116ce8d298cc97194062b75db64541661ce84d

ksemele · 2022-12-08T09:39:00Z

It's similar to #8987 right?
If anyone working right now with that issue - feel free to call help with that in this topic.

Dentrax · 2022-12-23T07:02:04Z

Hit the same issue, but stuck in pending-install. All pods are up and running:

helm history metrics-server
REVISION     UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION
1               Fri Dec 16 08:46:31 2022        pending-install metrics-server-3.8.2    0.6.1           Initial install underway

metrics-server-575f98446c-b7lpq                             1/1     Running   0          6d22h
metrics-server-575f98446c-blmqm                             1/1     Running   0          6d22h

Helm: v3.8.0

Any ideas on this?

willzhang · 2023-01-08T06:58:56Z

helm v3.10.3 same when install ingress-nginx with helm.

kladiv · 2023-03-20T15:17:10Z

1+ really required

Kalin-the-Builder · 2023-03-31T09:07:31Z

2+ really required

troyanskiy · 2023-04-20T14:29:46Z

I'm stuck here. Can't do any helm install

UPD:
cluster restart solved my issue

bcouetil · 2023-04-20T18:09:35Z

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : #5595 (comment)

TonyMcTony · 2023-05-30T10:23:00Z

Use this before any upgrade/install (maybe already posted in this issue) :
# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true
I initially found it here : #5595 (comment)

Brilliant, this is exactly what I was looking for.

peteroneilljr · 2023-06-29T14:31:04Z

I'm hitting this bug on helm version 3.12

helm history --namespace ingress-nginx ingress-nginx
REVISION	UPDATED                 	STATUS         	CHART              	APP VERSION	DESCRIPTION
1       	Wed Jun 28 20:27:41 2023	failed         	ingress-nginx-4.6.0	1.7.0      	Release "ingress-nginx" failed: context deadline exceeded
2       	Wed Jun 28 21:15:51 2023	pending-upgrade	ingress-nginx-4.6.0	1.7.0      	Preparing upgrade


helm version
version.BuildInfo{Version:"v3.12.0", GitCommit:"c9f554d75773799f72ceef38c51210f1842a1dea", GitTreeState:"clean", GoVersion:"go1.20.4"}

amannijhawan · 2023-08-22T17:40:20Z

Use this before any upgrade/install (maybe already posted in this issue) :
# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true
I initially found it here : #5595 (comment)
Brilliant, this is exactly what I was looking for.

The problem is if there is ever a race condition in the code that causes multiple helm deployments to go through at the same time, this technique can cause corruption.

bcouetil · 2023-08-22T17:53:40Z

If you deploy the same module in production multiple times at the exact same time, you have bigger problems than this one, my friend. For other environments, just deploy again.

Before avoiding problems occuring once in a million, there are other everyday's problems to solve, generally speaking 😉

vijaySamanuri · 2023-09-15T02:53:15Z

If you know the previous timeout
it is better to delete secrets which are pending and hit timeout to avoid unnecessary race condition.

something like

kubectl -n $NAMESPACE get secrets -o=custom-columns=NAME:.metadata.name,AGE:.metadata.creationTimestamp --no-headers=true --sort-by=.metadata.creationTimestamp -l "name=${RELEASE_NAME},status in (pending-install, pending-upgrade)" | awk '$2 < "'`date -d "${TIMEOUT_IN_MINS} minutes ago" -Ins --utc | sed 's/+0000/Z/'`'" { print $1 }' | xargs --no-run-if-empty kubectl delete secret -n $NAMESPACE

provided clock to be in sync

github-actions · 2023-12-15T00:14:13Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

ksemele · 2024-01-14T01:39:08Z

Please open that issue. It's not resolved.

andrewe123 · 2024-01-29T11:00:25Z

I agree that there's probably not much helm can do to handle this in all occurrences, but there should at least be some acknowledgement of the issue and some clear guidance on how to resolve when it occurs.

The only workarounds I've seen from reading through the issues pages are:

Rollback
This isn't suitable in production environments since the deployment may have been successful (which seem to be the case most of the time this occurs), if so this will cause a disruptive rollback of the application.
Delete the pending status by deleting the k8s secret.
This appears to be a hack workaround that could cause issues since helm is now out of sync with the actual deployment state and could lead to some side-effect on the next three-way merge deployment.
Update the status to failed or successful manually
This appears to be the only suitable workaround. It keeps helm in sync with the real deployment status. However, it requires knowing what to look for to confirm which status to set.

2 or 3 is where I think some official guidance could be provided.

Also, while helm might not be able to prevent the issue, since it is common and very disruptive, could helm provide some tool to resync correctly a stuck pending status?

jascott1 added the question/support label Aug 29, 2018

consideRatio changed the title ~~helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE~~ helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon Aug 29, 2018

bacongobbler added bug Categorizes issue or PR as related to a bug. and removed question/support labels Jun 23, 2020

bacongobbler mentioned this issue Jul 2, 2020

fix(helm): Avoid corrupting storage via a lock #7322

Merged

bacongobbler mentioned this issue Sep 1, 2020

--atomic flag is not working as expected: Timeout on rollback #8675

Closed

bacongobbler mentioned this issue Nov 6, 2020

Helm v3.4 Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress #8987

Closed

distorhead mentioned this issue Nov 9, 2020

Fix 'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress' error werf/werf#2877

Merged

tamasbege mentioned this issue Dec 10, 2020

Helm release stuck with status "pending-upgrade" #7476

Closed

Moser-ss mentioned this issue Dec 30, 2020

Handle SIGTERMs during helm upgrade and helm install #9180

Merged

6 tasks

hiddeco mentioned this issue Mar 12, 2021

Helm upgrade failed: another operation (install/upgrade/rollback) is in progress fluxcd/helm-controller#149

Closed

roberthbailey mentioned this issue Mar 29, 2021

Move RollingUpdateOnReady to Beta googleforgames/agones#2033

Merged

viveklak mentioned this issue Jul 1, 2022

helm.sh/v3:Release, unclear how to recover from "another operation (install/upgrade/rollback) is in progress" pulumi/pulumi-kubernetes#2054

Open

joejulian added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Sep 28, 2022

raulcabello mentioned this issue Mar 20, 2023

retry helm upgrade if it was interrupted rancher/fleet#1424

Merged

github-actions bot added the Stale label Dec 15, 2023

github-actions bot closed this as completed Jan 14, 2024

Venryx mentioned this issue Mar 19, 2024

Fix that the NGF pod/deployment can occasionally "disappear"! debate-map/app#281

Closed

helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon #4558

helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon #4558

Comments

consideRatio commented Aug 29, 2018

Reproduction and symptom

Expected outcome

consideRatio commented Aug 29, 2018 • edited Loading

consideRatio commented Aug 29, 2018

consideRatio commented Aug 29, 2018 • edited Loading

How to get an indeployable Deployment deployed

Reproduction commands:

andytom commented Nov 27, 2018

rodriguez-facundo commented Aug 13, 2019

Adhira-Deogade commented May 13, 2020

naseemkullah commented Jun 23, 2020

mitchellmaler commented Nov 19, 2020

tamasbege commented Dec 10, 2020

tamasbege commented Dec 10, 2020 • edited Loading

bcouetil commented Dec 11, 2020

Gumi22 commented Dec 31, 2020

brainbug89 commented Mar 18, 2021 • edited Loading

klose4711 commented Mar 22, 2021

Constantin07 commented Apr 1, 2021 • edited Loading

vharitonsky commented Apr 8, 2021

jpiper commented Apr 17, 2021

klose4711 commented Jun 14, 2021

klose4711 commented Apr 13, 2022

goenning commented May 17, 2022

mr-yaky commented May 30, 2022

Moser-ss commented Aug 1, 2022 • edited Loading

Shahard2 commented Aug 18, 2022

matti commented Aug 18, 2022 • edited Loading

der-ali commented Sep 27, 2022

joejulian commented Sep 28, 2022

ksemele commented Dec 8, 2022

Dentrax commented Dec 23, 2022

willzhang commented Jan 8, 2023 • edited Loading

kladiv commented Mar 20, 2023

Kalin-the-Builder commented Mar 31, 2023

troyanskiy commented Apr 20, 2023 • edited Loading

bcouetil commented Apr 20, 2023

TonyMcTony commented May 30, 2023

peteroneilljr commented Jun 29, 2023

amannijhawan commented Aug 22, 2023

bcouetil commented Aug 22, 2023

vijaySamanuri commented Sep 15, 2023

github-actions bot commented Dec 15, 2023

ksemele commented Jan 14, 2024

andrewe123 commented Jan 29, 2024 • edited Loading

helm upgrade > timeout on pre-upgrade hook > revision stuck in `PENDING_UPGRADE` and multiple `DEPLOYED` revisions arise soon #4558

helm upgrade > timeout on pre-upgrade hook > revision stuck in `PENDING_UPGRADE` and multiple `DEPLOYED` revisions arise soon #4558

consideRatio commented Aug 29, 2018 •

edited

Loading

consideRatio commented Aug 29, 2018 •

edited

Loading

tamasbege commented Dec 10, 2020 •

edited

Loading

brainbug89 commented Mar 18, 2021 •

edited

Loading

Constantin07 commented Apr 1, 2021 •

edited

Loading

Moser-ss commented Aug 1, 2022 •

edited

Loading

matti commented Aug 18, 2022 •

edited

Loading

willzhang commented Jan 8, 2023 •

edited

Loading

troyanskiy commented Apr 20, 2023 •

edited

Loading

andrewe123 commented Jan 29, 2024 •

edited

Loading