Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon #4558

Closed
consideRatio opened this issue Aug 29, 2018 · 62 comments
Labels
bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. Stale

Comments

@consideRatio
Copy link
Contributor

Reproduction and symptom

  1. helm upgrade with a helm pre-upgrade hook that times out.
  2. Error: UPGRADE FAILED: timed out waiting for the condition.
  3. helm history my-release-name
    # the last line...
    22      	Wed Aug 29 17:59:48 2018	PENDING_UPGRADE	jupyterhub-0.7-04ccf1a 	Preparing upgrade

Expected outcome

The revision should end up in FAILED rather than PENDING_UPGRADE right?

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 29, 2018

This may directly also have led to multiple revisions considered to be DEPLOYED but I fail to reproduce that exact steps, but it has happened to me very recently while working with hooks, some of which timed out, some of which failed. Will try to make the multiple DEPLOYED revisions reproducible as well, but this is perhaps another issue, it is at least another symptom.

@consideRatio consideRatio changed the title helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE helm upgrade > timeout on pre-upgrade hook > revision stuck in PENDING_UPGRADE and multiple DEPLOYED revisions arise soon Aug 29, 2018
@consideRatio
Copy link
Contributor Author

By running the same helm upgrade over, I ended up with multiple revisions being DEPLOYED.

My helm hooks are mostly a DaemonSet pulling images with init-containers with a pause image used by the main container. I also have a Job that awaits the DaemonSet to have the desired ready pods. All hooks have the following annotations.

  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
    "helm.sh/hook-weight": "0"

I think that tiller is fooled by the delete-policy of before-hook-creation. Something like this...

  1. An upgrade is done, getting stuck in the pre-upgrade hook for some reason. This upgrade will remain in PENDING_UPGRADE no matter what if the helm client times out or not.
  2. A new upgrade is done, the resources are removed and tiller believes the new hooks actually finished when they were deleted and progresses towards the actual upgrade.

An indication of this to be somewhat correct, is that the upgrade were considered to be succeeded in the end when it really should or could not be! I had asked it to pull images never to be found in the pre-upgrade hooks, but the actual upgrade happened even though those images did not exist. Somehow, tiller was fooled to believe the hooks completed successfully!

@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 29, 2018

How to get an indeployable Deployment deployed

I tried to create a minimalistic reproduction and ended up with something slightly different but I bet that this is related.

The following Charts deployment should never be deployed, right? Because it has a hook that should keep running in eternity. But it will be deployed if you run two upgrades in succession and have a hook resource already available with the same name and about to terminate.

Chart.yaml:

apiVersion: v1
appVersion: "1.0"
description: A Helm chart for Kubernetes
name: issue-4558
version: 0.1.0

templates/deployment.yaml:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: never-to-be-installed-deployment
spec:
  selector:
    matchLabels:
      dummy: dummy
  template:
    metadata:
      labels:
        dummy: dummy
    spec:
      containers:
        - name: never-to-be-installed-deployment
          image: "gcr.io/google_containers/pause:3.1"

templates/job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: never-finishing-job
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: never-finishing-job
          image: "gcr.io/google_containers/pause:3.1"

Reproduction commands:

helm upgrade issue . --install --namespace issue
# abort
helm upgrade issue . --install --namespace issue

messy-helm-upgrading

@andytom
Copy link
Contributor

andytom commented Nov 27, 2018

I have notice the same issue but we are not using the before-hook-creation. Our releases still get stuck but it doesn't seem to create duplicate deployed releases.

@rodriguez-facundo
Copy link

Same:

➜ helm history jupyterhub
REVISION	UPDATED                 	STATUS         	CHART                 	DESCRIPTION
1       	Mon Aug 12 21:29:28 2019	DEPLOYED       	jupyterhub-0.8-ff69a77	Install complete
2       	Mon Aug 12 22:04:16 2019	PENDING_UPGRADE	jupyterhub-0.8-ff69a77	Preparing upgrade

@Adhira-Deogade
Copy link

Is there a workaround for this? Is upgrading to helm3 a solution?

@bacongobbler bacongobbler added bug Categorizes issue or PR as related to a bug. and removed question/support labels Jun 23, 2020
@naseemkullah
Copy link
Contributor

Is there a workaround for this? Is upgrading to helm3 a solution?

I've just run into this issue and worked around it by performing a helm rollback to a previous release as follows:

problem:

26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade

fix:

$ helm rollback elasticsearch-release 26
Rollback was a success! Happy Helming!
$ helm history elasticsearch-release
26      	Mon Jun 15 14:13:24 2020	superseded     	elasticsearch-7.5.1	7.5.1      	Upgrade complete
27      	Mon Jun 15 17:52:09 2020	pending-upgrade	elasticsearch-7.5.1	7.5.1      	Preparing upgrade
28      	Tue Jun 23 14:51:11 2020	deployed       	elasticsearch-7.5.1	7.5.1      	Rollback to 26

@mitchellmaler
Copy link

We are running into this same issue with helm 3. The pipeline gets canceled and the helm operation is stuck in pending-upgrade. The current workaround for running a rollback does work but it isn't that great for an automated pipeline unless we add a check before to make sure to "rollback" before deploy.

Is there anyway to just bypass the "pending-upgrade" status on a new deploy without running a rollback?

@tamasbege
Copy link

We are running on Helm 3.4.1 and are running into the same issue as here from time to time. Worth mentioning that the previous version 3.3.x had no such trouble with the deployments...
Can someone from the Helm team take a look at this and give an update or something?

@tamasbege
Copy link

tamasbege commented Dec 10, 2020

See also #8987 and #7476

@bcouetil
Copy link

Same problem, coming here searching for a reason/fix 👍

@Gumi22
Copy link

Gumi22 commented Dec 31, 2020

We also have the same problem.
Our automated pipeline terminates the upgrade if it takes more than X minutes.
Using a short --timeout value is only a partial workaround, because the helm chart deploys multiple k8s resources and any one could potentially fail (even though its mainly a deployment).
#9180 would be a perfect fix in my opinion.

@brainbug89
Copy link

brainbug89 commented Mar 18, 2021

Same problem here. Our pipeline gets canceled when there is a new version running and afterwards we can't deploy anymore because of Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.

@klose4711
Copy link

We have the same problem in our GitLab pipelines. The workaround (running rollback) is not a good solution for prod CI/CD pipelines.

@Constantin07
Copy link

Constantin07 commented Apr 1, 2021

Also ran into the same issue:

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
make: *** [Makefile:131: deploy] Error 1

$ helm version
version.BuildInfo{Version:"v3.5.3", GitCommit:"041ce5a2c17a58be0fcd5f5e16fb3e7e95fea622", GitTreeState:"dirty", GoVersion:"go1.16"}

@vharitonsky
Copy link

Issue is reproducible on helm2 and helm3.

@jpiper
Copy link

jpiper commented Apr 17, 2021

This happened to me when I SIGTERMd an upgrade. I solved it by deleting the helm secret associated with this release, e.g.

$ k get secrets
NAME                                 TYPE                                  DATA   AGE
sh.helm.release.v1.app.v1            helm.sh/release.v1                    1      366d
sh.helm.release.v1.app.v2            helm.sh/release.v1                    1      331d
sh.helm.release.v1.app.v3            helm.sh/release.v1                    1      247d
sh.helm.release.v1.app.v4            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v5            helm.sh/release.v1                    1      77d
sh.helm.release.v1.app.v6            helm.sh/release.v1                    1      15m
sh.helm.release.v1.app.v7            helm.sh/release.v1                    1      66s

$ k delete secret sh.helm.release.v1.app.v7

@klose4711
Copy link

Any updates on this issue?

@klose4711
Copy link

Can someone else confirm it's "fixed" in v3.8.0? (I can't test it myself atm)

@goenning
Copy link

I didn't fix it for me, I've cancelled a deployment using v3.8.2 and it still got stuck on pending-upgrade.

@mr-yaky
Copy link

mr-yaky commented May 30, 2022

Yes, this issue is still existing with new version. We got the same with v3.8.2:
version.BuildInfo{Version:"v3.8.2", GitCommit:"6e3701edea09e5d55a8ca2aae03a68917630e91b", GitTreeState:"clean", GoVersion:"go1.17.5"}

@Moser-ss
Copy link
Contributor

Moser-ss commented Aug 1, 2022

@goenning

I didn't fix it for me, I've cancelled a deployment using v3.8.2 and it still got stuck on pending-upgrade.

Could you tell me how did you cancel your deployment? Like ctrl+C ?

@Shahard2
Copy link

Any solution for this?

@matti
Copy link

matti commented Aug 18, 2022

@der-ali
Copy link

der-ali commented Sep 27, 2022

+1

@joejulian joejulian added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Sep 28, 2022
@joejulian
Copy link
Contributor

Happy to have a contribution to address this. Probably should start with a HIP also take a look at the contributing doc.

starlingx-github pushed a commit to starlingx/config that referenced this issue Nov 24, 2022
It is observed that when a helm release is in pending state, another
helm release can't be started by FluxCD. FluxCD will not try to
do steps to apply the newer helm release, but will just error.

This prevents us from applying a new helm release over a release with
pods stuck in Pending state (just an example).

When the specific message for helm operation in progress is detected,
attempt to recover by moving the older releases to failed state.
Move inspired by [1].
To do so, patch the helm secret for the specific release.
As an optimization, trigger the FluxCD HelmRelease reconciliation right
after.
One future optimization we can do is run an audit to delete the helm
releases for which metadata status is a pending operation, but release
data is failed (resource that we patched in this commit).

Refactor HelmRelease resource reconciliation trigger, smaller size.

There are upstream references related to this bug, see [2] and [3].

Tests on Debian AIO-SX:
PASS: unlocked enabled available
PASS: platform-integ-apps applied
after reproducing error:
PASS: inspect sysinv logs, see recovery is attemped
PASS: inspect fluxcd logs, see that HelmRelease reconciliation is
triggered part of recovery

[1]: https://github.com/porter-dev/porter/pull/1685/files
[2]: helm/helm#8987
[3]: helm/helm#4558
Closes-Bug: 1997368
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I36116ce8d298cc97194062b75db64541661ce84d
@ksemele
Copy link

ksemele commented Dec 8, 2022

It's similar to #8987 right?
If anyone working right now with that issue - feel free to call help with that in this topic.

@Dentrax
Copy link

Dentrax commented Dec 23, 2022

Hit the same issue, but stuck in pending-install. All pods are up and running:

helm history metrics-server
REVISION     UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION
1               Fri Dec 16 08:46:31 2022        pending-install metrics-server-3.8.2    0.6.1           Initial install underway
metrics-server-575f98446c-b7lpq                             1/1     Running   0          6d22h
metrics-server-575f98446c-blmqm                             1/1     Running   0          6d22h

Helm: v3.8.0

Any ideas on this?

@willzhang
Copy link

willzhang commented Jan 8, 2023

helm v3.10.3 same when install ingress-nginx with helm.

@kladiv
Copy link

kladiv commented Mar 20, 2023

1+ really required

@Kalin-the-Builder
Copy link

2+ really required

@troyanskiy
Copy link

troyanskiy commented Apr 20, 2023

I'm stuck here. Can't do any helm install

UPD:
cluster restart solved my issue

@bcouetil
Copy link

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : #5595 (comment)

@TonyMcTony
Copy link

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : #5595 (comment)

Brilliant, this is exactly what I was looking for.

@peteroneilljr
Copy link

I'm hitting this bug on helm version 3.12

helm history --namespace ingress-nginx ingress-nginx
REVISION	UPDATED                 	STATUS         	CHART              	APP VERSION	DESCRIPTION
1       	Wed Jun 28 20:27:41 2023	failed         	ingress-nginx-4.6.0	1.7.0      	Release "ingress-nginx" failed: context deadline exceeded
2       	Wed Jun 28 21:15:51 2023	pending-upgrade	ingress-nginx-4.6.0	1.7.0      	Preparing upgrade


helm version
version.BuildInfo{Version:"v3.12.0", GitCommit:"c9f554d75773799f72ceef38c51210f1842a1dea", GitTreeState:"clean", GoVersion:"go1.20.4"}

@amannijhawan
Copy link
Contributor

Use this before any upgrade/install (maybe already posted in this issue) :

# delete failed previous deployment if any (that would else require a helm delete)
kubectl -n $NAMESPACE delete secret -l "name=$APP_NAME,status in (pending-install, pending-upgrade)" || true

I initially found it here : #5595 (comment)

Brilliant, this is exactly what I was looking for.

The problem is if there is ever a race condition in the code that causes multiple helm deployments to go through at the same time, this technique can cause corruption.

@bcouetil
Copy link

If you deploy the same module in production multiple times at the exact same time, you have bigger problems than this one, my friend. For other environments, just deploy again.

Before avoiding problems occuring once in a million, there are other everyday's problems to solve, generally speaking 😉

@vijaySamanuri
Copy link

If you know the previous timeout
it is better to delete secrets which are pending and hit timeout to avoid unnecessary race condition.

something like

kubectl -n $NAMESPACE get secrets -o=custom-columns=NAME:.metadata.name,AGE:.metadata.creationTimestamp --no-headers=true --sort-by=.metadata.creationTimestamp -l "name=${RELEASE_NAME},status in (pending-install, pending-upgrade)" | awk '$2 < "'`date -d "${TIMEOUT_IN_MINS} minutes ago" -Ins --utc | sed 's/+0000/Z/'`'" { print $1 }' | xargs --no-run-if-empty kubectl delete secret -n $NAMESPACE 

provided clock to be in sync

Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@ksemele
Copy link

ksemele commented Jan 14, 2024

Please open that issue. It's not resolved.

@andrewe123
Copy link

andrewe123 commented Jan 29, 2024

I agree that there's probably not much helm can do to handle this in all occurrences, but there should at least be some acknowledgement of the issue and some clear guidance on how to resolve when it occurs.

The only workarounds I've seen from reading through the issues pages are:

  1. Rollback
    This isn't suitable in production environments since the deployment may have been successful (which seem to be the case most of the time this occurs), if so this will cause a disruptive rollback of the application.

  2. Delete the pending status by deleting the k8s secret.
    This appears to be a hack workaround that could cause issues since helm is now out of sync with the actual deployment state and could lead to some side-effect on the next three-way merge deployment.

  3. Update the status to failed or successful manually
    This appears to be the only suitable workaround. It keeps helm in sync with the real deployment status. However, it requires knowing what to look for to confirm which status to set.

2 or 3 is where I think some official guidance could be provided.

Also, while helm might not be able to prevent the issue, since it is common and very disruptive, could helm provide some tool to resync correctly a stuck pending status?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. Stale
Projects
None yet
Development

No branches or pull requests