Install `--wait` should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

bgehman · 2018-06-12T17:17:47Z

The helm CLI has --wait argument for install:

      --wait                     if set, will wait until all Pods, PVCs, Services, and minimum number 
of Pods of a Deployment are in a ready state before marking the release as successful. 
It will wait for as long as --timeout

As can be seen with testing, it only "waits" on Jobs. PODs are not waited on to be "in a ready state" per the CLI docs. Helm is acting like a "fire and forget" -- it will not wait for the Deployment created PODs to become ready (per their readiness probes).

The published documentation seems to contradict the helm CLI usage-help, and does accurately document the seen behavior: https://docs.helm.sh/developing_charts/

What does it mean to wait until a hook is ready? This depends on the resource declared in the hook. If the resources is a Job kind, Tiller will wait until the job successfully runs to completion. And if the job fails, the release will fail. This is a blocking operation, so the Helm client will pause while the Job is run.
For all other kinds, as soon as Kubernetes marks the resource as loaded (added or updated), the resource is considered “Ready”.

Output from my install --wait:

RESOURCES:
==> v1/Pod(related)
NAME                                            READY  STATUS     RESTARTS  AGE
cb-audit-service-6786f97c9c-m5nl9               1/1    Running    0         1m
cb-budget-adapter-5b858f6f4c-82drj              1/1    Running    0         1m
cb-cam-diagnostics-5f95875d77-f6cg5             1/1    Running    0         1m
cb-core-auth-764d54bd8c-tpxhk                   0/1    Running    0         1m
cb-core-authorization-service-6c685886d8-rh7lw  1/1    Running    0         1m
cb-core-configuration-service-7878c4d5d5-v8pv5  1/1    Running    0         1m
cb-cred-svc-648d98b98b-szrpm                    1/1    Running    0         1m
cb-help-service-5b8f486765-gmpdt                1/1    Running    0         1m
cb-notification-api-786975c6f9-tg554            1/1    Running    0         1m
cb-notification-api-786975c6f9-xslqt            1/1    Running    0         1m
cb-vault-api-775f76bb86-vz4m5                   1/1    Running    0         1m
cb-vault-hvault-54d694d859-ch7r4                1/1    Running    0         1m
couchdb-58b4b9fb98-7jt2b                        1/1    Running    0         1m
mongodb-787bfb84-jqcvr                          1/1    Running    0         1m
cb-mongodb-migrator-xgvwv                       0/1    Completed  0         1m
core-cred-svc-v2-migrator-8b62r                 0/1    Completed  0         1m

Note: The output clearly shows my cb-core-auth deployment POD as not ready (0/1). The behavior appear to only wait for the Jobs to complete.

Output of helm version:

Client: v2.9.1+g20adb27
Server: v2.9.1+g20adb27

Output of kubectl version:

Client Version: v1.9.6
Server Version: v1.9.6

Cloud Provider/Platform (AKS, GKE, Minikube etc.):
docker-edge (on MacOS)

The text was updated successfully, but these errors were encountered:

helgi · 2018-06-20T15:25:14Z

@bacongobbler I remember back in Deis (v2 anyway) we did put in a lot more structure and "intelligence" (I'm kinda cringing thinking back at some of the code) to define proper ready (healthcheck, etc) and also pull out errors (pull secrets missing, etc).

Can we lift that code and move it into helm? It was working against the API direct, and now with more advances in how Kubernetes ties things together a lot of that logic consolidates anyway.

Thoughts?

bacongobbler · 2018-06-20T16:20:23Z

Yes, that could be a good solution assuming those assumptions we made back when deployments were in extensions/v1beta1 still run true. Before we go rip out existing code though, let's try to reproduce the error 😄

I found I could reproduce this bug IFF the Deployment apiVersion is extensions/v1beta1. When I switch the apiVersion to apps/v1 the watcher waits for the pods to come up successfully.

How to reproduce this:

helm fetch stable/traefik --untar
helm install ./traefik --set serviceType=NodePort --wait

We want to set the service type to NodePort so the watcher isn't waiting for the load balancer to become available. If you install this, you can hit this regression.

Now, change the apiVersion in traefik/templates/deployment.yaml to apps/v1:

><> cat traefik/templates/deployment.yaml | grep apiVersion
apiVersion: apps/v1

If you run helm install ./traefik --set serviceType=NodePort --wait again, it'll properly wait for the pods to become ready.

This was tested using v2.9.1.

The latest changes to --wait in the last few releases would have been with #3373 or #3921. Either that or it's a regression in the Kubernetes client APIs, which has bitten us in the past.

FYI, upgrading to apps/v1 won't work for existing production deployments since Tiller will stick to extensions/v1beta1, so it's best to roll out a new helm install rather than a helm upgrade in this case. To demonstrate:

><> helm install ./traefik/ --set serviceType=NodePort --wait
NAME:   smelly-sabertooth
LAST DEPLOYED: Wed Jun 20 09:16:43 2018
NAMESPACE: default
STATUS: DEPLOYED
...
><> vim traefik/templates/deployment.yaml
><> helm upgrade $(helm last) ./traefik --set serviceType=NodePort --wait
Release "smelly-sabertooth" has been upgraded. Happy Helming!
><> k get deploy smelly-sabertooth-traefik -o yaml | grep apiVersion
apiVersion: extensions/v1beta1

For now the best path forward is to upgrade to the apps/v1 API, if possible.

bacongobbler · 2018-06-20T16:21:09Z

labelling as a bug/regression. Not sure what our deprecation policy is for older API versions. I'll bring this up in tomorrow's dev call

bacongobbler · 2018-06-20T16:22:57Z

also updated the title to reflect the bug that this only affects Deployments using apiVersions prior to apps/v1.

helgi · 2018-06-20T16:46:52Z

Oh so the stable version may have finally removed that need! Great. Then it might be more work to fix it up then for people to migrate to stable?

unguiculus · 2018-06-22T07:33:02Z

So, is this only an issue with extensions/v1beta1, but not with apps/v1beta1 onwards? Charts CI switched to --wait with the refactored scripts. I haven't seen any issues so far. However, another problem with --wait is that it's inconsistent. It generally waits till all pods are ready, but not for deployments when it considers maxUnavailable.

https://github.com/kubernetes/helm/blob/master/pkg/kube/wait.go#L252

That's why Charts CI additionally uses kubectl rollout status for deployments.

https://github.com/kubernetes-helm/chart-testing/blob/master/lib/chartlib.sh#L292-L295

bacongobbler · 2018-06-22T12:26:38Z

i did not test apps/v1beta1 but if someone's willing to test that it'd be great! Some of the community charts (traefik like in my example) rely on extensions/v1beta1 so they're gonna hit this bug.

jgleonard · 2018-06-22T14:00:32Z

Seems to do the RightThing and wait using apps/v1beta1.

✔ ~/temp (⎈ |minikube:default)
09:56 $ helm init
$HELM_HOME has been configured at /Users/XXXX/.helm.

Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Happy Helming!
✔-INT ~/temp (⎈ |minikube:default)
09:57 $ helm version
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
✔ ~/temp (⎈ |minikube:default)
09:57 $ helm fetch stable/traefik --untar
✔ ~/temp (⎈ |minikube:default)
09:58 $ sed -i '' -e 's#\(apiVersion: \)extensions#\1apps#' traefik/templates/deployment.yaml
✔ ~/temp (⎈ |minikube:default)
09:58 $ grep apiVersion traefik/templates/deployment.yaml
apiVersion: apps/v1beta1
✔ ~/temp (⎈ |minikube:default)
09:58 $ helm install ./traefik --set serviceType=NodePort --wait
NAME:   angry-armadillo
LAST DEPLOYED: Fri Jun 22 09:58:29 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1beta1/Deployment
NAME                     DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
angry-armadillo-traefik  1        1        1           1          20s

==> v1/Pod(related)
NAME                                      READY  STATUS   RESTARTS  AGE
angry-armadillo-traefik-6869cf7459-6pxdv  1/1    Running  0         20s

==> v1/ConfigMap
NAME                     DATA  AGE
angry-armadillo-traefik  1     20s

==> v1/Service
NAME                     TYPE      CLUSTER-IP     EXTERNAL-IP  PORT(S)                     AGE
angry-armadillo-traefik  NodePort  10.111.67.114  <none>       80:31562/TCP,443:30507/TCP  20s


NOTES:

1. Traefik has been started. You can find out the port numbers being used by traefik by running:

          $ kubectl describe svc angry-armadillo-traefik --namespace default

2. Configure DNS records corresponding to Kubernetes ingress resources to point to the NODE_IP/NODE_HOST

thomastaylor312 · 2018-06-22T20:30:51Z

@bacongobbler In terms of supported API versions, the last discussion about the --wait flag that we had was that we want to support all of the versions that k8s supports. When I fix it, I will be handling all of those workload versions

staticdev · 2019-05-23T20:29:37Z

It is not waiting for apps/v1 Deployment on Helm 2.14 (client and server).

To reproduce just try the script:

helm repo add storageos https://charts.storageos.com
helm install storageos/storageos-operator --namespace storageos-operator --wait
kubectl patch storageclass fast -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

It installs a apiVersion: apps/v1 Deployment.
https://github.com/storageos/charts/blob/master/stable/storageos-operator/templates/operator.yaml

If you wait some seconds you can patch the created StorageClass.

A new command "system application-update" is introduced in this commit to support updating an applied application to a new version with a new versioned app tarfile. The application update leverages the existing application upload workflow to first validating/uploading the new app tarfile, then invokes Armada apply or rollback to deploy the charts for the new versioned application. If the version has ever applied before, Armada rollback will be performed, otherwise, Armada apply will be performed. After apply/rollback to the new version is done, the files for the old application version will be cleaned up as well as the releases which are not in the new application version. Once the update is completed successfully, the status will be set to "applied" so that user can continue applying app with user overrides. If there has any failure during updating, application recover will be triggered to recover the app to the old version. If application recover fails, the application status will be populated to "apply-failed" so that user can re-apply app. In order to use Armada rollback, a new sysinv table "kube_app_releases" is created to record deployed helm releases versions. After each app apply, if any helm release version changed, the corresponding release needs to be updated in sysinv db as well. The application overrides have been changed to tie to a specific application in commit https://review.opendev.org/#/c/660498/. Therefore, the user overrides is preserved when updating. Note: On the AIO-SX, always use Armada apply even it was applied issue on AIO-SX(replicas is 1) to leverage rollback, Armada/helm rollback --wait does not wait for pods to be ready before it returns. Related helm issue, helm/helm#4210 helm/helm#2006 Tests conducted(AIO-SX, DX, Standard): - functional tests (both stx-openstack and simple custom app) - upload stx-openstack-1.0-13-centos-stable-latest tarfile which uses latest docker images - apply stx-openstack - update to stx-openstack-1.0-13-centos-stable-versioned which uses versioned docker images - update back to stx-openstack-1.0-13-centos-stable-latest - update to a version that has less/more charts compared to the old version - remove stx-openstack - delete stx-openstack - failure tests - application-update rejected (app not found, update to a same version, operation not permitted etc...) - application-update fails that trigger recover - upload failure ie. invalid tarfile, manifest file validation failed ... - apply/rollback failure ie. download images failure, Armada apply/rollback fails Change-Id: I4e094427e673639e2bdafd8c476b897b7b4327a3 Story: 2005350 Task: 33568 Signed-off-by: Angie Wang <angie.wang@windriver.com>

bacongobbler · 2020-05-14T21:43:06Z

closing as inactive.

bacongobbler added the question/support label Jun 12, 2018

bacongobbler added bug Categorizes issue or PR as related to a bug. and removed question/support labels Jun 20, 2018

bacongobbler changed the title ~~Install --wait should actually wait for Deployments/PODs to be ready~~ Install --wait should actually wait for extension/v1beta1 Deployments/PODs to be ready Jun 20, 2018

bacongobbler mentioned this issue Jun 21, 2018

--wait returns true when pod cannot start #3296

Closed

bacongobbler mentioned this issue Jun 22, 2018

replace beta workloads/RBAC APIs with GA'd APIs helm/charts#6236

Closed

1 task

This was referenced Mar 27, 2019

helm install --wait does not wait for deployment pod readiness properly #3173

Closed

Helm --wait - does NOT wait microsoft/azure-pipelines-tasks#9862

Closed

jeremyje mentioned this issue Apr 23, 2019

push-helm occasionally finishes early googleforgames/open-match#277

Closed

bacongobbler closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Install `--wait` should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

Install `--wait` should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

bgehman commented Jun 12, 2018

helgi commented Jun 20, 2018

bacongobbler commented Jun 20, 2018 •

edited

Loading

bacongobbler commented Jun 20, 2018

bacongobbler commented Jun 20, 2018

helgi commented Jun 20, 2018

unguiculus commented Jun 22, 2018

bacongobbler commented Jun 22, 2018 •

edited

Loading

jgleonard commented Jun 22, 2018

thomastaylor312 commented Jun 22, 2018

staticdev commented May 23, 2019

bacongobbler commented May 14, 2020

Install --wait should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

Install --wait should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

Comments

bgehman commented Jun 12, 2018

helgi commented Jun 20, 2018

bacongobbler commented Jun 20, 2018 • edited Loading

bacongobbler commented Jun 20, 2018

bacongobbler commented Jun 20, 2018

helgi commented Jun 20, 2018

unguiculus commented Jun 22, 2018

bacongobbler commented Jun 22, 2018 • edited Loading

jgleonard commented Jun 22, 2018

thomastaylor312 commented Jun 22, 2018

staticdev commented May 23, 2019

bacongobbler commented May 14, 2020

Install `--wait` should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

Install `--wait` should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

bacongobbler commented Jun 20, 2018 •

edited

Loading

bacongobbler commented Jun 22, 2018 •

edited

Loading