Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install --wait should actually wait for extension/v1beta1 Deployments/PODs to be ready #4210

Closed
bgehman opened this issue Jun 12, 2018 · 11 comments
Labels
bug Categorizes issue or PR as related to a bug.

Comments

@bgehman
Copy link

bgehman commented Jun 12, 2018

The helm CLI has --wait argument for install:

      --wait                     if set, will wait until all Pods, PVCs, Services, and minimum number 
of Pods of a Deployment are in a ready state before marking the release as successful. 
It will wait for as long as --timeout

As can be seen with testing, it only "waits" on Jobs. PODs are not waited on to be "in a ready state" per the CLI docs. Helm is acting like a "fire and forget" -- it will not wait for the Deployment created PODs to become ready (per their readiness probes).

The published documentation seems to contradict the helm CLI usage-help, and does accurately document the seen behavior: https://docs.helm.sh/developing_charts/

What does it mean to wait until a hook is ready? This depends on the resource declared in the hook. If the resources is a Job kind, Tiller will wait until the job successfully runs to completion. And if the job fails, the release will fail. This is a blocking operation, so the Helm client will pause while the Job is run.
For all other kinds, as soon as Kubernetes marks the resource as loaded (added or updated), the resource is considered “Ready”.

Output from my install --wait:

RESOURCES:
==> v1/Pod(related)
NAME                                            READY  STATUS     RESTARTS  AGE
cb-audit-service-6786f97c9c-m5nl9               1/1    Running    0         1m
cb-budget-adapter-5b858f6f4c-82drj              1/1    Running    0         1m
cb-cam-diagnostics-5f95875d77-f6cg5             1/1    Running    0         1m
cb-core-auth-764d54bd8c-tpxhk                   0/1    Running    0         1m
cb-core-authorization-service-6c685886d8-rh7lw  1/1    Running    0         1m
cb-core-configuration-service-7878c4d5d5-v8pv5  1/1    Running    0         1m
cb-cred-svc-648d98b98b-szrpm                    1/1    Running    0         1m
cb-help-service-5b8f486765-gmpdt                1/1    Running    0         1m
cb-notification-api-786975c6f9-tg554            1/1    Running    0         1m
cb-notification-api-786975c6f9-xslqt            1/1    Running    0         1m
cb-vault-api-775f76bb86-vz4m5                   1/1    Running    0         1m
cb-vault-hvault-54d694d859-ch7r4                1/1    Running    0         1m
couchdb-58b4b9fb98-7jt2b                        1/1    Running    0         1m
mongodb-787bfb84-jqcvr                          1/1    Running    0         1m
cb-mongodb-migrator-xgvwv                       0/1    Completed  0         1m
core-cred-svc-v2-migrator-8b62r                 0/1    Completed  0         1m

Note: The output clearly shows my cb-core-auth deployment POD as not ready (0/1). The behavior appear to only wait for the Jobs to complete.

Output of helm version:

Client: v2.9.1+g20adb27
Server: v2.9.1+g20adb27

Output of kubectl version:

Client Version: v1.9.6
Server Version: v1.9.6

Cloud Provider/Platform (AKS, GKE, Minikube etc.):
docker-edge (on MacOS)

@helgi
Copy link
Contributor

helgi commented Jun 20, 2018

@bacongobbler I remember back in Deis (v2 anyway) we did put in a lot more structure and "intelligence" (I'm kinda cringing thinking back at some of the code) to define proper ready (healthcheck, etc) and also pull out errors (pull secrets missing, etc).

Can we lift that code and move it into helm? It was working against the API direct, and now with more advances in how Kubernetes ties things together a lot of that logic consolidates anyway.

Thoughts?

@bacongobbler
Copy link
Member

bacongobbler commented Jun 20, 2018

Yes, that could be a good solution assuming those assumptions we made back when deployments were in extensions/v1beta1 still run true. Before we go rip out existing code though, let's try to reproduce the error 😄

I found I could reproduce this bug IFF the Deployment apiVersion is extensions/v1beta1. When I switch the apiVersion to apps/v1 the watcher waits for the pods to come up successfully.

How to reproduce this:

helm fetch stable/traefik --untar
helm install ./traefik --set serviceType=NodePort --wait

We want to set the service type to NodePort so the watcher isn't waiting for the load balancer to become available. If you install this, you can hit this regression.

Now, change the apiVersion in traefik/templates/deployment.yaml to apps/v1:

><> cat traefik/templates/deployment.yaml | grep apiVersion
apiVersion: apps/v1

If you run helm install ./traefik --set serviceType=NodePort --wait again, it'll properly wait for the pods to become ready.

This was tested using v2.9.1.

The latest changes to --wait in the last few releases would have been with #3373 or #3921. Either that or it's a regression in the Kubernetes client APIs, which has bitten us in the past.

FYI, upgrading to apps/v1 won't work for existing production deployments since Tiller will stick to extensions/v1beta1, so it's best to roll out a new helm install rather than a helm upgrade in this case. To demonstrate:

><> helm install ./traefik/ --set serviceType=NodePort --wait
NAME:   smelly-sabertooth
LAST DEPLOYED: Wed Jun 20 09:16:43 2018
NAMESPACE: default
STATUS: DEPLOYED
...
><> vim traefik/templates/deployment.yaml
><> helm upgrade $(helm last) ./traefik --set serviceType=NodePort --wait
Release "smelly-sabertooth" has been upgraded. Happy Helming!
><> k get deploy smelly-sabertooth-traefik -o yaml | grep apiVersion
apiVersion: extensions/v1beta1

For now the best path forward is to upgrade to the apps/v1 API, if possible.

@bacongobbler bacongobbler added bug Categorizes issue or PR as related to a bug. and removed question/support labels Jun 20, 2018
@bacongobbler
Copy link
Member

labelling as a bug/regression. Not sure what our deprecation policy is for older API versions. I'll bring this up in tomorrow's dev call

@bacongobbler bacongobbler changed the title Install --wait should actually wait for Deployments/PODs to be ready Install --wait should actually wait for extension/v1beta1 Deployments/PODs to be ready Jun 20, 2018
@bacongobbler
Copy link
Member

also updated the title to reflect the bug that this only affects Deployments using apiVersions prior to apps/v1.

@helgi
Copy link
Contributor

helgi commented Jun 20, 2018

Oh so the stable version may have finally removed that need! Great. Then it might be more work to fix it up then for people to migrate to stable?

@unguiculus
Copy link
Member

So, is this only an issue with extensions/v1beta1, but not with apps/v1beta1 onwards? Charts CI switched to --wait with the refactored scripts. I haven't seen any issues so far. However, another problem with --wait is that it's inconsistent. It generally waits till all pods are ready, but not for deployments when it considers maxUnavailable.

https://github.com/kubernetes/helm/blob/master/pkg/kube/wait.go#L252

That's why Charts CI additionally uses kubectl rollout status for deployments.

https://github.com/kubernetes-helm/chart-testing/blob/master/lib/chartlib.sh#L292-L295

@bacongobbler
Copy link
Member

bacongobbler commented Jun 22, 2018

i did not test apps/v1beta1 but if someone's willing to test that it'd be great! Some of the community charts (traefik like in my example) rely on extensions/v1beta1 so they're gonna hit this bug.

@jgleonard
Copy link
Contributor

Seems to do the RightThing and wait using apps/v1beta1.

✔ ~/temp (⎈ |minikube:default)
09:56 $ helm init
$HELM_HOME has been configured at /Users/XXXX/.helm.

Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.

Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Happy Helming!
✔-INT ~/temp (⎈ |minikube:default)
09:57 $ helm version
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
✔ ~/temp (⎈ |minikube:default)
09:57 $ helm fetch stable/traefik --untar
✔ ~/temp (⎈ |minikube:default)
09:58 $ sed -i '' -e 's#\(apiVersion: \)extensions#\1apps#' traefik/templates/deployment.yaml
✔ ~/temp (⎈ |minikube:default)
09:58 $ grep apiVersion traefik/templates/deployment.yaml
apiVersion: apps/v1beta1
✔ ~/temp (⎈ |minikube:default)
09:58 $ helm install ./traefik --set serviceType=NodePort --wait
NAME:   angry-armadillo
LAST DEPLOYED: Fri Jun 22 09:58:29 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1beta1/Deployment
NAME                     DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
angry-armadillo-traefik  1        1        1           1          20s

==> v1/Pod(related)
NAME                                      READY  STATUS   RESTARTS  AGE
angry-armadillo-traefik-6869cf7459-6pxdv  1/1    Running  0         20s

==> v1/ConfigMap
NAME                     DATA  AGE
angry-armadillo-traefik  1     20s

==> v1/Service
NAME                     TYPE      CLUSTER-IP     EXTERNAL-IP  PORT(S)                     AGE
angry-armadillo-traefik  NodePort  10.111.67.114  <none>       80:31562/TCP,443:30507/TCP  20s


NOTES:

1. Traefik has been started. You can find out the port numbers being used by traefik by running:

          $ kubectl describe svc angry-armadillo-traefik --namespace default

2. Configure DNS records corresponding to Kubernetes ingress resources to point to the NODE_IP/NODE_HOST

@thomastaylor312
Copy link
Contributor

@bacongobbler In terms of supported API versions, the last discussion about the --wait flag that we had was that we want to support all of the versions that k8s supports. When I fix it, I will be handling all of those workload versions

@staticdev
Copy link

It is not waiting for apps/v1 Deployment on Helm 2.14 (client and server).

To reproduce just try the script:

helm repo add storageos https://charts.storageos.com
helm install storageos/storageos-operator --namespace storageos-operator --wait
kubectl patch storageclass fast -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

It installs a apiVersion: apps/v1 Deployment.
https://github.com/storageos/charts/blob/master/stable/storageos-operator/templates/operator.yaml

If you wait some seconds you can patch the created StorageClass.

MarioCarrilloA pushed a commit to MarioCarrilloA/config that referenced this issue Jun 10, 2019
A new command "system application-update" is introduced in this
commit to support updating an applied application to a new version
with a new versioned app tarfile.

The application update leverages the existing application upload
workflow to first validating/uploading the new app tarfile, then
invokes Armada apply or rollback to deploy the charts for the new
versioned application. If the version has ever applied before,
Armada rollback will be performed, otherwise, Armada apply will be
performed.

After apply/rollback to the new version is done, the files for the
old application version will be cleaned up as well as the releases
which are not in the new application version. Once the update is
completed successfully, the status will be set to "applied" so that
user can continue applying app with user overrides.

If there has any failure during updating, application recover will be
triggered to recover the app to the old version. If application recover
fails, the application status will be populated to "apply-failed" so
that user can re-apply app.

In order to use Armada rollback, a new sysinv table "kube_app_releases"
is created to record deployed helm releases versions. After each app
apply, if any helm release version changed, the corresponding release
needs to be updated in sysinv db as well.

The application overrides have been changed to tie to a specific
application in commit https://review.opendev.org/#/c/660498/. Therefore,
the user overrides is preserved when updating.

Note: On the AIO-SX, always use Armada apply even it was applied issue
      on AIO-SX(replicas is 1) to leverage rollback, Armada/helm
      rollback --wait does not wait for pods to be ready before it
      returns.
      Related helm issue,
      helm/helm#4210
      helm/helm#2006

Tests conducted(AIO-SX, DX, Standard):
  - functional tests (both stx-openstack and simple custom app)
    - upload stx-openstack-1.0-13-centos-stable-latest tarfile
      which uses latest docker images
    - apply stx-openstack
    - update to stx-openstack-1.0-13-centos-stable-versioned
      which uses versioned docker images
    - update back to stx-openstack-1.0-13-centos-stable-latest
    - update to a version that has less/more charts compared to
      the old version
    - remove stx-openstack
    - delete stx-openstack
  - failure tests
    - application-update rejected
      (app not found, update to a same version,
       operation not permitted etc...)
    - application-update fails that trigger recover
      - upload failure
        ie. invalid tarfile, manifest file validation failed ...
      - apply/rollback failure
        ie. download images failure, Armada apply/rollback fails

Change-Id: I4e094427e673639e2bdafd8c476b897b7b4327a3
Story: 2005350
Task: 33568
Signed-off-by: Angie Wang <angie.wang@windriver.com>
slittle1 pushed a commit to starlingx-staging/openstack-armada-app-test that referenced this issue Sep 3, 2019
A new command "system application-update" is introduced in this
commit to support updating an applied application to a new version
with a new versioned app tarfile.

The application update leverages the existing application upload
workflow to first validating/uploading the new app tarfile, then
invokes Armada apply or rollback to deploy the charts for the new
versioned application. If the version has ever applied before,
Armada rollback will be performed, otherwise, Armada apply will be
performed.

After apply/rollback to the new version is done, the files for the
old application version will be cleaned up as well as the releases
which are not in the new application version. Once the update is
completed successfully, the status will be set to "applied" so that
user can continue applying app with user overrides.

If there has any failure during updating, application recover will be
triggered to recover the app to the old version. If application recover
fails, the application status will be populated to "apply-failed" so
that user can re-apply app.

In order to use Armada rollback, a new sysinv table "kube_app_releases"
is created to record deployed helm releases versions. After each app
apply, if any helm release version changed, the corresponding release
needs to be updated in sysinv db as well.

The application overrides have been changed to tie to a specific
application in commit https://review.opendev.org/#/c/660498/. Therefore,
the user overrides is preserved when updating.

Note: On the AIO-SX, always use Armada apply even it was applied issue
      on AIO-SX(replicas is 1) to leverage rollback, Armada/helm
      rollback --wait does not wait for pods to be ready before it
      returns.
      Related helm issue,
      helm/helm#4210
      helm/helm#2006

Tests conducted(AIO-SX, DX, Standard):
  - functional tests (both stx-openstack and simple custom app)
    - upload stx-openstack-1.0-13-centos-stable-latest tarfile
      which uses latest docker images
    - apply stx-openstack
    - update to stx-openstack-1.0-13-centos-stable-versioned
      which uses versioned docker images
    - update back to stx-openstack-1.0-13-centos-stable-latest
    - update to a version that has less/more charts compared to
      the old version
    - remove stx-openstack
    - delete stx-openstack
  - failure tests
    - application-update rejected
      (app not found, update to a same version,
       operation not permitted etc...)
    - application-update fails that trigger recover
      - upload failure
        ie. invalid tarfile, manifest file validation failed ...
      - apply/rollback failure
        ie. download images failure, Armada apply/rollback fails

Change-Id: I4e094427e673639e2bdafd8c476b897b7b4327a3
Story: 2005350
Task: 33568
Signed-off-by: Angie Wang <angie.wang@windriver.com>
slittle1 pushed a commit to starlingx-staging/utilities2 that referenced this issue Sep 9, 2019
A new command "system application-update" is introduced in this
commit to support updating an applied application to a new version
with a new versioned app tarfile.

The application update leverages the existing application upload
workflow to first validating/uploading the new app tarfile, then
invokes Armada apply or rollback to deploy the charts for the new
versioned application. If the version has ever applied before,
Armada rollback will be performed, otherwise, Armada apply will be
performed.

After apply/rollback to the new version is done, the files for the
old application version will be cleaned up as well as the releases
which are not in the new application version. Once the update is
completed successfully, the status will be set to "applied" so that
user can continue applying app with user overrides.

If there has any failure during updating, application recover will be
triggered to recover the app to the old version. If application recover
fails, the application status will be populated to "apply-failed" so
that user can re-apply app.

In order to use Armada rollback, a new sysinv table "kube_app_releases"
is created to record deployed helm releases versions. After each app
apply, if any helm release version changed, the corresponding release
needs to be updated in sysinv db as well.

The application overrides have been changed to tie to a specific
application in commit https://review.opendev.org/#/c/660498/. Therefore,
the user overrides is preserved when updating.

Note: On the AIO-SX, always use Armada apply even it was applied issue
      on AIO-SX(replicas is 1) to leverage rollback, Armada/helm
      rollback --wait does not wait for pods to be ready before it
      returns.
      Related helm issue,
      helm/helm#4210
      helm/helm#2006

Tests conducted(AIO-SX, DX, Standard):
  - functional tests (both stx-openstack and simple custom app)
    - upload stx-openstack-1.0-13-centos-stable-latest tarfile
      which uses latest docker images
    - apply stx-openstack
    - update to stx-openstack-1.0-13-centos-stable-versioned
      which uses versioned docker images
    - update back to stx-openstack-1.0-13-centos-stable-latest
    - update to a version that has less/more charts compared to
      the old version
    - remove stx-openstack
    - delete stx-openstack
  - failure tests
    - application-update rejected
      (app not found, update to a same version,
       operation not permitted etc...)
    - application-update fails that trigger recover
      - upload failure
        ie. invalid tarfile, manifest file validation failed ...
      - apply/rollback failure
        ie. download images failure, Armada apply/rollback fails

Change-Id: I4e094427e673639e2bdafd8c476b897b7b4327a3
Story: 2005350
Task: 33568
Signed-off-by: Angie Wang <angie.wang@windriver.com>
@bacongobbler
Copy link
Member

closing as inactive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants