Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace FAILED deployments with helm upgrade --install --force #3597

Merged
merged 1 commit into from Mar 9, 2018
Merged

replace FAILED deployments with helm upgrade --install --force #3597

merged 1 commit into from Mar 9, 2018

Conversation

bacongobbler
Copy link
Member

@bacongobbler bacongobbler commented Mar 2, 2018

When using helm upgrade --install, if the first release fails, Helm will respond with an error saying that it cannot upgrade from an unknown state.

With this feature, helm upgrade --install --force automates the same process as helm delete && helm install --replace. It will mark the previously FAILED release as DELETED, delete any existing resources inside Kubernetes, then replace it as if it was a fresh install. I did not want to make this the default behaviour of helm upgrade --install because this is a destructive operation that deletes resources in Kubernetes, and the operator should opt into and accept this behaviour.

closes #3353
refs discussion in #3437

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 2, 2018
@bacongobbler
Copy link
Member Author

Forward note: I'm a little iffy on marking the previous release as SUPERSEDED.

@helgi
Copy link
Contributor

helgi commented Mar 2, 2018

Can this be the default behaviour if there is only 1 release prior and it failed? Seems to be the most common case so far.

Beyond that, I agree with it not being the default behaviour.

Copy link
Contributor

@thomastaylor312 thomastaylor312 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment for discussion, otherwise I tested this and it is good to go

@@ -37,6 +38,10 @@ func (s *ReleaseServer) UpdateRelease(c ctx.Context, req *services.UpdateRelease
s.Log("preparing update for %s", req.Name)
currentRelease, updatedRelease, err := s.prepareUpdate(req)
if err != nil {
if req.Force {
// Use the --force, Luke.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

}
}

oldRelease.Info.Status.Code = release.Status_SUPERSEDED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may deserve a new status. If I was troubleshooting and saw "superseded" I wouldn't know it was force updated. Maybe REPLACED?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. I also think it might be good to just keep this in the FAILED state so others cannot roll back to this release, which SUPERSEDED or REPLACED would allow. The more I think about this, the more I would prefer to retain the existing FAILED state.

Related comment: #3597 (comment)

@adamreese any opinions on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think FAILED would also be a good idea

Chart: &chart.Chart{
Metadata: &chart.Metadata{Name: "hello"},
Templates: []*chart.Template{
{Name: "templates/something", Data: []byte("hello: world")},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of sad this text isn't from Star Wars 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can be arranged...


compareStoredAndReturnedRelease(t, *rs, *res)

edesc := "Upgrade complete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think expectedDescription would be clearer here. Not a necessary change though by any means

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearer variable names are always great.

@bacongobbler
Copy link
Member Author

bacongobbler commented Mar 8, 2018

Can this be the default behaviour if there is only 1 release prior and it failed? Seems to be the most common case so far.

That is good feedback, thanks @helgi. I'd be a little concerned about the behaviour being inconsistent for users though. In this case, a user can expect helm upgrade --install to "fix" the first failed release, but it will continue to fail on subsequent releases with no feedback on how to fix that. I'd kinda prefer to make it explicitly opt-in as a feature flag, but I'd love to know whether that's important!

Perhaps we can decide at a later date if we should do that based on others' feedback in 2.8.2. How does that sound?

@helgi
Copy link
Contributor

helgi commented Mar 8, 2018

Perhaps we can decide at a later date if we should do that based on others' feedback in 2.8.2. How does that sound?

Yeah, I considered the inconsistency and cringed as I wrote that message. The scenario I run into a little too often (for comfort) is engineers throwing together a new helm chart and not running it in minikube but rather putting it directly into the dev CI system (copy pasta basically), which leads to failed first deployment a lot of the time.

The user generally has no idea why it failed and why their future pushes do not fix the issue, and even when they have done the fix before (helm delete --purge) it is forgotten a lot of the time. Basically, first time deploy clumsiness UX issues.

I'm happy with deferring the decision but I did want to bring up the use case

return res, err
}

// pre-upgrade hooks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pre-install?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

return res, err
}

// post-upgrade hooks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be post-install?

@bacongobbler
Copy link
Member Author

k, addressed all comments. New stuff since the last round of reviews:

  • the old release finishes in state DELETED instead of state SUPERSEDED, modelling what helm install --replace would do
  • due to the wonky semantics of this feature flag, pre/post delete and install hooks are run, not pre/post upgrade hooks
  • m0ar Star Wars references in the tests for @thomastaylor312's enjoyment

Should be good for another round of reviews

@bacongobbler bacongobbler added this to the 2.8.2 - Bugfix milestone Mar 8, 2018
Copy link
Contributor

@thomastaylor312 thomastaylor312 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and your Star Wars reference is golden

When using `helm upgrade --install`, if the first release fails, Helm will respond with an error saying that it cannot upgrade from an unknown state.

With this feature, `helm upgrade --install --force` automates the same process as `helm delete && helm install --replace`. It will mark the previous release as DELETED, delete any existing resources inside Kubernetes, then replace it as if it was a fresh install. It will then mark the FAILED release as SUPERSEDED.
@bacongobbler bacongobbler merged commit abe958e into helm:master Mar 9, 2018
@bacongobbler bacongobbler deleted the upgrade-force-replace branch March 9, 2018 19:38
@bacongobbler bacongobbler added needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. picked Indicates that a PR has been cherry-picked into the next release candidate. and removed needs-pick Indicates that a PR needs to be cherry-picked into the next release candidate. labels Mar 9, 2018
@mcfedr
Copy link

mcfedr commented Mar 27, 2018

Am I write in reading this as --force will only ever cause a delete and redeploy when the first deploy has failed?

so if i have a chart deployed, break it, and then upgrade it, it will be fixed, not recreated?

@stealthybox
Copy link
Contributor

stealthybox commented Apr 19, 2018

@bacongobbler I'm also confused regarding this change.

helm upgrade --install
  1. if release v1 failed, release v2 will fail because the operation is considered unsafe
  2. if release v4 succeeded and release v5 failed, release v6 will succeed based off of release v4 ?
helm upgrade --install --force
  1. if release v1 failed, release v2 will succeed because release v1 is deleted first
  2. if release v4 succeeded and release v5 failed, release v6 will succeed based off of release ... ?
    does it delete the pre-existing release?
    does it cause downtime?

Is the --force flag safe to use all the time, expecting that it only destroys kubernetes resources when the very first release fails?

@bacongobbler
Copy link
Member Author

bacongobbler commented Apr 19, 2018

helm update --install with the --force flag automates what one would do to "fix" a failed upgrade. --force is just a helm delete && helm install --replace, and it only kicks in when the release failed to deploy. It only causes downtime if your application would go into a failed state. There's nothing we can do to fix that. --force just attempts to fix it. If your application would normally upgrade gracefully, there's no downtime.

@bacongobbler
Copy link
Member Author

bacongobbler commented Apr 19, 2018

and no, it kicks in any time a release fails to upgrade.

@stealthybox
Copy link
Contributor

is case 2 accurate?
I'm already lost as to whether helm rolls back release versions on failure

@stealthybox
Copy link
Contributor

stealthybox commented Apr 19, 2018

We're still on Helm v2.7.0 because the current upgrade-over-failure behavior appears to be safer for our use case than deleting a deployment.

Our releases usually fail due to hooks, but our hooks are idempotent Jobs, so it's usually safe and desireable to upgrade right over them with the pre-existing Kubernetes resources.

If I'm understanding the new behavior correctly, I believe it would be possible for a Deployment to be deleted and for no Pods to be available to serve traffic mid-release if a hook failed on the previous release.

@bacongobbler
Copy link
Member Author

To answer case 2, helm upgrade --install --force will upgrade as normal, so it'll use v4 to upgrade. However, should v6 fail, that's when the --force flag kicks in and I'm not 100% sure I can recall exactly what happens. It's been a little while so I'll have to look through the code again to answer your question. Feel like looking at it together at KubeCon? :)

@stealthybox
Copy link
Contributor

👍 yep, I'm curious
looking forward to it

@stealthybox
Copy link
Contributor

If I'm understanding the new behavior correctly, I believe it would be possible for a Deployment to be deleted and for no Pods to be available to serve traffic mid-release if a hook failed on the previous release.

It seems from #3208 (comment) that this might be the case when using upgrade install --force.
This flag is pretty dangerous.

splisson pushed a commit to splisson/helm that referenced this pull request Dec 6, 2018
replace FAILED deployments with `helm upgrade --install --force`
RaphaelVogel added a commit to gardener/cc-utils that referenced this pull request Jan 23, 2019
Failed helm deployments deployments cannot be upgraded without
the --force flag.
See: helm/helm#3597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. picked Indicates that a PR has been cherry-picked into the next release candidate. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

helm upgrade --install doesn't perform an install/upgrade if the first ever install fails
7 participants