Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform three way merge patch during `helm upgrade` call as `kubectl apply` does #3805

Closed
zaa opened this issue Mar 31, 2018 · 28 comments

Comments

@zaa
Copy link

commented Mar 31, 2018

I've checked https://github.com/kubernetes/helm/blob/master/pkg/kube/client.go#L432 and found out that Helm v2.8.x uses a two way strategic merge patch between the old and the new configuration of a resource to modify it during helm upgrade call.
However, kubectl apply uses a three way merge patch between the old configuration, the new configuration and the current state of the resource: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubectl/cmd/apply.go#L670-L679 (https://kubernetes.io/docs/concepts/overview/object-management-kubectl/declarative-config/#merge-patch-calculation explains the algorithm in a bit more details)
Is there any specific reason why Helm does not use strategicpatch.CreateThreeWayMergePatch? Would the project be ok with using the function if I were to prepare a patch?

Thank you.

@michelleN

This comment has been minimized.

Copy link
Member

commented Apr 2, 2018

hey @zaa! happy to provide some context.

  • The general pattern we use to update objects in Kubernetes is 1. modify chart 2. perform helm upgrade. This allows Helm to keep track of the manifests and values at each revision. Helm expects that you're not modifying manifests outside of the Helm system, so we made the call to keep things simple and use the two way merge strategy.

  • Also, kubectl apply adds annotations to your Kubernetes objects to calculate what needs to be revised. We didn't want Helm to be in the business of modifying or injecting things into a user's objects and manifests to prevent confusion and allow for maximum transparency and control.

Things have changed quite a bit since we made this decision, so I'm open to revisiting. Could you list some reasons for moving to the 3 way merge patch strategy?

Also something to consider: If this changes current expected behavior in Helm (which I expect it would since we'd be comparing current state in the cluster), then this would be a breaking change.

@fejta-bot

This comment has been minimized.

Copy link

commented Jul 1, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bergerx

This comment has been minimized.

Copy link

commented Jul 2, 2018

Helm expects that you're not modifying manifests outside of the Helm system.

This could be a dangerous assumption for systems in production. I didn't hit an issue on this one yet since we don't yet have the majority of the company started using Helm in production. The key question I see here is "can I expect to find a chart in an un-upgradable state if resources are modified out of Helm?" (sorry but I'm not sure how to create such scenario, I can look into this one later)

If that's the case, I expect sooner or later, some operator will have to do in-place modifications (either via kubectl or dashboard) on some resources to prevent or shorten a critical outage. I expect such changes would likely break following upgrades and so leaving our deployment pipelines in a broken and note easy to recover state without reverting the change.

The "only-helm-can-modify-resources" expectation forces us to plan accordingly ahead. We need to educate each one being part of the operations (in our case everyone in every team) to not to in-place modify any Helm deployed resources, and they have to know how Helm is being operated on that instance. As far as I see 3-way merge addresses most of such issues in such scenarios.

Also, such decision forces us to have a strategy to identify and recover/alert if any out-of-helm change happened, maybe we can go for RBAC rules path to prevent non-helm entities to modify any resources that are deployed by Helm, but it will have it's own ups and downs to be discovered.

I think this one deserves a discussion before being closed. So I'm disabling the issue lifecycle and feel free to close if there is a final decision taken on this one.

/remove-lifecycle stale

@steven-sheehy

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2018

We've definitely ran into this in production. The Helm advice to only make changes via Helm is not practical during rapid prototyping in a development environment via kubectl edit or in production when there's an urgent issue that can be solved by increasing some resource limit.

@michelleN Is there any consideration for changing this behavior for Helm 3? Or maybe in Helm 2 with an optional flag on helm upgrade?

@bergerx

This comment has been minimized.

Copy link

commented Jul 25, 2018

We were also hit by this issue today, some values are somehow rendered with "null" string in some manifests, they way "null" strings spread around doesn't seem like a manual change but somehow helm has no track of what actually applied. We only realized faultily updated manifests after some time and since helm doesn't take the current state into account our solution was to manually compare the manifests and apply the drift.

@luisdavim

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2018

Also had the same issues and even got into a state where I couldn't deploy my chart anymore because of leftover resources (replicaSets, Pods and PVCs).

@luisdavim

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2018

@michelleN any plans for addressing this issue?

@michelleN michelleN added the v3 label Jul 26, 2018

@michelleN

This comment has been minimized.

Copy link
Member

commented Jul 26, 2018

@luisdavim @zaa @bergerx @steven-sheehy
Thanks so much for all of your input and for taking the time to explain these scenarios. I've labeled this as a discussion point for v3. I really like the idea of being able to toggle the upgrade strategy via a flag for situations when you need to force an upgrade.

I put this down as a discussion item for the Helm developer call on August 2. You're all welcome to join. In the meantime, we'll also want to figure out if there are any short term or long term plans for that strategic merge patch library in Kubernetes.

@bergerx

This comment has been minimized.

Copy link

commented Aug 2, 2018

I just missed the developer call :( here are the notes from there:

Update strategy in helm v3. See issue #3805 (michelle)

  • Adam’s been working a bit on a prototype to evaluate this. Pretty close to spiking it out, when he should have a better idea about this.
@michelleN

This comment has been minimized.

Copy link
Member

commented Aug 7, 2018

  • @adamreese to this thread for when he's ready with an update
@Nopik

This comment has been minimized.

Copy link

commented Aug 13, 2018

we made the call to keep things simple and use the two way merge strategy.

Well, the call was made, and code follows. But, the execution if flawed, sadly. Current code is ending up in non-upgradeable state, with very confusing messages, sometimes not even related to the real problem. A lot of users just say: 'delete release and install it again'. If I were core developer of Helm, I wouldn't find it acceptable.

One solution would be to do proper and full rollback on failures. That would be worse solution, I'm afraid (though it will help to have it). The better solution would be to allow a) taking over manually created resources, if user wants it, b) allow users to modify resources outside of helm, anyway.

IMHO Helm community grows a lot, and new use cases are being created very often, which might lead to lot of surprises (hopefully good ones) for Helm authors. As a result, I would recommend abandoning "you cant do that" with "ok, it is not our default, but if you want to modify resources, here is a flag for this" approach. With all the respect, that is the difference between opinionated and stubborn ;)

Like, I'm having some umbrella chart, which is containing ~30 other charts for our microservices + 5-10 charts for supporting tech like psql, or redis. In total there is like ~150 resources (at least that's what tiller says in its logs). I would prefer not to delete & recreate the whole release just because of Helm bugs. But, few days ago, I was hit by this 'service X does not exist'. Until I solved it, I had to delete 46 resources manually. And you know how to do that? You delete service X and redeploy. Helm says: 'service Y does not exist'. So, you delete service X (it was created in meantime), then delete service Y. Then redeploy. 'Service Z does not exist'. Delete services, X, Y and Z. Redeploy. 'Config map C does not exist', and so on. 46 iterations. Let that sink in. Of course I had a script for it, so it was pretty long at the end of the process. Insane user experience, really. So, "We made this call to keep things simple" really doesn't resonate with me, you know ;)

Anyway, that's why I've set out to fix it. Mostly just for myself, as I'm finding to be forced to delete 5-20 resources daily just to work around this single Helm annoyance. (It is so, because recently we're at the stage of rebuilding our app and been modifying our charts a lot recently. Hopefully in short time we'll move to other areas and our charts will be relatively unchanged).

@ssalaues

This comment has been minimized.

Copy link

commented Aug 14, 2018

I'm in the same situation as @Nopik where we're currently in development cycle of our charts which means lots of the code in our charts keep changing. This is something that we keep running into and it is annoying to have to helm delete and reinstall which can take some time with a dozen deployments/statefulsets/etc vs a simple helm upgrade (or chase deleting X, Y, and Z that "does not exist").

@sdake

This comment has been minimized.

Copy link

commented Aug 19, 2018

@michelleN in upstream Istio, I'm pretty sure we are suffering the problem #1193. Would this solution resolve that issue? Upgrade is obviously critical for service mesh operators. At present, we recommend two paths for installation:

  1. https://istio.io/docs/setup/kubernetes/helm-install/#option-1-install-with-helm-via-helm-template
  2. https://istio.io/docs/setup/kubernetes/helm-install/#option-2-install-with-helm-and-tiller-via-helm-install

Option two may have to go if we can't find a resolution to at least handle the addition of new objects.

See istio/istio#8054 and istio/istio#8034.

Cheers
-steve

@bergerx

This comment has been minimized.

Copy link

commented Aug 19, 2018

We are also forced to use the helm template <chart>| kubectl apply -f - rather than helm install in multiple places, but you lose the "helm tracks all the created resources and removes them once they are removed from the chart/manifests" feature. You'll have to maintain the deletions out of helm :(

Also, your first method doesn't allow updating istio once its deployed, because it also uses kubectl create rather than kubectl apply. Even though "kubectl apply" can't auto-remove the removed resources, it handles the further updates on the manifests properly. (I'm not sure if you decided to use kubectl create explicitly for users to get more obvious errors when they want to update later in this method since the kubectl apply way will skip removing the resources silently)

@sdake

This comment has been minimized.

Copy link

commented Aug 19, 2018

@bergerx thanks - I am pretty sure most devs have been testing the apply mechanism and the docs are on oversight. Thanks for the bug report - it is fixed here: istio/istio.io#2307

Cheers
-steve

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Aug 23, 2018

in upstream Istio, I'm pretty sure we are suffering the problem #1193. Would this solution resolve that issue?

Unfortunately, switching to a different merge patch strategy on an existing install will not resolve that particular issue. #1193 occurs when an upgrade fails during a two-way merge, leaving the release in an inconsistent state. Switching to a different merge strategy will not fix that, as one cannot simply switch from a two-way strategic merge patch to a three-way strategic merge patch. The behaviour of each strategy and how they maintain state are entirely different.

For #1193 in particular, I wrote up a summary of the issue and why switching to a different upgrade strategy will not solve the problem. Hopefully it may help shed some light on the reasoning why it won't resolve the issue. See #4223 (comment) for the write-up. :)

To hopefully summarize what's been mentioned in this thread, in #1193, and in the public dev call this morning:

  • as @michelleN mentioned earlier, kubectl apply's 3-way strategic merge patch works by adding annotations to your Kubernetes objects to calculate what needs to be revised. Since a two-way strategic merge patch does not add these annotations to the Kubernetes resources which a 3-way strategic merge patch needs to operate, users would not be able to switch over their existing deployments to the new strategy, only new deployments.

  • the main reason people are requesting this feature seems to stem from #1193, which switching to a three-way merge patch will not resolve existing deployments (as mentioned in my write-up above) because of point 1 (since Helm does not add annotations to objects, users would not be able to switch over their existing deployments to the new strategy, only new deployments).

  • from a codebase maintenance point of view, maintaining multiple merge strategies in Helm will lead to a completely different set of edge cases and would be a significant increase in maintenance cost for Helm 2, especially now when Helm 3 development is well underway at this point. We are however looking into and exploring different merge strategies for Helm 3, such as the server-side apply logic being proposed upstream via the Apply Working Group. @adamreese has been spiking a proof-of-concept to evaluate the different merge patch strategies for Helm 3 (including the 3-way merge patch strategy proposed here), but it isn't ready.

tl;dr

  1. We do not plan on supporting alternative merge patch strategies for Helm 2.
  2. We are exploring alternatives for Helm 3, but haven't made any commitments to switch at this point.

Hope that clears things up!

@sdake

This comment has been minimized.

Copy link

commented Aug 24, 2018

@bacongobbler thanks for taking the time to explain that. If a three way merge is not viable for new deployments in Helm 2.y.z, when will #1193 be fixed? The bug has been open for nearly two years with no clear resolution planned for Helm 2.0.

Doing something as simple as adding a single configmap to the Helm chart between Istio 0.8.0 and Istio 1.0.0 resulted in an inability to upgrade Istio. Hopefully the Helm team can come to a resolution for this critical flaw in Helm.

Cheers
-steve

@agolomoodysaada

This comment has been minimized.

Copy link

commented Aug 29, 2018

@bacongobbler , if no support for the 3-way merge feature is planned, we need an alternative or a workaround. Otherwise, #1193 is a seriously painful bocker.
This is a core feature of Helm.

As a user, I need to be able to update my chart.

What can the community do to resolve this basic requirement?

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Sep 7, 2018

If a three way merge is not viable for new deployments in Helm 2.y.z, when will #1193 be fixed? The bug has been open for nearly two years with no clear resolution planned for Helm 2.0.

At this point, we're stumped on how to proceed. We've discussed the bug for weeks and none of the proposed solutions will work in all cases, either by introducing new bugs or significantly changing tiller's upgrade behaviour.

For example, @michelleN and I brainstormed earlier this week and thought of two possible solutions, neither of which are particularly fantastic:

  1. When an upgrade fails, we automatically roll back and delete resources that were created during this release.

This is very risky as the cluster may be in an unknown state after a failed upgrade, so Helm may be unable to proceed in a clean fashion, potentially causing application downtime.

  1. During an upgrade, if we are creating a new resource and we see that it already exists, we instead apply those changes to the existing resource, or delete/re-create it.

This is extremely risky as Helm may delete objects that were installed via other packages or through kubectl create, neither of which users may want.

The safest option so far has been to ask users to manually intervene in the case of this conflict, which I'll demonstrate below.

If anyone has suggestions/feedback/alternative proposals, we'd love to hear your thoughts.

@bacongobbler , if no support for the 3-way merge feature is planned, we need an alternative or a workaround. Otherwise, #1193 is a seriously painful bocker.

@agolomoodysaada I've already mentioned the workaround in previous threads. To re-iterate the issue as well as the workaround:

When an upgrade that installs new resources fails, the release goes into a FAILED state and stops the upgrade process. The next time you call helm upgrade, Helm does a diff against the last DEPLOYED release. In the last DEPLOYED release, this object did not exist, so it tries to create the new resource, but fails because it already exists.

This can easily be worked around by manually intervening and deleting the resource that the error reports as "not found". Following the example I demonstrated in #4223 (comment):

><> helm fetch --untar https://github.com/helm/helm/files/2103643/foo-0.1.0.tar.gz
><> helm install ./foo/
...
><> vim foo/templates/service.yaml                    # change the service name from "foo" to "foo-bar"
><> kubectl create -f foo/templates/service.yaml      # create the service
service "foo-bar" created
><> helm upgrade $(helm last) ./foo/
Error: UPGRADE FAILED: no Service with the name "foo-bar" found
><> kubectl delete svc foo-bar
service "foo-bar" deleted
><> helm upgrade $(helm last) ./foo/
Release "riotous-echidna" has been upgraded. Happy Helming!
...
><> kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
foo-bar      ClusterIP   10.104.143.52   <none>        80/TCP    3s
kubernetes   ClusterIP   10.96.0.1       <none>        443/TCP   1h

In other words, deleting resources created during the FAILED release works around the issue.

@steven-sheehy

This comment has been minimized.

Copy link
Contributor

commented Sep 7, 2018

@bacongobbler I think this issue and issue #1193 are getting a little too intertwined. This issue is more of a feature request to add three way merge patch to solve scenarios like manual edits to helm managed resources. If it also addresses #1193 that's good, but I think the benefits of the 3 way merge should also be considered standalone from that issue.

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Sep 7, 2018

That's good feedback; Thanks @steven-sheehy. Let's carry on the discussion in #1193 instead so we can continue to focus on the proposal at-hand.

@sboardwell

This comment has been minimized.

Copy link

commented Mar 26, 2019

Hi @bacongobbler,

this is more of a question as I see the topic is now planned for helm3.

We had the situation that a secret in one of our releases was changed via kubectl apply in the wrong kube context. The resource was, but for the incorrect secret value, identical to the released version. Noticing this, we performed an upgrade to reset/override the values. However, since helm doesn't check the cluster state, the helm upgrade command was successful, with (to the unsuspecting of us) even the manifest looking great. However, the secrets value was obviously left unchanged and incorrect.

So, my question would be whether there is a way to have a diff between the cluster state and what helm has in it's manifest. I wouldn't need to merge the differences, but be aware that differences exist. Something similar to the terraform plan command perhaps, although it doesn't have to be that sophisticated - just a heads-up on which resources have 'unknown' changes would be enough.

I could most likely script something but thought I would ask here first.

Cheers,
Steve

@agolomoodysaada

This comment has been minimized.

Copy link

commented Mar 27, 2019

@sboardwell

This comment has been minimized.

Copy link

commented Mar 27, 2019

Hi @agolomoodysaada,

thanks I think I used this plugin or something similar before and, while it's nice and simple, it still only compares the manifests within helm and not the actual current state of the deployed cluster resources.

You can test by deploying say mysql and then manually changing the cpu limit in the deployment. This change will not show up and helm will also not override the manually changed value. So:

# install the chart
helm install --name my-release --namespace default stable/mysql

# change the cpu limit
kubectl -n default edit deployments.apps my-release-mysql

# compare
helm diff upgrade my-release stable/mysql
@bacongobbler

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

Whew! @thomastaylor312 and I think we've landed upon a solution. #6124 implements a three-way merge patch strategy for Helm 3. We've tested it at this point and have found that upgrades and rollbacks now take the live state into consideration. For example, if an environment variable is changed via kubectl edit after the initial release and you call helm rollback or helm upgrade to reset the values (as in @sboardwell's case), that variable is reverted back to its previous state.

Please feel free to give it a spin and provide feedback. We will share more information about the technicalities behind this change in the following days. For now, I need a brain break. 😂

@sboardwell

This comment has been minimized.

Copy link

commented Jul 31, 2019

@bacongobbler - fantastic news! Great work, and I can understand the need for a brain break 😄.

Re: rollback / upgrade to reset.
Would you need to rollback to a previous version, or upgrade to a new version? Or would a reset type action be possible by executing a "pseudo-rollback/upgrade" using the current version?

@bacongobbler

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

Either/or. The value just needs to be something different than what's in the live state.

We've tested the following. All three roll back to its intended state:

  • rolling back to an older release
  • rolling back to the current release (i.e. helm uograde, kubectl edit, then helm rolback)
  • calling helm upgrade
@bacongobbler

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

closed via #6124.

Helm 3 automation moved this from Backlog to Done Aug 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.