Fix `ApplyTask` patch behavior #1332

ANeumann82 · 2020-02-04T16:09:38Z

What this PR does / why we need it:
The ApplyTask patch method currently uses a StrategicMerge patch, which by default merges some lists, for example, containers in a Deployment-PodTemplate. This makes it impossible to have a conditional container in an operator.

Added a test harness test.

Fixes #1286

Issue
The Big Problem is that applying new versions of rendered resources from KUDO to the cluster is not as easy as it seems: We generally render full resources from the OperatorVersion templates, but there is no easy way to update existing resources in the cluster:

Patch with StrategicMerge
This is the solution we use at the moment. The issue is, that a strategic merge does not replace certain lists - for example, the list of containers in a PodTemplate is merged, not replaced by default. If a list (or map) is merged instead of replaced depends on the annotations in the resource types:

	// +patchMergeKey=name
	// +patchStrategy=merge
	Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`

It is possible to overwrite this behavior for strategic merges: https://www.disasterproject.com/kubernetes-kubectl-patching/
We can add an entry $patch: replace to a list or a map to replace instead of merge. But we can't add these entries to all maps and lists, as this would trigger the replacement of immutable fields which leads to an error by the k8s API. We would have to maintain a list of all places to add these patch-entries.

Replace
We could replace resources instead of patching them. This approach does not work as a lot of resources have immutable fields that can't be updated - for example the ServiceSpec.ClusterIP. As this field is usually filled in by k8s, it's not filled when we generate the resource from the KUDO template. A replace would then complain that ClusterIP will be updated with "" which is not allowed.

It might be possible to Force-Replace. This allows us to update immutable fields by doing a delete/create instead of a replace; the downside is that it would, for example, assign a new ClusterIP and be a disruptive update (https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#disruptive-updates)

ServerSideApply
This should theoretically work best. All creations and updates should be done via:

c.Patch(context.TODO(), newObj, client.Apply, client.FieldOwner("KUDO"))

and k8s should figure out what actually changed and apply it. There are some problems on the integration tests (no darwin binaries for kube-apiserver > 15.5), and bugs in the server side apply feature itself. Might be an option for the future.

Client Side Threeway-Diff
This is the only real way to do this at the moment.
First approach was to reuse code from kudectl apply, but that fails for various reasons. kubectl isn't really expected to be used as a library.

I rewrote the code similar to kudectl apply though, it still needs some polishing but it should work for now.

Changed task_apply to use update instead of patch Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

More testing with strategic merge patches Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Implemented correct three way merge for apply task Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Use kudo.dev annotation for lastAppliedConfig Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

pkg/test/utils/subset.go

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com> # Conflicts: # pkg/engine/task/task_apply.go

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

zen-dog

Nice work. I left a bunch of nits (mostly around naming) and one question about what happens with existing operators after people upgrade. PTAL

pkg/engine/task/task_apply.go

pkg/engine/task/task_delete.go

pkg/engine/task/task_apply.go

pkg/util/kudo/labels.go

pkg/engine/task/task_apply.go

zen-dog · 2020-02-11T16:45:04Z

pkg/engine/task/task_apply.go

+		return nil, nil
+	}
+
+	original, ok := annots[kudo.LastAppliedConfigAnnotation]


I got stuck at this line. Are we backwards compatible after this is merged? In a scenario where someone re-runs a deploy plan for an existing operator (after updating KUDO), all plan resources will be patched, correct? However, there is no LastAppliedConfigAnnotation so it will return nill, nil and the subsequent merges will... fail?

Yeah, this is a problem. I'll try to figure something out about this...

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 · 2020-02-12T12:56:41Z

This PR introduces a backwards incompatible change:

When used on an already installed instance of an operator, the deployed resources are missing the new "kudo.dev/last-applied-configuration", which will fail the patch process, as no originalConfiguration can be found for the ThreeWayMerge.

There are multiple options to work around this issue, all of which are a bit problematic:
(1) Bail out on old deployed Instances - This is the current state after this PR
(2) Automatically Delete/Recreate the resource - This is problematic, as we would make a disruptive update without the user explicitly acknowledging or allowing this.
(2a) Allow the user to specify a --force flag when using k kudo update and then use Delete/Recreate. This is detailed in #1335 and probably the most usable approach
(3) Patch the resources as we did before this PR - Not a good solution, as we would leave the deployed instances in weird unknown states, which doesn't solve anything
(4) Calculate the originalConfiguration from the old parameters and metadata of the Instance - This would theoretically work, but is a lot of effort and probably error prone, as there are a lot of edge cases and things to consider.

zen-dog

We need to announce/changelog the breaking change. Otherwise, LGTM

gerred · 2020-02-12T14:00:40Z

We need to think about this one. It's one thing to make breaking changes without some form of automated migration for the operator development side, it's another thing entirely to do something that can cause data loss. Can we introduce some sort of migration to handle this for the future? Or can we at least come up, while noting this, a clear way for any existing users of our operators to preserve their data? Even if we don't have any users now, this is a good practice to establish early.

I understand there's probably no great way to actually perform a migration of existing instances though, and overall the PR itself LGTM.

ANeumann82 · 2020-02-12T14:08:32Z

@gerred I have a proposal to use --force on a parameter update, see #1335 That would allow users to continue using their existing installation, at the downside of having one disruptive update of the operator.

As long as the operator does not manually create a PV, that should not lead to any data loss.

zen-dog · 2020-02-12T14:18:13Z

it's another thing entirely to do something that can cause data loss.

In the proposed implementation, KUDO manager won't find the needed annotation, three-way-merge will fail and so will the step/plan (a transient ERROR if I'm correct). So no data loss should occur @gerred Human operator, however, will have to recreate the Instance to move on.

ANeumann82 · 2020-02-12T14:22:28Z

@zen-dog Not a transient error anymore. In my last commit I changed a couple of the errors to fatalExecutionError, as they won't recover.
Apart from that you're correct, no data loss (but no change to the deployed resource possible either)

gerred · 2020-02-12T16:35:38Z

Right, but we're telling them the only way to move forward is to re-create the instance. Deleting the instance will destroy the existing instance. It might not get rid of PVs, but we're now in a manual situation where data loss can occur if I don't manually ensure that my PV is used or backed up/restored, right?

zen-dog · 2020-02-12T16:39:30Z

True, this is exactly as any prior breaking change situation when instances had to be recreated in the new KUDO version.

ANeumann82 · 2020-02-13T09:19:40Z

@gerred As you already mentioned, this would be a prime example for a migration process. I'll take that chance and prioritise working on KUDO upgrades and try to get a migration for this in.

ApplyTask now uses correct ThreeWayMerges (either plain JSON or Strategic K8s merges) to apply the task, the same way `kubectl apply` does. It stores the last applied version of the resource in an annotation of the resource so that the merge can be done correctly on the next apply. As already applied resources do not have this annoation, the ApplyTask can not calculate the correct patch and fails, this is the breaking change. Signed-off-by: Andreas Neumann <aneumann@mesosphere.com> Signed-off-by: Thomas Runyon <runyontr@gmail.com>

ANeumann82 added 2 commits February 4, 2020 13:12

Add integration test to verify that containers can be removed

68c9844

Changed task_apply to use update instead of patch Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Extended test

99dd13f

More testing with strategic merge patches Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 requested review from alenkacz, gerred, kensipe, nfnt and zen-dog as code owners February 4, 2020 16:09

ANeumann82 self-assigned this Feb 4, 2020

ANeumann82 added the do-not-merge/hold label Feb 4, 2020

ANeumann82 added 3 commits February 5, 2020 11:09

Test for e2e with normal patch

46535cf

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Allow assertation that a field is NOT present in the returned yaml

2934be0

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Fixed integration test

619d93d

Implemented correct three way merge for apply task Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 requested a review from jbarrick-mesosphere as a code owner February 6, 2020 14:20

ANeumann82 added 3 commits February 6, 2020 15:28

Small cleanup

3af73ca

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Allow fallback to three way json merge instead of strategic merge

f096dc0

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Merge branch 'master' into an/update-instead-of-patch

8499582

ANeumann82 added needs review and removed do-not-merge/hold labels Feb 6, 2020

Code cleanup

2a399b6

Use kudo.dev annotation for lastAppliedConfig Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 mentioned this pull request Feb 7, 2020

Ugly test to reproduce the issue. #1287

Closed

jbarrick-mesosphere reviewed Feb 7, 2020

View reviewed changes

pkg/test/utils/subset.go Outdated Show resolved Hide resolved

ANeumann82 added 2 commits February 10, 2020 09:27

Merge branch 'master' into an/update-instead-of-patch

927d11c

Use test harness errors instead adding feature in assert.yaml

ef48949

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

jbarrick-mesosphere approved these changes Feb 10, 2020

View reviewed changes

ANeumann82 added 2 commits February 11, 2020 14:02

Merge branch 'master' into an/update-instead-of-patch

5e43e65

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com> # Conflicts: # pkg/engine/task/task_apply.go

Small cleanup

7db191b

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

zen-dog changed the title ~~Fix Task_Apply patch behavior~~ Fix ApplyTask patch behaviour Feb 11, 2020

zen-dog changed the title ~~Fix ApplyTask patch behaviour~~ Fix ApplyTask patch behavior Feb 11, 2020

zen-dog reviewed Feb 11, 2020

View reviewed changes

Merge branch 'master' into an/update-instead-of-patch

5d7a917

ANeumann82 added 2 commits February 12, 2020 10:23

Applied code review comments

12b746d

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Bail out loud if an error occurs while patching

6d57b53

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 mentioned this pull request Feb 12, 2020

Allow usage of "--force" to apply updates to KUDO managed resources #1335

Open

zen-dog approved these changes Feb 12, 2020

View reviewed changes

ANeumann82 added the release/breaking-change This PR contains breaking changes and is marked in the release notes label Feb 28, 2020

ANeumann82 merged commit 8afbb97 into master Feb 28, 2020

ANeumann82 deleted the an/update-instead-of-patch branch February 28, 2020 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `ApplyTask` patch behavior #1332

Fix `ApplyTask` patch behavior #1332

ANeumann82 commented Feb 4, 2020 •

edited by zen-dog

Loading

zen-dog left a comment

zen-dog Feb 11, 2020

ANeumann82 Feb 12, 2020

ANeumann82 commented Feb 12, 2020

zen-dog left a comment

gerred commented Feb 12, 2020

ANeumann82 commented Feb 12, 2020

zen-dog commented Feb 12, 2020 •

edited

Loading

ANeumann82 commented Feb 12, 2020

gerred commented Feb 12, 2020

zen-dog commented Feb 12, 2020

ANeumann82 commented Feb 13, 2020

Fix ApplyTask patch behavior #1332

Fix ApplyTask patch behavior #1332

Conversation

ANeumann82 commented Feb 4, 2020 • edited by zen-dog Loading

zen-dog left a comment

Choose a reason for hiding this comment

zen-dog Feb 11, 2020

Choose a reason for hiding this comment

ANeumann82 Feb 12, 2020

Choose a reason for hiding this comment

ANeumann82 commented Feb 12, 2020

zen-dog left a comment

Choose a reason for hiding this comment

gerred commented Feb 12, 2020

ANeumann82 commented Feb 12, 2020

zen-dog commented Feb 12, 2020 • edited Loading

ANeumann82 commented Feb 12, 2020

gerred commented Feb 12, 2020

zen-dog commented Feb 12, 2020

ANeumann82 commented Feb 13, 2020

Fix `ApplyTask` patch behavior #1332

Fix `ApplyTask` patch behavior #1332

ANeumann82 commented Feb 4, 2020 •

edited by zen-dog

Loading

zen-dog commented Feb 12, 2020 •

edited

Loading