Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stargate templates should be coalesced field by field #526

Closed
wants to merge 8 commits into from

Conversation

adutra
Copy link
Contributor

@adutra adutra commented Apr 18, 2022

What this PR does:

This PR uses Mergo to merge Stargate and Image templates.

Note that it wasn't possible to do the same for Cassandra templates because the cluster and DC level structs are disjoint. See #508 for a fix.

Which issue(s) this PR fixes:
Fixes #525

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@adutra adutra requested a review from a team as a code owner April 18, 2022 15:07
@Miles-Garnsey
Copy link
Member

Miles-Garnsey commented Apr 19, 2022

This is just a general comment to surface some of my own findings from when I looked at using Mergo. Don't take it as criticism, I think this PR is a good idea and I'm happy to review it if you'd like @adutra.

Mergo will probably work fine here, but may present some issues elsewhere in the k8ssandra ecosystem. The problem is that k8s thinks of objects in terms of json, so the patch strategies are either json6902 or json7396.

There are relatively easy ways to make either of those work (even if they might involve marshalling/unmarshalling to/from json or yaml). The hitch is that k8s modifies the json7396 standard to include a novel "merge key" for fields declared with patch strategy 'merge' (see the container API fields envVar or containerPort for examples). When this strategy is in use, arrays will be merged using the merge key instead of by index.

K8s calls these types "associative lists" and they throw a huge spanner in the works in several ways. Firstly, it is hard to find libraries that handle this type of merge. But more importantly, libraries like Mergo don't support these types well, because the merge keys and strategies are defined via struct tags in the parent object, while a Mergo custom transformer has access only to the object being merged.

To overcome these problems, I tried several approaches, including:

  • Using Mergo.
  • Trying to re-use elements of k8s.io/apimachinery/pkg/util/strategicpatch.
  • Trying to use elements of structured-merge-diff (which is part of SSA).
  • Looking at the implementations in kustomize to see if something there might be reusable.
  • Using reflect to write a custom implementation.

I basically got to the last one, looked at the code in structured-merge-diff, and realised that this is a hard problem which might take weeks to complete. I do plan to return to it as priorities permit, but Mergo is a decent solution in the meantime.

@adutra
Copy link
Contributor Author

adutra commented Apr 19, 2022

This is just a general comment to surface some of my own findings from when I looked at using Mergo. Don't take it as criticism, I think this PR is a good idea and I'm happy to review it if you'd like @adutra.

Thank you for your insights @Miles-Garnsey, your feedback is most welcome and not seen as a criticism at all :-)

Mergo will probably work fine here, but may present some issues elsewhere in the k8ssandra ecosystem. The problem is that k8s thinks of objects in terms of json, so the patch strategies are either json6902 or json7396.

I wrote a small comment here about the issue and how Mergo fits within the problem space.

Indeed Mergo cannot emulate the strategic merge patch behavior of Kubernetes.

But taking a step back, let's define the scope of what we want to do here. We're not trying to merge or patch different versions of the same resource; we are rather trying to "coalesce" together similar portions of the K8ssandraCluster spec: namely, all those "templates" (as we call them) that can be defined "at cluster level" and "at DC level" (and sometimes, even "at rack level").

An immediate corollary of that that is worth mentioning: we don't need to convert these objects to or from json nor yaml. In general, everything happens inside the reconciler code, programmatically. Most of the Kubernetes machinery created around the merge strategy for associative arrays is done to allow for those types to be converted to and from json; but this doesn't apply here, and that kind of plays in favor of a simpler, reflection-based solution.

k8s modifies the json7396 standard to include a novel "merge key" for fields declared with patch strategy 'merge' [...] When this strategy is in use, arrays will be merged using the merge key instead of by index [...]

Indeed, Kubernetes associative lists are appealing, however I'm not sure we have an actual use case for them in our templates. We do have fields of type []Toleration in the Stargate templates: these would maybe be good candidates for associative lists, but quite frankly, we can live without them as the workaround is simple: all the user has to do is repeat the cluster-level tolerations in the dc-level tolerations slice.

Firstly, it is hard to find libraries that handle this type of merge. But more importantly, libraries like Mergo don't support these types well, because the merge keys and strategies are defined via struct tags in the parent object.

Well I've got news for you :-)

I came to the exact same conclusion last week and decided to spend my weekend (yeah, no comments) on it.

The result is a new library, Goalesce:

https://github.com/adutra/goalesce

My goal was precisely to overcome all the shortcomings in Mergo:

  • Ability to merge maps of structs
  • No in-place modifications, but rather allocate a new object to hold the merged result
  • Support for struct tags to modify the coalescing behavior of a specific field
  • And above all: many slice coalescing strategies:
    • set-union: the elements are coalesced in a set-like slice
    • list-append: all elements appended (possibly with duplicates)
    • merge-by-index: elements coalesced based on their indices
    • merge-by-key: elements are coalesced based on a user-provided merge key.

The last strategy is exactly the same as Kubernetes strategic patch merge.

I would love to get your opinion on Goalesce. The test coverage is 100%. If you think we should switch to Goalesce, I'd be more than happy to do so. Also if you have your own coalescing solution ready, I'd love to have a look.

But, as you said, until we decide on a new library, I think Mergo is the best, or let's say, the least inconvenient choice.

@Miles-Garnsey
Copy link
Member

I wrote a small comment #525 (comment) about the issue and how Mergo fits within the problem space.

I think your commentary is pretty much spot on. I'd only add:

  • Using Mergo means we can't use associative lists in future for Stargate.
  • We can't currently use Mergo in cass-operator because the podTemplateSpec already uses associative lists.

I'm excited to see that you might have an alternative library that overcomes these constraints. I wasn't able to get my own library into a usable state at all, so we should go for yours if you think it is ready. (And I will add it to my list of things to review).

Until we move to use Goalesce, using Mergo as an intermediate step makes sense to me and I think that - after review -we should merge this PR (given the work is done) and look to adopt goalesce as a second step to overcome the two issues I raised above.

If that all makes sense to you, I'm happy to review this PR so that we can start down that path.

@adutra
Copy link
Contributor Author

adutra commented Apr 20, 2022

Awesome I think we agree on the roadmap here. Please review this PR if you have time! In the meanwhile I will get a v0.0.1 release out for Goalesce and do a manual test to check how it would merge two PodTemplateSpec objects together with associative lists.

Copy link
Member

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be on the right track, from my understanding of this part of the codebase.

I need to examine whether there are enough remaining unit tests to provide full coverage (that'll take me a little bit).

I do have a bunch of questions (e.g. why are my attempts to merge a simple struct not working with Mergo?) and a few suggestions as well.

apis/stargate/v1alpha1/stargate_types.go Outdated Show resolved Hide resolved
apis/stargate/v1alpha1/stargate_types.go Outdated Show resolved Hide resolved
apis/stargate/v1alpha1/stargate_types.go Outdated Show resolved Hide resolved
func (in *StargateRackTemplate) Coalesce(dcTemplate *StargateDatacenterTemplate) *StargateTemplate {
// Merge merges the given StargateDatacenterTemplate with this StargateRackTemplate and returns the
// result.
func (in *StargateRackTemplate) Merge(dcTemplate *StargateDatacenterTemplate) (*StargateTemplate, error) {
Copy link
Member

@Miles-Garnsey Miles-Garnsey Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question perhaps, but wouldn't this always go the other way? My thinking is that you would always have
Cluster -> Datacenter -> Rack, with fields defined in types to the right always taking precedence over those defined in types to the left.

So you might have a method signature: func (in *StargateDatacenterTemplate) Merge(rackTemplate *StargateRackTemplate) (*StargateTemplate, error) instead of this one to make that structure clearer?

Another question: do we want to define Merge methods on these types? It would be nice if we could avoid defining any methods at all and just call Merge(stargateDatacenterTemplate.rackTemplate, rackTemplate) (but I realise this depends on the behaviour when one side or the other is nil).

On a related note, IMO a merge function should always have function signature func (a Type1) Merge (b Type1) (Type1, error) whereas here we have func (a Type1) Merge (b Type2) (Type3, error), which isn't great for consistency with the existing codebase and might get us confused in the long term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you might have a method signature: func (in *StargateDatacenterTemplate) Merge(rackTemplate *StargateRackTemplate) (*StargateTemplate, error) instead of this one to make that structure clearer?

Currently we have two merge methods where the semantics are: the receiver receives a template and apply non-zero values from that template to zero values in itself, and returns the result.

What you suggest would be: the receiver receives a template and applies its non-zero values to all of the template's zero values, and returns the result.

I don't have a strong opinion here, but what we have seems to me slightly more straightforward that the other way around.

do we want to define Merge methods on these types? It would be nice if we could avoid defining any methods at all

Yes, I am wondering myself. For Cassandra types, we don't have methods currently, but a simple utility function. I can do the change if we agree that a function is better than methods on API types. (For the nilness, we'd still need to check that before passing the objects to Mergo).

On a related note, IMO a merge function should always have function signature func (a Type1) Merge (b Type1) (Type1, error) whereas here we have func (a Type1) Merge (b Type2) (Type3, error), which isn't great for consistency with the existing codebase and might get us confused in the long term.

I agree on the principle but Stargate types have a complex hierarchy in fact. I will see if the hierarchy itself could be simplified somehow, but that might lead us out of the scope of this PR.

Also, tbh, I think the most idiomatic behavior in Go for a Merge method would be to mutate the receiver in-place, not return the merge result. That also advocates for a utility function rather than a method, in fact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, tbh, I think the most idiomatic behavior in Go for a Merge method would be to mutate the receiver in-place, not return the merge result. That also advocates for a utility function rather than a method, in fact.

Yes, we should follow the structure that Kubernetes uses with the DeepCopyInto instead of creating a merged copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do the change if we agree that a function is better than methods on API types.

Done in d56b14a, let me know what you think @Miles-Garnsey.

Copy link
Member

@Miles-Garnsey Miles-Garnsey Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things here:

Most recent commit

The most recent commit is looking a fair bit better to me and is definitely a clearer read. I do have some comments that I've put inline.

Pointers, purity etc.

I think the function should remain pure. There's no reason to mutate inputs here and it makes the code harder to reason about.

As it stands, the function is pure, but the signature is a little misleading. When you take pointer inputs, it leads the reader to the assumption that you'll mutate the objects in-place. It is a bit odd to then see a pointer as an output. It might be good for clarity to just avoid the use of pointers here.

(I also acknowledge that I was passing pointers around in my own Merge function, so I'm guilty of setting a bad example here.)

Methods vs functions

When I asked whether we had to define these methods, I was actually wondering if we couldn't call mergo.Merge() directly. As you say, this doesn't work if either input is nil, so we do need to wrap it.

I have no strong preference between methods/functions, what you have looks good to me. I also very much appreciate the way that Merge inputs/outputs a single type now. Much clearer!

I did idly wonder if there is a way that we can make this function take an interface{} so we can avoid defining it for every type. I don't mind either way since once generics are available we can probably safely use them for this task. No need to make any further changes for now though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing @Miles-Garnsey and sorry for the noise. In a comment that I can't find anymore, you referred to these methods.

I wasn't aware of those, my apologies. Indeed they are good candidates for Mergo. Do you want me to change them as well? Judging from the implementation though, it seems you designed them so that the passed argument overwrites the receiver. I think the same could be achieved with Mergo using WithOverwriteWithEmptyValue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default behavior (true replaces false) is not desired for a given field, then it is a sign that that field should rather be a *bool.

I agree with this, but it doesn't solve our problem where a user defines things at a cluster level but has one 'odd' DC or rack that must be treated differently and which they want to explicitly set false.

Even if the value is a *bool, a a *true defined at the cluster level will still override a *false at the rack/DC level.

The problem with this logic is that if a value is set true anywhere in the cluster or DC levels, the lower DC/rack levels can't ever override it. I do not think that is the behaviour we want, is it?

But then this wouldn't be a merge anymore, would it? You would be effectively skipping the field from merge.

If you look at my Merge logic from the original methods I wrote here you'll see that the behaviour differs from what we have here in two ways:

  1. Fields from b are always merged into a if they are non-nil in b. This will happen even if they are non-nil (AND non-zero) in a. All values defined in b override those defined in a.
  2. The logic doesn't hinge on zero values, whether a value is zero or not, if a value in b is non-nil, it will override the value in a.

For a concrete example of how this might be a problem, my Stargate telemetry data structures and the current Mergo behaviour is going have the following behaviour:

  1. A user has set up 2 DCs and both have Prometheus enabled for Stargate at cluster level.
  2. They go to set up another DC where Prometheus isn't available (I don't know why, maybe they're using a different solution in that DC).
  3. They will set Prometheus = false in that DC. But that setting will be overwritten with true 🙀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've done some digging and I'm uncertain whether WithOverwriteWithEmptyValue will work to avoid the issues I'm raising. When I use it here I'm seeing behaviour where the lack of a value in s2 overwrites s1's Field2 with nil.

This isn't what we want, we're trying to implement proper 3 value logic here for bools where a nil is always overwritten, but which value wins is decided by whether it is on the left or the right, not by whether it is true or false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied offline already but I'm repeating my thoughts here for visibility.

The default merge behavior is IMO the right one.

For boolean fields, the default merge behavior results in the following truth table:

Cluster-level value DC-level value Merge Result Comments
false false false
false true true DC-level explicit opt-in
true false true Cluster-level opt-in (imposed on all DCs)
true true true Cluster- and DC-level explicit opt-in

You'll notice that this is a standard boolean OR truth table. We disagree on row 3: you would like that to result in false (=explicit DC-level opt-out), I would like that to result in true (forced opt-in).

As stated already, my main reason for advocating for true here is that there is no way in Go to distinguish between:

  • a false that was set explicitly by the user to clearly express their intent to opt-out, as in Foo{Enabled: false};
  • and a false that was simply the result of the user not setting any explicit value, which could be interpreted as their intent to implicitly inherit whatever value was defined at the cluster level, as in Foo{}.

Due to this ambiguity, I'm in favor of letting the cluster level true value win.

You'd rightfully argue: but how can we emulate an explicit DC-level opt-out (that is, set DC level to false) if the cluster-level is set to true (opt-in)? We'd simply switch to *bool and a ternary logic as below:

Cluster-level value DC-level value Merge Result Comments
nil nil nil Apply convention: true or false
nil false false DC-level explicit opt-out
nil true true DC-level explicit opt-in
false nil false Explicit cluster-level opt-out
false false false Cluster- and DC-level explicit opt-out
false true true DC-level explicit opt-in overrides explicit cluster-level opt-out
true nil true Cluster-level opt-in (imposed on all DCs)
true false true Cluster-level opt-in (imposed on all DCs)
true true true Cluster- and DC-level explicit opt-in

Note that with the above table, we have the possibility of an explicit DC-level opt-out (row 2).

Again I suppose you would like to change row 8 to yield false. In this case indeed, the false value is not a zero value, so we know that the value was explicitly set, and that the user intent is clearly to explicitly opt-out the DC, cancelling the cluster-level opt-in. It does make some sense to me, but why bother? With row 2, we already have another way to express the situation where something is true by default (and by convention), but can be overwritten to false at DC-level.

At this point, I'm inclined to say that we need to disagree and commit, wdyt?

Copy link
Member

@Miles-Garnsey Miles-Garnsey Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your commentary is fair. I do not believe that a Merge() is a boolean OR operation, because OR commutes and the order of the arguments matters for my conception of Merge().

I also believe that explicit opt out should be supported. It currently is, and I'm hesitant to approve here because this is a breaking change that will modify the behaviour of existing clusters (not that I think that is a hard blocker in this case, but something to be aware of).

I should also note that the Mergo behaviour w.r.t. bools differs from the merge patch strategy supported by k8s, which is (in my view) what our Merge() functions should be replicating.

More importantly, because this functionality is key to how we work with hierarchies in the CR, I think it should probably be discussed amongst the broader team. I do apologise that I'm not able to provide an answer to move us forward here, please take it as a sign of how important I think this single family of functions really is.

controllers/k8ssandra/stargate.go Outdated Show resolved Hide resolved
pkg/images/images.go Show resolved Hide resolved
pkg/images/images.go Show resolved Hide resolved
@@ -77,7 +77,7 @@ func TestImageString(t *testing.T) {
}
}

func TestImageApplyDefaults(t *testing.T) {
func TestImageMerge(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a TODO for me: I need to come back and look at how many tests remain after you've removed the below ones and how much coverage they provide.

pkg/medusa/reconcile.go Outdated Show resolved Hide resolved
pkg/stargate/deployments_test.go Show resolved Hide resolved
@adutra adutra force-pushed the stargate-coalesce branch 2 times, most recently from cb15980 to 20d99ad Compare April 29, 2022 14:37
@adutra
Copy link
Contributor Author

adutra commented May 27, 2022

Rebased on top of main and force-pushed.

@Miles-Garnsey
Copy link
Member

A note that we are doing a lot of the work on this PR on the Goalesce repo (for things I think are probably issues), and on Slack (for questions).

This is still progressing even though this ticket looks quiet.

@adutra
Copy link
Contributor Author

adutra commented Jun 2, 2022

This is still progressing even though this ticket looks quiet.

I think we should close it now, and maybe re-open one day. But there are too many impediments right now for us to be able to use any available library to achieve what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K8SSAND-1460 ⁃ Cluster- and dc-level templates should be deep-merged
4 participants