Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we just use kustomize for transformations? #3121

Open
bgrant0607 opened this issue May 10, 2022 · 28 comments
Open

Could we just use kustomize for transformations? #3121

bgrant0607 opened this issue May 10, 2022 · 28 comments
Labels
area/hydrate enhancement New feature or request triaged Issue has been triaged by adding an `area/` label

Comments

@bgrant0607
Copy link
Contributor

bgrant0607 commented May 10, 2022

I'm filing this braindump for discussion.

Using kustomize for configuration generation and transformation was the original plan when we first created kpt. Just stamping out variants of groups of resources could be done with kustomize, obviously. Though verbose, the replacement transformer also has the advantage over the original kustomize patches of being able to adapt to other input formats.

kpt and kustomize both were designed with Config as Data principles. The configuration data is kept separate from the customization code. This enables automated transformations and validation to be performed on the configuration data. Both tools also eschew parameter-based interfaces to bodies of configuration, eliminating the pervasive problem of excessive parameterization, and instead embrace transparently exposing and manipulating the resources contained within.

While both kpt and kustomize support customization of KRM resources via a transformation-based approach, there are some differences.

From the FAQ:

kustomize

  • Builds the final configuration out of place, primarily using the overlay pattern.
  • Treats base layers as immutable, but enables nearly arbitrary overrides.

kpt

  • Optimizes for WYSIWYG configuration and in-place customization.
  • Allows edits to the configuration in-place without creating complex patches.
  • Supports rebase with resource merge strategy allowing for edited config to be updated.
  • Enables workflows that combine programmatic changes with manual edits.
  • Aims to support mutating and validating admission control on derived packages.

The approach shifted away from kustomize as a whole when we introduced functions, originally primarily for validation. The desire for parent / upstream packages to impose admission control on derived / downstream packages is different from the kustomize "override anything" model (kind of like "inner" vs "super" in object-oriented programming languages). Support for in-place nested subpackages and the desire for GitOps friendliness created differences in file structure compared to kustomize bases. The focus on WYSIWYG, in-place configuration creates some differences also (e.g., KRM functions need to be idempotent, and additional bookkeeping data needs to be maintained). I also don't want all transformations to need to be run synchronously on a "DRY" form. I want to be able to run some generators and transformations imperatively to generate declarative output, and also want to be able to apply automation (e.g., security remediation mutations) asynchronously, analogous to Operators.

In terms of a ~minor usability issue, kustomize requires specifying all of the input files individually (not my original intent, fwiw), so we'd want to generate that list. We'd effectively want an idempotent version of "kustomize edit add resource *.yaml" (that didn't match kustomization.yaml).

We'd also want to improve the kustomize build output formatting (e.g., file naming) to have parity with kpt.

We want to automate adding a new variant, similar to the appctl environment kustomize scaffolding generator. We would generate a new kustomization.yaml pointing at the base. We did find with that approach that users wanted to see the output of the kustomize build early, before committing config changes -- that was one motivation for the WYSIWYG approach, or at least eager rendering.

IIRC, if the base has a kustomization.yaml, it's evaluated first. At least with the classic kustomize feature set, the base can't directly modify the behavior of variant transformations and variants can't directly modify the behavior of base transformations. That has some limitations, such as I don't think there's a way to apply validation checks to derived variants from a base, or to apply a mutator to derived variants. Patches always override the base.

Components and/or compositions may make it possible to run an upstream pipeline in the downstream Kustomization, but would need to be constructed in an opinionated way.

I also think there's evidence that a single pipeline per layer isn't sufficient for execution order. For instance, there are known issues with generators (#2528). We at least need a registry of certain types of functions so that we can run them automatically based on the KRM types in the ResourceList, much as how controllers self-select what types they operate on in Kubernetes.

If maintaining patches in kustomize manually, the user would always write the patches with respect to the output of applying transformations to the base. In kpt, the upstream is the base. If we wanted to convert in-place edits to patches automatically, that suggests keeping the patch information somewhere, generated from the hydrated form. I don't think we'd want to generate a new overlay for each set of edits for all of history, so new patches would need to be merged into existing patches. Because other transforms could be added or changed, we'd want to re-run kustomize build before editing, sans patches, then derive patches by diffing the hydrated form with the edited hydrated form. (OTOH, git computes diffs between multiple revisions or branches.)

That approach would definitely imply the need for a layer like porch to interpose all interactions with the stored configuration. Probably all edits should be on the rendered ResourceList in that case. All analysis and validation, too. Especially if only the DRY form were stored.

Back to function inputs: Function inputs need to come from the downstream data. Applying all patches last doesn't work -- we wouldn't want to apply patches to function inputs after running the functions. We would need to explicitly identify the inputs to change the patch ordering or merge logic. Preprocessors (#2385) are a workaround rather than a solution. I doubt that even the addition of prefix, infix, and postfix execution would be sufficient. When a function should run depends on its purpose.

The ordering challenge is one of the things that convinces me that we can't just treat naming-changing functions as string munging and hope for correct behavior. We need to understand that we are transforming abstract resource names to concrete resource names, and most likely need to carry the original names through transformations. As with compiler debug info, there's a variety of information we need to carry through -- filenames/paths and comments are other examples.

So maybe we could use kustomize, or parts of it, in an opinionated way under the hood, but it's unclear whether we could make this work well with arbitrary kustomization.yaml files. The introduction of new kustomize types such as Composition creates the opportunity to leverage additional pieces of kustomize, maybe. The Pipeline resource was originally separate from the Kptfile. We could go back to that approach, particularly if we plan to automatically manage all of the files.

Why does this matter?

  1. Maximizing our ability to leverage investment in kustomize
  2. Resolving some challenges with update merge behavior (e.g., upstream function preservation and ordering in the Kptfile Maintain a version field, similar to upstream info #2544)

Anyway, once we have a conclusion or consensus, we should document it and the rationale for it.

@justinsb @droot

@bgrant0607 bgrant0607 added the enhancement New feature or request label May 10, 2022
@justinsb
Copy link
Contributor

Thanks for starting the discussion here @bgrant0607

I claim that our solution needs to be clearly defined and easy to understand. I claim this because we don't want packages to be write-only, we want content and changes to be reviewed by others, and we want everyone to have confidence both in the current state of package instances and in their state going forwards.

We have some feature requirements:

  1. We want to be able to take an upstream package and change it.
  2. We want to support direct-yaml changes ("patches").
  3. We want to be able to support updating the base package and merging in previous changes in a way that preserves the intent of those changes.
  4. We want to support functions as well, so those functions can encapsulate more complex logic, with the goal of making the base package simpler.

Today our challenge is that functions (4) and direct-yaml manipulation (2) don't always compose predictably, particularly when we start rebasing (3). For example, when we direct-edit a value that a function has written, it's not obvious whether we should end up with the value from the function or from the patch, particularly after rebasing. In kpt today, I believe the function value will always "win" (assuming we remember to rerun kpt fn render), but the concern is that this reduces the value of direct-yaml editing.

We could have multiple solutions, though I think we'd all be happier with one solution that works well for everyone.

I suggest we look at the two extremes:

  1. Working purely with the hydrated configuration is one option. When someone is reading a package, they will look at all the yaml objects, rather than at how those objects were made. As an analogy, if I'm reviewing Java code with lots of getters and setters, I assume the author used an IDE to generate those, but I don't know for sure and I review the code, not the macro invocations. The upside here is that my mental model is simple - there is nothing but the objects. The big downside is that I must review all the objects, and I get no help in doing so - where did these objects come from? We also push the "update" problem onto users - likely they would re-execute the same functions, but we have no way of helping them or enforcing that.
  2. Working purely with a linear pipeline of transformations (including patches) is another option, and this option is probably most similar to kustomize. When someone is reviewing a package, we would expect them to look primarily at the pipeline, and to mentally think through the transformation steps (which puts a practical limit on how complex they should be). We can of course show the object yaml (even at each stage) if that is helpful. We can support patches here, but for more complicated transformations the ordering of the patches might matter (a problem you referenced with "Function inputs need to come from the downstream data.") We could encourage function arguments to come from the pipeline, but this reduces the power of the system (thought it might still be a valid tradeoff). Updating the base package feels like an easy model to understand here, and will likely "do the right thing" (circularly assuming the intent is accurately described by the pipeline!).

I think the linear pipeline model is workable, but only if we effectively disallow "Function inputs need to come from the downstream data". Personally I think that's a great place to start (we can always relax restrictions, it's hard to tighten them).

Let's assume that's not acceptable though - that for example we want objects that generate objects, or we want iterative evaluation of functions. I think we have to make some change to the kpt model to allow that. Personally I struggle to find a sweet spot that still allows/encourages "function inputs coming from the downstream data"; because I think that makes it harder to understand the pipeline and pushes me to read both the pipeline and the yaml, which feels like more work than just reviewing the pipeline. I'll think more about this, but this is why I favor starting with the simple one-off linear pipeline.

@bgrant0607
Copy link
Contributor Author

bgrant0607 commented May 10, 2022

With respect to functions "winning": I want functions to win, in the case that they are intended to act as mutating admission control. This is the "inner" model of extension: the base decides how it can be extended. Other functions may be intended more as authoring automation. Those could be invoked imperatively, but in some cases we will need to distinguish between these two purposes.

With respect to working with hydrated configuration: I am working towards a future where people don't need to deal with YAML directly. It's a data representation. I wouldn't want a human to review all the low-level changes. Reviewing the DRY form can be complicated, too, which is why some people currently review both DRY and rendered forms. I'd prefer to be able to add validation automation and analysis, and to use UX techniques to mitigate complexity of complicated domains.

With respect to a linear pipeline of transformations: I am pretty sure we'd have to generate the order from higher-level semantics, rather than allow the user to specify the order directly.

If we disallowed function inputs from downstream sources, I'm not sure how one would provide inputs to transform an abstract package into a deployable package. Or does that just mean all inputs would have to be specified in the downstream Kptfile?

@justinsb
Copy link
Contributor

justinsb commented May 10, 2022

With respect to functions "winning": I want functions to win, in the case that they are intended to act as mutating admission control. This is the "inner" model of extension: the base decides how it can be extended. Other functions may be intended more as authoring automation. Those could be invoked imperatively, but in some cases we will need to distinguish between these two purposes.

I agree with the idea that functions have two purposes - @droot and I were talking about "hard-control" and "soft-control" of values. "Hard control" is where the value shouldn't be changeable other than through the function, for example because it's core to the identity or nature of a package instance, such as the namespace, and we don't want to validate the function only to then find that all the values have been changed. "Soft-control" is where it is more a suggested value, such as setting resource limits on podspecs, but users might well want to override the values.

We might need to further differentiate between authoring of a blueprint package and authoring of a package instance. I would expect more "soft" invocations in the package instance, because I think we want to keep the blueprint minimal to maximize reuse. As you point out, "soft" invocations might not carry across kpt pkg get, if they are considered blueprint authoring accelerators.

With respect to working with hydrated configuration: I am working towards a future where people don't need to deal with YAML directly. It's a data representation. I wouldn't want a human to review all the low-level changes. Reviewing the DRY form can be complicated, too, which is why some people currently review both DRY and rendered forms. I'd prefer to be able to add validation automation and analysis, and to use UX techniques to mitigate complexity of complicated domains.

Agreed, but I worry that there isn't yet trust that we can perform enough automated validation of the rendered configuration to make this safe. Moreover, people still seem to think in terms of the DRY steps, in my opinion - what is the version of the upstream package, what "core" values were provided, what additional functions/patches were made. That said, I'm intrigued by the idea of techniques that do anything other than provide something equivalent to the pipeline - did you have something in mind?

With respect to a linear pipeline of transformations: I am pretty sure we'd have to generate the order from higher-level semantics, rather than allow the user to specify the order directly.

It would certainly be better if the order didn't matter. I think one of the lessons of kustomize is that requiring the order not to matter, or enforcing an order on steps, is sometimes limiting and makes it harder to write transformations. I think a good place to start could be warning if the order was significant in a pipeline. (?)

If we disallowed function inputs from downstream sources, I'm not sure how one would provide inputs to transform an abstract package into a deployable package. Or does that just mean all inputs would have to be specified in the downstream Kptfile?

Exactly that yes. All function arguments would be in the downstream Kptfile. Most packages would likely have one or more recommended / required functions, and the downstream Kptfile would fork the upstream package and then invoke those functions. There are a few possible mechanisms here - for example the upstream Kptfile could declare that some functions are required / recommended and the forking could ensure they are included in the downstream Kptfile. Another way would be to consider the function and the package together, so users could specify only the function and then the package would be implicit. This is more similar to how operators work, which I think could be a nice synergy. The other thing I like is that these function arguments are likely the "headline" values that should be shown in a UI, and we can image associating a high-level UI with the function invocation; that's harder to do with a raw package.

@bgrant0607
Copy link
Contributor Author

Regarding coupling functions to packages: I explicitly do not want to require writing new functions for every package. At that point, one might as well use one of the "as code" tools. I see the same patterns recur for packages that contain particular resource types. I would like to be able to reuse functions implementing those patterns. One example: resource generators. As with imperative tools, I'd like to invoke them once to generate the declarative representation, and then not need to repeatedly invoke them. They should be authoring accelerators, not exclusive generator implementations. If we did degenerate to that approach, we'd also suffer parameter proliferation, and there would be no "headline" values. I do think we need to associate functions with specific resource types, as with Operators, but not with packages, which should be unencapsulated bundles of resources.

@justinsb
Copy link
Contributor

I explicitly do not want to require writing new functions for every package.

I don't think we would require writing a new function. But I think your point is that if a package had 3 required functions, we should keep the identity of those 3 functions, rather than combining them into one function. That feels like an approach we should add to the list of viable approaches that we're building here. (I'll try to write it up directly next)

If we did degenerate to that approach, we'd also suffer parameter proliferation,

It would certainly be possible to over-parameterize, but that's always going to be the case once we allow functions. Generally we should talk more about why over-parameterization is painful and steer people away from functions with lots of parameters.

If we encourage people to write simple base packages that are the lowest-common-denominator, and for package authors to express changes in terms of minimal, composable functions (or encouraging specific patches), I think we will be in good shape. I don't know that our decisions here will influence that particularly.

I do think we need to associate functions with specific resource types, as with Operators, but not with packages, which should be unencapsulated bundles of resources.

I am thinking of the output of functions as unencapsulated bundles of resources, i.e. the same as packages. I can't think of any reason why we need to treat the output of a function differently from a raw-yaml package. The one area is that perhaps we want to demarcate which values are "hard-controlled" and which are "soft-controlled", but functions probably have the advantage there. I then think of packages as parameterless functions from a consumption point of view; and from an authoring point of view I see packages as an enabler for writing generator functions. (i.e. everything works together and becomes more powerful, IMO, if we follow this line of thinking)

@justinsb
Copy link
Contributor

To try to document the approach you're exploring/proposing @bgrant0607:

A blueprint package has a set of recommended & required functions. When we create an instance of that blueprint package, the required functions must be invoked (and the user must provide any missing parameters), and recommended functions can be invoked (at creation time, maybe also in subsequent changes).

Generally, using a TBD mechanism, functions can produce "hard-controlled" values where it is an error if the user tries to change the value with a patch or function.

If we make all function values hard-controlled, I do think we avoid the problem of conflicting yaml edits. I don't think we've fully solved the ordering problem - particularly if we allow function arguments to come from the resources instead of "out-of-band" (the Kptfile). If a user makes a direct yaml-patch, I think we need to be sure to execute / re-execute the functions that read that value.

This model feels more complicated than the linear pipeline model (though there is likely a simplification I'm missing). Unless we literally want to treat this like controllers & admission-controllers. We could absolutely say that "kpt packages produce the same result as running kubernetes with corresponding admission controllers and controllers" - that feels like a well-defined model also. I know you've mentioned this a few times - was that just an analogy, or do we want to treat this as a specification of the behaviour?

@bgrant0607
Copy link
Contributor Author

"A blueprint package has a set of recommended & required functions"

Not all functions used on resources in a package need to be specified in the Kptfile. Imperative generation and asynchronous mutation need to be viable; excluding them is extremely limiting, crippling in fact. For example, the UI plugin generates the resources added to a blueprint.

Not all functions use on resources in a package need to be associated with a specific package. Reusable, off-the-shelf functions are applicable to particular sets of resource types, and may be extensible to support additional types, as with kustomize transformers. In particular, I want to encourage horizontal transforms.

Maybe "package" and "blueprint" aren't the best terms. I want to make configuration generation and transformation cheap and easy, not artisanal. Reusability should come from functions that aren't coupled to specific packages, but are usable across all packages that share some common contents, whether those packages were explicitly derived from the same source or not.

Some programming language analogies:

  • Most configuration generators use a strongly encapsulated object-oriented approach. These generators degenerate to struct constructors as the generators become widely used. All the resource attributes exist for reasons, so some use case will want to access each of them. Struct constructors aren't like reusable library routines with stable, well defined semantics and interfaces that hide their implementations. OTOH, horizontal concerns are similar to aspects from https://en.wikipedia.org/wiki/Aspect-oriented_programming, which cut across object boundaries. We need ways to identify the "join points", or places where the logic should apply. kustomize transformers use lists of resource field paths. Ideally we'd be able to attach information to OpenAPI schemas.
  • Related to that, we should leverage duck typing (https://en.wikipedia.org/wiki/Duck_typing) wherever possible. Strongly typed generators that are common in "as code" SDKs obstruct transformations that need to apply to heterogeneous types.
  • I'm imagining blueprints as somewhat similar to prototypes: https://en.wikipedia.org/wiki/Prototype-based_programming
  • I also want blueprints created by platform teams to be able to constrain changes to derived instances through admission-control-like behavior (or like github PR status checks). In a way, the parent decides how it can be extended, similar to the "inner" mechanism (https://www.cs.utah.edu/plt/publications/oopsla04-gff.pdf), as opposed to the more common "super" mechanism, where derived instances can override whatever they please.

@bgrant0607
Copy link
Contributor Author

"kpt packages produce the same result as running kubernetes with corresponding admission controllers and controllers" -- this is a good starting point for how to think about it.

This diagram is relevant:
https://raw.githubusercontent.com/GoogleContainerTools/kpt/main/docs/design-docs/CaD%20Overview.svg

I am thinking of the approach as more similar to approaches that operate directly on the live state than to monolithic parameter-driven generator programs / processes that presume exclusive actuation.

The declarative configuration data are the source of truth. Operate on the data.

I tried to capture that in the CaD design doc:
https://github.com/GoogleContainerTools/kpt/blob/main/docs/design-docs/06-config-as-data.md

@justinsb
Copy link
Contributor

Not all functions used on resources in a package need to be specified in the Kptfile.

It wasn't my intent to imply that - I agree with you there. We want a rich library of functions that can be used across packages.

"kpt packages produce the same result as running kubernetes with corresponding admission controllers and controllers" -- this is a good starting point for how to think about it.

Are there any significant differences that you would call out? I feel it's easy to communicate this model, there's some implementation work but it's well specified.

@bgrant0607
Copy link
Contributor Author

Significant differences: We'll need to start simple and work through more complex cases to see where it breaks down.

Kubernetes has a number of types of clients: admission controllers, controllers, CLIs, CLI plugins, UIs, UI plugins, etc.

If we start with a deployment package and ignore complications like upstream blueprints and nested subpackages, a ResourceList should be kind of like a slice of a cluster.

Imperative functions, UIs, etc. should be similar to imperative client tools.

Some functions in the Kptfile pipeline should be similar to admission control.

Function config is a concept that doesn't exist in Kubernetes, so that's worth thinking about more. Controllers know what resource types (or parts of resource types, such as in the scheduler's case) they are responsible for, for the most part. Where multiple controllers could be responsible, usually a field or annotation can be used to select ownership for a specific resource: kubernetes/kubernetes#31571.

We need a way to map resource types to functions, especially for generator functions. Does the KRM function registry do that? In Kubernetes, the CRD would be registered with the apiserver and a controller would watch resources of that type and of the types generated. https://github.com/metacontroller/metacontroller had an explicit mechanism for that IIRC.

For transformations of an upstream, we may need the idea of a deployment hook that we've discussed, so we can distinguish those transformations from admission control. The variant-constructor pattern using the package name as input is only part of the solution. kubectl has kubeconfig, terraform has provider config, gcloud has config, etc. We can capture that information when creating a deployment package and add it to the package. Building in resource name and namespace transformations isn't sufficient because there are other identifiers that need to be personalized, such as RBAC RoleBinding groups, GCP projects and resource ids in Config Connector resources, and so on. I do expect to need to be able to record the mappings explicitly, though.

In general, we need to put more thought into the upgrade merge process and what we expect to be possible and not possible.

@bgrant0607
Copy link
Contributor Author

Some thoughts from my morning ride before I forget the other half of them...

If we look at the namespace provisioning example, the blueprint acts as a factory for stamping out namespaces and their associated policy resources.
https://kpt.dev/guides/namespace-provisioning-ui
https://kpt.dev/guides/namespace-provisioning-cli

The inputs to the factory need to come from the downstream variant. In this example, we fully automated the generation of the input using the degenerate case of the variant-constructor pattern: just the package name was sufficient to forge a unique identity for the variant.

We're trying to enable platform engineers to easily create such factories for stamping out common patterns of resources that need to be provisioned, without needing to write YAML, patches, templates, Operators, or other configuration generation programs, except in high-value cases, such as greater than 10x non-boilerplate expansion. Platform users should be able to tweak and refine those blueprints, but within guardrails established by the platform team -- it's not the kustomize "override whatever you want" case. Deviating more significantly may require breaking the connection to the upstream. That's OK. Once we can operate on packages in bulk through the package orchestrator, many things become possible. I've also already proposed a rebase operation (#2548), but there are other ways to enforce standards across packages. I want to be able to inject admission control functions at the repo level in porch, for example. Another rebase case: create a deployment package from live resources, abstract-ify it by inserting placeholders in obvious places similar to kubectl get --export to create a blueprint, rebase the deployment package on the blueprint, and add a function to reverse the abstract-ification. Anyway, I digress.

An abstract blueprint needs to be able to specify functions to transform the upstream into downstreams, with the inputs coming from downstreams. That's not exactly the same as a deployment hook. We could run these functions on kpt pkg get and kpt pkg update before merging. Upstream admission control could be run after merging (maybe). The advantage of these transformation functions having well defined types rather than being package-specific types or parameters is that we could fully generate them in some cases, enable the user to select among common input sources, autocomplete and validate values, and so on. GUIs, CLIs, and some Terraform providers can do that, but I rarely see anyone put that kind of investment into individual packages, which tend to be a lot less reusable than one would hope. I don't want specific packages to require much investment. I want them to be cheap.

@bgrant0607
Copy link
Contributor Author

A few more thoughts after discussions today:

Just supporting imperative operations on a declarative data model in versioned storage is pretty simple -- nothing new there. Compared to operating on the live state directly, already that provides pre-deployment validation, admission control, review and approval, versioning and undo, bulk operations (e.g., set a particular attribute across 1000 packages), with similar WYSIWYG experience and tool interoperability on the source of truth.

Add some kind of "macros", as with functions implementing the variant-constructor pattern or kustomize transformers, and variants can be stamped out from a factory package. That could be done decades years ago, no problem. The data model is a bit under-defined (e.g., inter-resource references, allocated values, generated values).

So what is the hard part? Merge updates inherited from an updated parent. For all its challenges, I think users want that kind of inheritance (as opposed to just imperative backfill-style operations). It's been implemented over and over again. It might be worth some compromises to get all the other benefits and capabilities, but it's worth trying to provide. Update an attribute in the upstream, and propagate the diff to the downstreams. Obviously we need to maintain mappings for all changed merge keys, particularly resource names. We need to decide how to deal with conflicts (e.g., downstream wins). We need to decide function execution order after the downstream provides inputs to functions specified by upstream, and potentially adds additional functions, while the upstream specifies the variant constructor functions (macros) and admission control (I think in terms of the "inner" inheritance model). Generators also need to have their functions invoked on downstream inputs, but may also need to apply the transformation functions, implying they may need to run first, but not just on upstream package data.

Anyway, Kubernetes has several types of clients and components: UIs, CLIs, admission controllers, asynchronous controllers, etc. WYSIWYG config should be similar. The big difference is the inheritance model. We need to start with the behaviors we need and work backwards.

@bgrant0607
Copy link
Contributor Author

I acknowledge the need to create reviewable diffs.

@bgrant0607
Copy link
Contributor Author

Some notes on current merge behavior. Thanks for the pointers.

Load 3 trees: https://github.com/GoogleContainerTools/kpt/blob/main/internal/util/merge/merge3.go#L66-L104
Match objects by GKVNN: https://github.com/kubernetes-sigs/kustomize/blob/188e35fbfd5fcaee208e01027b37db11ee183b49/kyaml/kio/filters/merge3.go#L218-L232
Merge each object: https://github.com/kubernetes-sigs/kustomize/blob/master/kyaml/yaml/merge3/merge3.go#L13-L20
Conflict resolution:
https://github.com/kubernetes-sigs/kustomize/blob/188e35fbfd5fcaee208e01027b37db11ee183b49/kyaml/yaml/merge3/visitor.go
https://github.com/kubernetes-sigs/kustomize/blob/master/kyaml/yaml/walk/map.go#L32-L55
https://github.com/kubernetes-sigs/kustomize/blob/188e35fbfd5fcaee208e01027b37db11ee183b49/kyaml/yaml/walk/associative_sequence.go#L155

Since object names and other merge keys could be changed, we need to keep track of the original keys (in comments currently). If a hand edit changes a merge key, I'm not sure how we'd catch that currently.

Upstream changes (usually?) win at the moment, but I don't think that's what we want. In general, I think we want downstream to win. Upstream values are "defaults". A change of default shouldn't change the value that the downstream wanted. The upstream patch should only apply if the downstream values still matched the previous upstream values.

Kptfiles, of course, need special merge logic to deal with updating upstream info, setting downstream version info, merging function pipelines, and so on.

I created an area/pkg-update label. We should go through and find issues reported with current "kpt pkg update" behavior.

@bgrant0607
Copy link
Contributor Author

Another potentially useful PL analogy, from @seh: CLOS. Method definitions are not tied to specific classes, just as most functions should not be tied to specific packages, and many functions should apply to multiple resource types.
https://lispcookbook.github.io/cl-cookbook/clos.html

No surprise that it shares history with AOP.

@droot droot added the triaged Issue has been triaged by adding an `area/` label label May 17, 2022
@droot
Copy link
Contributor

droot commented Jun 8, 2022

Great discussion in this thread. I created a CUJ to make some of these more concrete in context of namespace provisioning use-case #3294.

@bgrant0607
Copy link
Contributor Author

Another issue uncovered by #3292: kustomize identifier transformers keep track of previous identifiers in annotations, but not in an idempotent manner, and it also doesn't output the annotations.

With in-place transformation, we need both idempotence and persistence, because functions may be run on the configuration multiple times and new functions may be run in the future.

@yuwenma
Copy link
Contributor

yuwenma commented Jun 16, 2022

Another issue uncovered by #3292: kustomize identifier transformers keep track of previous identifiers in annotations, but not in an idempotent manner, and it also doesn't output the annotations.

With in-place transformation, we need both idempotence and persistence, because functions may be run on the configuration multiple times and new functions may be run in the future.

Getting idempotent result requires the very first identifier to be stable (even in cases when users manually change the ID or update the package). kpt keeps track of the upstream via kpt-merge key. So I'm thinking of using the kpt-merge value as the original identifier. It will be something like calling a builtin converter in kpt fn eval and kpt fn render.

Ideally, kpt and KRM functions should use the same language for identifier transformation.

@bgrant0607
Copy link
Contributor Author

@yuwenma Are you proposing to propagate the kpt-merge comments on identifiers to annotations? Or do the typescript and starlark SDKs make comments accessible? I would rather propagate user-oriented comments in the orchestrator and not deal with comments at all in the SDKs, since that makes the SDKs simpler and less error-prone.

@droot
Copy link
Contributor

droot commented Jun 17, 2022

some quick thoughts wrt kpt-merge, whenever we create a new variant of a package, we need to track the identifier of upstream resources in the downstream package. This is sort of like tracking upstream information but at the resource level (I know we have discussed this in context of Kptfile for tracking pipeline in upstream/downstream). I don't see any downside in storing them as annotation. Thinking more, any package level operation at the inheritance chain boundary that can lead to generating new resources (for ex. pkg clone or pkg update, pkg rebase), upstream identifier needs to be updated in the downstream resources. We will need special handling of such annotations during pkg update because upstream resources may have their own upstream identifier as well and we don't want to clobber these.

Visualizing the yamls:

Upstream Resource in an abstract package:

apiVersion: apps/v1
metadata:
  name: example-app
  annotations:
      app: frontend
spec:
 ....

Downstream resource (after some transformations):

apiVersion: apps/v1
metadata:
  name: dev-example-app
  annotations:
      app: frontend
      internal.kpt.dev/upstream-id: example-app
spec:
 ....

@bgrant0607
Copy link
Contributor Author

Good point, @droot. As discussed in #2544, when a package is cloned, it becomes the upstream to the new downstream, even if it was a downstream to its own upstream.

@yuwenma
Copy link
Contributor

yuwenma commented Jun 17, 2022

@yuwenma Are you proposing to propagate the kpt-merge comments on identifiers to annotations? Or do the typescript and starlark SDKs make comments accessible? I would rather propagate user-oriented comments in the orchestrator and not deal with comments at all in the SDKs, since that makes the SDKs simpler and less error-prone.

Sorry I didn't make it clear. Yes, "not deal with comments at all in the SDKs" I'm thinking of having the kpt CLI or the orchestrator to feed the annotation info from the current merge-key and the KRM functions only deal with annotations (it would be great if it can provide the upstream group and kind as well).

My main thinking blocker previously was that KRM functions themselves do not have a reliable source as the very first original identifier. kpt-merge provides that missing piece. and no annotation life-cycle is needed.

I don't see any downside in storing them as annotation.

@droot That's great to know!

@droot
Copy link
Contributor

droot commented Jun 17, 2022

Good point, @droot. As discussed in #2544, when a package is cloned, it becomes the upstream to the new downstream, even if it was a downstream to its own upstream.

Yes.

@bgrant0607
Copy link
Contributor Author

Update behavior is being discussed in #3329.

Guardrails are covered by #3279 and/or #3218.

@bgrant0607
Copy link
Contributor Author

bgrant0607 commented Jul 1, 2022

Logging this here so that I don't lose track of it -- the WIP PR for Composition (not yet merged).
https://github.com/kubernetes-sigs/kustomize/pull/4323/files

The KEP:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-cli/2299-kustomize-plugin-composition

@bgrant0607
Copy link
Contributor Author

kubectl edit <resource> --output-patch can be used to generate patches from a wysiwyg yaml editor.

@bgrant0607
Copy link
Contributor Author

A few brief thoughts on Composition, since I've been looking at it.

At first glance it's similar to kpt pipelines, but there are some differences.

transformersFrom:
- path: base

transformers:
- apiVersion: builtin
  kind: ResourceGenerator
  files:
  - cm.yaml
  • Explicit imports are awkward with a package-centric model if references can be to paths outside the current directory, which they probably would be.
  • The ResourceGenerator being an explicit transform and requiring files to be explicitly enumerated is awkward and toilsome. At minimum, a generator would be needed, like https://github.com/bazelbuild/bazel-gazelle for bazel. It also seems problematic for a model that mixes imperative functions on the package contents (e.g., a UI) and the function pipeline.
  • The function input configs being inlined into Composition don't provide a great experience either. I prefer the configPath pattern in Kptfile, which is more consistent with the resource-per-file pattern. Functions are going to need some flexibility in where they pull their input values from anyway, as discussed in Make it easier to write value transformers #3155 (comment). Forcing all input values to be in the Kptfile would mean that Kptfiles would need to be generated to aggregate function inputs from variant-specific contexts. We have discussed some kind of "package set" controller or operation that would generate packages from sets of inputs, such as for different clusters or environments or other sources of variants, but I didn't want to make generation necessary even for just one variant, or for blueprints. Eventually we may land there, once we develop enough tooling, though.
  • As I commented in Support for required function parameters  #2041, we'd need a tool to generate the KRM boilerplate for function inputs.
  • transformerOverrides assumes the out-of-place patching model
  • I guess transformerOrder is to weave together imported and additional transformers.
  • It looks like it eliminated the distinction between different types of functions, such as generators and validators.

Pipeline was originally a separate object type. I proposed adding it to Kptfile initially for simplicity, discoverability, and to reduce opportunity for errors, like multiple pipelines in the same package.

It's looking like we'll want distinct pipelines for variant construction and for admission control, in which case we may want to support specification of the pipelines by reference.

In the case that function inputs are transformed by other functions we'll want to toposort the functions. It would be nice if we could do that automatically. One reason we added the pipeline was that we didn't have enough information to do this, unfortunately. The "iterate until convergence" proposal might also work, but could also generate spurious errors and be confusing to users.

@bgrant0607
Copy link
Contributor Author

bgrant0607 commented Apr 27, 2024

In Jan 2021 we had a follow up discussion on this topic. I didn't see a design doc added to the repo, so I thought it would be useful to relay the meat of the discussion.

Writing patches and configuration transformations for kustomize is not that simple if the built-in transformers don’t cover your needs. WYSIWYG is an alternative model that essentially enables in-line patching. What’s tricky is combining patches with automated transformations.

kpt pkg get and kpt pkg update support a similar mode of operation to git clone and git rebase; users can make a local copy of an upstream package (clone), make in-place changes to the local copy, and then re-apply their in-place changes on-top of upstream changes (rebase).

kustomize uses a different mode of operation that does not support in-place changes, but rather constructs a pipeline of changes on top of a (potentially remote) base. kpt functions provide similar functionality, but haven’t yet been rationalized with kpt pkg.

We attempted to identify an abstract model for this and show that kustomize patches and kpt pkg are quite similar, virtually isomorphic.

As compared to git, kpt differs in rebase strategy primarily in that we perform a 3-way merge that is KRM-schema aware (in particular, merge keys). In contrast, git performs a purely textual merge. (Git supports pluggable merge tools, so we could add schema-aware merge to git, but we’d also need to address other gaps in git, such as the ability to extract a subtree of a repo.)

We generically model the problem-space as a series of yaml changes that are performed by two actors: the package author and the package consumer. The challenge is to allow the package consumer to incorporate additional changes made upstream by the package author, while also preserving their own changes.

In this framing, we have not specified that the changes are described by patches or by git commits. Using patches or a sequence of git commits is a valid and universally applicable way to represent a sequence of changes to files. An interesting observation from git: git does not actually store patches, instead it stores complete tree snapshots for every revision. Diffs/patches are computed by comparing two revisions; merges are client-side operations and distinct from the git storage layer.

We can apply the same insight to understanding kpt. The first operation in kpt is defined by cloning an upstream package, and then additional operations are defined by commits of that local directory, which imply a series of patches. We could therefore potentially translate a local kpt package history into a kustomization.yaml; the upstream package is the remote base, and we add a kustomize patch operation for each commit. Similarly we could translate a kustomization.yaml into kpt, with a commit for each stage of the transformation pipeline.

The limitation of the kpt/git model is now apparent: kustomize supports operations beyond patch, such as namePrefix and commonLabels. While the effect of these operations for a given input can be represented by a patch, there is additional semantic meaning in these operations in the rebase operation that is not captured by a patch. For instance, when an additional object is added upstream, we would expect that after a rebase the namePrefix or commonLabels are added to the new object. (A git/code analogy: I apply a code refactoring to rename a method, if upstream adds another call to that method, on a rebase, a patch won’t rename the added method call.)

At present, users would have to re-run functions after updating in order to apply them to new upstream configuration. This is a key reason it is recommended that functions should be idempotent: invariants can only be maintained by re-running all functions after any of their inputs have changed.

This behavior is somewhat ad-hoc. If instead we define it based on the proposed model, that should be easier for users to understand.

To do so, we consider kpt-native changes as including the semantic meaning. If we execute a kpt function, we could record that kpt operation alongside the change itself, effectively automatically constructing a Pipeline.

However, if changes are made externally from kpt via in-place editing, we do not currently have a mechanism to record the meaning or sequencing of the patch with other changes. We could track git commits containing external edits and sequence them with other pipeline changes in order to support overrides. We could potentially translate the in-place changes to literal kustomize-style patches, using a new command. We may want to explicitly mark overrides.

On a rebase, kpt could reapply each of the existing commits/patches. However, when it encounters a kpt function, it should re-run the kpt function rather than applying the patch.

This provides a compelling reason to wrap logic in a kpt function rather than directly executing it: we can provide a rebase experience more true to the user’s intent (i.e. a rebase is less likely to introduce a regression that was addressed by a local change).

If the user runs a kpt function and then makes local changes without a git commit in between, we should construct a diff from the version after kpt function execution. The easy way to do so is by forcing a git commit after each kpt higher-level operation, but we can also do so with local storage (.kpt directory?) or by re-executing functions.

To record the structured operations in git, we could either encode them into a structured section of the commit message (cf kubernetes release-notes) or by using git notes. We could also use a shadow branch to record additional state (including committing after every operation?). We could also record the operations in the Kptfile or Pipeline file. Recording in the Kptfile is likely to produce a UX that is more consistent with how kustomize is used.

The kpt rebase operation likely should feel more like a git rebase operation; we likely need to expose the commits that we are rebasing and the potential for them to fail / conflict. This model may be easier for users to understand, by being less “magical”.

We probably also need an additional smarter diff after a rebase operation, to review the changes introduced by a rebase. We want to see what changed vs the previous local state, and why: was it upstream, or was it one of my local operations? Crucially, we don’t want to see changes that applied unchanged in the previous version (i.e. this isn’t just git diff origin/master) For the presentation, we probably want something akin to git blame. This is probably constructable by recording the SHA prior to the rebase and then doing a blame; i.e. this may be porcelain (not plumbing).

We also don’t want the patch set to grow monotonically over time. The problem is analogous to maintaining patches in a fork and/or commit squashing.

Also, if we want the in-place model to be competitive with out-of-place hydration, the rebase process needs to be more automated. Currently updating each variant of a common base package requires manual steps.

I am convinced that a single function pipeline for kpt is not enough.

We did split mutators and validators, similar to Kubernetes admission control:
https://kpt.dev/book/04-using-functions/01-declarative-function-execution

Since kpt essentially uses an inheritance model, I would also look at inner and super OOPL inheritance models. I would also consider taking some kinds of functions out of packages and treating them more like controllers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hydrate enhancement New feature or request triaged Issue has been triaged by adding an `area/` label
Projects
None yet
Development

No branches or pull requests

5 participants