Skip to content

Conversation

@vazirim
Copy link

@vazirim vazirim commented Dec 17, 2019

Description of the change:

Proposal doc to integrate Composable SDK into Operator-SDK.

Motivation for the change:

Composable SDK allows operator developers to add cross-resource references to their CRDs,
enabling dynamic configuration.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 17, 2019
@vazirim vazirim changed the title Composable integration in Operator-SDK Proposal: Composable integration in Operator-SDK Dec 17, 2019
@vazirim vazirim changed the title Proposal: Composable integration in Operator-SDK Proposal Doc: Composable integration in Operator-SDK Dec 20, 2019
Using the Composable SDK means that the CRD needs to have permission to access objects that are being referenced.
This means that the operator developer needs to consider what kinds of objects will be permitted for references and
add appropriate permissions for the RBAC rules of the CRD.

Copy link
Contributor

@camilamacedo86 camilamacedo86 Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vazirim,

Really thank you for share your ideas and contribute with, first at all 🥇.

I have a few questions/concerns which follow. Please let me know if I misunderstood something.

So, with this solution, in a CR of a CRD for example ( Memcached ), we can map many references for pre-exisent resources in the cluster and then, it will be used to find the data by the ResolveObject. Following code from the tutorial just to clarify my questions :-)

       // Resolve the memcached instance
	resolved := &cachev1alpha1.MemcachedResolved{}
	comperr := sdk.ResolveObject(r.client, r.config, memcached, resolved, request.Namespace)
	if comperr != nil {
		if comperr.ShouldBeReturned {
			return reconcile.Result{}, comperr.Error
		}
		reqLogger.Info("Error during resolve, but not returned", comperr.Error)
		return reconcile.Result{}, nil
	}
  • How it will deal with error scenarios, for example, if 3 of 10 resources mapped for Memcached cannot be found? Will the Memcached resource be created or not? Has it a way to tell that some ref is optional?
  • If the object, for example, a ConfigMap be found but it has not the key mapped in the CR how will it work?
  • How would be the performance of it if the user adds too many resources refs to be "resolved"?
  • How to validate the data in the CR? Has it already something implemented for that?
  • Also, I think that would not be possible to call forever the reconcile in the case of something nothing works.
  • And, in real scenarios, I am afraid that would not possible to solve it by changing the options in the manager to re-trigger the reconcile every few seconds as in the demo. It would need to have a watch/observer impl for each resource referenced.

WDYT folks @estroz @joelanford @shawn-hurley ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add one more, which is should there be some way to specify a default if it can not be resolved?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @camilamacedo86,

Thanks for your feedback!

  • If 3 out of 10 resources cannot be found, then ResolveObject returns an error. It either resolves it all, or nothing. It also employs caching to get a consistent view. The cache is made fresh for each invocation of ResolveObject.

  • If the object is found, but the data in it doesn't exist, then ResolveObject also returns an error.

  • The performance hasn't been an issue so far because we have only seen CRs with a handful of references and no more. Resolving those references amounts to looking up in etcd. In any case, caching also alleviates performance.

  • We have a validation that happens inside the Composable operator controller (so that the underlying template is well-formed). For the Composable-SDK, since the developer is modifying the CRD definition, and the open-api spec of the CRD, we get validation through that.

  • The decision to retry is up to the consumer of the composable-SDK. But any operator will have a similar problem -- you need to decide for various error scenarios whether it makes sense to requeue another Reconcile cycle or not. So this is not specific to composable-SDK.

  • Sure, having watchers works as well. In the operators we have developed, we like to resync periodically to check for the health of what the operator is managing.

--Mandana

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley, yes we had thought of having defaults in the context of the Composable operator because it helped with validation (we could do a dry-run deployment). But we opted for not having defaults. The idea being that we want to have the ability of resolving data-dependencies dynamically. So if a resources has not been created yet, we retry until it's online. We have found this use case to be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley The latest version of the Composable SDK (v.0.1.4) has support for default values.

The Composable SDK offers the following function for resolving the value of cross-resource references.

```golang
func ResolveObject(r client.Client, config *rest.Config, object interface{}, resolved interface{}, composableNamespace string) *ComposableError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ResolveObject() use the rest.Config parameter for? I'm curious if we can get by passing in just the client.Client?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is used to get a CachedDiscoveryInterface (k8s.io/client-go/discovery), which is used to lookup a resource kind given a kind and an apiversion. This in turn gives information that is used to lookup the reference object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it means that it will not respect the operator scope. For example, if we pass just an ns for the manager here it would be looking for in all ns of the cluster. Am I right?

If yes, IHMO it is something that needs to be changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the object reference have no namespace specified, then it will look only in the namespace of the manager. This gets passed as the composableNamespace parameter to ResolveObject(). That's the default namespace to use.

If the object reference has a namespace specified, it will look there instead.

The Config is used to get a CachedDiscoveryInterface and all we need to know is to see if the kind of the object reference is namespaced or not. This will inform how to create a NamespacedName for object lookup. If it is namespaced, then the convention above is used to determine the namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vazirim,

Tks for the clarifications.

IHMO in order to the Compasable work aligned with it would need to use the manager/cache which is provided instead of a new cache. Otherwise, shows that It would allow for example, besides the operator be set up in the manager/cache with a List of namespaces looking for the resource out of this scope. Also, it can have more options setup by default which IHMO should be respected in all project. Is it make sense?

PS.: The manager can be set up with some options. Regards the ns it can be "" which will allow cache all cluster, or a list of namespaces or just 1 namespace. See: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/manager#Options

Please, let me know if I misunderstood something as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResolveObject is indeed using the client provided by the controller:

func ResolveObject(r client.Client, config *rest.Config, object interface{}, resolved interface{}, composableNamespace string) *ComposableError {

The CachedDiscoveryInterface is obtained from the controller's client as well (the client passed into the ResolveObject)...

I am not sure I understand the concern (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation @vazirim! I see now why the rest.Config is important. But now I have a follow-up. :)

Have you considered making a ResolvesObject interface:

For example,

type ResolvesObject interface {
  // ResolveObject resolves object references. It uses a context
  // for cancellation.
  ResolveObject(ctx context.Context, in, out interface{}) error
}

And a struct that implements it:

type KubernetesResourceResolver struct {
  defaultNamespace string
  client client.Client
  resourcesClient discovery.ServerResourcesInterface
}

With this, the resolver could be setup and included as a field in the reconciler implementation, it would simplify the user-facing API, and it would make it more extensible for other possible future use cases (and writing unit tests).

Copy link
Author

@vazirim vazirim Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford The latest version of the Composable SDK (v0.1.4) now has an interface like you have suggested. In addition to the ResolveObject interface, we have the struct that implements it:

type KubernetesResourceResolver struct {
	Client          client.Client
	ResourcesClient discovery.ServerResourcesInterface
}

We simplified and got rid of the default namespace here.
The resolveObject now has this interface:

func (k KubernetesResourceResolver) ResolveObject(ctx context.Context, object interface{}, resolved interface{}) error 

The assumption is that the object passed has a namespace, and that serves as the default namespace. We also got rid of ComposableError and added functions to determine the nature of the error (see below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that the object passed has a namespace

Forgive me if this has already been answered or addressed. Does this mean that cluster-scoped resources are not supported? Or can I pass an object where namespace is empty and that means cluster-scoped?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster-scoped resources are supported as well. We check to see if the reference is to a cluster-scoped resource or not. If not, and if the namespace is omitted from the GetValueFrom reference, then we use the default namespace, which is the namespace of the object itself (that contains the reference).


The `ResolveObject` function uses caching for looking up object kinds, as well as for the objects themselves, in order
to ensure that a consistent view of the data is obtained. If any data is not available at the time of the lookup,
it returns an error. So this function either resolves the entire object or it doesn't -- there are no partial results.
Copy link
Member

@joelanford joelanford Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's at least one interesting scenario that operator developer's using these object references would need to consider, and that is that the object resolves successfully and is created, and then at some point later, one of the referenced objects in a GetValueFrom field is deleted.

In that case, it looks like ResolveObject() will fail, but the caller may want to have some custom logic to handle the specific reason for the failure.

For example, if the error was a temporary connection issue, a caller would want to retry. But if the error is that a referenced object is not found, I probably still want to start retrying, but I might also want to delete the other objects that I created during a previous reconciliation that relied the resolved object that is now no longer resolving. Does that make sense?

Does the returned ComposableError.Error field have enough detail to allow this sort of decision making in the calling code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently ComposableError does not have this level of detail.

}
```

The type `ComposableError` contains the error, if any, and a boolean `ShouldBeReturned` to indicate whether the calling reconciler should return this error or not. This is used to distinguish errors that are minor and could be fixed by immediately retrying from errors that may indicate a stronger failure, such as an ill-formed yaml (in which case there is no need to retry right away). If there are no errors, then the `ResolveObject` returns nil.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given my previous comment, it seems like there could be a gray area for deciding how ShouldBeReturned should be set. I wonder if it would make sense to expose a separate helper function in the composable SDK to handle this such that ResolveObject() could return the error directly.

For example:

func IsUnresolvableError(err error) bool { ... }
err := composable.ResolveObject( ... )
if composable.IsUnresolvableError(err) {
    return reconcile.Result{}, err
} else if apierrors.IsNotFound(err) {
    // do some extra logic
} else {
    return reconcile.Result{}, err
}

This way callers could use composable's helper function OR define their own custom error handling, and it would get error handling logic out of ResolveObject.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's a good idea to separate this concern, and add more detail to the kinds of error situations that can happen. Will open a work item.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford The latest version of Composable SDK (v0.1.4) got rid of ComposableError. We now have a series of functions as you suggested:

func IsIllFormedRef(err error) bool 

func IsKindNotFound(err error) bool 

func IsObjectNotFound(err error) bool 

func IsValueNotFound(err error) bool 

func IsRefNotFound(err error) bool 

IsIllFormedRef means there is a problem with the way the cross-resource reference has been written in the yaml. IsKindNotFound means the kind for the reference does not exist. IsObjectNotFound means the object does not exist, and IsValueNotFound means the value does not exist within the object. The last one simply means either IsKindNotFound, or IsObjectNotFound, or IsValueNotFound is true.

Rather than hardwiring the same value everywhere, I can define cross-resource references to a unique configmap that contains
the data. This makes it easier to change the data if needed, and can allow all resources to be configured
dynamically (i.e. changes in the configmap are absorbed and resource that point to it are updated).

Copy link
Contributor

@camilamacedo86 camilamacedo86 Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same need could be solved by just GET object in the controller and then, updating the CR resource with the values. Not necessary is required to add all values used in the CR. It can be obtained programmatically, dynamically and in realtime as well.

IHMO the user's stories needs to be improved because the idea here would not be allowed something that is not possible today. It is just to add a facility, I mean another way to get RefFrom without need to implement it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that rather than hardwiring that in the controller, each CR of the CRD could have fields that point to different objects. So this gives a lot of flexibility.


To add support for cross-resource references in CRD definitions and controllers (Composable SDK) and make this
functionality available through the Operator-SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is missing: How it would be added in the SDK?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would add the types that are here:
https://github.com/IBM/composable/blob/master/sdk/composable-types.go

so that operator developers can use the ObjectRef to specify that fields have this type.

We also add the method ResolveObject (possibly with a modified interface so that the receiver is the client instead of the first argument) as one more way in which the developer can access objects in etcd (like Get, Update etc...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be nice to have your thoughts on how it should be added to the project too in the doc. Really tks for all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will add that.
Thanks for all the feedback!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camilamacedo86 I changed the proposal doc according to the changes above for the interface.

## Summary

Kubernetes object specifications often require constant values for their fields. When deploying an entire application
with many different resources, this limitation often results in the need for staged deployments, because some resources have to be deployed first in order to determine what data to provide for the specifications of dependent resources. This undermines the declarative nature of Kubernetes object specification and requires workflows, manual step-by-step instructions and/or brittle automated scripts for the deployment of applications as a whole.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for the referencing object and its lifecycle when the referenced object does not exist yet? Will it be created?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmesser If the referenced object does not exist, then the referencing object's controller will keep retrying until that object is created. When that object is created, then the controller can resolve the referencing object and continue with whatever it needs to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! How long would it keep trying?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the consumer of the library. In the memcached tutorial, this is done indefinitely.

We can imagine having a message in the Status of the object saying that it is waiting for some object to be created. If there is a bug in the spec of the object reference, the user could catch that and fix it. In any case, when the error is fixed or the object is created, then the controller can resolve the referencing object and move forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. @joelanford I think if we were to take this in, it would probably be a good default behavior to write out a status when waiting on referenced values.

Signed-off-by: Mandana Vaziri <mvaziri@us.ibm.com>
Signed-off-by: Mandana Vaziri <mvaziri@us.ibm.com>
@vazirim
Copy link
Author

vazirim commented Apr 7, 2020

I wonder if there are any plans for merging this proposal?
TIA

@estroz
Copy link
Member

estroz commented Apr 23, 2020

@vazirim some thoughts on this contribution:

The SDK should definitely make composable visible to operator authors, as it solves advanced use cases. However I'm concerned about the maturity of the composable operator and the community commitment required to support a potentially immature project in the SDK itself. My suggestion is to add the helpers proposed here to an operator-framework library and/or discuss composable usage in documentation.

To further discuss how to proceed I'd like to nail down the specifics of contributing, with graduation criteria, in #2770. For now this PR should stay open while we figure out a contribution methodology.

@vazirim
Copy link
Author

vazirim commented Apr 24, 2020

@estroz would it help if I did a code PR against operator-sdk?

@estroz
Copy link
Member

estroz commented Apr 24, 2020

@vazirim not quite. To put what I said above differently: until we figure out a general framework for contributing code from external projects to operator-framework ones, neither this proposal nor its code should be merged.

@joelanford
Copy link
Member

@vazirim Apologies for being MIA on this proposal and leaving it hanging so far! And thanks for popping it to the top of the stack.

We haven't had a community proposal of this scale before so you're unfortunately the guinea pigs and we were unfortunately caught flat footed and unprepared when all of the questions were answered and the next step was to merge. 😳 🤣

In the coming weeks we will be putting together docs and guidelines for accepting community-contributed features of this scale and establish some governance and procedures.

@vazirim
Copy link
Author

vazirim commented Apr 30, 2020

@joelanford, @estroz Thanks for the update!

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020
@vazirim
Copy link
Author

vazirim commented Jul 29, 2020

Hi @joelanford, this PR is going stale!
Any news on the integration?

@camilamacedo86
Copy link
Contributor

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2020
@joelanford
Copy link
Member

@vazirim I opened operator-framework/community#22. I should have linked that in this PR already!

It adds some governance around projects in the operator-framework org. Our current thinking is to bring this (and other similar features) in as "incubating" projects with a separate repository to let them mature and build a community within operator framework before "graduating".

I think we need to get the project governance agreed upon and merged, and then we can proceed with this as the first candidate for incubation. Sorry this is taking so long!

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 30, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs discussion size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants