Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster API should not use controller refs for resources that it does not create #4014

Open
detiber opened this issue Dec 11, 2020 · 25 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@detiber
Copy link
Member

detiber commented Dec 11, 2020

What steps did you take and what happened:

Attempting to create a Cluster using an external tool such as Crossplane causes issues with both CAPI and Crossplane wanting to have the singular controller reference for resources that are created from Crossplane.

We've used controller refs in the past to avoid the possibility of multiple Clusters owning the same resources, but siince this hampers interoperability, the use of non-controller ownerReferences should be used and the verification that a resource isn't "owned" by the same cluster should be handled separately.

What did you expect to happen:

Crossplane should be able to create a Cluster API Cluster

Anything else you would like to add:

Anywhere that Cluster API controllers are creating resources should still continue to use controller references.

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2020
@vincepri
Copy link
Member

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

Would you be able to give more details on what the error is? Owner references are also used for clusterctl move as of today, and it seems that semantically that's the right place to understand ownership.

Regarding the Crossplane integration, I'm curious to understand how that would work 😄 given that we've had some similar ideas in mind for the infrastructure providers, is that something you could share?

@hasheddan
Copy link
Contributor

@vincepri I can give some more insight here, and would also love to attend a community meeting to present some ideas / demos. CAPI and Crossplane can interact at two main levels:

  1. CAPI could rely on Crossplane providers to supply the backing infrastructure for cloud providers as Kubernetes objects. Crossplane providers are similar to the CAPI infrastructure providers except they are more general purpose and support all types across a provider rather than just those needed to spin up a k8s cluster.
  2. Crossplane can orchestrate CAPI types using composition. You can see an extremely rough sketch of what this could look like in a demo we are putting together. This is the area in which we are running into issues as Crossplane's composition engine adds some controller references to the objects it composes, which CAPI is relying on controlling for its operations. We specifically hit this with the PacketCluster needing to be controlled by the CAPI Cluster type.

The Cluster is kind of an interesting case because I would typically not put controller references on a user-created object (i.e. the PacketCluster is part of the CAPI quickstart output that a user applies, or in this case, Crossplane applies), but I understand the rationale for it here. The controller refs are just one example of how Crossplane and CAPI can encounter situations where they don't play nicely, but I was actually pleasantly surprised that there were fewer conflicts than I was expecting.

I think a broader discussion about how CAPI and Crossplane can integrate would be beneficial here, and I am more than happy to create some content that can be presented. In many ways, the components of CAPI are similar to many Crossplane providers:

@fabriziopandini
Copy link
Member

/milestone v0.4.0
given that some investigation work is going on

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Mar 5, 2021
@fabriziopandini
Copy link
Member

@hasheddan we are trying to close API changes for the v1alpha4 version.
Are there some some actionable outcome from your investigation?

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 10, 2021
@hasheddan
Copy link
Contributor

@fabriziopandini yes, there has been significant work around using crossplane as a drop in replacement for types in cluster-api-provider-gcp. The work is being tracked in crossplane-contrib/provider-gcp#303 and most of it has taken place on livestreams that have been recorded. I would love to give a demo to CAPI community soon (potentially next week) -- what would you recommend as best forum for that?

Note that I believe that any crossplane integration will not need to block v1alpha4 API changes as one of the benefits of the model is that we can mold the composite resources (which are the drop-in types for existing CAPI provider types) to fit any API.

/cc @vincepri

@vincepri
Copy link
Member

I think for now we can delay a change in references until we find a suitable approach that works across all the proposed integrations, for now we can possibly use what's in #4135 to have a crossplane integration

/milestone Next

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.4.0, Next Mar 10, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lukebond
Copy link

/reopen

i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.

above @vincepri wrote:

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as AWSCluster, it adds a controller ref linking it to the Cluster resource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.

i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".

my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.

@k8s-ci-robot
Copy link
Contributor

@lukebond: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.

above @vincepri wrote:

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as AWSCluster, it adds a controller ref linking it to the Cluster resource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.

i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".

my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jan 17, 2023
@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 17, 2023
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 17, 2023
@sbueringer
Copy link
Member

sbueringer commented Jan 18, 2023

Sounds reasonable to me at a first glance.

I think we would have to differentiate between "standalone" clusters and clusters using clusterclass. With clusterclass the CAPI controller is actually creating infra cluster, control plane and MachineDeployments as well

@sbueringer
Copy link
Member

cc @killianmuldoon given recent work on ownerRefs

@enxebre
Copy link
Member

enxebre commented Jan 18, 2023

Revisiting this seems reasonable to me. IIRC The rationale behind using controller: true originally was mainly to prevent multiple Clusters from fighting over the "descendent" resources e.g MachineDep.

Also pending having better docs #5487

@fabriziopandini
Copy link
Member

Just another data point on this; the code base is full of assumptions about OwnerReferences to be in place, and unfortunately, there is not good documentation linking owners with those assumptions (AFAIK the only exception is https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#ownerreferences-chain that explicitly documents why it is relying on this info).

As a consequence changing owner reference should not be taken lightly, because it can lead to unexpected problems all around. We should figure out how to prevent regressions and how to communicate the change to all the Cluster API consumers that could have similar assumptions on their code

@sbueringer
Copy link
Member

Agree. I think we first should "lock down" and document the current behavior (which is what Killian was and is working on, e.g. also with: #7606, not sure if the ownerRef being a controller ref or not is part of that)

Then we can evolve it from there.

@fabriziopandini
Copy link
Member

/triage accepted
/remove-priority important-soon
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 21, 2023
@killianmuldoon
Copy link
Contributor

/assign

I'll come back to this, organise documentation and see if there's places where we can move controller-references to non-exclusive OwnerRefs

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Aug 29, 2023

We've now merged documentation and tests to cover the current state of ownerReferences: #9153

I probably don't have time to follow up on the second part of this issue - i.e. deciding which ownerReferences should remain controllers and which do not, so unassigning myself for now.

/unassign

@killianmuldoon
Copy link
Contributor

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests