Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: wip check only controller ref to decide if a pod is replicated #5419

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented Jan 13, 2023

This PR is still WIP. I will mention the reviewers here once I am done. Although it might not be in the best shape as a WIP PR but if the reviewers have any feedback, I would love to have it. 🙏

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5387

Special notes for your reviewer:

Does this PR introduce a user-facing change?

All pods with controller reference of any sort are treated as replicated. This means cluster-autoscaler would not block scale down on these pods unless `cluster-autoscaler.kubernetes.io/safe-to-evict: false` or `cluster-autoscaler.kubernetes.io/daemonset-pod: true` annotation is set on them. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

TBD


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 13, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @vadasambar!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 13, 2023
@vadasambar
Copy link
Member Author

This PR is still WIP. I will mention the reviewers here once I am done. Although it might not be in the best shape as a WIP PR but if the reviewers have any feedback, I would love to have it. 🙏

@@ -121,25 +121,7 @@ func GetPodsForDeletionOnNodeDrain(
// so OwnerReference doesn't have its own Namespace field
controllerNamespace := pod.Namespace

if refKind == "ReplicationController" {
Copy link
Member Author

@vadasambar vadasambar Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't remove the listers for the removed code from ListerRegistry because we use them somewhere else (ref1, ref2)

@@ -106,102 +105,16 @@ func GetPodsForDeletionOnNodeDrain(
continue
}

isDaemonSetPod := false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore.

@@ -84,7 +83,6 @@ func GetPodsForDeletionOnNodeDrain(

pods = []*apiv1.Pod{}
daemonSetPods = []*apiv1.Pod{}
checkReferences := listers != nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore.

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
- forgot to add this in the last commit

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar vadasambar force-pushed the feature/5387/allow-scale-down-with-custom-controller-pods branch from c03ea27 to 5df6e31 Compare January 16, 2023 05:47
- might need refactoring
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2023
- so that the changes to the interface are minimal
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
- i.e., the pod is orphan
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
}

// GetDynamicLister returns the lister for a particular GVR
func (g *GenericListerFactory) GetLister(gvr schema.GroupVersionResource, namespace string) dynamiclister.Lister {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PoC of this can be found here: https://github.com/vadasambar/dynamic-lister

// The assumption in the code below is that the controllerRef/owner of
// a pod resource will always be in the same namespace
// TODO: find a better way to handle this
l := listers.GenericListerFactory().GetLister(schema.GroupVersionResource{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using generic lister for all resources even when we have a lister for some in-built resources like DaemonSets, StatefulSets etc., to simplify the code at the cost additional overhead of initializing on-demand listers (per namespace). Not sure if this is the best approach to do this. Would love to have any feedback on this 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it an implementation detail of GetLister? If there's a dedicated lister already, it would use that, otherwise fall back to the dynamic one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Thank you for the suggestion!

Copy link
Member Author

@vadasambar vadasambar Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea might not work because there is no common Lister interface we can use as a return type. We could use interface{} as the return type but then the user would have to typecast the lister i.e., they'd have to know the lister type beforehand.
e.g.,
Lister interface for dynamic/unstructured type is different from the ReplicaSet's Lister interface which is different from Job's Lister interface which is different from StatefulSet's lister etc.,

I don't think there's a common Lister interface for in-built types but assuming there was, we would still have the same problem since unstructured's Lister interface would be different from the common Lister interface for in-built types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's an elegant way to handle this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'd have to wrap existing listers into the dynamic one and just use that return type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this and get back. I am imagining something like a BigLister interface which would have the in-built listers along with the dynamic lister.

Copy link
Member Author

@vadasambar vadasambar Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'd have to wrap existing listers into the dynamic one and just use that return type.

Just making sure I am understanding this right.

I am guessing the idea is to change the return-type for built-in listers (deployments, statefulsets etc.,) with dynamic lister

e.g.,

// ListerRegistry is a registry providing various listers to list pods or nodes matching conditions
type ListerRegistry interface {
	AllNodeLister() NodeLister
	ReadyNodeLister() NodeLister
	ScheduledPodLister() PodLister
	UnschedulablePodLister() PodLister
	PodDisruptionBudgetLister() PodDisruptionBudgetLister
	DaemonSetLister() v1appslister.DaemonSetLister
	ReplicationControllerLister() v1lister.ReplicationControllerLister
	JobLister() v1batchlister.JobLister
	ReplicaSetLister() v1appslister.ReplicaSetLister
	StatefulSetLister() v1appslister.StatefulSetLister
	GetLister(gvr schema.GroupVersionKind, namespace string) dynamiclister.Lister
}

type ListerRegistry interface {
AllNodeLister() NodeLister
ReadyNodeLister() NodeLister
ScheduledPodLister() PodLister
UnschedulablePodLister() PodLister
PodDisruptionBudgetLister() PodDisruptionBudgetLister
DaemonSetLister() v1appslister.DaemonSetLister
ReplicationControllerLister() v1lister.ReplicationControllerLister
JobLister() v1batchlister.JobLister
ReplicaSetLister() v1appslister.ReplicaSetLister
StatefulSetLister() v1appslister.StatefulSetLister
GetLister(gvr schema.GroupVersionKind, namespace string) dynamiclister.Lister
}

to

// ListerRegistry is a registry providing various listers to list pods or nodes matching conditions
type ListerRegistry interface {
	AllNodeLister() dynamiclister.Lister
	ReadyNodeLister() dynamiclister.Lister
	ScheduledPodLister() dynamiclister.Lister
	UnschedulablePodLister() dynamiclister.Lister
	PodDisruptionBudgetLister() dynamiclister.Lister
	DaemonSetLister() dynamiclister.Lister
	ReplicationControllerLister() dynamiclister.Lister
	JobLister() dynamiclister.Lister
	ReplicaSetLister() dynamiclister.Lister
	StatefulSetLister() dynamiclister.Lister
	GetLister(gvr schema.GroupVersionKind, namespace string) dynamiclister.Lister
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea.

One problem I see is, if we use dynamic.Lister as a return type (or wrapper type), accessing fields in the struct won't be as straightforward as just doing something like ob.Spec.replicas (which is possible for in-built types)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think a bit more around #5419 (comment)

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
…terface

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
- in drain test for starters
- the test is still failing because the scheme is empty
- and type meta for the objects is empty
- because of which scheme can't be set in the code either
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
}
} else {
if refKind != "" {
gv := strings.Split(controllerRef.APIVersion, "/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be safer using ParseGroupVersion from https://github.com/kubernetes/apimachinery/blob/master/pkg/runtime/schema/group_version.go#L211 instead of trying to parse manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will use this. 👍

// The assumption in the code below is that the controllerRef/owner of
// a pod resource will always be in the same namespace
// TODO: find a better way to handle this
l := listers.GenericListerFactory().GetLister(schema.GroupVersionResource{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it an implementation detail of GetLister? If there's a dedicated lister already, it would use that, otherwise fall back to the dynamic one?

@@ -46,6 +57,19 @@ type ListerRegistry interface {
JobLister() v1batchlister.JobLister
ReplicaSetLister() v1appslister.ReplicaSetLister
StatefulSetLister() v1appslister.StatefulSetLister
GenericListerFactory() GenericListerFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need such two-level interface for? Why not just have GenericLister(gvr schema.GroupVersionResource, namespace string) dynamiclister.Lister directly here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is so that tests can implement this interface (like this for example). I am still going through the unit tests. This interface might not be necessary but when I look at the tests it seems like having this would be useful. I would be happy to remove it if I am wrong because like you said not having the interface simplifies the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having a factory like this means, GenericLister can be used independently of the ListerRegistry (not sure if we have a use-case like that in cluster-autoscaler but I thought having that de-coupling might prove useful in the future.

// or it's assumed to be a cluster-scoped resource
// The assumption in the code below is that the controllerRef/owner of
// a pod resource will always be in the same namespace
// TODO: find a better way to handle this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always list controllers from all namespaces? This is equivalent to what existing listers do, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand your question. The code in master branch assumes the controller/owner resource would be in the same namespace as the pod. The code I wrote does the same thing i.e., it assumes the controller/owner resource is in the same namespace as the pod. This approach has a caveat where if the pod owner is a cluster scoped resource, our logic wouldn't work correctly (I haven't seen a case like this yet).

The comment above basically puts this caveat in words and the TODO is for handling the case of cluster-scoped resource as the owner. Maybe I should tweak the wording a bit.

Let me know if that doesn't answer the question 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I think you're right.

@x13n
Copy link
Member

x13n commented Jan 30, 2023

/assign

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar
Copy link
Member Author

vadasambar commented Jan 31, 2023

The code has a problem where I pass in Kind as Resource to the GVR in #5419 but that won't work. Resource has to be plural (e.g., Deployment or deployment doesn't work. It has to be deployments. You can experiment with this using the dynamic-lister PoC by changing this line).

We could use UnsafeGuessKindToResource but as the name implies it'd still be a guess and using this seems to be discouraged (rightfully so).

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
- use `ListerRegistry` interface in place of the factory
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@@ -380,6 +381,7 @@ func buildAutoscaler(debuggingSnapshotter debuggingsnapshot.DebuggingSnapshotter
kubeClientConfig.Burst = autoscalingOptions.KubeClientBurst
kubeClientConfig.QPS = float32(autoscalingOptions.KubeClientQPS)
kubeClient := createKubeClient(kubeClientConfig)
dynamicClient := dynamic.NewForConfigOrDie(kubeClientConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reuse kubeClient here to share qps rate limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we can create a dynamicClient out of kubeClient because createKubeClient returns a kube_client.Interface which actually is a ClientSet. I don't think we can use ClientSet for creating a dynamicClient (would be happy to be corrected here).

Since we pass kubeClientConfig to dynamic.NewForConfigDie, all of the QPS/Burst settings are applied dynamicClient here which can be traced as:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube_client.Interface is a superset of rest.Interface, which is what dynamic.New expects. I didn't test this, but from docs it seems like what we want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I missed that obvious function. Thanks for the link. I will test it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works (can be tested by running this PoC). I have updated the code.

}

// // GenericListerFactory is a factory for creating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the commented out code?


resource, found, err := unstructured.NestedString(crd.Object, "spec", "names", "plural")
if !found {
fmt.Println(fmt.Errorf("couldn't find the field 'spec.names.plural' on the CRD '%s'", crd.GetName()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use klog for logging, like in other parts of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

// or it's assumed to be a cluster-scoped resource
// The assumption in the code below is that the controllerRef/owner of
// a pod resource will always be in the same namespace
// TODO: find a better way to handle this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I think you're right.

// this lister to listers map
// so that another attempt is made
// to create the lister and sync cache
klog.Error("couldn't sync cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leak reflector goroutines, right? They won't get added to the cache and end up hanging forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take another look at this.

// The assumption in the code below is that the controllerRef/owner of
// a pod resource will always be in the same namespace
// TODO: find a better way to handle this
l := listers.GenericListerFactory().GetLister(schema.GroupVersionResource{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you'd have to wrap existing listers into the dynamic one and just use that return type.

go reflector.Run(stopChannel)

// Wait for reflector to sync the cache for the first time
// TODO: check if there's a better way to do this (listing all the nodes seems wasteful)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this line (irrelevant)

@@ -368,3 +549,75 @@ func NewConfigMapListerForNamespace(kubeClient client.Interface, stopchannel <-c
go reflector.Run(stopchannel)
return lister
}

// // NewGenericListerFactory initializes a new generic lister factory
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this code block


// GetLister returns the lister for a particular GVR
func (g listerRegistryImpl) GetLister(gvr schema.GroupVersionKind, namespace string) dynamiclister.Lister {
crd, err := g.crdLister.Get(fmt.Sprintf("%s/%s", gvr.Group, gvr.Kind))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am retrieving Resource name from the CRD (crdLister is defined here). I use Resource name to create GVR which is needed to create List and Watch functions in the GenericLister here.

Resource has to be plural. ownerRef on pod doesn't have a Resource, it has a Kind. We can't convert Kind to Resource easily because Kind is singular and Resource is a plural in small case.
e.g., If Kind is Deployment, Resource would be deployments (If you use deployment in place of deployments List and Watch functions don't work; to check this for yourself you can change the nodes to node in this PoC)
It's not easy to predict what would be the correct plural of the Kind because of which we need to look up the CRD to check the plural name. This approach gives us accurate results but also feels like an overkill.

Would love any suggestion on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I don't think there's a better way though, as the plural string for the resource doesn't appear anywhere in the owner ref.

- for dynamic client
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vadasambar
Once this PR has been reviewed and has the lgtm label, please ask for approval from x13n. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vadasambar
Copy link
Member Author

This pull-request has been approved by: vadasambar

Not sure why it says this. You can't approve your own PR in Github.

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@x13n
Copy link
Member

x13n commented Feb 9, 2023

After thinking more about this change I convinced myself it might be introducing too much complexity for little extra benefit. My reasoning is:

  • There is a way out: if there are workloads affected by this change, users can always configure safe-to-evict annotation to indicate the behavior they want.
  • Nodes may get restarted for other reasons than just CA - version upgrades & node crashes/repairs will also cause pods owned by custom controllers to disappear sometimes.
  • I have a bit of general architectural concern about making CA aware of so many additional controllers in the cluster. Where possible I think we should strive for simplicity.
  • Extra QPS to kube-apiserver may throttle other CA behavior (e.g. draining nodes in scale down).
  • If, for whatever reason, it turns out we need to have such elaborate mechanism, we can always revisit this in the future.

Given the above, I'd suggest limiting the scope and start with blocking scale down only on bare pods (i.e. having no owner refs) and on owner refs pointing to missing known controllers (i.e. the ones we support today, to prevent backwards incompatible changes). WDYT?

@vadasambar
Copy link
Member Author

vadasambar commented Feb 10, 2023

After thinking more about this change I convinced myself it might be introducing too much complexity for little extra benefit. My reasoning is:

  • There is a way out: if there are workloads affected by this change, users can always configure safe-to-evict annotation to indicate the behavior they want.

  • Nodes may get restarted for other reasons than just CA - version upgrades & node crashes/repairs will also cause pods owned by custom controllers to disappear sometimes.

  • I have a bit of general architectural concern about making CA aware of so many additional controllers in the cluster. Where possible I think we should strive for simplicity.

  • Extra QPS to kube-apiserver may throttle other CA behavior (e.g. draining nodes in scale down).

  • If, for whatever reason, it turns out we need to have such elaborate mechanism, we can always revisit this in the future.

Given the above, I'd suggest limiting the scope and start with blocking scale down only on bare pods (i.e. having no owner refs) and on owner refs pointing to missing known controllers (i.e. the ones we support today, to prevent backwards incompatible changes). WDYT?

I kind of agree. I didn't think our idea would turn into something big and complex like this. I like the idea of querying for owner of the pods but I have similar concerns over introducing such a level of complexity (for not that big of a benefit)

I think we can close this PR and keep it for posterity while I can create another PR with the limited scope as the beginning step. If we feel like we need an elaborate solution like the current PR, we can come back to it again.

@x13n
Copy link
Member

x13n commented Feb 13, 2023

Thanks! If you're planning to follow up with a separate PR, I'm going to close this one.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

Thanks! If you're planning to follow up with a separate PR, I'm going to close this one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vadasambar
Copy link
Member Author

Thanks! If you're planning to follow up with a separate PR, I'm going to close this one.

/close

Thanks. I will create a follow-up PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't block scale down on pods owned by custom controllers
3 participants