Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check only controller ref to decide if a pod is replicated #5507

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented Feb 14, 2023

This PR is a follow-up to #5419 (comment)

This PR is still WIP. I will mention the reviewers here once I am done. Although it might not be in the best shape as a WIP PR but if the reviewers have any feedback, I would love to have it. 🙏

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5387

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added: New flag `--allow-scale-down-on-custom-controller-owned-pods`. If this flag is set to true cluster-autoscaler doesn't block node scale-down if a pod owned by a custom controller is running on the node.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

TBD


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2023
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2023
// Using names like FooController is discouraged
// https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#naming-conventions
// vadasambar: I am using it here just because `FooController`` is easier to understand than say `FooSet`
OwnerReferences: GenerateOwnerReferences("Foo", "FooController", "apps/v1", ""),
Copy link
Member Author

@vadasambar vadasambar Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other suggestions in place of FooController are welcome.

@x13n
Copy link
Member

x13n commented Feb 28, 2023

/assign

I would preserve the existing behavior for known controllers - perhaps behind a flag - so that there's no regression. We can eventually delete the flag after a few releases, but I'd like to avoid breaking existing users.

@vadasambar
Copy link
Member Author

vadasambar commented Mar 1, 2023

/assign

I would preserve the existing behavior for known controllers - perhaps behind a flag - so that there's no regression. We can eventually delete the flag after a few releases, but I'd like to avoid breaking existing users.

Makes sense to me (don't want to surprise users with a new default behavior). Let me think about this.
Thank you for the feedback (I always get to learn something because of your feedback).

@vadasambar
Copy link
Member Author

I had a few hiccups around trying to get a custom CA run properly on GKE (all good now). I want to test this PR manually on a GKE cluster. The rough idea is something like this:

  • add debug log to show pod with an owner reference was skipped during scale down
  • create custom CA in a GKE cluster with PR image tag and --v flag to set log level high enough to show the debug log for owner reference
  • schedule a workload with N replicas that needs a new node e.g., set pod anti-affinity so that N-1 nodes scale out to N nodes
  • run a pod with owner reference
  • delete the N replicas workload
  • tail logs for cluster-autoscaler to make sure it skips the pod with owner reference
  • if I see the owner reference log, screenshot it and paste it in the ticket PR
  • change the PR from draft to review
  • ask for review

@x13n
Copy link
Member

x13n commented Mar 3, 2023

Rather than on logs, I'd just rely on behavior. In this case you could do something like this:

  • Scale cluster to N nodes using N replica workload you described.
  • Put static pods on N-1 nodes and one custom controlled pod on the last node.
  • Delete the workload.
  • Observe if the node with custom controlled pod got rescheduled.

Note: N=2 might be problematic due to system workloads, but N=3 should work. Alternatively, you can use a separate nodepool to put all non-DS system workloads there using taints/tolerations, so that they don't interfere with the test.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2023
maxFreeDifferenceRatio = flag.Float64("max-free-difference-ratio", config.DefaultMaxFreeDifferenceRatio, "Maximum difference in free resources between two similar node groups to be considered for balancing. Value is a ratio of the smaller node group's free resource.")
maxAllocatableDifferenceRatio = flag.Float64("max-allocatable-difference-ratio", config.DefaultMaxAllocatableDifferenceRatio, "Maximum difference in allocatable resources between two similar node groups to be considered for balancing. Value is a ratio of the smaller node group's allocatable resource.")
forceDaemonSets = flag.Bool("force-ds", false, "Blocks scale-up of node groups too small for all suitable Daemon Sets pods.")
allowScaleDownOnCustomControllerOwnedPods = flag.Bool("allow-scale-down-on-custom-controller-owned-pods", false, "Don't block node scale-down if a pod owned by a custom controller is running on the node.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better naming suggestions are welcome (the flag name feels a little too long).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is set to false right now to provide backwards compatibility. This flag would be set to true in the future (we might remove it completely and use true one as the default behavior).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadasambar Just an idea, maybe you could follow the skip-nodes-with-* scheme, for instance,
skip-nodes-with-custom-controller-pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregth thank you for the suggestion. I thought the new flag name skip-nodes-with-custom-controller-pods wouldn't tell the user about if the skipping happens during scale-up or scale-down but looks like we already use skip-nodes-with-* flag naming for skipping things when scaling down nodes:

skipNodesWithSystemPods = flag.Bool("skip-nodes-with-system-pods", true, "If true cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods)")
skipNodesWithLocalStorage = flag.Bool("skip-nodes-with-local-storage", true, "If true cluster autoscaler will never delete nodes with pods with local storage, e.g. EmptyDir or HostPath")

Making the naming consistent makes sense to me. I will change it to skip-nodes-with-custom-controller-pods. Thank you.

P.S.: If you have any other suggestions, I would love to know.

} else {
replicated = true
} else {
checkReferences := listers != nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in the else part is just copy-paste of our current code. I have moved a couple of variable declarations into the else part so that it's easier to remove in the future.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 6, 2023
@vadasambar
Copy link
Member Author

vadasambar commented Mar 8, 2023

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

* [18902e8](https://github.com/kubernetes/autoscaler/commits/18902e8fd48a0f789608b69f42d25d02ab140406) fix: remove `@` in `@vadasambar`

📝

P.S.: Fixed

@vadasambar vadasambar force-pushed the feature/5387/allow-scale-down-with-custom-controller-pods-2 branch from b718749 to 13921ae Compare March 8, 2023 05:27
@k8s-ci-robot k8s-ci-robot removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 8, 2023
@vadasambar
Copy link
Member Author

vadasambar commented Mar 9, 2023

Testing

Created a new nodepool in GKE with taints (pool-1)
image

Set the flag to true:
image

Scaled up the pool-1 nodepool using the following workload:

# +kubectl
apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-scale-up-with-pod-anti-affinity
  namespace: default
spec:
  selector:
    matchLabels:
      app: node-scale-up-with-pod-anti-affinity
  replicas: 1
  template:
    metadata:
      labels:
        app: node-scale-up-with-pod-anti-affinity
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: pool-1
      tolerations:
        - effect: NoSchedule
          key: test
          operator: Equal
          value: "true"
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - node-scale-up-with-pod-anti-affinity
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: node-scale-up-with-pod-anti-affinity
        image: registry.k8s.io/pause:2.0

Deployed the test workload pod with custom controller owner ref

apiVersion: v1
kind: Pod
metadata:
  name: custom-controller-owned-pod
  ownerReferences:
  - apiVersion: foos
    kind: FooSet
    controller: true
    name: custom-controller-owned-pod-56fdfd787b
    uid: 1c6544a7-12e7-426c-bd2d-7ac858d18d7d 
spec:
  nodeName: gke-cluster-1-pool-1-b66d130e-9mrw # name of the node that was created to accomodate scale-up workload
  tolerations:
    - effect: NoSchedule 
      key: test  # pool-1 nodepool is tainted with this taint
      operator: Equal
      value: "true"
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent

Removed the scale up workload to see how CA behaves. Sure enough, it removes the pod with custom controller owner reference:
image

You can see some logs like inside new loop and allowScaleDownOnCustomController... This was to debug a problem where CA didn't skip the custom controller owned pod. Turns out I forgot to add controller: true in the owner reference.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2023
@vadasambar vadasambar force-pushed the feature/5387/allow-scale-down-with-custom-controller-pods-2 branch from 3d24178 to 88c4a2d Compare March 13, 2023 04:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 15, 2023
maxNodeGroupBinpackingDuration = flag.Duration("max-nodegroup-binpacking-duration", 10*time.Second, "Maximum time that will be spent in binpacking simulation for each NodeGroup.")
skipNodesWithSystemPods = flag.Bool("skip-nodes-with-system-pods", true, "If true cluster autoscaler will never delete nodes with pods from kube-system (except for DaemonSet or mirror pods)")
skipNodesWithLocalStorage = flag.Bool("skip-nodes-with-local-storage", true, "If true cluster autoscaler will never delete nodes with pods with local storage, e.g. EmptyDir or HostPath")
scaleDownNodesWithCustomControllerPods = flag.Bool("scale-down-nodes-with-custom-controller-pods", false, "If true cluster autoscaler will delete nodes with pods owned by custom controllers")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new flag. Looks like adding the new flag added extra space in every flag definition.

NodeGroupBackoffResetTimeout: *nodeGroupBackoffResetTimeout,
MaxScaleDownParallelism: *maxScaleDownParallelismFlag,
MaxDrainParallelism: *maxDrainParallelismFlag,
GceExpanderEphemeralStorageSupport: *gceExpanderEphemeralStorageSupport,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Extra space was added because I added ScaleDownNodesWithCustomControllerPods on line 332 below.

@@ -231,6 +158,104 @@ func GetPodsForDeletionOnNodeDrain(
return pods, daemonSetPods, nil, nil
}

func legacyCheckForReplicatedPods(listers kube_util.ListerRegistry, pod *apiv1.Pod, minReplica int32) (replicated bool, isDaemonSetPod bool, blockingPod *BlockingPod, err error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name legacyCheckForReplicatedPods doesn't account for checking for daemon set pods or blocking pods but using legacyCheckForReplicatedAndDaemonSetAndBlockingPods feels too long so I have kept it as is for now.

Copy link
Member Author

@vadasambar vadasambar Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the return type replicated bool, isDaemonSetPod bool, blockingPod *BlockingPod, err error has names. I did this because I think it improves readability when you hover over the function call in IDEs e.g.,
image

This might not be the best approach since it can confuse someone when they look at line 162 and 165 below (because the variables are already defined due to them being named return types). Open to other ideas here.

checkReferences := listers != nil
isDaemonSetPod = false

controllerRef := ControllerRef(pod)
Copy link
Member Author

@vadasambar vadasambar Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same code as our current code here abstracted into a function. I have done some minor changes like:

  1. Change return values to include replicated and isDaemonSetPod. Also, remove []*apiv1.Pod since we don't need to return it from this function. Returning pod list is done in the else part here.
  2. Don't do appending to daemonSetPods slice here. Instead just return if isDaemonSetPod and let the calling function do the appending because I thought it was cleaner and need the appending part even if we remove this function in the future.

for _, test := range tests {
// run all tests for scaleDownNodesWithCustomControllerPods=false
// TODO(vadasambar): remove this when we get rid of scaleDownNodesWithCustomControllerPods
for _, test := range append(tests, testOpts{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing append in the for loop declaration itself. This might not be the best approach. I didn't want to create 2 variables called testsWithScaleDownNodesWithCustomControllerPodsDisabled and testsWithScaleDownNodesWithCustomControllerPodsEnabled. Any better suggestions are welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong about being explicit? I'd something along the lines of:

  1. Make custom controller pods flag a part of the test case struct.
  2. Define a list of shared test cases.
  3. Define a list of flag-disabled test cases (copy the shared ones and add the extra one)
  4. Define a list of flag-enabled test cases (copy the shared ones, flip the flag on all and add the extra one)
  5. Define the body of the test once and run it for a combined list (flag-enabled + flag-disabled).

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, just copy all test cases and have a single list in the first place :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I like the idea of putting the flag on the test struct and combining all tests into one. Let me try it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// run all tests for scaleDownNodesWithCustomControllerPods=true
for _, test := range append(tests, testOpts{
Copy link
Member Author

@vadasambar vadasambar Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am running the same tests twice. Code here is practically the same as code beyond line above. The only difference is scaleDownNodesWithCustomControllerPods is set to false here and it is set to true in the test cases beyond the above line.

I did think of abstracting duplicate code into a separate function but I am not so sure after thinking about the comment here since abstracting the tests into a function would add another level of indirection. I am a little on the fence about doing that now. Any suggestions here are welcome.

@vadasambar vadasambar requested review from x13n and removed request for gregth March 15, 2023 06:26
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 20, 2023
SkipNodesWithSystemPods: true,
SkipNodesWithLocalStorage: true,
MinReplicaCount: 0,
SkipNodesWithCustomControllerPods: true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have set SkipNodesWithCustomControllerPods: true to preserve current behavior of these test.

if err != nil {
return []*apiv1.Pod{}, []*apiv1.Pod{}, blockingPod, err
}
} else {
if controllerRef != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like controllerRef is not used anywhere else, so maybe just directly call ControllerRef() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thank you for pointing this out.

if len(pods) != len(test.expectPods) {
t.Fatalf("Wrong pod list content: %v", test.description)
}
for i := range tests {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was to duplicate the test cases and flip the flag on the copy to ensure the new flag doesn't affect the vast majority of test cases.

Copy link
Member Author

@vadasambar vadasambar Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's another problem with the code where the test cases for flag=false are not there.

My suggestion was to duplicate the test cases and flip the flag on the copy to ensure the new flag doesn't affect the vast majority of test cases.

I see. Just to confirm my understanding. You mean something like this right?

tests := []{
		// *flag: true test cases*
		{
			// shared test case 1
                       flag: true
		},
		{
			// shared test case 2
                       flag: true
		},
		{
			// additional test case 1
                       flag: true
		},
		// *flag: false test cases*
		{
			// shared test case 1 
                       flag: false
		},
		{
			// shared test case 2
                       flag: false
		},
		{
			// additional test case 1
                       flag: false
		},

}

for _, test := range tests {
	// execute test case
}

shared test case 1, shared test case 2 and additional test case 1 are duplicated with some minor tweaks (if required) and flipped flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct, one problem I see is we would have to change name of the duplicated test cases slightly to differentiate them from original test cases so that it's easy to identify which test case failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which I think should be fine since it would be a one time thing for now.

But, if we have a similar flag in the future say --skip-nodes-with-ignore-local-vol-storage-annotation, we would need 4 sets of test cases ([current-flag=true,future-flag=true], [true, false], [false, false], [false, true]) which are almost similar to each other with slight differences. This will get difficult to maintain as we go forward. One argument here can be, we can deal with it once we get to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a more future-proof solution here but it would also increase the code complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can duplicate the test cases for now and there can be another issue to work through a better solution for test cases duplication/maintainability problem.

Copy link
Member Author

@vadasambar vadasambar Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would have to change name of the duplicated test cases

This can be solved by adding the flag in the error log output here so that the error output includes the flag value. Something like this:

drain_test.go:955: Custom-controller-managed non-blocking pod: unexpected non-error, skipNodesWithCustomControllerPods: false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is what you wanted to tell me. This can be a possible solution:

sharedTests := []testOpts{
		{
			// shared test case 1
		},
		{
			// shared test case 2
		},

}

allTests = []testOpts{}
for _, sharedTest := range sharedTests {
	sharedTest.skipNodesWithCustomControllerPods = true
	allTests = append(allTests, sharedTest)
	sharedTest.skipNodesWithCustomControllerPods = false
	allTests = append(allTests, sharedTest)
}

allTests = append(allTests, testOpts{
	// additional test case 1
			   flag: true
	})

allTests = append(allTests, testOpts{
	// additional test case 1
			   flag: false
	})


for _, sharedTest := range sharedTests {
	sharedTest.newFlagInFuture = true
	allTests = append(allTests, sharedTest)
	sharedTest.newFlagInFuture = false
	allTests = append(allTests, sharedTest)
}

allTests = append(allTests, testOpts{
	// additional test case 2 for new flag in future
			   flag: false
	})

allTests = append(allTests, testOpts{
	// additional test case 2 for new flag in future
				flag: true
	})


for _, test := range tests {
	// execute test case
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this PR with a reference implementation based on the above idea. Happy to change it based on your review comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

for _, sharedTest := range sharedTests {
// to execute the same shared tests for when the skipNodesWithCustomControllerPods flag is true
// and when the flag is false
sharedTest.skipNodesWithCustomControllerPods = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could update sharedTest.description to append a flag here, so it only affects the shared tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. 👍

@x13n
Copy link
Member

x13n commented Mar 21, 2023

Thanks for the changes! The code looks good to me now, can you just squash the commits before merging?

// make sure you shallow copy the test like this
// before you modify it
// (so that modifying one test doesn't affect another)
enabledTest := sharedTest
Copy link
Member Author

@vadasambar vadasambar Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am creating a shallow copy of the sharedTest so that if I modify sharedTest once it doesn't affect the second modification. e.g.,

for _, sharedTest := range sharedTests {
	sharedTest.skipNodesWithCustomControllerPods = true
	sharedTest.description = fmt.Sprintf("%s with skipNodesWithCustomControllerPods: %v", 
		sharedTest.description, sharedTest.skipNodesWithCustomControllerPods)
	allTests = append(allTests, sharedTest)

	sharedTest.skipNodesWithCustomControllerPods = false
	sharedTest.description = fmt.Sprintf("%s with skipNodesWithCustomControllerPods: %v", 
		sharedTest.description, sharedTest.skipNodesWithCustomControllerPods)
	allTests = append(allTests, sharedTest)
}

fmt.Println("allTests[0]", allTests[0])
fmt.Println("allTests[1]", allTests[1])

Prints

allTests[0] {RC-managed pod with skipNodesWithCustomControllerPods:true [0xc00018ad80] [] [0xc000498c60] [] false [0xc00018ad80] [] <nil> true}
allTests[1] {RC-managed pod with skipNodesWithCustomControllerPods:true with skipNodesWithCustomControllerPods:false [0xc00018ad80] [] [0xc000498c60] [] false [0xc00018ad80] [] <nil> false}

Note the description and the last bool (which is our flag)

To overcome this, I use shallow copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output when we use shallow copy

allTests[0] {RC-managed pod with skipNodesWithCustomControllerPods:true [0xc000202d80] [] [0xc0004b54a0] [] false [0xc000202d80] [] <nil> true}
allTests[1] {RC-managed pod with skipNodesWithCustomControllerPods:false [0xc000202d80] [] [0xc0004b54a0] [] false [0xc000202d80] [] <nil> false}

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit 144a64a)

fix: set `replicated` to true if controller ref is set to `true`
- forgot to add this in the last commit

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit f8f4582)

fix: remove `checkReferences`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

(cherry picked from commit 5df6e31)

test(drain): add test for custom controller pod
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add flag to allow scale down on custom controller pods
- set to `false` by default
- `false` will be set to `true` by default in the future
- right now, we want to ensure backwards compatibility and make the feature available if the flag is explicitly set to `true`
- TODO: this code might need some unit tests. Look into adding unit tests.
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: remove `at` symbol in prefix of `vadasambar`
- to keep it consistent with previous such mentions in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(utils): run all drain tests twice
- once for  `allowScaleDownOnCustomControllerOwnedPods=false`
- and once for `allowScaleDownOnCustomControllerOwnedPods=true`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs(utils): add description for `testOpts` struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update FAQ with info about `allow-scale-down-on-custom-controller-owned-pods` flag
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `allow-scale-down-on-custom-controller-owned-pods` -> `skip-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `allowScaleDownOnCustomControllerOwnedPods` -> `skipNodesWithCustomControllerPods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(utils/drain): fix failing tests
- refactor code to add cusom controller pod test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix long code comments
- clean-up print statements
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move `expectFatal` right above where it is used
- makes the code easier to read
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix code comment wording
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR comments
- abstract legacy code to check for replicated pods into a separate function so that it's easier to remove in the future
- fix param info in the FAQ.md
- simplify tests and remove the global variable used in the tests
- rename `--skip-nodes-with-custom-controller-pods` -> `--scale-down-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename flag `--scale-down-nodes-with-custom-controller-pods` -> `--skip-nodes-with-custom-controller-pods`
- refactor tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update flag info
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: forgot to change flag name on a line in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `ControllerRef()` directly instead of `controllerRef`
- we don't need an extra variable
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: create tests consolidated test cases
- from looping over and tweaking shared test cases
- so that we don't have to duplicate shared test cases
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: append test flag to shared test description
- so that the failed test is easy to identify
- shallow copy tests and add comments so that others do the same
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar vadasambar force-pushed the feature/5387/allow-scale-down-with-custom-controller-pods-2 branch from dee5eae to ff6fe58 Compare March 22, 2023 05:21
@vadasambar
Copy link
Member Author

Thanks for the changes! The code looks good to me now, can you just squash the commits before merging?

Thank you for bearing with me. I have squashed the commits and added a comment.

@vadasambar
Copy link
Member Author

@x13n unless you have any other comments, I think I'm done from my side.

@x13n
Copy link
Member

x13n commented Mar 24, 2023

Great, thank you for the changes!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vadasambar, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2023
@k8s-ci-robot k8s-ci-robot merged commit b8ba233 into kubernetes:master Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't block scale down on pods owned by custom controllers
4 participants