Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logarithmic timestamp comparison for downscaling #99212

Merged

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Feb 18, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements logarithmically-compared timestamps for replica scale downs, from #96898:

Compares ready and creation timestamps in a logarithmic scale. This allows for some level of randomness when Pods are quick-sorted to get downscaling candidates.

Used base 2. This means that (roughly) if a Pod A has been created/running for less than half the time of Pod B, then Pod A will be downscaled first. But if Pod A has been created/running for more than half the time of Pod B, they can be equally downscaled.

Which issue(s) this PR fixes:

Ref kubernetes/enhancements#2185 and #96748

Special notes for your reviewer:

This is a proposal that has very low overhead compared to #96748. Since behavior is not backwards compatible, we could release with a FeatureGate first. Then, from feedback, we can adjust the logarithmic base or, if we find out that the behavior might not be desired by everyone, we could make it a configuration option.

Does this PR introduce a user-facing change?

When downscaling ReplicaSets, ready and creation timestamps are compared in a logarithmic scale.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 18, 2021
Copy link
Contributor Author

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damemi
Copy link
Contributor Author

damemi commented Feb 18, 2021

/retest

@damemi damemi force-pushed the alculquicondor-log-timestamp branch from 088c236 to d6ec6a6 Compare February 18, 2021 21:50
@alculquicondor
Copy link
Member

You forgot to add the KEP to the description :)

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing the feature gate

@@ -808,6 +809,9 @@ func (s ActivePods) Less(i, j int) bool {
// 7. If the pods' creation times differ, the pod that was created more recently
// comes before the older pod.
//
// In 5 and 7, times are compared in a logarithmic scale. This allows a level
// of randomness among equivalent Pods when sorting.
//
// If none of these rules matches, the second pod comes before the first pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the UUID comparison that was suggested in the KEP?

@@ -494,6 +495,45 @@ func TestSpecReplicasChange(t *testing.T) {
}
}

func TestLogarithmicScaleDown(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is ensuring that the existing behavior somewhat still holds. That is good. We should run it with the feature gate enabled and disabled (I would remove the Logarithmic part from the test name)

Another test we could have is rather opposite: have all the pods run at roughly the same time, downscale, and see if some level of randomness holds. Do you think something like this is possible? Maybe we can run the same test X times and ensure that at least in one of them, the removed pod was not the absolute youngest.

In the test plan we also suggested emulating the story https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2185-random-pod-select-on-replicaset-downscale#test-plan
We can do that in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have all the pods run at roughly the same time, downscale, and see if some level of randomness holds ... Maybe we can run the same test X times

There is an inherent problem with randomness in that we can never be certain it will be the randomness we expect :) Something like this will introduce flakes, which can be minimized by increasing the number of test runs, but still always possible. Are there examples of other tests which take a similar approach?

What we really wish to show from such a test is that pods created on a similar logarithmic time scale have an equal chance of being chosen for downscaling. In other words, that their ranks are calculated properly. I believe the unit tests cover this level of detail sufficiently.

The integration test is simply showing that the basic logic still works when removed from the vacuum of unit tests. I think that the user story you linked will make a good e2e, and I could actually add that to this PR (I don't see any reason to wait, and that will flesh out test cases here more comfortably). Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the unit tests cover this level of detail sufficiently.

I agree, but thanks for giving it a thought.

Still, modify the test to run for feature gate enabled and disabled.

Fine with me to add the 2e2 test for the user story in this PR.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2021
@damemi damemi force-pushed the alculquicondor-log-timestamp branch from d6ec6a6 to d99be17 Compare March 3, 2021 15:49
@damemi
Copy link
Contributor Author

damemi commented Mar 3, 2021

@alculquicondor I added the UID sorting right before the call to actually sort the pods in 5ffacd0a98bb855bcbe067b2560a6f56c8775254. Let me know if that looks good, it should at least give a pseudo-random start before pods of similar ranks are sorted.

I am still working on how to add the e2e. I'm thinking these steps:

  1. Create X pods
  2. Wait 30 seconds (this should be enough that those pods will have a rank of 4)
  3. Create Y more pods
  4. Wait another 30 seconds (now the first pods should be ~60s old w/rank 6, and the new pods have rank 4)
  5. Downscale by Y pods
  6. Confirm that the remaining pods are all from the original group X

My goal is to create 2 groups of pods, spaced far enough apart that they will have different ranks. However I add the 30s waits to reduce flakes caused by latency (in other words, the difference between a base-2 rank 2 and rank 3 is only 4 seconds, but rank 5 and 6 is 32 seconds).

What do you think of this approach? I want to avoid adding too much "scheduling" dependency into the test (ie, assuming that pods will be evenly distributed among nodes) because that introduces flakes unless the nodes are evenly balanced and I think it's irrelevant to this anyway.

@@ -802,6 +802,10 @@ func getPodsToDelete(filteredPods, relatedPods []*v1.Pod, diff int) []*v1.Pod {
// diff will always be <= len(filteredPods), so not need to handle > case.
if diff < len(filteredPods) {
podsWithRanks := getPodsRankedByRelatedPodsOnSameNode(filteredPods, relatedPods)
// First obtain a pseudo-random ordering by sorting by pod UID
sort.Slice(podsWithRanks.Pods, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add the comparison in the existing Less? I feel like that should perform better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that come before or after the timestamp comparison?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a comparison criteria with the lowest priority, so after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case I should also update logarithmicOrAfterZero to check if the 2 ranks are equal (and continue), rather than just returning... or I can add the UID check right into logarithmicOrAfterZero. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to be explicit about UIDs by having that comparison outside of timestamp comparison.

@alculquicondor
Copy link
Member

Re: E2E

I think we should do what was suggested in the KEP: simulate the scenario from the story and make sure spreading is good, with some allowance for skew.

@damemi damemi force-pushed the alculquicondor-log-timestamp branch from c43188e to 469e24d Compare March 3, 2021 17:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021
@damemi damemi force-pushed the alculquicondor-log-timestamp branch from 30808c2 to bb80f72 Compare March 4, 2021 20:53
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for squash

pkg/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/controller/controller_utils.go Outdated Show resolved Hide resolved
@damemi damemi force-pushed the alculquicondor-log-timestamp branch from bb80f72 to d5054e2 Compare March 5, 2021 15:04
@alculquicondor
Copy link
Member

/assign @kow3ns

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
Some minor nits.

@@ -807,6 +808,10 @@ func (s ActivePods) Less(i, j int) bool {
// 8. If the pods' creation times differ, the pod that was created more recently
// comes before the older pod.
//
// In 5 and 8, times are compared in a logarithmic scale. This allows a level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In 5 and 8, times are compared in a logarithmic scale. This allows a level
// In 6 and 8, times are compared in a logarithmic scale. This allows a level

if rankDiff > 0 {
return false
}
return s.Pods[i].UID < s.Pods[j].UID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is identical for both pieces, how about making this a function rather than repeating the code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to keep the UID comparison in this method, for visibility. But a shorter alternative would be

diff := logarithmicRankDiff(s.Pods[i].CreationTimestamp, s.Pods[j].CreationTimestamp, s.Now)
if diff == 0 {
  return s.Pods[i].UID < s.Pods[j].UID
}
return diff < 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did it my way to keep the UID comparison physically last since it has the lowest priority. But if this is cleaner, we can do that too

r2 = -1
} else {
r2 = int64(math.Log2(float64(d2)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r1 := int64(-1)
r2 := int64(-1)
if d1 > 0 {
	r1 = int64(math.Log2(float64(d1)))
}
if d2 > 0 {
	r2 = int64(math.Log2(float64(d2)))
}

Seems simpler, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2021
@damemi damemi force-pushed the alculquicondor-log-timestamp branch 2 times, most recently from 71c0029 to 15fc96f Compare March 5, 2021 20:29
@damemi
Copy link
Contributor Author

damemi commented Mar 5, 2021

Thanks for the review @soltysh. I pushed and rebased with the feedback, and also added the feature gate @alculquicondor pointed out from #99212 (review). Please let me know if there is anything else I need to do

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to enable the feature gate for the tests.

I had long forgotten about the gate 😂

Change-Id: I0657ea0ce41b98fdee1a5307b5826a10deaff98c
@damemi damemi force-pushed the alculquicondor-log-timestamp branch from 15fc96f to a8d105a Compare March 5, 2021 20:58
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 5, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants