Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for random ReplicaSet downscale #2233

Merged
merged 1 commit into from
Feb 5, 2021

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Jan 6, 2021

This introduces the KEP for random replicaset downscale (as previously discussed in kubernetes/kubernetes#96748). Currently addresses sections for alpha designation, looking for feedback from sig-apps and sig-scheduling

re: #2185

/sig apps
/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 6, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 6, 2021
Copy link
Contributor Author

@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @alculquicondor @ahg-g @kubernetes/sig-apps-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 6, 2021
@damemi damemi force-pushed the random-downscale branch 2 times, most recently from 2522b5f to 0ed4d54 Compare January 6, 2021 19:07
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 6, 2021
@alculquicondor
Copy link
Member

Don't mark this as "fixes". The issue remains open until graduation to GA.

@damemi
Copy link
Contributor Author

damemi commented Jan 6, 2021

@alculquicondor thanks, updated

- Unit and e2e tests

Beta (v1.22):
- Enable RandomReplicaSetDownscale feature gate by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some user feedback criteria.

Would silent consensus be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for user feedback is probably a good idea, but otherwise I think no news is good news

For example, let's assume the base 10 is used, then we have the following
mapping for different durations:

| Duration | Scale |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps clarify that scale translates to the pod rank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank as defined in the code? It doesn't

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not literally as defined in the code, but that it translates to the order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. It just changes how timestamps are compared, which is the last criteria for Pod comparisons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, may be I am being too brief since I am commenting from the phone (you also need to give me the benefit of the doubt :)), what I meant is that this is the new sorting key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this gets explained with the other comment.
It can be seen as a sorting key, but saying that would create confusion, as it is not the only criteria.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are multiple sorting keys, not just one.

creation timestamp.

Instead of directly comparing timestamps, the algorithm compares the elapsed
times since the timestamp until the current time but in a logarithmic scale,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"since the timestamp"

Which timestamp? creation timestamp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2: creation and ready

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have two sorting keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this now too, I thought this was just referring to creationTimestamp (picked this part up from Aldo's draft)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diving into implementation details here:
There is not a single sorting key. There are sorting criteria
https://github.com/kubernetes/kubernetes/blob/cac9339/pkg/controller/controller_utils.go#L784-L809

Two of these criteria are ready time and creation time. We should affect both.
Feel free to include some of these details in the KEP, @damemi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so criteria 5 and 7 each will be updated so that the key for each is not the timestamp, but the "scale".

We propose a randomized approach to the algorithm for Pod victim selection
during ReplicaSet downscale:

1. Do a random shuffle of ReplicaSet Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if just quick sort, with the random pivot, provides the guarantees that any order is equally likely.
If it doesn't, then we would favor removing Pods in the order provided by the lister.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last sorting key can be a random number or the pods' uuid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uid sounds good. The random number would have to be produced before calling sort (and not in the Less function) otherwise it leads to undefined behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the random numbers need to be assigned before hand, hence the uuid suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to sort the pods before or after grouping them into their scale buckets? I thought the goal was to select randomly from the youngest bucket

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is that Pods that belong to the same scale bucket are in a random order.

Maybe we don't need to get into the implementation detail of how we achive that in the KEP, but what it's important to note is what is the precedence of the buckets with regard to the other sorting criteria https://github.com/kubernetes/kubernetes/blob/cac9339/pkg/controller/controller_utils.go#L784-L809

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is that Pods that belong to the same scale bucket are in a random order.
Maybe we don't need to get into the implementation detail

Agree, just making sure I understood the intent.


### User Stories

#### Story 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

focusing on the upgrade story is more convincing in my opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this story could probably just be shortened to the last 2 steps (5,6): where a new domain is added and upscaled to 3N, containing all the youngest pods. Then downscaled to 2N and all the pods from the new domain are removed. Is that what you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to just focus on an upgrade, please check my wording :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's what Abdullah meant, but thanks for the simplification.

For completion, I would say:

A deployment could become imbalanced after:

  • An entire failure domain fails and becomes unavailable
  • A failure domain gets teared down for maintenance or upgrade
  • A new failure domain is added to a cluster

The imbalance cycle goes like follows:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable, but the upgrade case I had in mind is not necessarily about adding a new failure domain, but about a rolling upgrade of nodes: consider a region with three zones, and the nodes get upgraded one zone at a time. As aldo mentioned, you can say that all those cases can lead to imbalance.

@alculquicondor
Copy link
Member

/assign @janetkuo @kow3ns

@alculquicondor
Copy link
Member

Note that this proposal is complementary to #1828

@damemi
Copy link
Contributor Author

damemi commented Jan 14, 2021

Tried to address the feedback so far in a new commit (that I intend to squash)

@damemi damemi force-pushed the random-downscale branch 2 times, most recently from 074f9c7 to 16d8044 Compare January 14, 2021 15:37
status: implementable
creation-date: 2020-12-15
reviewers:
- "@janetkuo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add me as a reviewer.


1. Sort ReplicaSet pods by pod UUID.
2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815)
2. Call sorting algorithm with a modified time comparison for start and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider algorithm which does 2 as described above but still includes node affinity like it's currently implemented. But instead of the initial sort extend the algorithm such that it will randomly pick from a bucket if len(bucket) > 1, which would limit the amount of sorting needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional problem I'm seeing here is that since you're modifying start and creation timestamp you'll end up copying the pod resources which will increase memory consumption during this phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not copying the timestamp. We just compare it differently.

You can see the proof-of-concept implementation kubernetes/kubernetes#96898


1. Sort ReplicaSet pods by pod UUID.
2. Obtain wall time, and add it to [`ActivePodsWithRanks`](https://github.com/kubernetes/kubernetes/blob/dc39ab2417bfddcec37be4011131c59921fdbe98/pkg/controller/controller_utils.go#L815)
2. Call sorting algorithm with a modified time comparison for start and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not copying the timestamp. We just compare it differently.

You can see the proof-of-concept implementation kubernetes/kubernetes#96898

We propose a randomized approach to the algorithm for Pod victim selection
during ReplicaSet downscale:

1. Sort ReplicaSet pods by pod UUID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You include say the purpose of this: to obtain a pseudo-random shuffle.

Also, you don't have to make it as a first step. It can just be another comparison criteria for ActivePodsWithRnaks.Less after comparing timestamps.

@damemi damemi force-pushed the random-downscale branch 2 times, most recently from 83bf8bf to fb205f5 Compare January 26, 2021 16:08
@damemi
Copy link
Contributor Author

damemi commented Jan 26, 2021

Updated to add @soltysh to reviewers. Was there anything else I missed?

@damemi
Copy link
Contributor Author

damemi commented Feb 1, 2021

Bumping this for @kubernetes/sig-apps-pr-reviews as we approach KEP freeze

@alculquicondor
Copy link
Member

/assign @wojtek-t
for PRR

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR itself looks fine, but I would like to see SIG approval first.

Add summary, motivation, detailed design and alternatives.

Signed-off-by: Aldo Culquicondor <acondor@google.com>
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits.
/lgtm


### Risks and Mitigations

Certain users might be relaying in the existing downscaling heuristic. However,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Certain users might be relaying in the existing downscaling heuristic. However,
Certain users might be relaying on the existing downscaling heuristic. However,

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2021
@kow3ns
Copy link
Member

kow3ns commented Feb 5, 2021

/approve

@wojtek-t
Copy link
Member

wojtek-t commented Feb 5, 2021

/approve for Alpha PRR (will require more for for beta)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, kow3ns, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit 77a84d2 into kubernetes:master Feb 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants