Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pod Topology Spread KEP to new template #1796

Merged
merged 1 commit into from May 19, 2020

Conversation

Huang-Wei
Copy link
Member

ref #895

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 18, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels May 18, 2020
@Huang-Wei
Copy link
Member Author

/assign @alculquicondor

keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved

* **How can an operator determine if the feature is in use by workloads?**

Operator can query `pod.spec.topologySpreadConstraints` field and identify if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the metric "plugin_execution_duration_seconds" which has a "plugin" sub-field

* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**

N/A.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric "plugin_execution_duration_seconds" is the perfect candidate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to mention plugin_execution_duration_seconds, clarify that this is an indicator for latency. Ideally, to measure the "health of the service", we also want a "functionality" indicator, an indictor for the spread of pods that use the feature, which we don't have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might not be possible to do, since this is a Pod-level feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible, but needs work, and should likely be done by an external component. One can also simplify things by for example monitoring only Deployments that use topology spread in their template...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we can put it as a suggestion for operators.

keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved
* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs][]?**

No.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

framework_extension_point_duration_seconds and other latency SLI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say an SLI/SLO can be setup for pod scheduling latency, this feature is replacing default spreading, so it is unlikely to increase pod scheduling latency. framework_extension_point_duration_seconds will not be a metric that we use to measure an SLO, it is more for debugging purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but there is a difference: default spreading uses only the nodes that pass filters, whereas we use all nodes in Topology spreading. It might we worth saying that some increase in latency might happen, and that it can be monitored with the above metric.


* **How does this feature react if the API server and/or etcd is unavailable?**

Running workloads won't be impacted. Submissions of new workloads using this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something like: "no new scheduling impact", but PRR to confirm.

@Huang-Wei Huang-Wei force-pushed the ga-pod-topology-spread branch 2 times, most recently from 84d2919 to 3885fbe Compare May 19, 2020 19:37
@Huang-Wei
Copy link
Member Author

Thanks for the review.

Comments addressed. @alculquicondor @ahg-g PTAL.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a Production Readiness reviewer, according to new guidelines :(

keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

N/A.
- Metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` < 500ms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very high number. Our benchmarks put the plugin at around 2ms (score) and 3ms (filter) for 10k pods #89487

But I'm not sure how to state a SLO from those numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about <= 100 ms on 90-percentile?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still high, but more reasonable. But I think we can mention the number of pods for that SLO to hold.

@Huang-Wei
Copy link
Member Author

@alculquicondor I'm not sure it's mandatory to have a PRR reviewer for each KEP. If so, @wojtek-t could you kindly review the PRR part? Thanks.

/assign @wojtek-t
(feel free to unassign or assign others)

@alculquicondor
Copy link
Member

It's not enforced for 1.19, but it will be in the future. So I think I could LGTM, given that the freeze is today. Could you squash?

@Huang-Wei
Copy link
Member Author

@alculquicondor squashed.

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7e828cc into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
@Huang-Wei Huang-Wei deleted the ga-pod-topology-spread branch May 19, 2020 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants