Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate DS maxSurge to beta #2665

Merged

Conversation

ravisantoshgudimetla
Copy link
Contributor

Most of the questions in the PRR section seems answered and if needed they can be answered when GA'ing.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 30, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 30, 2021
@JamesLaverack
Copy link
Member

xref #1591

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 6, 2021
@ravisantoshgudimetla ravisantoshgudimetla force-pushed the promote-maxSurge-beta branch 2 times, most recently from 6bcf815 to 7527dfe Compare May 6, 2021 21:51
@ravisantoshgudimetla ravisantoshgudimetla changed the title [wip] Graduate DS maxSurge to beta Graduate DS maxSurge to beta May 6, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ravisantoshgudimetla,

The required PRR sections for beta are not currently filled out. Please complete them so I can review. They are:

  • Rollout, Upgrade and Rollback Planning
  • Monitoring Requirements
  • Dependencies
  • Scalability (completed)
  • Troubleshooting

@@ -27,7 +30,7 @@ feature-gates:
- kube-controller-manager
disable-supported: true
stage: "alpha"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read beta, right?

Copy link
Contributor Author

@ravisantoshgudimetla ravisantoshgudimetla May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this should be intended state or current state? I can change this to beta, if it has to be intended state. To be clear, I thought it has to be changed to beta once the implementation merges :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be the intended state.

@ehashman ehashman self-assigned this May 6, 2021
Copy link
Contributor Author

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the initial feedback @ehashman

_This section must be completed when targeting beta graduation to a release._. This statement confused me, I was thinking this has to be done when we are targeting from beta to GA. I'll fill those values and get back to you. Apologies for lack of understanding on my part.

@ehashman
Copy link
Member

ehashman commented May 7, 2021

No problem, looking forward to your responses.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2021
Copy link
Contributor Author

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman - I added content for the following sub-sections

  • Rollout, Upgrade and Rollback Planning
  • Monitoring Requirements
  • Dependencies
  • Troubleshooting
    PTAL and let me know, if I am missing any other info

- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the SLI is rather "manual" we should be guarantee some reasonable SLO for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it now.

@@ -328,18 +309,10 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events for StatefulSets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get events nor provide updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't update an event too. ;). I changed the sentence to provide more clarity.


* **What specific metrics should inform a rollback?**
None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you tell if the feature is failing then?

Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
Manually tested. No issues were found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please list the specific scenarios that were/will be tested?

Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
By checking the StatefulSet's RollingUpdate strategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide the exact key in the spec to check or an example?

- Details:
- [x] Other (treat as last resort)
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in
the StatefulSet definition. This can be used to determine the health of this feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing metrics in Kubernetes that describe this, can you please list them under the metrics section?

- 99% percentile over day of absolute value from (job creation time minus expected
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this SLO.

From the summary:

This will allow daemonset workloads to implement zero-downtime upgrades.

Thus, I would expect your SLOs to be targeted at high availability for daemonset workloads, and describe to cluster admins how to measure this.

@@ -328,18 +309,10 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for StatefulSets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it's mid-rollout and this occurs? How can someone handle that failure?

levels that could help debug the issue?
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be filled out. No feature in Kubernetes is failure-proof :) I'd like you to think of some possible ways that the feature could fail.

@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label May 11, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2021
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits.
/approve
from sig-apps pov

Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
By checking the StatefulSet's `.spec.strategy.rollingUpdate.maxSurge` field. The additional workload pods created should be respecting the value specified in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is about DaemonSet not StatefulSet 😉

- Details:
- [x] Other (treat as last resort)
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in
the StatefulSet definition. This can be used to determine the health of this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/StatefulSet/DaemonSet

@@ -328,18 +317,24 @@ details). For now, we leave it here.
_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for StatefulSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn't start since it cannot communicate with the API server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/StatefulSet/DaemonSet

@@ -93,6 +93,8 @@ tags, and then generate with `hack/update-toc.sh`.
- [Updating the component manifests](#updating-the-component-manifests)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect any other changes in this PR.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 11, 2021
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for PRR

Comment on lines +254 to +255
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload
service is available 99.99% of time during updates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 9's is a very aggressive target, I don't think it makes sense as a universal SLO. But this is not blocking.

keps/sig-apps/1591-daemonset-surge/README.md Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ehashman, ravisantoshgudimetla, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2021
@k8s-ci-robot k8s-ci-robot merged commit c09cc13 into kubernetes:master May 11, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 11, 2021
Copy link
Contributor Author

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman - Thank you for reviewing the PR from PRR side.

@soltysh - Thank you for reviewing the PR from sig-apps side.

ravisantoshgudimetla added a commit to ravisantoshgudimetla/website that referenced this pull request May 23, 2021
kubernetes/kubernetes#101742 merged which made this feature beta. 
This is the companion docs PR for kubernetes/enhancements#2665
ravisantoshgudimetla added a commit to ravisantoshgudimetla/website that referenced this pull request Jun 7, 2021
kubernetes/kubernetes#101742 merged which made this feature beta. 
This is the companion docs PR for kubernetes/enhancements#2665
ravisantoshgudimetla added a commit to ravisantoshgudimetla/website that referenced this pull request Jun 7, 2021
kubernetes/kubernetes#101742 merged which made this feature beta.
This is the companion docs PR for kubernetes/enhancements#2665
ravisantoshgudimetla added a commit to ravisantoshgudimetla/website that referenced this pull request Jul 14, 2021
kubernetes/kubernetes#101742 merged which made this feature beta.
This is the companion docs PR for kubernetes/enhancements#2665

Co-authored-by: Karen Bradshaw <kbhawkey@gmail.com>
ravisantoshgudimetla added a commit to ravisantoshgudimetla/website that referenced this pull request Jul 21, 2021
kubernetes/kubernetes#101742 merged which made this feature beta.
This is the companion docs PR for kubernetes/enhancements#2665

Co-authored-by: Karen Bradshaw <kbhawkey@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants