-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graduate DS maxSurge to beta #2665
Graduate DS maxSurge to beta #2665
Conversation
xref #1591 |
c34cae3
to
3c3440e
Compare
6bcf815
to
7527dfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The required PRR sections for beta are not currently filled out. Please complete them so I can review. They are:
- Rollout, Upgrade and Rollback Planning
- Monitoring Requirements
- Dependencies
- Scalability (completed)
- Troubleshooting
@@ -27,7 +30,7 @@ feature-gates: | |||
- kube-controller-manager | |||
disable-supported: true | |||
stage: "alpha" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should read beta, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this should be intended state or current state? I can change this to beta, if it has to be intended state. To be clear, I thought it has to be changed to beta once the implementation merges :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be the intended state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the initial feedback @ehashman
_This section must be completed when targeting beta graduation to a release._
. This statement confused me, I was thinking this has to be done when we are targeting from beta to GA. I'll fill those values and get back to you. Apologies for lack of understanding on my part.
No problem, looking forward to your responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman - I added content for the following sub-sections
- Rollout, Upgrade and Rollback Planning
- Monitoring Requirements
- Dependencies
- Troubleshooting
PTAL and let me know, if I am missing any other info
- 99% percentile over day of absolute value from (job creation time minus expected | ||
job creation time) for cron job <= 10% | ||
- 99,9% of /health requests per day finish with 200 code | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the SLI is rather "manual" we should be guarantee some reasonable SLO for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it now.
@@ -328,18 +309,10 @@ details). For now, we leave it here. | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How does this feature react if the API server and/or etcd is unavailable?** | |||
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events for StatefulSets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get events nor provide updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't update an event too. ;). I changed the sentence to provide more clarity.
9b3c9ed
to
a389cb8
Compare
|
||
* **What specific metrics should inform a rollback?** | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you tell if the feature is failing then?
Describe manual testing that was done and the outcomes. | ||
Longer term, we may want to require automated upgrade/rollback tests, but we | ||
are missing a bunch of machinery and tooling and can't do that now. | ||
Manually tested. No issues were found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please list the specific scenarios that were/will be tested?
Ideally, this should be a metric. Operations against the Kubernetes API (e.g., | ||
checking if there are objects with field X set) may be a last resort. Avoid | ||
logs or events for this purpose. | ||
By checking the StatefulSet's RollingUpdate strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide the exact key in the spec to check or an example?
- Details: | ||
- [x] Other (treat as last resort) | ||
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in | ||
the StatefulSet definition. This can be used to determine the health of this feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are existing metrics in Kubernetes that describe this, can you please list them under the metrics section?
- 99% percentile over day of absolute value from (job creation time minus expected | ||
job creation time) for cron job <= 10% | ||
- 99,9% of /health requests per day finish with 200 code | ||
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this SLO.
From the summary:
This will allow daemonset workloads to implement zero-downtime upgrades.
Thus, I would expect your SLOs to be targeted at high availability for daemonset workloads, and describe to cluster admins how to measure this.
@@ -328,18 +309,10 @@ details). For now, we leave it here. | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How does this feature react if the API server and/or etcd is unavailable?** | |||
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for StatefulSets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it's mid-rollout and this occurs? How can someone handle that failure?
levels that could help debug the issue? | ||
Not required until feature graduated to beta. | ||
- Testing: Are there any tests for failure mode? If not, describe why. | ||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be filled out. No feature in Kubernetes is failure-proof :) I'd like you to think of some possible ways that the feature could fail.
a389cb8
to
692049d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits.
/approve
from sig-apps pov
Ideally, this should be a metric. Operations against the Kubernetes API (e.g., | ||
checking if there are objects with field X set) may be a last resort. Avoid | ||
logs or events for this purpose. | ||
By checking the StatefulSet's `.spec.strategy.rollingUpdate.maxSurge` field. The additional workload pods created should be respecting the value specified in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is about DaemonSet not StatefulSet 😉
- Details: | ||
- [x] Other (treat as last resort) | ||
- Details: The number of pods that are created above the desired amount of pods during an update when this feature is enabled can be compared to maxSurge value available in | ||
the StatefulSet definition. This can be used to determine the health of this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/StatefulSet/DaemonSet
@@ -328,18 +317,24 @@ details). For now, we leave it here. | |||
_This section must be completed when targeting beta graduation to a release._ | |||
|
|||
* **How does this feature react if the API server and/or etcd is unavailable?** | |||
This feature will not work if the API server or etcd is unavailable as the controller-manager won't be even able get events or updates for StatefulSets. If the API server and/or etcd is unavailable during the mid-rollout, the featuregate would not be enabled and controller-manager wouldn't start since it cannot communicate with the API server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/StatefulSet/DaemonSet
@@ -93,6 +93,8 @@ tags, and then generate with `hack/update-toc.sh`. | |||
- [Updating the component manifests](#updating-the-component-manifests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect any other changes in this PR.
692049d
to
e749751
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for PRR
All the surge pods created should be within the value(% or number) of maxSurge field provided 99.99% of the time. The additinal pods created should ensure that the workload | ||
service is available 99.99% of time during updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 9's is a very aggressive target, I don't think it makes sense as a universal SLO. But this is not blocking.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ehashman, ravisantoshgudimetla, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#101742 merged which made this feature beta. This is the companion docs PR for kubernetes/enhancements#2665
kubernetes/kubernetes#101742 merged which made this feature beta. This is the companion docs PR for kubernetes/enhancements#2665
kubernetes/kubernetes#101742 merged which made this feature beta. This is the companion docs PR for kubernetes/enhancements#2665
kubernetes/kubernetes#101742 merged which made this feature beta. This is the companion docs PR for kubernetes/enhancements#2665 Co-authored-by: Karen Bradshaw <kbhawkey@gmail.com>
kubernetes/kubernetes#101742 merged which made this feature beta. This is the companion docs PR for kubernetes/enhancements#2665 Co-authored-by: Karen Bradshaw <kbhawkey@gmail.com>
Most of the questions in the PRR section seems answered and if needed they can be answered when GA'ing.