New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller changes for perma failed deployments #35691
Controller changes for perma failed deployments #35691
Conversation
if deployment.Spec.ProgressDeadlineSeconds != nil && cond == nil { | ||
msg := fmt.Sprintf("Found new replica set %q", rsCopy.Name) | ||
condition := deploymentutil.NewDeploymentCondition(extensions.DeploymentProgressing, api.ConditionTrue, deploymentutil.FoundNewRSReason, msg) | ||
deploymentutil.SetDeploymentCondition(&deployment.Status, *condition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many extra deployment status writes does this PR add (on average) to a deployment creation and a deployment update? I.e. for scenario:
- Create a new deployment -> deployment reaches success
- Modify deployment -> new code is rolled out
What is before and after REST calls to D and RS look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really depends on the size and the fenceposts of the deployment. The fastest a rollout can advance ((spec.replicas+maxSurge - (spec.replicas-maxUnavailable)) is bigger) and the less the pods under a deployment, the less the writes will be.
Deployment with one pod:
I don't notice any difference
Deployment with 3 pods:
~7-8 writes in master
~10 writes with progressDeadline set
Deployment with 10 pods:
~40 writes in master
~50 writes with progressDeadline set
This PR adds no additional RS writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@@ -418,8 +425,8 @@ func testRollingUpdateDeploymentEvents(f *framework.Framework) { | |||
newRS, err := deploymentutil.GetNewReplicaSet(deployment, c) | |||
Expect(err).NotTo(HaveOccurred()) | |||
Expect(newRS).NotTo(Equal(nil)) | |||
Expect(events.Items[0].Message).Should(Equal(fmt.Sprintf("Scaled up replica set %s to 1", newRS.Name))) | |||
Expect(events.Items[1].Message).Should(Equal(fmt.Sprintf("Scaled down replica set %s to 0", rsName))) | |||
Expect(events.Items[0].Message).Should(Equal(fmt.Sprintf("Created new replica set %q and scaled up to 1", newRS.Name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this cause version skewed tests to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other tests outside of this repo that need to be taken into account?
|
||
// Create a nginx deployment. | ||
deploymentName := "nginx" | ||
badImage := "nginx:404" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's not bad, but non-existent
thirty := int32(30) | ||
d := newDeployment(deploymentName, replicas, podLabels, nginxImageName, badImage, extensions.RecreateDeploymentStrategyType, nil) | ||
d.Spec.ProgressDeadlineSeconds = &thirty | ||
framework.Logf("Creating deployment %q with a bad image", deploymentName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention ProgressDeadlineSeconds = 30s or at least mention it's set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
badImage := "nginx:404" | ||
thirty := int32(30) | ||
d := newDeployment(deploymentName, replicas, podLabels, nginxImageName, badImage, extensions.RecreateDeploymentStrategyType, nil) | ||
d.Spec.ProgressDeadlineSeconds = &thirty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using By
to describe test steps where appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one just below
Expect(err).NotTo(HaveOccurred()) | ||
|
||
// Check that deployment is created fine. | ||
deployment, err := c.Extensions().Deployments(ns).Get(deploymentName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the returned deployment from Create
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
case n < 0.8: | ||
// pause / resume the deployment | ||
if deployment.Spec.Paused { | ||
framework.Logf("%02d: pausing deployment %q", i, deployment.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as #35691 (comment)
Expect(err).NotTo(HaveOccurred()) | ||
|
||
case n < 0.8: | ||
// pause / resume the deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toggling
podList, err := c.Core().Pods(ns).List(opts) | ||
Expect(err).NotTo(HaveOccurred()) | ||
if len(podList.Items) == 0 { | ||
framework.Logf("%02d: no deployment pods", i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... to delete
|
||
default: | ||
// arbitrarily delete deployment pods | ||
framework.Logf("%02d: deleting one or more deployment pods for deployment %q", i, deployment.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrarily deleting...
iterations := 20 | ||
for i := 0; i < iterations; i++ { | ||
if r := rand.Float32(); r < 0.6 { | ||
time.Sleep(time.Duration(float32(i) * r * float32(time.Second))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for creating a random wait between actions. The more the test proceeds, the more the waits will be longer so we can test various action rates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fairly effective at exposing races and bad logic in the deployment controller in openshift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the first race in the scaling code: http://pastebin.com/fwwEfJnp
Fast updates can block rollouts because the replica annotations are not updated correctly in the scale code.
I would not have expected the resource quota e2e test flake in https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/35691/pull-kubernetes-e2e-gce-gci/2732/ @derekwaynecarr |
@@ -343,6 +343,14 @@ func NewUIDTrackingControllerExpectations(ce ControllerExpectationsInterface) *U | |||
return &UIDTrackingControllerExpectations{ControllerExpectationsInterface: ce, uidStore: cache.NewStore(UIDSetKeyFunc)} | |||
} | |||
|
|||
// Reasons for pod events | |||
const ( | |||
FailedCreatePodReason = "FailedCreate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Godoc
// progress or tried to create a replica set, or resumed a paused deployment and | ||
// compare against progressDeadlineSeconds. | ||
from := condition.LastTransitionTime | ||
delta := time.Duration(*deployment.Spec.ProgressDeadlineSeconds) * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO here that this is clock dependent and should be based on observed time from this controller, not stored time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between stored time that has been set based on observation by this controller (what we currently have) and observed time that is not stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time stored on the object may have been set by a different controller with a different current time, rate of time change, or transient problem. In a distributed system any sort of "store time, then read it later from a different process" action is fundamentally broken. Instead, each component in the cluster needs to maintain its own clock (based on when it sees something happen) and then act only according to the time the local process keeps. That's not 100% safe (the admin on the machine could adjust the time forward or backwards), but it prevents clock skew from having an impact.
Here's an example:
- Controller process 1 observes a new deployment update, sets a condition
- Node that controller process 1 is on dies, controller fails over to node 2 with controller process 2
- Node 2's clock is 30s in the future
- Controller process 2 compares stored time from etcd (recorded by node 1's clock) against now, sees that the progress deadline window has passed, and records "failure", even though the wall clock measurement is ~1-2s from when process 1 actually saw the deployment update.
The correct way to track time is for each controller to maintain its own clock, and it can only act when it observes a change. I.e.:
- Controller process 2 observes progress deadline is set AND condition is set, starts its own clock at now (t0)
- Controller process 2 waits to update condition until at least progressDeadlineSeconds have elapsed since t0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. This is not a problem with the main controller running on a single master since it's the only component that updates conditions. I guess the master going down in a HA setup would break this. In the future when we will support custom controllers that run on nodes, this problem will become more apparent. Dropping a reference in #29229.
// and when new pods scale up or old pods scale down. Progress is not estimated for paused | ||
// deployments or when users don't really care about it ie. progressDeadlineSeconds is not | ||
// specified. | ||
// TODO: Look for permanent failures in the new replica set such as image pull errors or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding this TODO, but image pull and crash looping are not perm failures. Crash looping can be because a dependency is not up, and image pull can be because we are waiting for another pod to push an image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for now - #18568 is the issue for identifying permanent failures
A few comments, but needs tests passing. |
@smarterclayton all green |
@mfojtik do you want to have a last pass? |
if err != nil { | ||
return err | ||
} | ||
if failed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this if, just keep TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
|
||
// If there is no progressDeadlineSeconds set, remove any Progressing condition. | ||
if d.Spec.ProgressDeadlineSeconds == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this check up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
LGTM |
Will wait until #34645 is merged and then I will tag this. |
This commit adds support for failing deployments based on a timeout parameter defined in the spec. If there is no progress for the amount of time defined as progressDeadlineSeconds then the deployment will be marked as failed by adding a condition with a ProgressDeadlineExceeded reason in it. Progress in the context of a deployment means the creation or adoption of a new replica set, scaling up new pods, and scaling down old pods.
@mfojtik ptal in the last commit, annotations weren't updated correctly and the rollout would block. Every time we get into |
Here is the place where we check rs annotations in order to identify scaling events. |
/lgtm |
This needs a release note, can you update? |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
…ployments Automatic merge from submit-queue kubectl: enhancements for deployment progress deadline Changes: * add deployment conditions in the describer * abort 'rollout status' for deployments that have exceeded their progress deadline Depends on #35691. @kubernetes/kubectl @kubernetes/deployment Fixes #31319
This PR adds support for reporting failed deployments based on a timeout
parameter defined in the spec. If there is no progress for the amount
of time defined as progressDeadlineSeconds then the deployment will be
marked as failed by a Progressing condition with a ProgressDeadlineExceeded
reason.
Follow-up to #19343
Docs at kubernetes/website#1337
Fixes #14519
@kubernetes/deployment @smarterclayton
This change is