ProgressDeadlineExceeded not set outside of Deployment rollouts #106054

wking · 2021-11-01T18:23:02Z

I'd floated #93933 earlier with a PR attempting to address this, and more recently opened rhbz#1983823. Here's a third try at pitching this proposal 🤞.

/sig apps

Proposal

When a Deployment has a single underlying ReplicaSet that had previously completed, but which now lacks the target count of ready replicas, and the underling ReplicaSet fails to make progress for more than progressDeadlineSeconds, the Deployment controller should set Progressing=False with ProgressDeadlineExceeded.

Benefits

Deployment owners get a clear signal that the Deployment controller (via the delegated ReplicaSet controller) is struggling to recover the desired state. In situations where the Deployment controller continues to be unable to make progress, additional disruption will eventually push us into Available=False. But Progressing=False with ProgressDeadlineExceeded is a way the Deployment controller can request assistance before things get bad enough to go Available=False.

Downsides

Maybe someone depends on the current behavior and would be broken by the proposed pivot. Feedback welcome, if anyone can think of a use case that might be vulnerable.

Context

The only Progressing condition in kubernetes/kubernetes is DeploymentProgressing, so we don't have to worry about changes to the Deployment controller's handling being inconsistent with other core controllers.

The dev-facing godocs for Deployment's Progressing is and has been since it landed:

Progressing means the deployment is progressing. Progress for a deployment is considered when a new replica set is created or adopted, and when new Pods scale up or old Pods scale down. Progress is not estimated for paused deployments or when progressDeadlineSeconds is not specified.

"when new pods scale up or old pods scale down" is the fuzzy bit where this proposal is working.

From the user-facing Deployment docs:

Kubernetes marks a Deployment as progressing when one of the following tasks is performed:

The Deployment creates a new ReplicaSet.

The Deployment is scaling up its newest ReplicaSet.

The Deployment is scaling down its older ReplicaSet(s).

New Pods become ready or available (ready for at least MinReadySeconds).

This last is a bit vague. For example, if the Deployment is completing a rollout and the final Pod becomes ready for min ready seconds, the wording on that last condition is satisfied, but ~~the Deployment actually uses that event to transition to Progressing=False~~ (edit: actually, ProgressDeadlineExceeded is the only Progressing=False case). Anyhow, it's this last entry where this proposal is working, as the Deployment controller attempts to pass along information about how well the target ReplicaSet controller is doing at reconciling the desired state.

From later in the user-facing Deployment docs:

Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:

Insufficient quota

Readiness probe failures

...

One way you can detect this condition is to specify a deadline parameter in your Deployment spec: (.spec.progressDeadlineSeconds). .spec.progressDeadlineSeconds denotes the number of seconds the Deployment controller waits before indicating (in the Deployment status) that the Deployment progress has stalled.
...
Type=Progressing with Status=True means that your Deployment is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum required new replicas are available (see the Reason of the condition for the particulars - in our case Reason=NewReplicaSetAvailable means that the Deployment is complete).

This same "stuck trying to deploy its newest ReplicaSet" is the situation I'm concerned with. And again the docs are a bit vague. "the minimum required new replicas are available" seems like it's about Available despite belonging to a sentence discussing Progressing. That line landed with the original ProgressDeadlineExceeded docs, so it's pretty old. But "your Deployment may get stuck" is the situation I'm trying to detect.

Reproducer

Using OpenShift's oc, which in this case is a fairly thin shim around kubectl, on a 1.21.1 cluster:

$ oc version
Client Version: 4.9.0-0.nightly-arm64-2021-07-08-160356
Server Version: 4.8.18
Kubernetes Version: v1.21.1+6438632

Looking at a happy, leveled deployment:

$ oc -n openshift-ingress get -o json deployment router-default | jq '{spec: (.spec | {replicas, progressDeadlineSeconds}), status: (.status | {availableReplicas, conditions})}'
{
  "spec": {
    "replicas": 2,
    "progressDeadlineSeconds": 600
  },
  "status": {
    "availableReplicas": 2,
    "conditions": [
      {
        "lastTransitionTime": "2021-11-01T17:03:12Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "Deployment has minimum availability.",
        "reason": "MinimumReplicasAvailable",
        "status": "True",
        "type": "Available"
      },
      {
        "lastTransitionTime": "2021-11-01T17:00:02Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "ReplicaSet \"router-default-84648d65b6\" has successfully progressed.",
        "reason": "NewReplicaSetAvailable",
        "status": "True",
        "type": "Progressing"
      }
    ]
  }
}

Making it impossible for that deployment to get new Pods:

$ oc get -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' -l node-role.kubernetes.io/worker= nodes | while read NODE; do oc adm cordon "${NODE}"; done
node/ci-ln-8wswzbb-72292-wckxp-worker-a-7rdzb cordoned
node/ci-ln-8wswzbb-72292-wckxp-worker-b-qkdbx cordoned
node/ci-ln-8wswzbb-72292-wckxp-worker-c-kp8qs cordoned

Disrupt the workload by deleting a Pod (e.g. maybe we're draining a node in preparation to reboot):

$ oc -n openshift-ingress get pods | grep router
router-default-84648d65b6-bjxvn   1/1     Running   0          30m
router-default-84648d65b6-mr4cg   1/1     Running   0          30m
$ oc -n openshift-ingress delete pod router-default-84648d65b6-bjxvn
pod "router-default-84648d65b6-bjxvn" deleted

Checking back in on the Deployment:

$ oc -n openshift-ingress get -o json deployment router-default | jq '{spec: (.spec | {replicas, progressDeadlineSeconds}), status: (.status | {availableReplicas, conditions})}'
{
  "spec": {
    "replicas": 2,
    "progressDeadlineSeconds": 600
  },
  "status": {
    "availableReplicas": 1,
    "conditions": [
      {
        "lastTransitionTime": "2021-11-01T17:03:12Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "Deployment has minimum availability.",
        "reason": "MinimumReplicasAvailable",
        "status": "True",
        "type": "Available"
      },
      {
        "lastTransitionTime": "2021-11-01T17:00:02Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "ReplicaSet \"router-default-84648d65b6\" has successfully progressed.",
        "reason": "NewReplicaSetAvailable",
        "status": "True",
        "type": "Progressing"
      }
    ]
  }
}

It has gone Progressing=True, and while the target ReplicaSet is Available=True, ~~the "has successfully progressed" in the message is a bit weird (I'd expect something about why we weren't Progressing=False like "is working to create additional pods")~~ (edit: actually, ProgressDeadlineExceeded is the only Progressing=False case, and you can see above that we were Progressing=True with the same reason and message in the leveled case too).

And after the 10m (default) progressDeadlineSeconds:

$ sleep 600
$ oc -n openshift-ingress get -o json deployment router-default | jq '{spec: (.spec | {replicas, progressDeadlineSeconds}), status: (.status | {availableReplicas, conditions})}'
{
  "spec": {
    "replicas": 2,
    "progressDeadlineSeconds": 600
  },
  "status": {
    "availableReplicas": 1,
    "conditions": [
      {
        "lastTransitionTime": "2021-11-01T17:03:12Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "Deployment has minimum availability.",
        "reason": "MinimumReplicasAvailable",
        "status": "True",
        "type": "Available"
      },
      {
        "lastTransitionTime": "2021-11-01T17:00:02Z",
        "lastUpdateTime": "2021-11-01T17:03:12Z",
        "message": "ReplicaSet \"router-default-84648d65b6\" has successfully progressed.",
        "reason": "NewReplicaSetAvailable",
        "status": "True",
        "type": "Progressing"
      }
    ]
  }
}

So no change there, despite going more than 10m without progress. The proposal is to adjust this result to be Progressing=False with ProgressDeadlineExceeded.

Dropping down into the ReplicaSet:

$ oc -n openshift-ingress get -o json replicaset router-default-84648d65b6 | jq .status
{
  "availableReplicas": 1,
  "fullyLabeledReplicas": 2,
  "observedGeneration": 1,
  "readyReplicas": 1,
  "replicas": 2
}

So the Deployment controller is definitely not getting a lot of help from the ReplicaSet controller.

Alternatives

Prometheus alerts

In my OpenShift cluster, I do have KubePodNotReady firing with:

Pod NS openshift-ingress / P router-default-84648d65b6-bwxcv has been in a non-ready state for longer than 15 minutes.

But not all clusters will have Prometheus/Alertmanager installed. And if this was a sufficient guard for this situation, we wouldn't have needed ProgressDeadlineExceeded at all. Another benefit of ProgressDeadlineExceeded over the alerts is that progressDeadlineSeconds is a Deployment-specific knob, and having Deployment-specific alerts watching over the shoulder of a quiet Deployment controller seems pretty heavy, compared to making the Deployment controller a bit more forthcoming.

Looking over the Deployment controller's shoulder

Deployment owners can work around the Deployment controller's current behavior by reaching around to find ReplicaSets. And then work around the ReplicaSet controller's current behavior (no conditions at all!) by reaching around to find Pods. And then see whether there are long-stuck Pods or other issues. But again, if this was a sufficient guard for this situation, we wouldn't have needed ProgressDeadlineExceeded at all. We grew ProgressDeadlineExceeded for the rollout case, because it's more convenient and reliable to have this watching code once in the Deployment controller, where all Deployment owners can benefit from the central analysis.

Saying `Progressing` as a whole is rollout-specific

Another internally-consistent approach would be to say "Progressing is just about Deployment rollouts and the direct operands of the Deployment controller" and to explicitly exclude downstream operands like the Pods operated on by the ReplicaSet controller. In this case, the Deployment controller would not stay Progressing=True after the rollout completed. But while this is internally-consistent, it doesn't seem all that helpful for Deployment owners, who would then need to walk the whole controller stack to see how their workload was doing.

Having each controller be responsible for reporting any concerning behavior for the workload up to higher levels of the controller stack scales more easily, because each controller only needs to understand how its direct operands will report issues. In this vein, it would certainly be possible for the Deployment controller to delegate lack-of-progress detection to the ReplicaSet controller, and just pass it up the stack if/when the ReplicaSet controller reported it on the target ReplicaSet.

Ignoring issues as long as `Available=True`

Available=False is a pretty unambiguous signal, but depending on the workload, this can be pretty serious. For the ingress-router example I picked for my reproducer, it means "none of your cluster workloads have functional ingress anymore", which is about as bad as it gets and certainly in the midnight-admin-page space for some clusters. On the other hand, signals that get Deployment owners involved earlier on, when it's clear that reconciliation/recovery is having issues, but before the existing pods have all been deleted, allows for calmer, working-hours intervention.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-11-01T18:23:09Z

@wking: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-11-01T23:49:46Z

Working through Code Search results, looking for DeploymentProgressing (will miss some users, e.g. folks who are doing things like oc wait --for=condition=Progressing=False deployment/router-default, because it will not follow the Go -> CLI transition):

Istio has a rollout-centric verifyDeploymentStatus, called only from verifyPostInstall, but I'm not clear on the broader context. Verify control plane health after installation/upgrade istio/istio#28255 and Add post-install verification that control plane is functional istio/istio#21715 sound like they're talking about a one-shot, post-install check, and I'm not sure how to square an install-focus (which, unlike updates/rollouts should only have a single ReplicaSet) with the rollout focus they've used in verifyDeploymentStatus's error wording.
Rook has a progressDeadlineExceeded checker. The only caller is WaitForDeploymentsToUpdate, which is fed a set of updated deployments. I have not yet found a consumer outside of Rook's test suite, so I'm unable to confirm how strict callers are about limiting the input to mid-rollout deployments.
OpenShift's conformance tests have a checkDeploymentReadiness checker that treats Progressing=False as a failure condition. Although I dunno what the reasoning there was; seems like you'd want some qualification like "and we aren't all level and happy, because in that case, Progressing=False is fine". Perhaps the lack of qualifier is grounded in some specifics of the test case limiting the expected deployment situations.
KubeSphere has deploymentStatus, where only Progressing=True with reason NewReplicaSetAvailable counts as "progressing", and that or Available=True are part of a distinction between KubeSphere-side Ready or InProgress (they don't seem to have an analog to Available=False, let alone "still Available=True, but having trouble recovering to the target situation").
Operator SDK checks for ProgressDeadlineExceeded to fail fast out of a DoRolloutWait that seems rollout specific, so they likely don't care about changes to out-of-rollout Progressing handling.
Flagger checks for ProgressDeadlineExceeded in isDeploymentReady. That feeds a retriable return bool, but the only retriable consumer I've found is advanceCanary, which appears to be focused on rollouts.
Flux treats all Progressing=False conditions as errors. ~~The lack of qualifications is strange, as discussed above, but possibly mitigated by the only consumer appearing to be rendering informative message strings.~~
tsuru checks for ProgressDeadlineExceeded to fail fast out of a monitorDeployment that seems specific to creation and updates, so they likely don't care about changes to out-of-rollout Progressing handling.
Nuclio treats Progressing=False as one of several "unavailable" conditions. I'm unclear on the complete context, but marking the Deployment unhealthy when we have trouble recovering from external disruption (this proposal) seems like it would be a benefit for them. They certainly don't seem to have any "looking over the Deployment and ReplicaSet controller's shoulders at the Pods" going on.
Knative Serving has a conversion to duckv1.Status, and I haven't traced that beyond the API transition. They also have hasDeploymentTimedOut checking for ProgressDeadlineExceeded. They use hasDeploymentTimedOut in reconcileDeployment, which opens up like a generic Deployment-owning robot, but then has an "if the world is ending and status.availableReplicas == 0" block. There are a bunch of assumptions in here that don't make sense to me, so I'm not going to try and paraphrase more, but my impression is that they would benefit from this proposal as well.
The Kubernetes provider for Terraform treats ProgressDeadlineExceeded as a non-retryable error in waitForDeploymentReplicasFunc. But they only call it after creating or updating the Deployment, so they likely don't care about changes to out-of-rollout Progressing handling.
The Operator Lifecycle Manager checks for ProgressDeadlineExceeded in DeploymentStatus. The consumer I found seemed to be install/creation-specific, in which case they wouldn't care about changes to out-of-rollout Progressing handling.
The GitOps Engine checks for ProgressDeadlineExceeded in ProgressDeadlineExceeded as the only option that leads to their HealthStatusDegraded condition (they don't check Available at all?!). HealthStatusDegraded seems to feed a phase switch. I didn't trace further to find any phase consumers.
Kyma checks Progressing!=True in updateDeploymentStatus. The next case in that switch is a more generic error, so they may change from Running=Unknown to Running=False if the check a stuck, out-of-rollout deployment. But they also look for Available=True reason MinimumReplicasAvailable and Progressing=True reason NewReplicaSetAvailable in isDeploymentReady, and that's their happy case in updateDeploymentStatus. Why does this keep turning up? Oh wait...
Many, many more hits, but I'm out of time for now.

Summarizing:

I don't see any consumers who would be hurt by this proposal yet, and there are a few that I expect would be helped.
Progressing=True with reason NewReplicaSetAvailable is the happy case, and has been for years! I'd naively expected the Deployment to transition back to Progressing=False once it achieved its desired state, with Progressing=True being reserved for "we are making progress in reconciling towards a target we haven't reached yet". I need to go mull this over...

wking · 2021-11-02T02:30:11Z

Ok, so Progressing=True with NewReplicaSetAvailable is the documented happy state and I'm just slow:

If you satisfy the quota conditions and the Deployment controller then completes the Deployment rollout, you'll see the Deployment's status update with a successful condition (Status=True and Reason=NewReplicaSetAvailable).

That's not what I'd expected, and it seems to be overloading "available" a bit, but I can wrap my head around it as "we have exactly the available, updated pods we want". So with a history like:

Deployment is happy with spec A.
Owner patches the Deployment to spec B.
Deployment controller begins progressing towards B.
Down to one ReplicaSet, only one Pod to go...
Hooray, we made it! NewReplicaSetAvailable.
Oops, someone deleted one of my pods. No matter, only one Pod to go...

Currently 4 gets the progressDeadlineSeconds and ProgressDeadlineExceeded guards around "hrm, I seem to be stuck", but 6 (which is exactly the same current situation but with a different history of having previously achieved spec B) currently does not. I'm proposing 4 and 6 both get the same progressDeadlineSeconds and ProgressDeadlineExceeded guards. I can start back in on the search results to see if we can find anyone vulnerable to that pivot...

atiratree · 2021-11-02T12:54:32Z

For now I can see one issue with inconsistency of the Progressing condition ( I want to dive into the impact of this feature more later).

meaning of the old behaviour:

Progressing=True with NewReplicaSetAvailable with all available pods is saying: we know this rollout has an ability to progress and we have 100% availability
Progressing=True with NewReplicaSetAvailable with unavailable pods is saying: we know this rollout has an ability to progress even though we are not on 100% availability anymore
Progressing=False with ProgressDeadlineExceeded is saying: we know this rollout does not have an ability to progress

meaning of the new behaviour:

Progressing=True with NewReplicaSetAvailable with all available pods : we know this rollout has an ability to progress and we have 100% availability
Progressing=False with ProgressDeadlineExceeded is saying: we do not know if this rollout has an ability to progress (it might) and some pods are unavailable

So the difference with the new behaviour is that we cannot infer if we have a healthy rollout when some pods are disrupted but we have a better signal that the pods are disrupted.

Also in either case we always can resolve to NewReplicaSetAvailable after the ProgressDeadlineExceeded . So we will have the old behaviour before the deadline and the new behaviour after the deadline. Which could get quite confusing wrt semantics since you would always need to check the time of your deadline before consulting the status to know the real meaning.

*edit: ability to progress is achieved if there was a complete deployment at some point in time for this rollout (revision)

atiratree · 2021-11-08T16:03:07Z

One thing to note is that since this is fixing a rare behaviour (deployment pods are disrupted after successful rollout), the things this would break would also be rare.

For example. it could lead to race conditions in consumers which are polling the deployment status.

poller checks if rollout done
rollout completes
pod is disrupted -> rollout waiting for deployment
poller checks if rollout done
rollout waiting for deployment
1. and 5. lasts for some amount of time
rollout exceeded its progress deadline and sets ProgressDeadlineExceeded
poller times out with ProgressDeadlineExceeded
rollout completes and the disruption dissapears
10 poller couldn't execute the logic after succesfull rollout as it could before

You can take a look at this case here: status_check.go#L308 & status_check.go#L220

The Kubernetes provider for Terraform treats ProgressDeadlineExceeded as a non-retryable error in waitForDeploymentReplicasFunc. But they only call it after creating or updating the Deployment, so they likely don't care about changes to out-of-rollout Progressing handling.

You can observe the same polling issue here. Since it is waiting for a rollout and this observation could be skipped we might not get to this part resource_kubernetes_deployment.go#L254 even though the rollout happened. I know it is very unlikely to happen, but once it happens it will be very hard figure out what went wrong.

This polling pattern is very common (as also seen in your examples) and could potentially cause problems (in rare cases).

wking · 2021-11-17T05:20:58Z

One thing to note is that since this is fixing a rare behaviour (deployment pods are disrupted after successful rollout)...

External disruption isn't that rare, although unrecoverable external disruption may be. For example, if you are keeping up with the tip of an OpenShift distribution channel, you are draining and rebooting every node in your cluster every week.

wking · 2021-11-17T05:26:31Z

poller couldn't execute the logic after succesfull rollout as it could before

That's a pretty thin race, where the disruption occurred within one polling period of the successful rollout. But sure, I'm open to alternatives. If there's no fixing Progressing's current overloading of "I previously leveled the current target" and "I'm currently working to level the current target", can we mint a new condition type pair to decouple? Because saying "meh, I guess we can never represent non-rollout disruption in Deployment conditions" seems like it is leaving a sizeable usability hole on the table.

atiratree · 2021-11-25T17:36:59Z

Deployment owners get a clear signal that the Deployment controller (via the delegated ReplicaSet controller) is struggling to recover the desired state. In situations where the Deployment controller continues to be unable to make progress, additional disruption will eventually push us into Available=False. But Progressing=False with ProgressDeadlineExceeded is a way the Deployment controller can request assistance before things get bad enough to go Available=False.

So to reitarate, the issue is, to have an option as a cluster admin to look at any Deployment to see if any of its replicas lost Availability. And to have easier time reading disrupted/misbehaving deployments.

But sure, I'm open to alternatives. If there's no fixing Progressing's current overloading of "I previously leveled the current target" and "I'm currently working to level the current target", can we mint a new condition type pair to decouple?

This could be indeed solved by a new condition called for example CompletelyAvailable, FullyAvailable, etc. This would tell us if we lost any pod regardless of maxUnavailable value and indicate a potential problem. It could also include checking if all replicas are updated as well. It is up for a discussion.

Although, one problem is that the condition would be set to False any time there is a new rollout happening. Is this what we want in case we are looking for potential problems with our deployment? We could start checking this after the rollout completes/fails, but now I am not so sure what the name of the condition should be. So when triggering new rollout you would either get ReplicaSetFailedCreate, ProgressDeadlingExceeded or False in our new condition after the rollout.

Do any of these options sound feasible?

Also, as a workaround currently you can set maxUnavailable: 0 and set maxSurge: x for your own deployments and monitor Available. Although this would be a problem for workloads that have pods which can't share the same node.

soltysh · 2021-11-26T11:43:58Z

@atiratree wrt adding conditions I'd like you to keep in mind this work kubernetes/enhancements#2833

atiratree · 2022-01-06T20:56:33Z

@atiratree wrt adding conditions I'd like you to keep in mind this work kubernetes/enhancements#2833

@soltysh since we are trying to be analogous to the Deployment Progressing condition in other workloads, it probably doesn't make sense to bring new behaviours to the KEP at the moment

atiratree · 2022-01-06T20:56:56Z

@wking I am trying to document the current behaviour in more detail here: kubernetes/website#31226

Regarding the proposed changes to the Progressing condition or introducing a new condition, I think it makes sense to wait for more input / interest from the community before committing to such a change.

k8s-triage-robot · 2022-04-06T21:06:30Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

atiratree · 2022-04-11T11:00:49Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-10T11:57:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-08-09T12:54:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

atiratree · 2022-09-05T18:38:05Z

/remove-lifecycle rotten

smarterclayton · 2022-10-24T21:07:51Z

We need to make some progress on giving users deployment level summarization of useful level driven transitions (that match the user driven intent of making changes).

A deployment or replicaset in steady state that is failing to get back to an available state for longer than a certain reasonable period should definitely be summarized as a condition, but i agree that changing Progressing is probably not the right place, because Progressing is about the creation of a new replica set (update), not existing replica set. Certainly Available=False Reason=(because of creating new replicas) is a proxy for that, but i don’t see progress deadline as having a role there because we can aggressively update deployment status when we can’t create replicas. We might want to have the reason change to something minReadySeconds related, but that is already associated with available.

k8s-triage-robot · 2023-01-22T21:50:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wking · 2023-01-25T16:51:35Z

... but i don’t see progress deadline as having a role there because we can aggressively update deployment status when we can’t create replicas...

I don't think we want the deployment controller to panic on short-term scheduling issues, and progressDeadlineSeconds is already owner-input on how long we wait before we decide lack-of-progress is suspiciously slow. But 🤷 I'll take what I can get here, because it's not great to have owners either ignoring the issue, or reaching around the deployment controller to perform their own checks.

/remove-lifecycle stale

k8s-triage-robot · 2023-04-25T17:28:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-25T17:53:08Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-06-24T18:35:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-24T18:36:03Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 1, 2021

dprotaso mentioned this issue Nov 26, 2021

Deployment Conditions - Progressing & Available not updated appropriately #106697

Closed

atiratree mentioned this issue Jan 6, 2022

make deployment status behaviour more descriptive kubernetes/website#31226

Merged

1 task

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 9, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 5, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2023

skonto mentioned this issue Aug 30, 2023

Error for failed revision is not reported due to scaling to zero knative/serving#14157

Open

skonto mentioned this issue Jan 25, 2024

If deployment is never available propagate the container msg knative/serving#14835

Closed

zachaller mentioned this issue Mar 4, 2024

fix(controller): don't timeout rollout when still waiting for scale down delay argoproj/argo-rollouts#3417

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProgressDeadlineExceeded not set outside of Deployment rollouts #106054

ProgressDeadlineExceeded not set outside of Deployment rollouts #106054

wking commented Nov 1, 2021 •

edited

k8s-ci-robot commented Nov 1, 2021

wking commented Nov 1, 2021 •

edited

wking commented Nov 2, 2021

atiratree commented Nov 2, 2021 •

edited

atiratree commented Nov 8, 2021

wking commented Nov 17, 2021

wking commented Nov 17, 2021

atiratree commented Nov 25, 2021 •

edited

soltysh commented Nov 26, 2021

atiratree commented Jan 6, 2022 •

edited

atiratree commented Jan 6, 2022

k8s-triage-robot commented Apr 6, 2022

atiratree commented Apr 11, 2022

k8s-triage-robot commented Jul 10, 2022

k8s-triage-robot commented Aug 9, 2022

atiratree commented Sep 5, 2022

smarterclayton commented Oct 24, 2022

k8s-triage-robot commented Jan 22, 2023

wking commented Jan 25, 2023

k8s-triage-robot commented Apr 25, 2023

k8s-triage-robot commented May 25, 2023

k8s-triage-robot commented Jun 24, 2023

k8s-ci-robot commented Jun 24, 2023

ProgressDeadlineExceeded not set outside of Deployment rollouts #106054

ProgressDeadlineExceeded not set outside of Deployment rollouts #106054

Comments

wking commented Nov 1, 2021 • edited

Proposal

Benefits

Downsides

Context

Reproducer

Alternatives

Prometheus alerts

Looking over the Deployment controller's shoulder

Saying Progressing as a whole is rollout-specific

Ignoring issues as long as Available=True

k8s-ci-robot commented Nov 1, 2021

wking commented Nov 1, 2021 • edited

wking commented Nov 2, 2021

atiratree commented Nov 2, 2021 • edited

atiratree commented Nov 8, 2021

wking commented Nov 17, 2021

wking commented Nov 17, 2021

atiratree commented Nov 25, 2021 • edited

soltysh commented Nov 26, 2021

atiratree commented Jan 6, 2022 • edited

atiratree commented Jan 6, 2022

k8s-triage-robot commented Apr 6, 2022

atiratree commented Apr 11, 2022

k8s-triage-robot commented Jul 10, 2022

k8s-triage-robot commented Aug 9, 2022

atiratree commented Sep 5, 2022

smarterclayton commented Oct 24, 2022

k8s-triage-robot commented Jan 22, 2023

wking commented Jan 25, 2023

k8s-triage-robot commented Apr 25, 2023

k8s-triage-robot commented May 25, 2023

k8s-triage-robot commented Jun 24, 2023

k8s-ci-robot commented Jun 24, 2023

wking commented Nov 1, 2021 •

edited

Saying `Progressing` as a whole is rollout-specific

Ignoring issues as long as `Available=True`

wking commented Nov 1, 2021 •

edited

atiratree commented Nov 2, 2021 •

edited

atiratree commented Nov 25, 2021 •

edited

atiratree commented Jan 6, 2022 •

edited