Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3973: Consider Terminating Pods in Deployments #4357

Merged
merged 1 commit into from Feb 8, 2024

Conversation

atiratree
Copy link
Member

  • One-line PR description: Deployment should consider terminating pods when performing new rollouts and scaling.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 12, 2023
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Dec 12, 2023
@kannon92
Copy link
Contributor

/cc

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! I have a few minor comments. We should get someone from api-review to decide on the name but otherwise, I think it looks good.

We should also consider metrics for this feature even if we don't implement them in alpha.

// DeploymentPodReplacementPolicy specifies the policy for creating Deployment Pod replacements.
// Default is a mixed behavior depending on the DeploymentStrategy
// +enum
type DeploymentPodReplacementPolicy string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mix of past and current tense I find to be a bit odd..

Maybe instead of Terminated, we could use Terminal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would maybe ask in api-review to see if we can get some kind of consensus on name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only saw Terminal being used in reference to the phases. I think Terminated is less ambiguous.

But for sure, let's bring this up in the api review.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another option is to explicitly name this SucceededOrFailed? Was this already discussed in api review? Job has TerminatingOrFailed in their case, right?

I actually like Terminated, just thought about bringing this up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR still has to go through API review

  • IIRC for Jobs, failed has a different meaning than just the pod status. The pod may fail or may not according to the PodFailurePolicy or if it started terminating prematurely (has deletionTimestamp). So it does not always correlate with the pod status.
  • jobs are also using different combinations: TerminatingOrFailed, which is not applicable for Deployments

Since deployments do not care about the final phase of the pod, it might be simpler to call this Terminated. And to have less confusion when comparing this API to the Jobs API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add my two cents, that the distinction between terminating and terminated might be confusing. It should not block the KEP, since this can be decided at the API review phase during implementation and updated here afterwards. Still, I'd probably try to use 2 distinct words, maybe terminated and final?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to introduce a new terminology (final), but rather have some term that can be associated with the pod lifecycle.

I agree, we can revisit this discussion in the implementation phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming idea:

podReplacementPolicy: TerminationStarted
podReplacementPolicy: TerminationComplete

(I agree this does not need to be decided until implementation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for TerminationStarted/TerminationComplete. Still, it is a bit different from the Jobs API, but I suppose if we are already incompatible, we can do the naming a bit differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the KEP to the suggestion above.

We are distinguishing between terminating and terminated pods.
- Terminating pods are running pods with a `deletionTimestamp`.
- Terminated pods are pods with a `deletionTimestamp` that have reached the `Succeeded` or `Failed` phase
and are subsequently removed from etcd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod is removed from etcd, does that mean that we no longer track them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will stop tracking them once they reach the terminal phase. There is a small delay before the pod gets removed from etcd though. Not sure if it is important to mention, but some people might find it weird, that the Pod count is higher in this situation (although not running ones!).


- <test>: <link to test coverage>

##### e2e tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I worked on the promotion to Beta for this feature in Job controller, there were a couple of caveats we encountered along the way when wanting to force the pods into terminating state for some amount of time in order to assert status, so I'll post the PR here kubernetes/kubernetes#121491, as it might save somebody's time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the deployment tests to be a bit different, but it might come handy. Thanks!

@atiratree atiratree force-pushed the pod-replacement-policy branch 5 times, most recently from 4bd355c to 85d26c1 Compare December 16, 2023 00:32
@atiratree atiratree force-pushed the pod-replacement-policy branch 2 times, most recently from 12f9823 to ac5de31 Compare January 22, 2024 15:25
@kannon92
Copy link
Contributor

@atiratree any traction on getting this queued for 1.30 features?

@atiratree
Copy link
Member Author

@kannon92 yes, we went over this one in the planning

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2024
@atiratree
Copy link
Member Author

the PR should be ready for PRR
/assign @wojtek-t

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an early pass on API review for this KEP. This is non-blocking, we can merge the KEP and finish API review during implementation, but I figured I get started on feedback.

// This is an alpha field. Enable DeploymentPodReplacementPolicy to be able to
// use this field.
// +optional
TerminatingReplicas int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, but just to confirm, +optional indicates that this will be an omitempty field?

Copy link
Contributor

@jpbetz jpbetz Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about relationships between counters here. I see "non-terminated" used in comments, I assume that is different than what "terminatingReplicas" will count but it would be best if we state it as clearly as possible.

Also, do any of the Replicas, UpdatedReplicas, ... counts include the TerminatingReplicas count? If so we should document clearly. Is there any relationship between the counts (e.g. does ReadyReplicas + TerminatingReplicas = Replicas?)

	// Total number of non-terminated pods targeted by this deployment (their labels match the selector).
	// +optional
	Replicas int32

	// Total number of non-terminated pods targeted by this deployment that have the desired template spec.
	// +optional
	UpdatedReplicas int32

	// Total number of ready pods targeted by this deployment.
	// +optional
	ReadyReplicas int32

	// Total number of available pods (ready for at least minReadySeconds) targeted by this deployment.
	// +optional
	AvailableReplicas int32

	// Total number of unavailable pods targeted by this deployment. This is the total number of
	// pods that are still required for the deployment to have 100% available capacity. They may
	// either be pods that are running but not yet available or pods that still have not been created.
	// +optional
	UnavailableReplicas int32

Which counters (if any) also count terminatingReplicas?

Copy link
Member Author

@atiratree atiratree Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, but just to confirm, +optional indicates that this will be an omitempty field?

Yes. I have added the tags for clarity.

Curious about relationships between counters here. I see "non-terminated" used in comments, I assume that is different than what "terminatingReplicas" will count but it would be best if we state it as clearly as possible.

I have also added the other relevant fields to the KEP.

I have replaced non-terminated with non-terminating as I think it describes that better. It should include pods that do not have a .metadata.deletionTimestamp field set.

Terminating pods have a .metadata.deletionTimestamp field set.

I have tried to make the fields clearer. But having non-terminating (no .metadata.deletionTimestamp) everywhere seems too verbose, so I have added the full explanation only to the TerminatingReplicas field

Also, do any of the Replicas, UpdatedReplicas, ... counts include the TerminatingReplicas count?
Which counters (if any) also count terminatingReplicas?

No, they do not. And they cannot - to be backwards compatible.

Is there any relationship between the counts (e.g. does ReadyReplicas + TerminatingReplicas = Replicas?)

There is not. Basically the amount of terminating replicas is unlimited. For example if you have Terminating policy and do multiple rollouts in a row and have a long TerminationGracePeriodSeconds.

// DeploymentPodReplacementPolicy specifies the policy for creating Deployment Pod replacements.
// Default is a mixed behavior depending on the DeploymentStrategy
// +enum
type DeploymentPodReplacementPolicy string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming idea:

podReplacementPolicy: TerminationStarted
podReplacementPolicy: TerminationComplete

(I agree this does not need to be decided until implementation)

```golang
type ReplicaSetStatus struct {
...
// The number of terminating replicas for this replica set.
Copy link
Contributor

@jpbetz jpbetz Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the number of replica status counters increases, I'd like to make sure the relationships between the counts are clear. We have:

	// Replicas is the number of actual replicas.
	Replicas int32

	// The number of pods that have labels matching the labels of the pod template of the replicaset.
	// +optional
	FullyLabeledReplicas int32

	// The number of ready replicas for this replica set.
	// +optional
	ReadyReplicas int32

	// The number of available replicas (ready for at least minReadySeconds) for this replica set.
	// +optional
	AvailableReplicas int32

Does FullyLabeledReplicas , Replicas or ReadyReplicas count terminating replicas?

Are we guaranteed ReadyReplicas + TerminatingReplicas = Replicas ? Any other relationships between counters we should clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as #4357 (comment).

Strange thing, not sure why are the comments a bit different in the /api/ and /apis/ types?

Also the style of each field is a bit different.

Can we go over the fields and make the documentation consistent when we do the PR in k/k?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go over the fields and make the documentation consistent when we do the PR in k/k?

That would be great.

@atiratree atiratree force-pushed the pod-replacement-policy branch 2 times, most recently from 0325a9d to f3d60fb Compare February 2, 2024 20:27
@atiratree
Copy link
Member Author

The policy naming has been updated as follows:

Terminating -> TerminationStarted
Terminated -> TerminationComplete

@soltysh
Copy link
Contributor

soltysh commented Feb 7, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2024
@atiratree
Copy link
Member Author

@wojtek-t can you please take a look?

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments from PRR pov.

### Version Skew Strategy


This feature doesn't depend on the version for nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can have skew between kube-apiserver and kube-controller-manager. Everything should work fine, but can you please briefly describe what happens if FG is enabled in one of those and not the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an explanation

By disabling the feature:
- Extra pods can appear during a deployment rollout or scaling. This can increase the number of pods
that need to be scheduled, and it can have an impact on the resource consumption.
- Actors reading `.status.TerminatingReplicas` for ReplicaSet and Deployments will observe 0 pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the FG is disabled, the ReplicaSet/Deployment controllers will actually set the TerminatingReplicas to 0? Or rather will not touch it at all?

[In particular the controller will certainly not touch it if rolled back to a version that doesn't support the field]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the status will be reconciled in normal deployment/replicaset flow and the value will not be preserved - it will be omitted (0)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024

If the feature is not enabled on the apiserver, and it is enabled in the kube-controller-manager, then
- The feature cannot be used for new workloads.
- Workloads that use this feature will revert to their default behavior. This might effect existing workloads as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FG in kube-controller-manager is enabled, this isn't true right?
The controller will continue to use the new behavior...

Copy link
Member Author

@atiratree atiratree Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about an apiserver that does not support the field. But yes it will function if it supports the field.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2024

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit a95e6d8 into kubernetes:master Feb 8, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

8 participants