- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The Job status has a field active which counts the number of Job Pods that
are in Running or Pending phases. In this KEP, we add a field ready that
counts the number of Job Pods that have a Ready condition, with the same
best effort guarantees as the existing active field.
Job Pods can remain in the Pending phase for a long time in clusters with
tight resources and when image pulls take long. Since the Job.status.active
field includes Pending Pods, this can give a false impression of progress
to end users or other controllers. This is more important when the pods serve
as workers and need to communicate among themselves.
A separate Job.status.ready field can provide more information for users
and controllers, reducing the need to listen to Pod updates themselves.
Note that other workload APIs (such as ReplicaSet and StatefulSet) have a
similar field: .status.readyReplicas.
- Add the field
Job.status.readythat keeps a count of Job Pods with theReadycondition.
- Provide strong guarantees for the accuracy of the count. Due to the asynchronous nature of k8s, there are can be more or less Pods currently ready than what the count provides.
Add the field .status.ready to the Job API. The job controller updates the
field based on the number of Pods that have the Ready condition.
- An increase in Job status updates. To mitigate this, the job controller holds the Pod updates that happen in X ms before syncing a Job. From experiments using E2E load tests, X=1s was found to be a reasonable value.
type JobStatus struct {
...
Active int32
Ready *int32 // new field
Succeeded int32
Failed int32
}The Job controller already lists the Pods to populate the active, succeeded
and failed fields. To count ready pods, the job controller will filter the
pods that have the Ready condition.
- Unit and integration tests covering:
- Count of ready pods.
- Feature gate disablement.
- Verify passing existing E2E and conformance tests for Job.
- Added e2e test for the count of ready pods.
- Feature gate disabled by default.
- Unit and integration tests passing.
-
Feature gate enabled by default.
-
Existing E2E and conformance tests passing.
-
Scalability tests for Jobs of varying sizes, up to 500 parallelism. There should be no significant degradation in E2E time after enabling the feature gate.
Using a clusterloader test that creates 338 jobs (total of ~3000 pods) on a 100 nodes cluster, with 100 QPS for the job controller, where each pod sleeps for 30s, I obtained the following results (averaged for 3 runs, from the time the first job got created):
- Feature disabled, no batching of pod updates : 68s
- Feature enabled, batching pod updates for 0.5s: 72s (+5.9%)
- Feature enabled, batching pod updates for 1s: 71s (+4.4%)
- Every bug report is fixed.
- E2e test for the count of ready pods.
- Lock the feature-gate and document deprecation of the feature-gate
In GA+2 release:
- Remove the feature gate definition
- Job controller ignores the feature gate
No changes required for existing cluster to use the enhancement.
The feature doesn't affect nodes.
In the first release, a version skew between apiservers might cause the new field to remain at zero even if there are Pods ready.
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: JobReadyPods
- Components depending on the feature gate:
- kube-controller-manager
- kube-apiserver
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Yes, the Job controller might upgrade the Job status more frequently to report ready pods.
Yes, the lost of information is acceptable as the field is only informative.
The Job controller will start populating the field again.
We have unit tests (see link) for
the status.ready field when the feature is enabled or disabled.
Similarly, we have integration tests (see link
and link)
for the feature being enabled or disabled.
However, due to omission we graduated to Beta without feature gate transition (enablement or disablement) tests. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see here).
The field is only informative, it doesn't affect running workloads.
- An increase in
job_sync_duration_seconds. - A reduction in
job_syncs_total.
A manual test on Beta was performed, as follows:
- Create a cluster in 1.28 with the
JobReadyPodsdisabled (=false). - Simulate upgrade by modifying control-plane manifests to enable
JobReadyPods. - Create long running Job A, ensure that the ready field is populated.
- Simulate downgrade by modifying control-plane manifests to disable
JobReadyPods. - Verify that ready field in Job A is cleaned up shortly after the startup of the Job controller completes.
- Create long running Job B, ensure that ready field is not populated.
- Simulate upgrade by modifying control-plane manifests to enable
JobReadyPods. - Verify that Job A and B ready field is tracked again shortly after the startup of the Job controller completes.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
The feature applies to all Jobs, unless the feature gate is disabled.
- API .status
- Other field:
ready
- Other field:
The 99% percentile of Job status sync (processing+API calls) is below 2s, when the controller doesn't create new Pods or tracks finishing Pods.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_sync_duration_seconds,job_syncs_total. - Components exposing the metric:
kube-controller-manager
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
No.
-
API: PUT Job/status
Estimated throughput: at most one additional API call for each Job Pod reaching Ready condition per second. The reason is that the update of the
.status.readyfield triggers another reconciliation of the Job controller.In order to control the number of reconciliations, the Job controller batches and deduplicates reconciliation requests within each second.
The mechanism is based on reconciliation delaying queue, where the requests are added using the
AddAfterfunction. If there is another reconciliation request planned within a second, the one triggered by.status.readyupdate is skipped.Originating component: job-controller
No.
No.
-
API: Job/status
Estimated increase in size: New field of less than 10B.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
No change from existing behavior of the Job controller.
No.
- Check reachability between kube-controller-manager and apiserver.
- If the
job_sync_duration_secondsis too high, check for the number of requests in apiserver coming from the kube-system/job-controller service account. Consider increasing the number of inflight requests for apiserver or tuning API priority and fairness to give more priority for the job-controller requests. - If the steps above are insufficient disable the
JobTrackingWithFinalizersfeature gate from apiserver and kube-controller-manager and report an issue.
- 2021-08-19: Proposed KEP starting in alpha status, including full PRR questionnaire.
- 2022-01-05: Proposed graduation to beta.
- 2022-03-20: Merged PR#107476 with beta implementation
The only drawback is an increase in API calls. However, this is capped by the number of times a Pod flips ready status. This is usually once for each Pod created.
-
Add
Job.status.running, counting Pods withRunningphase. TheRunningphase doesn't take into account preparation work before the worker is ready to accept connections. On the other hand, theReadycondition is configurable through a readiness probe. If the Pod doesn't have a readiness probe configured, theReadycondition is equivalent to theRunningphase.In other words,
Job.status.readyprovides as the same behavior asJob.status.runningwith the advantage of it being configurable. -
We considered exploring different batch periods for regular pod updates versus finished pod updates, so we can do less pod readiness updates without compromising how fast we can declare a job finished.
However, the feature has been on for a long time already the there were no bugs or requests raised around the choice of batch period. Moreover, the introduced batch period was considered an important element of the Job controller, and is now not guarded by the feature gate since the PR#118615 which is already released in 1.28.