Skip to content

Latest commit

 

History

History

2879-ready-pods-job-status

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

KEP-2879: Track ready Pods in Job status

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Job status has a field active which counts the number of Job Pods that are in Running or Pending phases. In this KEP, we add a field ready that counts the number of Job Pods that have a Ready condition, with the same best effort guarantees as the existing active field.

Motivation

Job Pods can remain in the Pending phase for a long time in clusters with tight resources and when image pulls take long. Since the Job.status.active field includes Pending Pods, this can give a false impression of progress to end users or other controllers. This is more important when the pods serve as workers and need to communicate among themselves.

A separate Job.status.ready field can provide more information for users and controllers, reducing the need to listen to Pod updates themselves.

Note that other workload APIs (such as ReplicaSet and StatefulSet) have a similar field: .status.readyReplicas.

Goals

  • Add the field Job.status.ready that keeps a count of Job Pods with the Ready condition.

Non-Goals

  • Provide strong guarantees for the accuracy of the count. Due to the asynchronous nature of k8s, there are can be more or less Pods currently ready than what the count provides.

Proposal

Add the field .status.ready to the Job API. The job controller updates the field based on the number of Pods that have the Ready condition.

Risks and Mitigations

  • An increase in Job status updates. To mitigate this, the job controller holds the Pod updates that happen in X ms before syncing a Job. From experiments using E2E load tests, X=1s was found to be a reasonable value.

Design Details

API

type JobStatus struct {
	...
	Active    int32
	Ready     *int32  // new field
	Succeeded int32
	Failed    int32
}

Changes to the Job controller

The Job controller already lists the Pods to populate the active, succeeded and failed fields. To count ready pods, the job controller will filter the pods that have the Ready condition.

Test Plan

  • Unit and integration tests covering:
    • Count of ready pods.
    • Feature gate disablement.
  • Verify passing existing E2E and conformance tests for Job.
  • Added e2e test for the count of ready pods.

Graduation Criteria

Alpha

  • Feature gate disabled by default.
  • Unit and integration tests passing.

Beta

  • Feature gate enabled by default.

  • Existing E2E and conformance tests passing.

  • Scalability tests for Jobs of varying sizes, up to 500 parallelism. There should be no significant degradation in E2E time after enabling the feature gate.

    Using a clusterloader test that creates 338 jobs (total of ~3000 pods) on a 100 nodes cluster, with 100 QPS for the job controller, where each pod sleeps for 30s, I obtained the following results (averaged for 3 runs, from the time the first job got created):

    • Feature disabled, no batching of pod updates : 68s
    • Feature enabled, batching pod updates for 0.5s: 72s (+5.9%)
    • Feature enabled, batching pod updates for 1s: 71s (+4.4%)

GA

  • Every bug report is fixed.
  • E2e test for the count of ready pods.
  • Lock the feature-gate and document deprecation of the feature-gate

Deprecation

In GA+2 release:

  • Remove the feature gate definition
  • Job controller ignores the feature gate

Upgrade / Downgrade Strategy

No changes required for existing cluster to use the enhancement.

Version Skew Strategy

The feature doesn't affect nodes.

In the first release, a version skew between apiservers might cause the new field to remain at zero even if there are Pods ready.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: JobReadyPods
    • Components depending on the feature gate:
      • kube-controller-manager
      • kube-apiserver
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

Yes, the Job controller might upgrade the Job status more frequently to report ready pods.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the lost of information is acceptable as the field is only informative.

What happens if we reenable the feature if it was previously rolled back?

The Job controller will start populating the field again.

Are there any tests for feature enablement/disablement?

We have unit tests (see link) for the status.ready field when the feature is enabled or disabled. Similarly, we have integration tests (see link and link) for the feature being enabled or disabled.

However, due to omission we graduated to Beta without feature gate transition (enablement or disablement) tests. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see here).

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The field is only informative, it doesn't affect running workloads.

What specific metrics should inform a rollback?
  • An increase in job_sync_duration_seconds.
  • A reduction in job_syncs_total.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

A manual test on Beta was performed, as follows:

  1. Create a cluster in 1.28 with the JobReadyPods disabled (=false).
  2. Simulate upgrade by modifying control-plane manifests to enable JobReadyPods.
  3. Create long running Job A, ensure that the ready field is populated.
  4. Simulate downgrade by modifying control-plane manifests to disable JobReadyPods.
  5. Verify that ready field in Job A is cleaned up shortly after the startup of the Job controller completes.
  6. Create long running Job B, ensure that ready field is not populated.
  7. Simulate upgrade by modifying control-plane manifests to enable JobReadyPods.
  8. Verify that Job A and B ready field is tracked again shortly after the startup of the Job controller completes.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The feature applies to all Jobs, unless the feature gate is disabled.

How can someone using this feature know that it is working for their instance?
  • API .status
    • Other field: ready
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The 99% percentile of Job status sync (processing+API calls) is below 2s, when the controller doesn't create new Pods or tracks finishing Pods.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name: job_sync_duration_seconds, job_syncs_total.
    • Components exposing the metric: kube-controller-manager
Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?
  • API: PUT Job/status

    Estimated throughput: at most one additional API call for each Job Pod reaching Ready condition per second. The reason is that the update of the .status.ready field triggers another reconciliation of the Job controller.

    In order to control the number of reconciliations, the Job controller batches and deduplicates reconciliation requests within each second.

    The mechanism is based on reconciliation delaying queue, where the requests are added using the AddAfter function. If there is another reconciliation request planned within a second, the one triggered by .status.ready update is skipped.

    Originating component: job-controller

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?
  • API: Job/status

    Estimated increase in size: New field of less than 10B.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior of the Job controller.

What are other known failure modes?

No.

What steps should be taken if SLOs are not being met to determine the problem?
  1. Check reachability between kube-controller-manager and apiserver.
  2. If the job_sync_duration_seconds is too high, check for the number of requests in apiserver coming from the kube-system/job-controller service account. Consider increasing the number of inflight requests for apiserver or tuning API priority and fairness to give more priority for the job-controller requests.
  3. If the steps above are insufficient disable the JobTrackingWithFinalizers feature gate from apiserver and kube-controller-manager and report an issue.

Implementation History

  • 2021-08-19: Proposed KEP starting in alpha status, including full PRR questionnaire.
  • 2022-01-05: Proposed graduation to beta.
  • 2022-03-20: Merged PR#107476 with beta implementation

Drawbacks

The only drawback is an increase in API calls. However, this is capped by the number of times a Pod flips ready status. This is usually once for each Pod created.

Alternatives

  • Add Job.status.running, counting Pods with Running phase. The Running phase doesn't take into account preparation work before the worker is ready to accept connections. On the other hand, the Ready condition is configurable through a readiness probe. If the Pod doesn't have a readiness probe configured, the Ready condition is equivalent to the Running phase.

    In other words, Job.status.ready provides as the same behavior as Job.status.running with the advantage of it being configurable.

  • We considered exploring different batch periods for regular pod updates versus finished pod updates, so we can do less pod readiness updates without compromising how fast we can declare a job finished.

    However, the feature has been on for a long time already the there were no bugs or requests raised around the choice of batch period. Moreover, the introduced batch period was considered an important element of the Job controller, and is now not guarded by the feature gate since the PR#118615 which is already released in 1.28.