Skip to content

Job controller: Updates might override stale data #105199

@alculquicondor

Description

@alculquicondor

When investigating #105179, @liggitt and I discovered that the job controller does a GET request of the job before issuing any Job status update.

func (jm *Controller) updateJobStatus(job *batch.Job) error {
jobClient := jm.kubeClient.BatchV1().Jobs(job.Namespace)
var err error
for i := 0; i <= statusUpdateRetries; i = i + 1 {
var newJob *batch.Job
newJob, err = jobClient.Get(context.TODO(), job.Name, metav1.GetOptions{})
if err != nil {
break
}
newJob.Status = job.Status
if _, err = jobClient.UpdateStatus(context.TODO(), newJob, metav1.UpdateOptions{}); err == nil {
break
}
}
return err
}

This is problematic because it can masquerade any incompatibilities between the job sync and the latest state of the Job. In particular, this can cause UIDs or counters to have stale data when tracking job status with finalizers.
It might not have been a problem in the past because the job controller would always recompute status from zero. However, when tracking with finalizers, the existing status is part of the input to the sync.

The solution is to skip the Job get and let the sync fail in case of conflict. The conflict implies that the Job is back in the workqueue because of its update.

/sig apps

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions