When investigating #105179, @liggitt and I discovered that the job controller does a GET request of the job before issuing any Job status update.
|
func (jm *Controller) updateJobStatus(job *batch.Job) error { |
|
jobClient := jm.kubeClient.BatchV1().Jobs(job.Namespace) |
|
var err error |
|
for i := 0; i <= statusUpdateRetries; i = i + 1 { |
|
var newJob *batch.Job |
|
newJob, err = jobClient.Get(context.TODO(), job.Name, metav1.GetOptions{}) |
|
if err != nil { |
|
break |
|
} |
|
newJob.Status = job.Status |
|
if _, err = jobClient.UpdateStatus(context.TODO(), newJob, metav1.UpdateOptions{}); err == nil { |
|
break |
|
} |
|
} |
|
|
|
return err |
|
} |
This is problematic because it can masquerade any incompatibilities between the job sync and the latest state of the Job. In particular, this can cause UIDs or counters to have stale data when tracking job status with finalizers.
It might not have been a problem in the past because the job controller would always recompute status from zero. However, when tracking with finalizers, the existing status is part of the input to the sync.
The solution is to skip the Job get and let the sync fail in case of conflict. The conflict implies that the Job is back in the workqueue because of its update.
/sig apps
When investigating #105179, @liggitt and I discovered that the job controller does a GET request of the job before issuing any Job status update.
kubernetes/pkg/controller/job/job_controller.go
Lines 1357 to 1373 in 752c4b7
This is problematic because it can masquerade any incompatibilities between the job sync and the latest state of the Job. In particular, this can cause UIDs or counters to have stale data when tracking job status with finalizers.
It might not have been a problem in the past because the job controller would always recompute status from zero. However, when tracking with finalizers, the existing status is part of the input to the sync.
The solution is to skip the Job get and let the sync fail in case of conflict. The conflict implies that the Job is back in the workqueue because of its update.
/sig apps