Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job controller: Updates might override stale data #105199

Closed
alculquicondor opened this issue Sep 22, 2021 · 2 comments · Fixed by #105214
Closed

Job controller: Updates might override stale data #105199

alculquicondor opened this issue Sep 22, 2021 · 2 comments · Fixed by #105214
Assignees
Labels
kind/bug needs-triage sig/apps

Comments

@alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Sep 22, 2021

When investigating #105179, @liggitt and I discovered that the job controller does a GET request of the job before issuing any Job status update.

func (jm *Controller) updateJobStatus(job *batch.Job) error {
jobClient := jm.kubeClient.BatchV1().Jobs(job.Namespace)
var err error
for i := 0; i <= statusUpdateRetries; i = i + 1 {
var newJob *batch.Job
newJob, err = jobClient.Get(context.TODO(), job.Name, metav1.GetOptions{})
if err != nil {
break
}
newJob.Status = job.Status
if _, err = jobClient.UpdateStatus(context.TODO(), newJob, metav1.UpdateOptions{}); err == nil {
break
}
}
return err
}

This is problematic because it can masquerade any incompatibilities between the job sync and the latest state of the Job. In particular, this can cause UIDs or counters to have stale data when tracking job status with finalizers.
It might not have been a problem in the past because the job controller would always recompute status from zero. However, when tracking with finalizers, the existing status is part of the input to the sync.

The solution is to skip the Job get and let the sync fail in case of conflict. The conflict implies that the Job is back in the workqueue because of its update.

/sig apps

@alculquicondor alculquicondor added the kind/bug label Sep 22, 2021
@k8s-ci-robot k8s-ci-robot added the sig/apps label Sep 22, 2021
@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Sep 22, 2021

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage label Sep 22, 2021
@alculquicondor
Copy link
Member Author

@alculquicondor alculquicondor commented Sep 23, 2021

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug needs-triage sig/apps
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants