Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Never clean backoff in job controller #63650

Merged
merged 3 commits into from
Jun 6, 2018

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented May 10, 2018

What this PR does / why we need it:
In #60985 I've added a mechanism which allows immediate job status update, unfortunately that broke the backoff logic seriously. I'm sorry for that. I've changed the immediate mechanism so that it NEVER cleans the backoff, but for the cases when we want fast status update it uses a zero backoff.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62382

Special notes for your reviewer:
/assign @janetkuo

Release note:

Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 10, 2018
@dims
Copy link
Member

dims commented May 10, 2018

LGTM 👍

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. Would you add a test to catch this regression?

@janetkuo janetkuo added this to the v1.10 milestone May 14, 2018
@janetkuo janetkuo added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/bug Categorizes issue or PR as related to a bug. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels May 14, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@janetkuo @soltysh

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@janetkuo
Copy link
Member

Haven't heard back from you so I opened #63990 with your commit and a test for it.

@kow3ns kow3ns added this to In Progress in Workloads May 31, 2018
@dims
Copy link
Member

dims commented Jun 3, 2018

/milestone v1.11

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.10, v1.11 Jun 3, 2018
@cblecker
Copy link
Member

cblecker commented Jun 3, 2018

Holding as @janetkuo has #63990 open
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2018
@dims dims mentioned this pull request Jun 4, 2018
@soltysh
Copy link
Contributor Author

soltysh commented Jun 5, 2018

/retest

@soltysh
Copy link
Contributor Author

soltysh commented Jun 5, 2018

'd prefer to merge the fix asap and backport to 1.10 and deal with flakes (when they appear) afterwards.

As a proof, I'm just reaching 300 iteration of the test without a flake. So I guess the flakiness is a low probability, as well as hard to fix :/

@@ -393,7 +393,9 @@ func (jm *JobController) processNextWorkItem() bool {
}

utilruntime.HandleError(fmt.Errorf("Error syncing job: %v", err))
jm.queue.AddRateLimited(key)
if !errors.IsConflict(err) {
jm.queue.AddRateLimited(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If status update failed, we still need to requeue this job; otherwise, the job status may never be updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a workaround, we can probably retry on update conflict. But in the long run we should not use # of requeues to compare against backoff limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I totally agree that the re-queues are not good for that. They're causing us more harm than good :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #64787 to track this.

// new failures happen when status does not reflect the failures and active
// is different than parallelism, otherwise the previous controller loop
// failed updating status so even if we pick up failure it is not a new one
exceedsBackoffLimit := jobHaveNewFailure && (active != *job.Spec.Parallelism) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you explain why active != *job.Spec.Parallelism is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case when a controller failed an update, but went through the whole process, which means that the number of active is exactly what it should be. There's no other way right now to figure that the previous loop failed.

@@ -494,8 +496,12 @@ func (jm *JobController) syncJob(key string) (bool, error) {
var failureReason string
var failureMessage string

jobHaveNewFailure := failed > job.Status.Failed
exceedsBackoffLimit := jobHaveNewFailure && (int32(previousRetry)+1 > *job.Spec.BackoffLimit)
jobHaveNewFailure := (failed > job.Status.Failed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could happen when Job Status failed to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did not change, it is as it was before I was trying different scenarios, so that change is nothing new. During tests I've just added parenthesis here.

@soltysh
Copy link
Contributor Author

soltysh commented Jun 5, 2018

@janetkuo updated, ptal once again

@soltysh soltysh force-pushed the issue62382 branch 3 times, most recently from a430523 to abd9e0d Compare June 5, 2018 20:23
st := job.Status

var err error
for i, job := 0, job; i < statusUpdateRetries; i, job = i+1, refresh(jobClient, job) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i <= statusUpdateRetries

if err == nil {
return newJob
} else {
return job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should probably give up updating status when GET returns an error

@janetkuo
Copy link
Member

janetkuo commented Jun 5, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: janetkuo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@janetkuo
Copy link
Member

janetkuo commented Jun 5, 2018

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 6, 2018

@soltysh: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-kubemark-e2e-gce d80ed53 link /test pull-kubernetes-kubemark-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cblecker
Copy link
Member

cblecker commented Jun 6, 2018

/test pull-kubernetes-verify

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64009, 64780, 64354, 64727, 63650). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 34759c2 into kubernetes:master Jun 6, 2018
Workloads automation moved this from In Progress to Done Jun 6, 2018
@soltysh soltysh deleted the issue62382 branch June 6, 2018 08:39
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jun 7, 2018
k8s-github-robot pushed a commit that referenced this pull request Jun 11, 2018
#63650-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #58972: Fix job's backoff limit for restart policy OnFailure #63650: Never clean backoff in job controller

Cherry pick of #58972 #63650 on release-1.10.

#58972: Fix job's backoff limit for restart policy OnFailure
#63650: Never clean backoff in job controller

Fixes #62382.

**Release Note:**
```release-note
Fix regression in `v1.JobSpec.backoffLimit` that caused failed Jobs to be restarted indefinitely.
```
@linyouchong
Copy link
Contributor

linyouchong commented Jul 24, 2018

@soltysh I run the test locally, failed 4 times
Fail scenario 1:
kubectl get pod --all-namespaces -w

e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       ContainerCreating   0         0s
e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       Error     0         2s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       ContainerCreating   0         0s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       Error     0         2s
e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       Terminating   0         11s
e2e-tests-job-td9cs   backofflimit-wvhgv   0/1       Terminating   0         11s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       Pending   0         0s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       ContainerCreating   0         0s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       Error     0         2s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       Terminating   0         9s
e2e-tests-job-td9cs   backofflimit-bvqb6   0/1       Terminating   0         9s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       Terminating   0         18s
e2e-tests-job-td9cs   backofflimit-nkfcw   0/1       Terminating   0         18s

related log:

 Expected
      <v1.PodPhase>: Pending
  to equal
      <v1.PodPhase>: Failed

fail scenario 2:
kubectl get pod --all-namespaces -w

e2e-tests-job-896l6   backofflimit-646jx   0/1       Pending   0         0s
e2e-tests-job-896l6   backofflimit-646jx   0/1       Pending   0         0s
e2e-tests-job-896l6   backofflimit-646jx   0/1       ContainerCreating   0         0s
e2e-tests-job-896l6   backofflimit-646jx   0/1       Error     0         2s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       Pending   0         0s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       Pending   0         0s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       ContainerCreating   0         0s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       Error     0         2s
e2e-tests-job-896l6   backofflimit-646jx   0/1       Terminating   0         15s
e2e-tests-job-896l6   backofflimit-646jx   0/1       Terminating   0         15s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       Terminating   0         18s
e2e-tests-job-896l6   backofflimit-zsjxs   0/1       Terminating   0         18s

related log:

Jul 24 04:32:45.718: Not enough pod created expected at least 2, got []v1.Pod{v1.Pod{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"backofflimit-kncb7", GenerateName:"backofflimit-", Namespace:"e2e-tests-job-5g6xd", SelfLink:"/api/v1/namespaces/e2e-tests-job-5g6xd/pods/backofflimit-kncb7", UID:"21364a62-8f1c-11e8-9d2a-525400328f1b", ResourceVersion:"55083", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63668017961, loc:(*time.Location)(0x674e200)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"controller-uid":"1fd49ea7-8f1c-11e8-9d2a-525400328f1b", "job":"backofflimit", "job-name":"backofflimit"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"batch/v1", Kind:"Job", Name:"backofflimit", UID:"1fd49ea7-8f1c-11e8-9d2a-525400328f1b", Controller:(*bool)(0xc4217c0be8), BlockOwnerDeletion:(*bool)(0xc4217c0be9)}}, Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.PodSpec{Volumes:[]v1.Volume{v1.Volume{Name:"data", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(nil), EmptyDir:(*v1.EmptyDirVolumeSource)(0xc4215b3f00), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}, v1.Volume{Name:"default-token-xqtxk", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(nil), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(0xc421d8b5c0), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"c", Image:"docker.artsz.zte.com.cn/cci/usee/busybox:1.24", Command:[]string{"/bin/sh", "-c", "exit 1"}, Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"data", ReadOnly:false, MountPath:"/data", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}, v1.VolumeMount{Name:"default-token-xqtxk", ReadOnly:true, MountPath:"/var/run/secrets/kubernetes.io/serviceaccount", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}}, VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:"Never", TerminationGracePeriodSeconds:(*int64)(0xc4217c0c38), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"default", DeprecatedServiceAccount:"default", AutomountServiceAccountToken:(*bool)(nil), NodeName:"10.114.51.188", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(0xc421d8b680), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]v1.Toleration{v1.Toleration{Key:"node.kubernetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc4217c0c80)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc4217c0ca0)}}, HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil)}, Status:v1.PodStatus{Phase:"Failed", Conditions:[]v1.PodCondition{v1.PodCondition{Type:"Initialized", Status:"True", LastProbeTime:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63668017961, loc:(*time.Location)(0x674e200)}}, Reason:"", Message:""}, v1.PodCondition{Type:"Ready", Status:"False", LastProbeTime:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63668017961, loc:(*time.Location)(0x674e200)}}, Reason:"ContainersNotReady", Message:"containers with unready status: [c]"}, v1.PodCondition{Type:"PodScheduled", Status:"True", LastProbeTime:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63668017961, loc:(*time.Location)(0x674e200)}}, Reason:"", Message:""}}, Message:"", Reason:"", NominatedNodeName:"", HostIP:"10.114.51.188", PodIP:"172.22.27.182", StartTime:(*v1.Time)(0xc4215b3f40), InitContainerStatuses:[]v1.ContainerStatus(nil), ContainerStatuses:[]v1.ContainerStatus{v1.ContainerStatus{Name:"c", State:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(0xc4201460e0)}, LastTerminationState:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(nil)}, Ready:false, RestartCount:0, Image:"busybox:1.24", ImageID:"docker-pullable://docker.artsz.zte.com.cn/cci/usee/busybox@sha256:d3b73e79f6841be61246107e620a33278d0c6a64a6254bcb03c7e6f4f8d77626", ContainerID:"docker://e00a6c301d359466a2a881de5aac8050e35558be1f75c2623ebabeed8ad4ba4f"}}, QOSClass:"BestEffort"}}}

@antoineco
Copy link
Contributor

For posterity, this was backported in 1.10.5 -> #64813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Workloads
  
Done
Development

Successfully merging this pull request may close these issues.

Backoff Limit for Job does not work on Kubernetes 1.10.0
9 participants