Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job TTLs not working #1533

Closed
arashd opened this issue Feb 9, 2022 · 2 comments · Fixed by #1614
Closed

Job TTLs not working #1533

arashd opened this issue Feb 9, 2022 · 2 comments · Fixed by #1614

Comments

@arashd
Copy link

arashd commented Feb 9, 2022

I'm running into an issue where the "Ttl Seconds After Finished" on my TFJobs isn't being respected and I suspect it's because the reconcile loop where CleanupJob runs isn't run frequently enough for all jobs.

As an example: I start a TFJob with a TTL of 1 minute. The reconcile loop runs anytime there's a state change (including when the job successfully finishes). It doesn't get deleted upon finishing because the TTL hasn't passed. Then, the reconcile loop doesn't run, in this case, until 7 hours after the job was finished. At that point the job does get cleaned up. (this and other jobs past TTL)

My question is: Is the fact that the reconcile loop doesn't run at shorter intervals for finished jobs expected behaviour? Is there a way to find out where that default 7 hours period if set? If so, would changing it to a smaller default to accommodate the TTL feature make sense?

Here are the training-operator logs for the job from start to finish:

training-operator
time="2022-02-09T01:42:43Z" level=info msg="TFJob test-ttl-6 is created."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:43.836 PST
training-operator
time="2022-02-09T01:42:43Z" level=info msg="Need to create new pod: chief-0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created pod test-ttl-6-chief-0" job=.test-ttl-6 pod=.test-ttl-6-chief-0 uid=
Error
2022-02-08 17:42:44.424 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="need to create new service: chief-0" job=adelijani.test-ttl-6 replica-type=chief uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.424 PST
training-operator
2022-02-09T01:42:44.424Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreatePod", "message": "Created pod: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Controller test-ttl-6 created service test-ttl-6-chief-0"
Error
2022-02-08 17:42:44.436 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.436 PST
training-operator
2022-02-09T01:42:44.436Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"TFJob","namespace":"adelijani","name":"test-ttl-6","uid":"5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39","apiVersion":"kubeflow.org/v1","resourceVersion":"806803740"}, "reason": "SuccessfulCreateService", "message": "Created service: test-ttl-6-chief-0"}
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (8.171651ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.444 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.450 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Finished updating TFJobs Status \"test-ttl-6\" (5.933567ms)" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="Reconcile Tensorflow Job error Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"
Error
2022-02-08 17:42:44.451 PST
training-operator
2022-02-09T01:42:44.450Z ERROR controller-runtime.manager.controller.tfjob-controller Reconciler error {"name": "test-ttl-6", "namespace": "adelijani", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-ttl-6\": the object has been modified; please apply your changes to the latest version and try again"}
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.451 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.456 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.457 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="Reconciling for job test-ttl-6"
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=warning msg="The restart policy of replica Chief of the job test-ttl-6 is not OnFailure or Always. Not counted in backoff limit."
Error
2022-02-08 17:42:44.835 PST
training-operator
time="2022-02-09T01:42:44Z" level=info msg="TFJob=adelijani/test-ttl-6, ReplicaType=Chief expected=1, running=0, failed=0" job=adelijani.test-ttl-6 uid=5eeb5c65-b33b-4a56-ba8d-c02ce3bccb39
@zw0610
Copy link
Member

zw0610 commented Feb 10, 2022

Is there a way to find out where that default 7 hours period if set?

I believe so. When constructing the manager, an option named SyncPeriod should do the work. You will need to modify the main.go file to enable such option.

However, I believe we might find another way to re-support such feature after the workqueue disabled in the reconcile mode. @Jeffwan

@arashd
Copy link
Author

arashd commented Feb 10, 2022

Thanks for the helpful guidance. Looking forward to hearing if there's a way to change this without touching code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants