Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet TTL to clean up completed workloads #279

Closed
Tracked by #350 ...
danielvegamyhre opened this issue Aug 29, 2023 · 21 comments · Fixed by #443
Closed
Tracked by #350 ...

JobSet TTL to clean up completed workloads #279

danielvegamyhre opened this issue Aug 29, 2023 · 21 comments · Fixed by #443
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Aug 29, 2023

We should look into having JobSet respect TTL so completed JobSets can be deleted from etcd and not take up space

https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

@kannon92
Copy link
Contributor

So I am a bit confused by this.

Are you saying we should have a field in JobSet that will purge JobSet from etcd giving a TTL?

Or are you wanting someone to test if one sets TTL in a Job, will the Job finish and then get deleted. And even if the JobSet was marked as success (when Jobs all finish), the JobSet will recreate the finished job because it was deleted?

@kannon92
Copy link
Contributor

So individual jobs can use ttl without any issue.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: simple-with-ttl
spec:
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        ttlSecondsAfterFinished: 60
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 100
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        ttlSecondsAfterFinished: 60
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 100
~                            

Jobs are cleaned up and deleted from etcd but JobSet will still mark this was successful.

I think we could add a ttl for JobSet as the JobSet will not be deleted. IMO this would be a nice feature for any CRD though..

@danielvegamyhre
Copy link
Contributor Author

Thanks for testing ttlSecondsAfterFinished for child jobs! So to clarify, the jobset controller did not recreate the child jobs that were cleaned up after the TTL? Just want to confirm since I created this issue when a user mentioned this happened to them, jobs were recreated. They could have just been misunderstanding what they were seeing though.

@kannon92
Copy link
Contributor

kannon92 commented Nov 2, 2023

Yea I didn’t see jobs recreated in my case.

@danielvegamyhre
Copy link
Contributor Author

Sounds good, thanks for taking a look at this.

@danielvegamyhre
Copy link
Contributor Author

I think we actually need a TTL in JobSet to improve the user experience with Kueue integration. Right now, users have to manually delete all Workloads since JobSet has no TTL, so they either need to do it manually or have some process doing it periodically. If JobSet had a TTL, the JobSet + Workload object would be cleaned up automatically.

@danielvegamyhre danielvegamyhre changed the title Investigate JobSet handling of TTL to clean up completed jobs Investigate JobSet TTL to clean up completed workloads Nov 6, 2023
@kannon92
Copy link
Contributor

kannon92 commented Nov 7, 2023

Can you clean up the description? I think this could be a good first issue if we clear the objectives.

Add a ttlafterFinished field, delete finished jobs sets after that time.

My questions I would have is do we delete jobs and pods of finished jobsets? That would make debugging difficult but if they finished I guess we don’t need it.

@kannon92
Copy link
Contributor

/help
/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@kannon92:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/help
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Nov 23, 2023
@dejanzele
Copy link
Contributor

dejanzele commented Nov 30, 2023

I have capacity to help with this issue if no one else is looking at it.

@dejanzele
Copy link
Contributor

Can you clean up the description? I think this could be a good first issue if we clear the objectives.

Add a ttlafterFinished field, delete finished jobs sets after that time.

My questions I would have is do we delete jobs and pods of finished jobsets? That would make debugging difficult but if they finished I guess we don’t need it.

I guess we should only agree on the final question from @kannon92 and it should be a clear path to the finish line.

@dejanzele
Copy link
Contributor

/assign

@dejanzele
Copy link
Contributor

@danielvegamyhre what is your opinion on what Kevin asked for cleaning up jobs and pods after jobsets finishes?

@danielvegamyhre
Copy link
Contributor Author

After the jobset finishes and the TTL has been reached, then we want the jobset and it's child resources (jobs, pods, service, etc.) to be deleted.

@dejanzele
Copy link
Contributor

Thanks @danielvegamyhre, I am good to proceed :)

@kannon92
Copy link
Contributor

/retitle JobSet TTL to clean up completed workloads

@k8s-ci-robot k8s-ci-robot changed the title Investigate JobSet TTL to clean up completed workloads JobSet TTL to clean up completed workloads Mar 20, 2024
@kannon92
Copy link
Contributor

I was testing this feature. I noticed that pods and jobs were getting deleted with ttl but I did not see jobsets getting deleted.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 12, 2024

That is not good, do you want to open an issue with any logs you have noticed?

@ahg-g
Copy link
Contributor

ahg-g commented Apr 12, 2024

@dejanzele can we add an e2e test as well?

@kannon92
Copy link
Contributor

Sorry, I was mistaken.

I created the example ttl job and noticed that jobset was still around but jobs/pods were deleted.

But I tried testing again and I did actually see the jobset being deleted eventually. I'm not sure they will happen at the same time but I do see ttl working with JobSet/Jobs/Pods.

@dejanzele
Copy link
Contributor

@ahg-g ok, I'll create a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
5 participants