Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

Closed
lvxian-kjl opened this issue Jul 12, 2023 · 13 comments · Fixed by #1024
Closed

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

lvxian-kjl opened this issue Jul 12, 2023 · 13 comments · Fixed by #1024

Comments

@lvxian-kjl
Copy link

lvxian-kjl commented Jul 12, 2023

The cleanPolicy, activeDeadlineSeconds, and ttlSecondsAfterFinished within spec should be under spec.runPolicy, not directly under spec.

@panpan0000
Copy link

panpan0000 commented Dec 18, 2023

+1

I just see the same problem.

I will have to run submitJob CLI and generate a /tmp/$xxx.yaml... temporary file , then manually update the position of cleanPodPolicy , then kubectl apply again...

image

it's a bit troublesome.

v0.9.0+447b534 still have this problem.

@panpan0000
Copy link

Hi, @cheyang , considering the incompatibility between the kubeflow trainer-operator CRD with arena code , what's your though ?

@panpan0000
Copy link

If no objection, I'm doing the fix now

panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@Syulin7
Copy link
Collaborator

Syulin7 commented Dec 18, 2023

@lvxian-kjl @panpan0000 In previous tf-operator and pytorch-operator, the cleanPolicy was under spec, but in the training-operator CRD, it has been moved to spec.RunPolicy, which is a breaking change. Considering compatibility with users who are currently using it, there has been no direct switch to the training-operator CRD.

@Syulin7
Copy link
Collaborator

Syulin7 commented Dec 18, 2023

If no objection, I'm doing the fix now

Thank you for your PR, but it would impact existing users. We are also considering how to make it compatible with the training-operator CRD, such as adding a new template in the arena charts, and making a distinction when installing arena.

@panpan0000
Copy link

panpan0000 commented Dec 18, 2023

Thank you @Syulin7 , but people's intuitions are "arena will be the CLI tool to manage the jobs , after installing kubeflow and its training-operator"
But the current arena will never submit any job for up to date kubeflow standard training-operator and CRDs.
So every new user will be blocked here and has to give up arena. I'm afraid that's not the good way out for community.

May I suggest that :
(1) create a maintenance branch to support those OLD users with old kubeflow CRD, especially for those users in AliYun. with NOTES in README to tell the max compatbile kubeflow version, to avoid confusion .
(2) the main branch just supports latest kubeflow CRD, and add a NOTE in README, to tell people the minimal kubeflow version for this branch.

@Syulin7
Copy link
Collaborator

Syulin7 commented Dec 18, 2023

@panpan0000 Thank you for your suggestion, I agree. Supporting the training-operator CRD and maintaining backward compatibility are both very important.

Creating a new branch is a solution, but it may add complexity and cause confusion. I am trying to achieve compatibility with the training-operator CRD by identifying the CRD and using different templates when submitting jobs. In summary, our goal is to support the latest training-operator CRD.

@panpan0000
Copy link

so did you mean the old CRD having the different apiVersion ? (now it's kubeflow.org/v1, so the old one maybe v1beta2..etc )?

@Syulin7
Copy link
Collaborator

Syulin7 commented Dec 20, 2023

@Syulin7 Syulin7 changed the title pytorchjob的cleanPolicy字段层级错误 The cleanPolicy field of PyTorchJob is incorrectly placed. Jan 12, 2024
@Syulin7
Copy link
Collaborator

Syulin7 commented Jan 16, 2024

#1024 fixed this issue, If you have any suggestions, please let me know. @panpan0000 @lvxian-kjl

@panpan0000
Copy link

sorry, I found the release binary missing so cannot test the update
#1045

@Syulin7
Copy link
Collaborator

Syulin7 commented Mar 4, 2024

sorry, I found the release binary missing so cannot test the update #1045

@panpan0000 I just updated, PTLA.

@panpan0000
Copy link

I just verified with v0.9.12, the issue has gone , thank you @Syulin7

my verification log

kubeflow training-operator version v1-5525468
arena: v0.9.12
kubernetes v1.25.3
arena submit pytorch --namespace=peter --name=latest-arena-verify  --workers=1 --image=nvcr.io/nvidia/pytorch:23.03-py3  bash


arena get  latest-arena-verify --type pytorchjob -n peter
Name:        latest-arena-verify
Status:      SUCCEEDED


kubectl -n peter get pytorchjobs latest-arena-verify -o yaml
spec:
  pytorchReplicaSpecs:
    Master:
      template:
        metadata:
          name: latest-arena-verify
        spec:
          containers:
          - command:
          ..... 
  runPolicy:
    cleanPodPolicy: Running
    suspend: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants