The cleanPolicy field of PyTorchJob is incorrectly placed. #966

lvxian-kjl · 2023-07-12T03:03:16Z

The cleanPolicy, activeDeadlineSeconds, and ttlSecondsAfterFinished within spec should be under spec.runPolicy, not directly under spec.

panpan0000 · 2023-12-18T03:09:26Z

+1

I just see the same problem.

I will have to run submitJob CLI and generate a /tmp/$xxx.yaml... temporary file , then manually update the position of cleanPodPolicy , then kubectl apply again...

it's a bit troublesome.

v0.9.0+447b534 still have this problem.

panpan0000 · 2023-12-18T05:12:17Z

Hi, @cheyang , considering the incompatibility between the kubeflow trainer-operator CRD with arena code , what's your though ?

panpan0000 · 2023-12-18T06:21:18Z

If no objection, I'm doing the fix now

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Syulin7 · 2023-12-18T07:27:33Z

@lvxian-kjl @panpan0000 In previous tf-operator and pytorch-operator, the cleanPolicy was under spec, but in the training-operator CRD, it has been moved to spec.RunPolicy, which is a breaking change. Considering compatibility with users who are currently using it, there has been no direct switch to the training-operator CRD.

Syulin7 · 2023-12-18T07:35:19Z

If no objection, I'm doing the fix now

Thank you for your PR, but it would impact existing users. We are also considering how to make it compatible with the training-operator CRD, such as adding a new template in the arena charts, and making a distinction when installing arena.

panpan0000 · 2023-12-18T08:55:00Z

Thank you @Syulin7 , but people's intuitions are "arena will be the CLI tool to manage the jobs , after installing kubeflow and its training-operator"
But the current arena will never submit any job for up to date kubeflow standard training-operator and CRDs.
So every new user will be blocked here and has to give up arena. I'm afraid that's not the good way out for community.

May I suggest that :
(1) create a maintenance branch to support those OLD users with old kubeflow CRD, especially for those users in AliYun. with NOTES in README to tell the max compatbile kubeflow version, to avoid confusion .
(2) the main branch just supports latest kubeflow CRD, and add a NOTE in README, to tell people the minimal kubeflow version for this branch.

Syulin7 · 2023-12-18T12:26:01Z

@panpan0000 Thank you for your suggestion, I agree. Supporting the training-operator CRD and maintaining backward compatibility are both very important.

Creating a new branch is a solution, but it may add complexity and cause confusion. I am trying to achieve compatibility with the training-operator CRD by identifying the CRD and using different templates when submitting jobs. In summary, our goal is to support the latest training-operator CRD.

panpan0000 · 2023-12-19T02:57:01Z

so did you mean the old CRD having the different apiVersion ? (now it's kubeflow.org/v1, so the old one maybe v1beta2..etc )?

Syulin7 · 2023-12-20T09:57:58Z

No, the apiVersion is all kubeflow.org/v1.

https://github.com/kubeflow/training-operator/blob/92389064014d3e661e1a69cb9a33cb996e493e10/pkg/apis/tensorflow/v1/types.go#L56C6-L56C6

Syulin7 · 2024-01-16T03:12:22Z

#1024 fixed this issue, If you have any suggestions, please let me know. @panpan0000 @lvxian-kjl

panpan0000 · 2024-03-04T02:27:37Z

sorry, I found the release binary missing so cannot test the update
#1045

Syulin7 · 2024-03-04T02:30:33Z

sorry, I found the release binary missing so cannot test the update #1045

@panpan0000 I just updated, PTLA.

panpan0000 · 2024-03-04T03:41:14Z

I just verified with v0.9.12, the issue has gone , thank you @Syulin7

my verification log

kubeflow training-operator version v1-5525468
arena: v0.9.12
kubernetes v1.25.3

arena submit pytorch --namespace=peter --name=latest-arena-verify  --workers=1 --image=nvcr.io/nvidia/pytorch:23.03-py3  bash


arena get  latest-arena-verify --type pytorchjob -n peter
Name:        latest-arena-verify
Status:      SUCCEEDED

kubectl -n peter get pytorchjobs latest-arena-verify -o yaml

spec:
  pytorchReplicaSpecs:
    Master:
      template:
        metadata:
          name: latest-arena-verify
        spec:
          containers:
          - command:
          ..... 
  runPolicy:
    cleanPodPolicy: Running
    suspend: false

panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023

Fix: kubeflow#966, correct cleanPolicy position

1c6ae0a

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023

Fix: kubeflow#966: golang part for CleanPolicy position

c18318e

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023

Fix: kubeflow#966: golang part for CleanPolicy position

01397ed

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 added a commit to panpan0000/arena that referenced this issue Dec 18, 2023

Fix: kubeflow#966, correct cleanPolicy position

e7834bd

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Syulin7 mentioned this issue Jan 12, 2024

Compatible with training-operator CRD. #1024

Merged

Syulin7 changed the title ~~pytorchjob的cleanPolicy字段层级错误~~ The cleanPolicy field of PyTorchJob is incorrectly placed. Jan 12, 2024

google-oss-prow bot closed this as completed in #1024 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

lvxian-kjl commented Jul 12, 2023 •

edited by Syulin7

Loading

panpan0000 commented Dec 18, 2023 •

edited

Loading

panpan0000 commented Dec 18, 2023

panpan0000 commented Dec 18, 2023

Syulin7 commented Dec 18, 2023

Syulin7 commented Dec 18, 2023 •

edited

Loading

panpan0000 commented Dec 18, 2023 •

edited

Loading

Syulin7 commented Dec 18, 2023

panpan0000 commented Dec 19, 2023

Syulin7 commented Dec 20, 2023

Syulin7 commented Jan 16, 2024

panpan0000 commented Mar 4, 2024

Syulin7 commented Mar 4, 2024

panpan0000 commented Mar 4, 2024

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

The cleanPolicy field of PyTorchJob is incorrectly placed. #966

Comments

lvxian-kjl commented Jul 12, 2023 • edited by Syulin7 Loading

panpan0000 commented Dec 18, 2023 • edited Loading

panpan0000 commented Dec 18, 2023

panpan0000 commented Dec 18, 2023

Syulin7 commented Dec 18, 2023

Syulin7 commented Dec 18, 2023 • edited Loading

panpan0000 commented Dec 18, 2023 • edited Loading

Syulin7 commented Dec 18, 2023

panpan0000 commented Dec 19, 2023

Syulin7 commented Dec 20, 2023

Syulin7 commented Jan 16, 2024

panpan0000 commented Mar 4, 2024

Syulin7 commented Mar 4, 2024

panpan0000 commented Mar 4, 2024

lvxian-kjl commented Jul 12, 2023 •

edited by Syulin7

Loading

panpan0000 commented Dec 18, 2023 •

edited

Loading

Syulin7 commented Dec 18, 2023 •

edited

Loading

panpan0000 commented Dec 18, 2023 •

edited

Loading