HPA support for PyTorch Elastic #1751

tsuiot · 2023-02-06T09:36:46Z

1. Background

use training-operator/examples/pytorch/elastic/imagenet/imagenet.yaml to HPA for Pytorch elastic

2.ScaleUp and PytorchJob always in running state

Jobs list:

PytorchJob.Status:

HPA.status:

in this case, worker3 which scale up by HPA is pendding first, due to NotEnounghResources, and scheduled until worker0-2 is Successed。 this let PytorchJob always in Runing state, and workder3 in Running and Restart to communicate to worker0

johnugeorge · 2023-02-06T11:20:37Z

Which image are you using ? Is this always reproducible?

tsuiot · 2023-02-06T12:53:12Z

build image myself from the master branch. yes

johnugeorge · 2023-02-06T13:28:47Z

Does this happen when you get into a situation with "NotEnounghResources" ?What about the normal situations? Is it the situation that underlying elastic job got completed but Pytorchjob doesn't show succeeded ?

tsuiot · 2023-02-07T01:15:35Z

1.yes.
2.if all pods scheduled and completed, the pytorchjob will change to compeleted.
3.No, it appears when part of pods completed and another pendding or failed, and Pytorchjob doesn`t show succeeded. But actually the training has been completed.

Syulin7 · 2023-02-07T07:39:22Z

When we use PyTorch Elastic(eg. https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml):

first worker replicas is two.
HPA scale the worker replicas to 4.
Due to some reasons(like NotEnounghResources), worker-2 is not running.
Other workers completed (including worker-0), worker-2 cannot connect to worker-0.

# kubectl get pod
elastic-example-imagenet-worker-0   0/1     Completed           0                3h16m
elastic-example-imagenet-worker-1   0/1     Completed           0                3h16m
elastic-example-imagenet-worker-2   1/1     Running             20 (5m13s ago)   3h15m
elastic-example-imagenet-worker-3   0/1     Completed           0                3h15m

# kubectl logs elastic-example-imagenet-worker-2 -p
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (elastic-example-imagenet-worker-0, 23456).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
TimeoutError: The client socket has timed out after 60s while trying to connect to (elastic-example-imagenet-worker-0, 23456).

pytorchjob is still running because the success criteria are not met.

training-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go

Lines 435 to 460 in aae672f

    
           if rtype == kubeflowv1.PyTorchJobReplicaTypeWorker { 
        
           	// TODO(gaocegege): Support SuccessPolicy 
        
           	if expected == 0 { 
        
           		msg := fmt.Sprintf("PyTorchJob %s/%s successfully completed.", 
        
           			pytorchjob.Namespace, pytorchjob.Name) 
        
           		r.recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg) 
        
           		if jobStatus.CompletionTime == nil { 
        
           			now := metav1.Now() 
        
           			jobStatus.CompletionTime = &now 
        
           		} 
        
           		err := commonutil.UpdateJobConditions(jobStatus, 
        
           			commonv1.JobSucceeded, commonutil.JobSucceededReason, msg) 
        
           		if err != nil { 
        
           			commonutil.LoggerForJob(pytorchjob).Infof("Append pytorchjob condition error: %v", err) 
        
           			return err 
        
           		} 
        
           		trainingoperatorcommon.SuccessfulJobsCounterInc(pytorchjob.Namespace, kubeflowv1.PytorchJobFrameworkName) 
        
           	} else if running > 0 { 
        
           		// Some workers are still running, leave a running condition. 
        
           		msg := fmt.Sprintf("PyTorchJob %s/%s is running.", 
        
           			pytorchjob.Namespace, pytorchjob.Name) 
        
           		err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRunning, commonutil.JobRunningReason, msg) 
        
           		if err != nil { 
        
           			commonutil.LoggerForJob(pytorchjob).Infof("Append pytorchjob condition error: %v", err) 
        
           			return err 
        
           		}

Support successPolicy on pytorchjob can fix it:

SuccessPolicyAllWorkers - pytorchjob now default success policy.
SuccessPolicyChiefWorker - the job is succeeded if worker 0 completed.（same as tfjob's SuccessPolicyDefault）
@johnugeorge @tsuiot WDYT.

johnugeorge · 2023-02-07T08:06:50Z

This is related to bug in reporting success for Elastic runs.

Related: #1711 (comment)

Syulin7 · 2023-02-07T09:29:05Z

For Elastic mode, if worker0 completes, we default to setting the pytorchjob to Succeeded？

johnugeorge · 2023-02-07T11:30:09Z

Yes. I think, if any worker succeeds, we can mark it succeeded. Worker 0 is safer as it runs the c10d

johnugeorge · 2023-02-07T16:44:29Z

@tsuiot Can you try the fix from #1752

tsuiot · 2023-02-08T02:19:32Z

Yes.

Syulin7 · 2023-02-08T07:40:12Z

@tsuiot Is there any problem with the test?
I tested it myself and the job shows succeeded. @johnugeorge

$ kubectl get pod
NAME                                READY   STATUS             RESTARTS          AGE
elastic-example-imagenet-worker-0   0/1     Completed          0                 19h
elastic-example-imagenet-worker-1   0/1     CrashLoopBackOff   188 (2m39s ago)   19h
elastic-example-imagenet-worker-2   0/1     Completed          0                 19h
elastic-example-imagenet-worker-3   0/1     Completed          0                 19h
$ kubectl get pytorchjob
NAME                       STATE       AGE
elastic-example-imagenet   Succeeded   19h

tsuiot · 2023-02-08T07:46:30Z

1.the pytorchJob change to Succeeded with #1752
2.new worker pod in running and restart, and should we delete the pods scaled and not in normal status when pytorchJob Succeeded?

Syulin7 · 2023-02-08T08:14:49Z

2.new worker pod in running and restart, and should we delete the pods scaled and not in normal status when pytorchJob Succeeded?

PytorchJob.RunPolicy.CleanPodPolicy defines the policy to kill pods after the job completes.(the deafut value is None for PytorchJob).

training-operator/pkg/apis/kubeflow.org/v1/pytorch_defaults.go

Lines 66 to 72 in aae672f

    
           // SetDefaults_PyTorchJob sets any unspecified values to defaults. 
        
           func SetDefaults_PyTorchJob(job *PyTorchJob) { 
        
           	// Set default cleanpod policy to None. 
        
           	if job.Spec.RunPolicy.CleanPodPolicy == nil { 
        
           		policy := commonv1.CleanPodPolicyNone 
        
           		job.Spec.RunPolicy.CleanPodPolicy = &policy 
        
           	}

But in the comments of common, the default value is Running, which is ambiguous.

type RunPolicy struct {
	// CleanPodPolicy defines the policy to kill pods after the job completes.
	// Default to Running.
	CleanPodPolicy *CleanPodPolicy `json:"cleanPodPolicy,omitempty"`

@tsuiot you can set PytorchJob.RunPolicy.CleanPodPolicy to Running and try again.

@johnugeorge should we set pytorjob CleanPodPolicy default to Running?

tsuiot · 2023-02-08T09:14:49Z

/LGTM

johnugeorge · 2023-02-08T09:54:51Z

We can fix the comment for now. It is better to keep consistent for all jobs.

Syulin7 mentioned this issue Feb 7, 2023

Fix the success condition of the job in PyTorchJob's Elastic mode. #1752

Merged

1 task

google-oss-prow bot closed this as completed in #1752 Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPA support for PyTorch Elastic #1751

HPA support for PyTorch Elastic #1751

tsuiot commented Feb 6, 2023

johnugeorge commented Feb 6, 2023

tsuiot commented Feb 6, 2023

johnugeorge commented Feb 6, 2023

tsuiot commented Feb 7, 2023

Syulin7 commented Feb 7, 2023 •

edited

johnugeorge commented Feb 7, 2023

Syulin7 commented Feb 7, 2023

johnugeorge commented Feb 7, 2023 •

edited

johnugeorge commented Feb 7, 2023

tsuiot commented Feb 8, 2023

Syulin7 commented Feb 8, 2023

tsuiot commented Feb 8, 2023

Syulin7 commented Feb 8, 2023 •

edited

tsuiot commented Feb 8, 2023

johnugeorge commented Feb 8, 2023

HPA support for PyTorch Elastic #1751

HPA support for PyTorch Elastic #1751

Comments

tsuiot commented Feb 6, 2023

1. Background

2.ScaleUp and PytorchJob always in running state

johnugeorge commented Feb 6, 2023

tsuiot commented Feb 6, 2023

johnugeorge commented Feb 6, 2023

tsuiot commented Feb 7, 2023

Syulin7 commented Feb 7, 2023 • edited

johnugeorge commented Feb 7, 2023

Syulin7 commented Feb 7, 2023

johnugeorge commented Feb 7, 2023 • edited

johnugeorge commented Feb 7, 2023

tsuiot commented Feb 8, 2023

Syulin7 commented Feb 8, 2023

tsuiot commented Feb 8, 2023

Syulin7 commented Feb 8, 2023 • edited

tsuiot commented Feb 8, 2023

johnugeorge commented Feb 8, 2023

Syulin7 commented Feb 7, 2023 •

edited

johnugeorge commented Feb 7, 2023 •

edited

Syulin7 commented Feb 8, 2023 •

edited