feat: enable pytorch elastic training fashion based on torch elastic #267

wanziyu · 2022-08-15T05:30:51Z

Ⅰ. Describe what this PR does

The PR designs elastic training APIs, adds a torch-elastic controller and implements elastic training control flow on torch-elastic controller and pytorch controller. Currently, the scaling algorithm is based on the real-time batch training latency collected from running pod logs.

elastic training APIs on pytorchJob spec.
implement elastic training control flow on torch elastic controller.
pytorch elastic training job example.

II. Does this pull request fix one issue?

#251

SimonCqk · 2022-08-15T06:51:38Z

apis/training/v1alpha1/pytorchjob_types.go

+
+	// EnableElastic decides whether torch elastic is enabled for job.
+	// +optional
+	EnableElastic bool `json:"enableElastic"`


add omitempty json tag

SimonCqk · 2022-08-16T13:18:14Z

apis/training/v1alpha1/pytorchjob_types.go

+	MaxReplicas *int32 `json:"maxReplicas,omitempty"`
+
+	RDZVBackend  string `json:"rdzvBackend,omitempty"`
+	RdzvEndpoint string `json:"rdzvEndpoint"`


is rdzvEndpoint required and must not omitempty?

Yes, RDZVBackend and RdzvEndpoint are both required when EnableElastic is true.

SimonCqk · 2022-08-16T13:20:24Z

@wanziyu hi wanziyu, thanks for your contribution! before merge your PR to master branch, you should sign-off your commit first.

SimonCqk · 2022-08-16T13:25:43Z

pkg/job_controller/api/v1/types.go

+	LastReplicas int32 `json:"lastReplicas,omitempty"`
+
+	// Continue represents whether the job needs to continue scaling.
+	Continue bool `json:"continue,omitempty"`


will user amend this continue field manually? if not, who will pause the scaling progress

The torchelastic controller will amend this continue field. When continue is set to false in the last control loop, the controller will pause the scaling progress.

SimonCqk · 2022-08-16T13:29:36Z

pkg/job_controller/service.go

@@ -162,11 +162,28 @@ func (jc *JobController) FilterServicesForReplicaType(services []*v1.Service, re
 	return result, nil
 }

+// calculateServiceSliceSize compare max pod index with desired replicas and return larger size
+func calculateServiceSliceSize(services []*v1.Service, replicas int) int {


is it necessary? in GetServiceSlices function, serviceSlices will be re-allocated when index> size

If calculateServiceSliceSize is not used, there will exist an error "index out of range" when scaling in workers and services (ie., update worker replicas from 2 to 1).

…-elastic Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

codecov-commenter · 2022-08-26T02:13:53Z

Codecov Report

Merging #267 (90832ba) into master (171c0d7) will increase coverage by 0.18%.
The diff coverage is 2.98%.

@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
+ Coverage   28.93%   29.12%   +0.18%     
==========================================
  Files          88       89       +1     
  Lines        5985     6260     +275     
==========================================
+ Hits         1732     1823      +91     
- Misses       4000     4174     +174     
- Partials      253      263      +10

Flag	Coverage Δ
unittests	`29.12% <2.98%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
apis/training/v1alpha1/pytorchjob_defaults.go	`17.85% <0.00%> (-1.38%)`	⬇️
apis/training/v1alpha1/pytorchjob_types.go	`100.00% <ø> (ø)`
apis/training/v1alpha1/zz_generated.deepcopy.go	`14.02% <0.00%> (-0.65%)`	⬇️
controllers/pytorch/elastic_scale.go	`34.04% <ø> (ø)`
controllers/pytorch/pytorchjob_controller.go	`0.52% <0.00%> (-0.09%)`	⬇️
controllers/pytorch/util.go	`0.00% <0.00%> (ø)`
pkg/job_controller/job.go	`24.92% <0.00%> (+0.23%)`	⬆️
pkg/job_controller/service.go	`0.00% <0.00%> (ø)`
pkg/job_controller/util.go	`20.83% <100.00%> (-1.39%)`	⬇️
... and 11 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

SimonCqk · 2022-08-26T09:47:35Z

controllers/torchelastic/elastic_controller.go

+	ctrl "sigs.k8s.io/controller-runtime"
+)
+
+func SetupWithManager(mgr ctrl.Manager) error {


may torchelasticcontroller implementation be a sub-module of /pytorch, directory structure will looks cleaner.

SimonCqk · 2022-08-26T10:00:03Z

controllers/torchelastic/job.go

+	jobStatus.ElasticStatus[rtype].ElasticCondition = apiv1.ElasticStop
+}
+
+func updateElasticStatusForContinueJob(pytorchJob *training.PyTorchJob, currentReplicas, newReplicas int32, rtype apiv1.ReplicaType) {


updateElasticStatusForContinueJob ... updateElasticStatusForMaxMetricJob has some code duplication that can be further abstracted.

SimonCqk · 2022-08-26T10:01:12Z

controllers/torchelastic/pod.go

+	return true, nil
+}
+
+func (ts *TorchElasticController) restartWorkerInKruiseProtocol(job *trainingv1alpha1.PyTorchJob, pod *corev1.Pod, expectedWorldSize, expectedGeneration string) (completed bool, err error) {


reuse restartWorkerInKruiseProtocol under /pytorch package

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

SimonCqk reviewed Aug 16, 2022

View reviewed changes

wanziyu added 3 commits August 25, 2022 16:21

feat(training):enable pytorch elastic training fashion based on torch…

2f355b7

…-elastic Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

feat(training):update elastic training control flow and APIs

e004e87

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

feat(training):update pytorch elastic training job example

90832ba

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

wanziyu force-pushed the feat/pytorch-elastic branch from 1dc3aa9 to 90832ba Compare August 25, 2022 08:34

SimonCqk reviewed Aug 26, 2022

View reviewed changes

Add torch elastic job tutorial

f71d3dc

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable pytorch elastic training fashion based on torch elastic #267

feat: enable pytorch elastic training fashion based on torch elastic #267

wanziyu commented Aug 15, 2022 •

edited

Loading

SimonCqk Aug 15, 2022

SimonCqk Aug 16, 2022

wanziyu Aug 16, 2022

SimonCqk commented Aug 16, 2022

SimonCqk Aug 16, 2022

wanziyu Aug 22, 2022

SimonCqk Aug 16, 2022

wanziyu Aug 22, 2022

codecov-commenter commented Aug 26, 2022

SimonCqk Aug 26, 2022

SimonCqk Aug 26, 2022

SimonCqk Aug 26, 2022

feat: enable pytorch elastic training fashion based on torch elastic #267

Are you sure you want to change the base?

feat: enable pytorch elastic training fashion based on torch elastic #267

Conversation

wanziyu commented Aug 15, 2022 • edited Loading

Ⅰ. Describe what this PR does

II. Does this pull request fix one issue?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonCqk commented Aug 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 26, 2022

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanziyu commented Aug 15, 2022 •

edited

Loading