Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable pytorch elastic training fashion based on torch elastic #267

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wanziyu
Copy link

@wanziyu wanziyu commented Aug 15, 2022

Ⅰ. Describe what this PR does

The PR designs elastic training APIs, adds a torch-elastic controller and implements elastic training control flow on torch-elastic controller and pytorch controller. Currently, the scaling algorithm is based on the real-time batch training latency collected from running pod logs.

  • elastic training APIs on pytorchJob spec.
  • implement elastic training control flow on torch elastic controller.
  • pytorch elastic training job example.

II. Does this pull request fix one issue?

#251


// EnableElastic decides whether torch elastic is enabled for job.
// +optional
EnableElastic bool `json:"enableElastic"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add omitempty json tag

MaxReplicas *int32 `json:"maxReplicas,omitempty"`

RDZVBackend string `json:"rdzvBackend,omitempty"`
RdzvEndpoint string `json:"rdzvEndpoint"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is rdzvEndpoint required and must not omitempty?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, RDZVBackend and RdzvEndpoint are both required when EnableElastic is true.

@SimonCqk
Copy link
Collaborator

@wanziyu hi wanziyu, thanks for your contribution! before merge your PR to master branch, you should sign-off your commit first.

LastReplicas int32 `json:"lastReplicas,omitempty"`

// Continue represents whether the job needs to continue scaling.
Continue bool `json:"continue,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will user amend this continue field manually? if not, who will pause the scaling progress

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torchelastic controller will amend this continue field. When continue is set to false in the last control loop, the controller will pause the scaling progress.

@@ -162,11 +162,28 @@ func (jc *JobController) FilterServicesForReplicaType(services []*v1.Service, re
return result, nil
}

// calculateServiceSliceSize compare max pod index with desired replicas and return larger size
func calculateServiceSliceSize(services []*v1.Service, replicas int) int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary? in GetServiceSlices function, serviceSlices will be re-allocated when index> size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If calculateServiceSliceSize is not used, there will exist an error "index out of range" when scaling in workers and services (ie., update worker replicas from 2 to 1).

…-elastic

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>
Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>
Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>
@codecov-commenter
Copy link

Codecov Report

Merging #267 (90832ba) into master (171c0d7) will increase coverage by 0.18%.
The diff coverage is 2.98%.

@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
+ Coverage   28.93%   29.12%   +0.18%     
==========================================
  Files          88       89       +1     
  Lines        5985     6260     +275     
==========================================
+ Hits         1732     1823      +91     
- Misses       4000     4174     +174     
- Partials      253      263      +10     
Flag Coverage Δ
unittests 29.12% <2.98%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apis/training/v1alpha1/pytorchjob_defaults.go 17.85% <0.00%> (-1.38%) ⬇️
apis/training/v1alpha1/pytorchjob_types.go 100.00% <ø> (ø)
apis/training/v1alpha1/zz_generated.deepcopy.go 14.02% <0.00%> (-0.65%) ⬇️
controllers/pytorch/elastic_scale.go 34.04% <ø> (ø)
controllers/pytorch/pytorchjob_controller.go 0.52% <0.00%> (-0.09%) ⬇️
controllers/pytorch/util.go 0.00% <0.00%> (ø)
pkg/job_controller/job.go 24.92% <0.00%> (+0.23%) ⬆️
pkg/job_controller/service.go 0.00% <0.00%> (ø)
pkg/job_controller/util.go 20.83% <100.00%> (-1.39%) ⬇️
... and 11 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ctrl "sigs.k8s.io/controller-runtime"
)

func SetupWithManager(mgr ctrl.Manager) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may torchelasticcontroller implementation be a sub-module of /pytorch, directory structure will looks cleaner.

jobStatus.ElasticStatus[rtype].ElasticCondition = apiv1.ElasticStop
}

func updateElasticStatusForContinueJob(pytorchJob *training.PyTorchJob, currentReplicas, newReplicas int32, rtype apiv1.ReplicaType) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateElasticStatusForContinueJob ... updateElasticStatusForMaxMetricJob has some code duplication that can be further abstracted.

return true, nil
}

func (ts *TorchElasticController) restartWorkerInKruiseProtocol(job *trainingv1alpha1.PyTorchJob, pod *corev1.Pod, expectedWorldSize, expectedGeneration string) (completed bool, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse restartWorkerInKruiseProtocol under /pytorch package

Signed-off-by: wanziyu <ziyuwan@zju.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants