Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Distribution across multi-gpu nodes #128

Closed
SeanNaren opened this issue Jan 25, 2019 · 7 comments
Closed

Distribution across multi-gpu nodes #128

SeanNaren opened this issue Jan 25, 2019 · 7 comments

Comments

@SeanNaren
Copy link

SeanNaren commented Jan 25, 2019

Thanks for the work in this! This is somewhat tied #30, but I'm used to using DistributedDataParallel with a script similar to this to use multi-gpu for speed/performance over the DataParallel wrapper!

I've started using kubeflow for single GPU nodes, but I'm curious if there is any way I could use two separate 8GPU nodes to train while using the DistributedDataParallel locally on each 8GPU node? Anything I can do to help include this?

@jwwandy
Copy link
Contributor

jwwandy commented Jan 28, 2019

@SeanNaren Not sure if this solve your problem. Distributed training using DistributedDataParallel across multi-gpu nodes can be achieved by setting up MASTER_PORT, RANK, WORLD_SIZE, MASTER_PORT environment variables when using init_process_group. When using with kubeflow pytorch-operator, pod controllers will help you to set up these environment variables in per pod perspective, which means each pod will have a unique rank and world size is equal to total number of pods and master port and address be pod IP of Master pod. If you're using multiprocess per pods distributed training, you should convert your world size by world_size = number of pods * number of process per pods and designated your rank. This could also be done by using https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py

@SeanNaren
Copy link
Author

@jwwandy seems like for now this is a good enough solution and is working with a modified version of the launch script! Once i've migrated changes to a public branch I'll close this ticket with a link to the fixes

@hyperparameters
Copy link

@jwwandy how do set ip address of master pod --master_addr this argument , when the job is created the master and worker pods are up at the same time.. so how do we know beforehand the ip of master pod.
All the other environment variable can be set, need some help with this
also do we have to launch our script with python -m torch.distributed.launch

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.75

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jwwandy
Copy link
Contributor

jwwandy commented Aug 14, 2020

@hyperparameters It has been a while since last time I worked on it. From my experience, Pytorch-operator will set up environment variable MASTER_ADDR for your worker pods. It should work fine if you collect environment variables and pass it into that pytorch script. However, Kubeflow has released numerous new versions in the past 18 months, so the information I provided might be outdated. Try it and let me know if there's anything I could help.

The code where the operator set up the env var is here:

for i := range podTemplateSpec.Spec.Containers {
if len(podTemplateSpec.Spec.Containers[i].Env) == 0 {
podTemplateSpec.Spec.Containers[i].Env = make([]v1.EnvVar, 0)
}
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, v1.EnvVar{
Name: "MASTER_PORT",
Value: strconv.Itoa(int(masterPort)),
})
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, v1.EnvVar{
Name: "MASTER_ADDR",
Value: masterAddr,
})
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, v1.EnvVar{
Name: "WORLD_SIZE",
Value: strconv.Itoa(int(totalReplicas)),
})
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, v1.EnvVar{
Name: "RANK",
Value: strconv.Itoa(rank),
})
podTemplateSpec.Spec.Containers[i].Env = append(podTemplateSpec.Spec.Containers[i].Env, v1.EnvVar{
Name: "PYTHONUNBUFFERED",
Value: "0",
})
}
return nil

@hyperparameters
Copy link

@jwwandy thankyou for quick response. Yes it worked without the need to explicitly set the variable.
Also if you could point to any good examples on how to setup multi node training with pytorch, it will be really helpful

@calvin11ung
Copy link

@jwwandy do you have sample code of how you achieved this with the pytorchoperator? Currently struggling with this problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants