Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement Multi-GPU for training #10

Open
nku-zhichengzhang opened this issue May 25, 2021 · 3 comments
Open

How to implement Multi-GPU for training #10

nku-zhichengzhang opened this issue May 25, 2021 · 3 comments

Comments

@nku-zhichengzhang
Copy link

I've tried to implement data-parallel for training in multi-GPUs but it doesn't work. The model only runs in my first GPU.
So I replace the model with ResNet with no extra tensor operation and it works normally.
Maybe tensor operation renders data-parallel.
Could you tell me how to make the model parallel in DataParallel or Distributed DataParallel way?

@upxinxin
Copy link

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

@nku-zhichengzhang
Copy link
Author

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

nope

@lyxok1
Copy link
Owner

lyxok1 commented Jun 22, 2021

@zzc000930 Hi, if you try nn.DataParallel for multi-GPU training, please ensure that (1) batchsize can be divided by your GPU number, (2) the second dimension of data (the number of object) should be same across batch, this can be implemented by either random sampling objects and padding zeros to none objects. (3) variable num_objects should be wrapped by tensor before putting into the model. An example of nn.DataParallel usage can be found in our basical baseline branch, hope this can provide some help.

As for distributed training, we are working on implementing it to make the multi-GPU training more flexible, commit will be done when we finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants