How to implement Multi-GPU for training #10

nku-zhichengzhang · 2021-05-25T01:33:30Z

I've tried to implement data-parallel for training in multi-GPUs but it doesn't work. The model only runs in my first GPU.
So I replace the model with ResNet with no extra tensor operation and it works normally.
Maybe tensor operation renders data-parallel.
Could you tell me how to make the model parallel in DataParallel or Distributed DataParallel way?

upxinxin · 2021-06-17T01:17:18Z

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

nku-zhichengzhang · 2021-06-17T01:43:54Z

Hello, sorry to disturb you. I met the same problem. Have you solved the problem now?

nope

lyxok1 · 2021-06-22T15:48:57Z

@zzc000930 Hi, if you try nn.DataParallel for multi-GPU training, please ensure that (1) batchsize can be divided by your GPU number, (2) the second dimension of data (the number of object) should be same across batch, this can be implemented by either random sampling objects and padding zeros to none objects. (3) variable num_objects should be wrapped by tensor before putting into the model. An example of nn.DataParallel usage can be found in our basical baseline branch, hope this can provide some help.

As for distributed training, we are working on implementing it to make the multi-GPU training more flexible, commit will be done when we finished.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement Multi-GPU for training #10

How to implement Multi-GPU for training #10

nku-zhichengzhang commented May 25, 2021

upxinxin commented Jun 17, 2021

nku-zhichengzhang commented Jun 17, 2021

lyxok1 commented Jun 22, 2021

How to implement Multi-GPU for training #10

How to implement Multi-GPU for training #10

Comments

nku-zhichengzhang commented May 25, 2021

upxinxin commented Jun 17, 2021

nku-zhichengzhang commented Jun 17, 2021

lyxok1 commented Jun 22, 2021