when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 #3

zyegao · 2021-12-22T07:36:37Z

When I test the DP mode ,the error happened

Refer to the https://di-engine-docs.readthedocs.io/zh_CN/latest/best_practice/multi_gpu_example.html， I changed the file：
（1）gobigger_no_spatial_config.py

（2）gobigger_vsbot_baseline_simple_main.py

jayyoung0802 · 2021-12-28T13:52:22Z

Regarding this issue, we have summarized the following points
1、[3, 3, 16] is [evaluator_num, players_per_team, 4*4(dirction * action)]
2、In policy, we merge the first two dimensions, so it becomes [9,16].
3、When using DP，it will divide 9 into 4 and 5, and it will becomes [4,16] and [5,16].
4、[4,16] and [5,16] can't transform [batch, 3, 16], So it reported an error.
5、The solution is to only use DP on Encode，not on all gobigger model.

zyegao mentioned this issue Dec 23, 2021

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 opendilab/DI-engine#162

Closed

PaParaZz1 assigned jayyoung0802 Dec 23, 2021

jayyoung0802 closed this as completed Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 #3

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 #3

zyegao commented Dec 22, 2021

jayyoung0802 commented Dec 28, 2021

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 #3

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0 #3

Comments

zyegao commented Dec 22, 2021

jayyoung0802 commented Dec 28, 2021