🐛 Describe the bug
Running mpirun to lanuch distrtibuted training on 2 nodes (2x8 GPUs) stucks in colossalai.launch_from_openmpi() function. The 16 processes can be found using top command on the 2 nodes.
Lanuch command:
mpirun --allow-run-as-root -np 16 -hostfile hosts python train.py --config configs/config.py --host 10.80.210.83 --port 29500
The hosts file contains the following content:
10.80.210.83 slots=8
10.80.209.79 slots=8
Environment
No response