-
Notifications
You must be signed in to change notification settings - Fork 642
Horovod
Bandicoot edited this page Mar 1, 2024
·
3 revisions
# install prerequisites
sudo make install -y g++
#
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.0.tar.gz
tar -xf openmpi-4.1.0.tar.gz
cd openmpi-4.1.0
./configure --prefix=/usr/local
# <...lots of output...>
make all && sudo make install
- Run a machine with 4 GPUS
$ horovodrun -np 4 python train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod
- Run on 4 machines with 4 GPUs each:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod
- Horovod autotuning:
$ mpirun -x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/tmp/autotune_log.csv ... train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod
If you are inside of a docker container - make sure to check if you have a docker0
LAN interface. If you do, you will need to follow specific instructions to ensure that this interface is ignored. See https://horovod.readthedocs.io/en/stable/mpi.html for further details.