Skip to content
Bandicoot edited this page Mar 1, 2024 · 3 revisions

Install MPI

# install prerequisites
sudo make install -y g++
# 
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.0.tar.gz
tar -xf openmpi-4.1.0.tar.gz
cd openmpi-4.1.0
./configure --prefix=/usr/local
# <...lots of output...>
make all && sudo make install

Usage

  1. Run a machine with 4 GPUS
$ horovodrun -np 4 python train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod
  1. Run on 4 machines with 4 GPUs each:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod
  1. Horovod autotuning:
$ mpirun -x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/tmp/autotune_log.csv ... train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend horovod

Docker (Doesn't effect vast.ai)

If you are inside of a docker container - make sure to check if you have a docker0 LAN interface. If you do, you will need to follow specific instructions to ensure that this interface is ignored. See https://horovod.readthedocs.io/en/stable/mpi.html for further details.