You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 7, 2023. It is now read-only.
when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
when i htop on evry node
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt
I think something is wrong with mlsl ,my mlsl version is
because when i run with my own openmpi,it is ok
The text was updated successfully, but these errors were encountered:
Hi @Tron-x, could you please specify how do you launch IntelCaffe over OpenMPI?
As far as I know IntelCaffe uses MLSL only for multi-node communications.
MLSL uses Intel MPI under the hood but can be re-built with OpenMPI support, specify MPIRT = openmpi in MLSL Makefile.
hi @mshiryaev, when i use openmpi ,i launch intelcaffe with a case such as :
i use five node ,evey node launch 8 process,openmp thread seted 8 ,one node have 64 cores
Hi @Tron-x
Besides of @mshiryaev suggestion to build MLSL with OpenMPI, could you please also try setting environment variable "I_MPI_HYDRA_TOPOLIB=hwloc" to check if this helps to out-of-box MLSL/IntelMPI?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
when i htop on evry node
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt
I think something is wrong with mlsl ,my mlsl version is
because when i run with my own openmpi,it is ok
The text was updated successfully, but these errors were encountered: