Skip to content
This repository has been archived by the owner on Jan 7, 2023. It is now read-only.

run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0 #19

Open
Tron-x opened this issue Sep 6, 2018 · 3 comments

Comments

@Tron-x
Copy link

Tron-x commented Sep 6, 2018

when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
image
when i htop on evry node
image
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt

I think something is wrong with mlsl ,my mlsl version is
image
because when i run with my own openmpi,it is ok

@mshiryaev
Copy link
Contributor

Hi @Tron-x, could you please specify how do you launch IntelCaffe over OpenMPI?
As far as I know IntelCaffe uses MLSL only for multi-node communications.
MLSL uses Intel MPI under the hood but can be re-built with OpenMPI support, specify MPIRT = openmpi in MLSL Makefile.

@Tron-x
Copy link
Author

Tron-x commented Sep 6, 2018

hi @mshiryaev, when i use openmpi ,i launch intelcaffe with a case such as :
image
i use five node ,evey node launch 8 process,openmp thread seted 8 ,one node have 64 cores

@SmorkalovME
Copy link

Hi @Tron-x
Besides of @mshiryaev suggestion to build MLSL with OpenMPI, could you please also try setting environment variable "I_MPI_HYDRA_TOPOLIB=hwloc" to check if this helps to out-of-box MLSL/IntelMPI?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants