Skip to content
This repository has been archived by the owner on Jan 7, 2023. It is now read-only.

mlsl_test with MLSL_NUM_SERVERS>0 error #14

Closed
zj88 opened this issue Jun 3, 2018 · 25 comments
Closed

mlsl_test with MLSL_NUM_SERVERS>0 error #14

zj88 opened this issue Jun 3, 2018 · 25 comments
Assignees

Comments

@zj88
Copy link

zj88 commented Jun 3, 2018

Hi,

I'm unable to run mlsl_test app with MLSL_NUM_SERVERS=1. When MLSL_NUM_SERVERS=0 all works fine. I'm using PBSPro to submit the job to two KNL nodes. This is the error I get:

/opt/pbs/default/bin/pbs_tmrsh: host "r1i4n33" is not a node in job <27928.lic01>
=>> PBS: job killed: walltime 79 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

I'm using Intel MPI from mlsl, 'which mpirun' returns:
<mlsl root path>/intel64/bin/mpirun
and 'mpirun -version':
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.

There may be an issue with MPI and PBS paths, but then it would probably not work for MLSL_NUM_SERVERS=0 either. Similar reports:
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/743142
https://software.intel.com/pt-br/forums/intel-clusters-and-hpc-technology/topic/713369
env | grep I_MPI:
I_MPI_HYDRA_DEBUG=1
I_MPI_DEBUG_OUTPUT=debug_output-27928.lic01.txt
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/default/bin/pbs_tmrsh
I_MPI_DEBUG=5
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_ROOT=<mlsl root path>

I am not sure how I should query PBS when using mlsl parameter servers, but I tried many configurations, all resulting in the above error:
#PBS -l select=2:ncpus=64:mpiprocs=2:ompthreads=63
mpirun -n 4 -ppn 2 ./mlsl_test 1

#PBS -l select=2:ncpus=64:mpiprocs=1:ompthreads=63
mpirun -n 2 -ppn 1 ./mlsl_test 1

also with
export MLSL_SERVER_AFFINITY=63
export MLSL_SERVER_CREATION_TYPE=1 (also 0)

mpirun -n 2 -ppn 1 hostname
opuputs correctly
r1i4n32
r1i4n33

MLSL used is l_mlsl_2017.1.016.

Do you have any ideas what may be wrong here?

@SmorkalovME
Copy link

Hi Zbigniew,

Could you please check if you observe the same issue when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" variable instead of using pbs/rsh?

Thanks,
Mikhail

@zj88
Copy link
Author

zj88 commented Jun 4, 2018

Hi Mikhail,

Thanks for your reply. When I set "I_MPI_HYDRA_BOOTSTRAP=ssh" I get:

usage: /opt/pbs/default/bin/pbs_tmrsh [-n][-l username] host [-n][-l username] command
/opt/pbs/default/bin/pbs_tmrsh --version
=>> PBS: job killed: walltime 78 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

I can track some commands with pbs_tmrsh in the output:

Proxy launch args: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id

[mpiexec@r1i4n32] Launch arguments: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 0

[mpiexec@r1i4n32] Launch arguments: /opt/pbs/default/bin/pbs_tmrsh -x -q r1i4n33 <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 1

@SmorkalovME
Copy link

Thanks Zbigniew. Would you please collect the output with I_MPI_HYDRA_DEBUG=1 environment variable set and you initial launch approach (I_MPI_HYDRA_BOOTSTRAP=rsh)? Please note that this debug output may contain all env variables set in your session, so if you have something private there, please cut it off.

@zj88
Copy link
Author

zj88 commented Jun 5, 2018

I replaced some info with variables <...>. Hope it's readable.

mlsl_debug.txt

@SmorkalovME
Copy link

Thanks - it does help! Could you please try setting "MLSL_HOSTNAME_TYPE=1" in addition to what you already have in your env? If this doesn't help please recollect the debug output w/ I_MPI_HYDRA_DEBUG=1 and MLSL_HOSTNAME_TYPE=1.

@zj88
Copy link
Author

zj88 commented Jun 6, 2018

Sure, here you are:
mlsl_debug.txt

At least we've got different error and not timing out:
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f5c0)
PMPI_Ibcast(989).: Invalid communicator
Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a380)
PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i4n33] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i4n33] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i4n33] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event

By the way, I had this error before, when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" and also "I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh" (MLSL_HOSTNAME_TYPE unset).

@zj88
Copy link
Author

zj88 commented Jun 6, 2018

Actually, setting MLSL_HOSTNAME_TYPE=2 seems to work! No error returned. Could you please check if this is the correct output:
mlsl_debug.txt

@VinnitskiV
Copy link

Hi Zbigniew,

Unfortunately, MLSL_HOSTNAME_TYPE=2 doesn't work without MLSL_IFACE_IDX or MLSL_IFACE_NAME env.

Could you please collect full output(out and err; if you have something private there, please cut it off ) with I_MPI_HYDRA_DEBUG=1, MLSL_LOG_LEVEL=5 and MLSL_HOSTNAME_TYPE=1? Also, add "-l" in your mpirun command line, like this:

  • mpirun -n 2 -ppn 1 -l -hosts ***.

--
BR,
Vladimir

@zj88
Copy link
Author

zj88 commented Jun 7, 2018

out:
mlsl_debug.txt

err:
[1] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[1] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x73f680)
[1] PMPI_Ibcast(989).: Invalid communicator
[0] Fatal error in PMPI_Ibcast: Invalid communicator, error stack:
[0] PMPI_Ibcast(1047): MPI_Ibcast(buffer=0x2aaab4021540, count=294912, datatype=MPI_FLOAT, comm=comm=0x1, request=0x74a3c0)
[0] PMPI_Ibcast(989).: Invalid communicator
[proxy:1:1@r1i0n3] HYD_pmcd_pmip_control_cmd_cb (../../pm/pmiserv/pmip_cb.c:3481): assert (!closed) failed
[proxy:1:1@r1i0n3] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[proxy:1:1@r1i0n3] main (../../pm/pmiserv/pmip.c:558): demux engine error waiting for event

@VinnitskiV
Copy link

Thanks, Zbigniew,

Could you please try to reproduce this issue with latest Intel mlsl version: Intel(R) MLSL 2018 Update 1 Preview?

--
BR,
Vladimir

@zj88
Copy link
Author

zj88 commented Jun 7, 2018

Hi Vladimir,

It seems to be working well with with latest mlsl, but please take a look:
mlsl_debug.txt
Error file is empty, mlsl used is mlsl_2018.1.005.

@VinnitskiV
Copy link

VinnitskiV commented Jun 8, 2018

Yes, it's working right.

So, for next runs you must set MLSL_HOSTNAME_TYPE=1.

If you don't have other questions we will close this issue.

--
BR,
Vladimir

@zj88
Copy link
Author

zj88 commented Jun 8, 2018

Yes, you can close it. Thanks Mikhail and Vladimir for your help!

@SmorkalovME
Copy link

Thanks Zbigniew. Please let us know in case you face any further issues.

@zj88
Copy link
Author

zj88 commented Aug 15, 2018

Hi again,

Unfortunately I am having similar issue while running on 16 nodes. This time however when setting MLSL_NUM_SERVERS>3. So this config runs correctly:

#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=61
export MLSL_NUM_SERVERS=3
export MLSL_SERVER_AFFINITY="61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1

while this one times out:

#PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=60
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY="60,61,62,63"
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
mpirun -n 16 -ppn 1 -l -f $HOSTFILE ./mlsl_test 1

Do you have any ideas what could be wrong here? I'm attaching output files:
mlsl_debug-success.txt
mlsl_debug-fail.txt
Mlsl used is mlsl_2018.1.005.

Also running on 8 nodes with 4 mlsl servers is successful.

@SmorkalovME SmorkalovME reopened this Aug 22, 2018
@SmorkalovME SmorkalovME self-assigned this Aug 22, 2018
@VinnitskiV
Copy link

@zj88
Hello, could you please collect results with debug mode:

make clean && make ENABLE_DEBUG=1 && make install

@zj88
Copy link
Author

zj88 commented Aug 23, 2018

Thanks for the answer. I have built mlsl in debug, here is the output:
mlsl_output.txt

I had this config:
#PBS -l select=16:ncpus=64:mpiprocs=1
export MLSL_SERVER_CREATION_TYPE=0
export MLSL_HOSTNAME_TYPE=1
export OMP_NUM_THREADS=60
export KMP_HW_SUBSET=1t
export MLSL_NUM_SERVERS=4
export MLSL_SERVER_AFFINITY=6,7,8,9
export KMP_AFFINITY="proclist=[0-5,10-63],granularity=thread,explicit"

@VinnitskiV
Copy link

@zj88
Thank you, looks like the problem is in a small walltime.
Could you try use 5 minutes?

@zj88
Copy link
Author

zj88 commented Sep 3, 2018

Sorry for the delay, the machine was busy.
Unfortunately the output is almost the same for 5 min:
mlsl_output.txt
I'm not sure what could be wrong here

@VinnitskiV
Copy link

@zj88
Could you run with MLSL_SERVER_CREATION_TYPE=1?

@zj88
Copy link
Author

zj88 commented Sep 14, 2018

Similar result:
mlsl_output.txt
I tried also with 10 min, but ends with the same error code and with almost the same output.

@VinnitskiV
Copy link

@zj88 Thank you,
Can you use ssh instead rsh? Like this
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=

@zj88
Copy link
Author

zj88 commented Oct 21, 2018

Please take a look:
mlsl_output.txt
Thanks!

@VinnitskiV
Copy link

@zj88 Thank you,
Could you run with:
export I_MPI_FABRICS=tcp or export I_MPI_FABRICS=ofa
If it do not helped("PASSED" lines in output), please send me debug output with MLSL_NUM_SERVERS=3

@zj88
Copy link
Author

zj88 commented Nov 27, 2018

export I_MPI_FABRICS=tcp helped.
Works great now, thanks a lot!

@zj88 zj88 closed this as completed Nov 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants