-
Notifications
You must be signed in to change notification settings - Fork 34
mlsl_test with MLSL_NUM_SERVERS>0 error #14
Comments
Hi Zbigniew, Could you please check if you observe the same issue when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" variable instead of using pbs/rsh? Thanks, |
Hi Mikhail, Thanks for your reply. When I set "I_MPI_HYDRA_BOOTSTRAP=ssh" I get: usage: /opt/pbs/default/bin/pbs_tmrsh [-n][-l username] host [-n][-l username] command I can track some commands with pbs_tmrsh in the output: Proxy launch args: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id [mpiexec@r1i4n32] Launch arguments: <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 0 [mpiexec@r1i4n32] Launch arguments: /opt/pbs/default/bin/pbs_tmrsh -x -q r1i4n33 <mlsl root path>/intel64/bin/pmi_proxy --control-port r1i4n32:39059 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --launcher-exec /opt/pbs/default/bin/pbs_tmrsh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1938569249 --usize -2 --proxy-id 1 |
Thanks Zbigniew. Would you please collect the output with I_MPI_HYDRA_DEBUG=1 environment variable set and you initial launch approach (I_MPI_HYDRA_BOOTSTRAP=rsh)? Please note that this debug output may contain all env variables set in your session, so if you have something private there, please cut it off. |
I replaced some info with variables <...>. Hope it's readable. |
Thanks - it does help! Could you please try setting "MLSL_HOSTNAME_TYPE=1" in addition to what you already have in your env? If this doesn't help please recollect the debug output w/ I_MPI_HYDRA_DEBUG=1 and MLSL_HOSTNAME_TYPE=1. |
Sure, here you are: At least we've got different error and not timing out: By the way, I had this error before, when setting "I_MPI_HYDRA_BOOTSTRAP=ssh" and also "I_MPI_HYDRA_BOOTSTRAP_EXEC=ssh" (MLSL_HOSTNAME_TYPE unset). |
Actually, setting MLSL_HOSTNAME_TYPE=2 seems to work! No error returned. Could you please check if this is the correct output: |
Hi Zbigniew, Unfortunately, MLSL_HOSTNAME_TYPE=2 doesn't work without MLSL_IFACE_IDX or MLSL_IFACE_NAME env. Could you please collect full output(out and err; if you have something private there, please cut it off ) with I_MPI_HYDRA_DEBUG=1, MLSL_LOG_LEVEL=5 and MLSL_HOSTNAME_TYPE=1? Also, add "-l" in your mpirun command line, like this:
-- |
out: err: |
Thanks, Zbigniew, Could you please try to reproduce this issue with latest Intel mlsl version: Intel(R) MLSL 2018 Update 1 Preview? -- |
Hi Vladimir, It seems to be working well with with latest mlsl, but please take a look: |
Yes, it's working right. So, for next runs you must set MLSL_HOSTNAME_TYPE=1. If you don't have other questions we will close this issue. -- |
Yes, you can close it. Thanks Mikhail and Vladimir for your help! |
Thanks Zbigniew. Please let us know in case you face any further issues. |
Hi again, Unfortunately I am having similar issue while running on 16 nodes. This time however when setting MLSL_NUM_SERVERS>3. So this config runs correctly: #PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=61 while this one times out: #PBS -l select=16:ncpus=64:mpiprocs=1:ompthreads=60 Do you have any ideas what could be wrong here? I'm attaching output files: Also running on 8 nodes with 4 mlsl servers is successful. |
@zj88
|
Thanks for the answer. I have built mlsl in debug, here is the output: I had this config: |
@zj88 |
Sorry for the delay, the machine was busy. |
@zj88 |
Similar result: |
@zj88 Thank you, |
Please take a look: |
@zj88 Thank you, |
|
Hi,
I'm unable to run mlsl_test app with MLSL_NUM_SERVERS=1. When MLSL_NUM_SERVERS=0 all works fine. I'm using PBSPro to submit the job to two KNL nodes. This is the error I get:
/opt/pbs/default/bin/pbs_tmrsh: host "r1i4n33" is not a node in job <27928.lic01>
=>> PBS: job killed: walltime 79 exceeded limit 60
[mpiexec@r1i4n32] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@r1i4n32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@r1i4n32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@r1i4n32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@r1i4n32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@r1i4n32] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
I'm using Intel MPI from mlsl, 'which mpirun' returns:
<mlsl root path>/intel64/bin/mpirun
and 'mpirun -version':
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
There may be an issue with MPI and PBS paths, but then it would probably not work for MLSL_NUM_SERVERS=0 either. Similar reports:
https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/743142
https://software.intel.com/pt-br/forums/intel-clusters-and-hpc-technology/topic/713369
env | grep I_MPI:
I_MPI_HYDRA_DEBUG=1
I_MPI_DEBUG_OUTPUT=debug_output-27928.lic01.txt
I_MPI_HYDRA_BOOTSTRAP_EXEC=/opt/pbs/default/bin/pbs_tmrsh
I_MPI_DEBUG=5
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_ROOT=<mlsl root path>
I am not sure how I should query PBS when using mlsl parameter servers, but I tried many configurations, all resulting in the above error:
#PBS -l select=2:ncpus=64:mpiprocs=2:ompthreads=63
mpirun -n 4 -ppn 2 ./mlsl_test 1
#PBS -l select=2:ncpus=64:mpiprocs=1:ompthreads=63
mpirun -n 2 -ppn 1 ./mlsl_test 1
also with
export MLSL_SERVER_AFFINITY=63
export MLSL_SERVER_CREATION_TYPE=1 (also 0)
mpirun -n 2 -ppn 1 hostname
opuputs correctly
r1i4n32
r1i4n33
MLSL used is l_mlsl_2017.1.016.
Do you have any ideas what may be wrong here?
The text was updated successfully, but these errors were encountered: