Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 30, 2016

Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.

Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.

@jsquyres You and Peter might find this helpful

…s/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.

Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2016

@jladd-mlnx @miked-mellanox Looks like the stack trace stuff is working! Is the failure intended to test it, or is this an actual hang in MXM?

This PR isn't related to the error, which is why I'm asking.

@mike-dubman
Copy link
Member

@rhc54 - cool! the hanging command line is pure openib btl multi-thread test.

mxm frames are purely from progress function.

05:41:51 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to core --report-state-on-timeout --get-stack-traces --timeout 300 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/message_rate_th 8

@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2016

Very good - thx! I'll let @hjelmn or someone deal with the openib issue - might even be a PR waiting for it now.

@rhc54 rhc54 merged commit 8762574 into open-mpi:master May 30, 2016
@rhc54 rhc54 deleted the topic/sim branch May 30, 2016 04:29
@hjelmn
Copy link
Member

hjelmn commented May 30, 2016

5 mins was too short for the test. For some reason (need to investigate) ob1+openib does poorly on that test. Please restore the timeout to 10 mins until we can figure out why it is slow.

@mike-dubman
Copy link
Member

increased to 10m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants