Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Aug 24, 2015

No description provided.

hjelmn added 2 commits August 24, 2015 16:00
When using eager RDMA in debug builds the openib btl generates a
sequence number for each send. The code independently updated the head
index and the sequence number for the eager rdma transaction. If
multiple threads enter this code at the same time and run in the
following order:

thread 1: update sequence (0 -> 1)
thread 2: update sequence (1 -> 2)
thread 2: update head (0 -> 1)
thread 1: update head (1 -> 2)

the sequence number for head[0] gets 1 and the sequence number for
head[1] gets 0. The fix is to generate the sequence number from the
head index.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There were several issues preventing the openib btl from running in
thread multiple mode:

 - Missing locks in UDCM when generating a loopback endpoint. Fixed in
   open-mpi/ompi@8205d79.

 - Incorrect sequence numbers generated in debug mode. This did not
   prevent the openib btl from running but instead produced incorrect
   error messages in debug builds.

 - Recursive locking of the rcache lock caused by the malloc
   hooks. This is fixed by open-mpi#827

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Aug 24, 2015

This PR depends on #827.

@miked-mellanox Please take a look. With these fixes and #827 the openib btl runs without issue with MPI_THREAD_MULTIPLE. I know mellanox only cares about mxm these days but there are benefits to having the btl work.

@mike-dubman
Copy link
Member

@hjelmn -

  • jenkins does not run openib w/ threads, it is disabled
  • I can re-enable it
  • tried to run this PR with threads test but it failed w/ segv:
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to core -mca btl_openib_if_include mlx5_0:1  -mca pml ob1 -mca btl openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/latency_th 8
  • did it work for you?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 27, 2015

@miked-mellanox That one was passing for me in multiple different configurations. Did you apply the patches from #827?

@hjelmn
Copy link
Member Author

hjelmn commented Aug 31, 2015

@miked-mellanox Can you enable thread tests with the openib btl with this PR? Master now has the needed code to make this work.

@mike-dubman
Copy link
Member

@hjelmn - fixed.

@mike-dubman
Copy link
Member

bot:retest

1 similar comment
@mike-dubman
Copy link
Member

bot:retest

@mike-dubman
Copy link
Member

@hjelmn - tests passed, only warn:

14:30:47 + timeout -s SIGKILL 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/message_rate_th 8
14:30:48 [warn] opal_libevent2022_event_base_loop: reentrant invocation.  Only one event_base_loop can run on each event_base at once.
14:30:50 Thread 4, message rate 4248.267450 messages/sec
14:30:50 [warn] opal_libevent2022_event_base_loop: reentrant invocation.  Only one event_base_loop can run on each event_base at once.
14:30:50 [warn] opal_libevent2022_event_base_loop: reentrant invocation.  Only one event_base_loop can run on each event_base at once.

hjelmn added a commit that referenced this pull request Sep 1, 2015
@hjelmn hjelmn merged commit f926796 into open-mpi:master Sep 1, 2015
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Aug 23, 2016
rmaps base help: update binding error messages
@hjelmn hjelmn deleted the openib_thread_fix branch March 21, 2018 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants