Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU staggered Dslash hanging on 8/16 GPUs #100

Closed
jpfoley opened this issue Jan 23, 2013 · 5 comments
Closed

Multi-GPU staggered Dslash hanging on 8/16 GPUs #100

jpfoley opened this issue Jan 23, 2013 · 5 comments

Comments

@jpfoley
Copy link
Member

jpfoley commented Jan 23, 2013

I just noticed this problem on Blue Waters yesterday when I was testing the MPI build. staggered_dslash_test and staggered_invert_test run fine on 4 GPUs, but hang in tests involving 8 and 16 GPUs. The bug was introduced in one of the commits of November 27 and 28. The code in the master branch worked fine before that. Blue Waters is down for maintenance today, but I will check whether the same problem occurs in the QMP build once it's back up.

@maddyscientist
Copy link
Member

When going from 4 to 8 GPUs, was there anything else that changed? I am curious if the 4 to 8 GPU transition is when another dimension is partitioned.

@jpfoley
Copy link
Member Author

jpfoley commented Jan 23, 2013

I had partitioned the z and t directions.
(1,1,2,2) runs but (1,1,2,4) doesn't.
I did run on a larger lattice on 8 GPUs, however.

On 01/23/2013 10:39 AM, mikeaclark wrote:

When going from 4 to 8 GPUs, was there anything else that changed? I
am curious if the 4 to 8 GPU transition is when another dimension is
partitioned.


Reply to this email directly or view it on GitHub
#100 (comment).

@maddyscientist
Copy link
Member

Ok, it looks like the partitioning isn't the issue then. I'll take a look later today.

@alexstrel
Copy link
Member

Had the same problem for multi-node execution. Single node (i.e., 2 GPUs in my case) seemed to be ok.

@maddyscientist
Copy link
Member

This problem is not to do with number of nodes, rather it only seems to occur when the number grid size is 4 or greater.
E.g., (1, 2, 2, 2) runs, but (1,1,1,4) does not run. Continuing to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants