Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clover first solve succeeds second fails #74

Closed
bjoo opened this issue Jul 17, 2012 · 14 comments
Closed

Clover first solve succeeds second fails #74

bjoo opened this issue Jul 17, 2012 · 14 comments
Assignees
Labels

Comments

@bjoo
Copy link
Member

bjoo commented Jul 17, 2012

In a recent master branch we observed the following behaviour at JLab in the clover solver:

  • i) The first solve succeeds
  • ii) The second solve appears to converge to the wrong answer

This was tested in commit ID: 0929692
This behaviour is not present in version: a0c7a3b (which was identified by Frank as stable).

Specific output:

First solve OK:

BiCGstab: 2181 iterations, r2 = 2.350111e-13
BiCGstab: 2182 iterations, r2 = 2.314445e-13
BiCGstab: Reliable updates = 21
BiCGstab: Converged after 2182 iterations, relative residua: iterated = 4.984836e-07, true = 7.308019e-07
Solution = 2.970304
Reconstructed: CUDA solution = 2.970304, CPU copy = 2.970303
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=46.575336 s Performance=1648.30932630335 GFLOPS Total Time (incl. load gauge)=51.613748 s
QUDA_BICGSTAB_CLOVER_SOLVER: 2182 iterations. Rsd = 1.274569e-06 Relative Rsd = 7.56500540505804e-07

second solve: only 606 iterations and QUDA and Chroma disagree

BiCGstab: 606 iterations, r2 = 2.035114e-13
BiCGstab: Reliable updates = 10
BiCGstab: Converged after 606 iterations, relative residua: iterated = 4.677403e-07, true = 4.998825e-07
Solution = 1.833076
Reconstructed: CUDA solution = 1.833076, CPU copy = 1.833076
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=5.873358 s Performance=3641.36232313031 GFLOPS Total Time (incl. load gauge)=6.105013 s
ERROR: QUDA Solver residuum is outside tolerance: QUDA resid=0.0460195576642859 Desired =5e-07 Max Tolerated = 5e-06
QUDA_BICGSTAB_CLOVER_SOLVER: 606 iterations. Rsd = 0.07754742 Relative Rsd = 0.0460195576642859

Similar behaviour was reported from Edge recently.

@maddyscientist
Copy link
Member

Balint, if I am to debug this, can you give me an appropriate chroma build package, so I build and run in ignorance and swap out different versions of quda as I see fit until the change that causes this bug is tracked down?

@ghost ghost assigned maddyscientist Aug 13, 2012
@maddyscientist
Copy link
Member

Ok, I think now is the time to test this. I have just merged in my BQCD branch, and I can see no problems with successive solves there. Balint, can you do a fresh pull and check that this issue has gone?

@ghost ghost assigned bjoo Aug 15, 2012
@maddyscientist
Copy link
Member

All final Chroma issues should now be fixed as of commit 1b77cc9. Leaving this open until Balint confirms this.

@maddyscientist
Copy link
Member

I have finally managed to reproduce the originally reported issue in the QUDA tests. The successive solve problem only occurs with multiple GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for all precision combinations.

Working on locating the source now.

@maddyscientist
Copy link
Member

I still haven't isolated this yet, but I have found the problem occurs with Wilson as well as clover, and it happens with MPI as well as QMP back ends. Need to sleep now.

@jpfoley
Copy link
Member

jpfoley commented Sep 12, 2012

This sounds pretty similar to the problems we were having with
GPUDirect. Where were you running?

On 09/11/2012 11:44 PM, mikeaclark wrote:

I have finally managed to reproduce the originally reported issue in
the QUDA tests. The successive solve problem only occurs with multiple
GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for
all precision combinations.

Working on locating the source now.


Reply to this email directly or view it on GitHub
#74 (comment).

@maddyscientist
Copy link
Member

I was running on an internal cluster, using two nodes with one M2090 per node. Having interactive access is a huge bonus to this type of debugging.

I just tried to repro with the staggered invert test and I cannot - success solves always converge. This doesn't exclude that the GPU Direct bug is related to this one.

Testing now with GPU Direct disabled.....

@maddyscientist
Copy link
Member

Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate....

@jpfoley
Copy link
Member

jpfoley commented Sep 12, 2012

You ran with Rolf's flag, I guess.
I think Steve was seeing this same problem on Keeneland.
The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is
related to the other GPU Direct issues. This is with 1.5.4 openmpi and
CUDA 4.0. Continuing to investigate....


Reply to this email directly or view it on GitHub
#74 (comment).

@bjoo
Copy link
Member Author

bjoo commented Sep 12, 2012

Hi Mike,
I can also do a quick rebuild with MVAPICH2 with and without GPU
direct if having a different MPI helps.

Best,
B
On Sep 12, 2012, at 1:16 PM, Justin Foley wrote:

You ran with Rolf's flag, I guess.
I think Steve was seeing this same problem on Keeneland.
The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is
related to the other GPU Direct issues. This is with 1.5.4 openmpi and
CUDA 4.0. Continuing to investigate....


Reply to this email directly or view it on GitHub
#74 (comment).


Reply to this email directly or view it on GitHub.


Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

@maddyscientist
Copy link
Member

I didn't run with Rolf's flags. I did now, and found that the issue goes away. So this is definitely the same problem.

@maddyscientist
Copy link
Member

Just made huge progress. The current QMP and MPI backends use cudaHostAlloc to create pinned memory which is then also pinned used by IB. The alternative is to simply do a malloc and then use cudaHostRegister to pin the memory to CUDA. There should be no difference..........but doing the latter makes the issue go away.

The reason this does not show up with Frank stable is because now FaceBuffer is recreated with every invertQuda call, whereas previously it was reused between invertQuda calls. This is consistent with the fact that with current master the first solve works correctly, but subsequently solves do not work. There appears to be something wrong with reallocing pinned memory directly when the FaceBuffer is recreated.

Looking more into this, but it appears I have a fix if not totally understanding it.

@maddyscientist
Copy link
Member

Balint please try commit 3b21a83 when you have a chance. I believe this fixes the issue.

@bjoo
Copy link
Member Author

bjoo commented Sep 14, 2012

I tried a variety of modes here with WEAK_FIELD tests on 16 GTX480 GPUs
(24x24x24x128 lattice spread out on virt geometry: 1x1x1x16, CentOS 5.5, CUDA-4.2,
MVAPICH2-1.8, Driver version 304.43(certified))

HALF(12)-SINGLE(12) - OK
SINGLE(12)-SINGLE(12) - OK
HALF(12)-SINGLE(18) - OK
HALF(12)-DOUBLE(18) - OK
SINGLE(12)-DOUBLE(18) - OK
DOUBLE(18)-DOUBLE(18)- OK

(clearly there are more combinatorics, e.g. 8 reconstructs etc, but I think these are the most important ones potentially).

In addition I ran a user job which performed something like 192 calls to QUDA inversions
using HALF(12)-DOUBLE(18) in 16 full prop calculations with I/O in between the prop calculations and that job ran through fine also.

At this time I am happy to sign off on this and close this issue. If user experience reveals further problems we can open a new issue. Great job on sorting this out. It was a nasty.

@bjoo bjoo closed this as completed Sep 14, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants