Clover first solve succeeds second fails #74

bjoo · 2012-07-17T17:53:39Z

In a recent master branch we observed the following behaviour at JLab in the clover solver:

i) The first solve succeeds
ii) The second solve appears to converge to the wrong answer

This was tested in commit ID: 0929692
This behaviour is not present in version: a0c7a3b (which was identified by Frank as stable).

Specific output:

First solve OK:

BiCGstab: 2181 iterations, r2 = 2.350111e-13
BiCGstab: 2182 iterations, r2 = 2.314445e-13
BiCGstab: Reliable updates = 21
BiCGstab: Converged after 2182 iterations, relative residua: iterated = 4.984836e-07, true = 7.308019e-07
Solution = 2.970304
Reconstructed: CUDA solution = 2.970304, CPU copy = 2.970303
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=46.575336 s Performance=1648.30932630335 GFLOPS Total Time (incl. load gauge)=51.613748 s
QUDA_BICGSTAB_CLOVER_SOLVER: 2182 iterations. Rsd = 1.274569e-06 Relative Rsd = 7.56500540505804e-07

second solve: only 606 iterations and QUDA and Chroma disagree

BiCGstab: 606 iterations, r2 = 2.035114e-13
BiCGstab: Reliable updates = 10
BiCGstab: Converged after 606 iterations, relative residua: iterated = 4.677403e-07, true = 4.998825e-07
Solution = 1.833076
Reconstructed: CUDA solution = 1.833076, CPU copy = 1.833076
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=5.873358 s Performance=3641.36232313031 GFLOPS Total Time (incl. load gauge)=6.105013 s
ERROR: QUDA Solver residuum is outside tolerance: QUDA resid=0.0460195576642859 Desired =5e-07 Max Tolerated = 5e-06
QUDA_BICGSTAB_CLOVER_SOLVER: 606 iterations. Rsd = 0.07754742 Relative Rsd = 0.0460195576642859

Similar behaviour was reported from Edge recently.

The text was updated successfully, but these errors were encountered:

maddyscientist · 2012-07-17T17:58:57Z

Balint, if I am to debug this, can you give me an appropriate chroma build package, so I build and run in ignorance and swap out different versions of quda as I see fit until the change that causes this bug is tracked down?

maddyscientist · 2012-08-15T23:15:10Z

Ok, I think now is the time to test this. I have just merged in my BQCD branch, and I can see no problems with successive solves there. Balint, can you do a fresh pull and check that this issue has gone?

maddyscientist · 2012-09-06T23:58:48Z

All final Chroma issues should now be fixed as of commit 1b77cc9. Leaving this open until Balint confirms this.

maddyscientist · 2012-09-12T05:44:45Z

I have finally managed to reproduce the originally reported issue in the QUDA tests. The successive solve problem only occurs with multiple GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for all precision combinations.

Working on locating the source now.

maddyscientist · 2012-09-12T07:54:18Z

I still haven't isolated this yet, but I have found the problem occurs with Wilson as well as clover, and it happens with MPI as well as QMP back ends. Need to sleep now.

jpfoley · 2012-09-12T15:38:23Z

This sounds pretty similar to the problems we were having with
GPUDirect. Where were you running?

On 09/11/2012 11:44 PM, mikeaclark wrote:

I have finally managed to reproduce the originally reported issue in
the QUDA tests. The successive solve problem only occurs with multiple
GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for
all precision combinations.

Working on locating the source now.

—
Reply to this email directly or view it on GitHub
#74 (comment).

maddyscientist · 2012-09-12T16:57:48Z

I was running on an internal cluster, using two nodes with one M2090 per node. Having interactive access is a huge bonus to this type of debugging.

I just tried to repro with the staggered invert test and I cannot - success solves always converge. This doesn't exclude that the GPU Direct bug is related to this one.

Testing now with GPU Direct disabled.....

maddyscientist · 2012-09-12T17:14:23Z

Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate....

jpfoley · 2012-09-12T17:16:53Z

You ran with Rolf's flag, I guess.
I think Steve was seeing this same problem on Keeneland.
The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is
related to the other GPU Direct issues. This is with 1.5.4 openmpi and
CUDA 4.0. Continuing to investigate....

—
Reply to this email directly or view it on GitHub
#74 (comment).

bjoo · 2012-09-12T17:19:07Z

Hi Mike,
I can also do a quick rebuild with MVAPICH2 with and without GPU
direct if having a different MPI helps.

Best,
B
On Sep 12, 2012, at 1:16 PM, Justin Foley wrote:

You ran with Rolf's flag, I guess.
I think Steve was seeing this same problem on Keeneland.
The runtime flag seems to work on some platforms, but not on others.

On 09/12/2012 11:14 AM, mikeaclark wrote:

Ok, with GPU Direct disabled, the bug goes away. I guess this is
related to the other GPU Direct issues. This is with 1.5.4 openmpi and
CUDA 4.0. Continuing to investigate....

—
Reply to this email directly or view it on GitHub
#74 (comment).

—
Reply to this email directly or view it on GitHub.

Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

maddyscientist · 2012-09-12T17:19:30Z

I didn't run with Rolf's flags. I did now, and found that the issue goes away. So this is definitely the same problem.

maddyscientist · 2012-09-13T00:59:26Z

Just made huge progress. The current QMP and MPI backends use cudaHostAlloc to create pinned memory which is then also pinned used by IB. The alternative is to simply do a malloc and then use cudaHostRegister to pin the memory to CUDA. There should be no difference..........but doing the latter makes the issue go away.

The reason this does not show up with Frank stable is because now FaceBuffer is recreated with every invertQuda call, whereas previously it was reused between invertQuda calls. This is consistent with the fact that with current master the first solve works correctly, but subsequently solves do not work. There appears to be something wrong with reallocing pinned memory directly when the FaceBuffer is recreated.

Looking more into this, but it appears I have a fix if not totally understanding it.

maddyscientist · 2012-09-13T07:58:26Z

Balint please try commit 3b21a83 when you have a chance. I believe this fixes the issue.

bjoo · 2012-09-14T15:15:37Z

I tried a variety of modes here with WEAK_FIELD tests on 16 GTX480 GPUs
(24x24x24x128 lattice spread out on virt geometry: 1x1x1x16, CentOS 5.5, CUDA-4.2,
MVAPICH2-1.8, Driver version 304.43(certified))

HALF(12)-SINGLE(12) - OK
SINGLE(12)-SINGLE(12) - OK
HALF(12)-SINGLE(18) - OK
HALF(12)-DOUBLE(18) - OK
SINGLE(12)-DOUBLE(18) - OK
DOUBLE(18)-DOUBLE(18)- OK

(clearly there are more combinatorics, e.g. 8 reconstructs etc, but I think these are the most important ones potentially).

In addition I ran a user job which performed something like 192 calls to QUDA inversions
using HALF(12)-DOUBLE(18) in 16 full prop calculations with I/O in between the prop calculations and that job ran through fine also.

At this time I am happy to sign off on this and close this issue. If user experience reveals further problems we can open a new issue. Great job on sorting this out. It was a nasty.

ghost assigned maddyscientist Aug 13, 2012

ghost assigned bjoo Aug 15, 2012

bjoo closed this as completed Sep 14, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clover first solve succeeds second fails #74

Clover first solve succeeds second fails #74

bjoo commented Jul 17, 2012

maddyscientist commented Jul 17, 2012

maddyscientist commented Aug 15, 2012

maddyscientist commented Sep 6, 2012

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

jpfoley commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

jpfoley commented Sep 12, 2012

bjoo commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 13, 2012

maddyscientist commented Sep 13, 2012

bjoo commented Sep 14, 2012

Clover first solve succeeds second fails #74

Clover first solve succeeds second fails #74

Comments

bjoo commented Jul 17, 2012

maddyscientist commented Jul 17, 2012

maddyscientist commented Aug 15, 2012

maddyscientist commented Sep 6, 2012

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

jpfoley commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 12, 2012

jpfoley commented Sep 12, 2012

bjoo commented Sep 12, 2012

email: bjoo@jlab.org

maddyscientist commented Sep 12, 2012

maddyscientist commented Sep 13, 2012

maddyscientist commented Sep 13, 2012

bjoo commented Sep 14, 2012