-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clover first solve succeeds second fails #74
Comments
Balint, if I am to debug this, can you give me an appropriate chroma build package, so I build and run in ignorance and swap out different versions of quda as I see fit until the change that causes this bug is tracked down? |
Ok, I think now is the time to test this. I have just merged in my BQCD branch, and I can see no problems with successive solves there. Balint, can you do a fresh pull and check that this issue has gone? |
All final Chroma issues should now be fixed as of commit 1b77cc9. Leaving this open until Balint confirms this. |
I have finally managed to reproduce the originally reported issue in the QUDA tests. The successive solve problem only occurs with multiple GPUs one separate nodes, i.e., over InfiniBand. The problem occurs for all precision combinations. Working on locating the source now. |
I still haven't isolated this yet, but I have found the problem occurs with Wilson as well as clover, and it happens with MPI as well as QMP back ends. Need to sleep now. |
This sounds pretty similar to the problems we were having with On 09/11/2012 11:44 PM, mikeaclark wrote:
|
I was running on an internal cluster, using two nodes with one M2090 per node. Having interactive access is a huge bonus to this type of debugging. I just tried to repro with the staggered invert test and I cannot - success solves always converge. This doesn't exclude that the GPU Direct bug is related to this one. Testing now with GPU Direct disabled..... |
Ok, with GPU Direct disabled, the bug goes away. I guess this is related to the other GPU Direct issues. This is with 1.5.4 openmpi and CUDA 4.0. Continuing to investigate.... |
You ran with Rolf's flag, I guess. On 09/12/2012 11:14 AM, mikeaclark wrote:
|
Hi Mike, Best,
Dr Balint Joo High Performance Computational Scientist email: bjoo@jlab.org |
I didn't run with Rolf's flags. I did now, and found that the issue goes away. So this is definitely the same problem. |
Just made huge progress. The current QMP and MPI backends use cudaHostAlloc to create pinned memory which is then also pinned used by IB. The alternative is to simply do a malloc and then use cudaHostRegister to pin the memory to CUDA. There should be no difference..........but doing the latter makes the issue go away. The reason this does not show up with Frank stable is because now FaceBuffer is recreated with every invertQuda call, whereas previously it was reused between invertQuda calls. This is consistent with the fact that with current master the first solve works correctly, but subsequently solves do not work. There appears to be something wrong with reallocing pinned memory directly when the FaceBuffer is recreated. Looking more into this, but it appears I have a fix if not totally understanding it. |
Balint please try commit 3b21a83 when you have a chance. I believe this fixes the issue. |
I tried a variety of modes here with WEAK_FIELD tests on 16 GTX480 GPUs HALF(12)-SINGLE(12) - OK (clearly there are more combinatorics, e.g. 8 reconstructs etc, but I think these are the most important ones potentially). In addition I ran a user job which performed something like 192 calls to QUDA inversions At this time I am happy to sign off on this and close this issue. If user experience reveals further problems we can open a new issue. Great job on sorting this out. It was a nasty. |
In a recent master branch we observed the following behaviour at JLab in the clover solver:
This was tested in commit ID: 0929692
This behaviour is not present in version: a0c7a3b (which was identified by Frank as stable).
Specific output:
First solve OK:
BiCGstab: 2181 iterations, r2 = 2.350111e-13
BiCGstab: 2182 iterations, r2 = 2.314445e-13
BiCGstab: Reliable updates = 21
BiCGstab: Converged after 2182 iterations, relative residua: iterated = 4.984836e-07, true = 7.308019e-07
Solution = 2.970304
Reconstructed: CUDA solution = 2.970304, CPU copy = 2.970303
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=46.575336 s Performance=1648.30932630335 GFLOPS Total Time (incl. load gauge)=51.613748 s
QUDA_BICGSTAB_CLOVER_SOLVER: 2182 iterations. Rsd = 1.274569e-06 Relative Rsd = 7.56500540505804e-07
second solve: only 606 iterations and QUDA and Chroma disagree
BiCGstab: 606 iterations, r2 = 2.035114e-13
BiCGstab: Reliable updates = 10
BiCGstab: Converged after 606 iterations, relative residua: iterated = 4.677403e-07, true = 4.998825e-07
Solution = 1.833076
Reconstructed: CUDA solution = 1.833076, CPU copy = 1.833076
Cuda Space Required
Spinor:0.1640625 GiB
Gauge :0 GiB
InvClover :0 GiB
QUDA_BICGSTAB_CLOVER_SOLVER: time=5.873358 s Performance=3641.36232313031 GFLOPS Total Time (incl. load gauge)=6.105013 s
ERROR: QUDA Solver residuum is outside tolerance: QUDA resid=0.0460195576642859 Desired =5e-07 Max Tolerated = 5e-06
QUDA_BICGSTAB_CLOVER_SOLVER: 606 iterations. Rsd = 0.07754742 Relative Rsd = 0.0460195576642859
Similar behaviour was reported from Edge recently.
The text was updated successfully, but these errors were encountered: