-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-degenerate twisted mass dslash is broken #161
Comments
Can you use git bisect to work out where the bug happened? |
I'll do this. Actually, it looked strange for me why rec 12 for single precision is effected, might be just a side effect of some tricky problem: reconstruction stuff was not changed for quite a long period of time... |
This is the bad commit bd97f59 |
Looks like my bug then. I wonder if the DslashFaceBuffer functor properly implements the older dslashCuda function. That could be where the bug lies. I'll try to take a look tomorrow. Apologies Alexei. |
Some progress here. Different policies are used for different dslash types, in particular wilson, clover, staggered, improved staggered, twisted clover all use QUDA_DSLASH2 and domain wall, mobius and twisted mass all use QUDA_DSLASH. Switching twisted mass to use QUDA_DSLASH2 fixes the bug. So the bug looks like it is with the QudaDslashFaceBuffer policy. Justin this may also be the source of your problem with fused mobius. |
Ok, I spoke too soon. Changing policy doesn't fix the issue, but I've found that the issue has a lack of reproducibility to it. 8 / 18 reconstruct sometimes break as well. This is a nasty one. |
Now I'm more confused than ever. Switching back to the old dslashCuda2, instead of using the new dslash policy class doesn't fix the issue. Continuing to investigate. Justin, at this rate, I would suggest that you don't waste your time working on this until I work more what's going on. |
Some more data before I drop into bed. The errors always occur on the first or last time slice. Changing the volume affects the correctness. Given the lack of reproducibility, I should probably do another git bisect to double check where the error originates. |
Yuck. Another low-level error reaching up from QUDA's depths? |
Ok, line 217 in dslash_twisted_mass.cu introduced the bug for non-deg dslash: argument in->Volume() must be replaced by bulk_threads and in->GhostFace() must be changed by ghost_threads. Let me check. |
Mike, Alexei - thanks for tracking this down. It was a bug I introduced, and a rather obvious one at that. I was clearly in cut-and-paste mode when I introduced this bug. |
Does this bug explain why it worked on say 4^4 volume but broke on 8^4? Sent from my iPhone On Oct 24, 2014, at 8:49, "Justin Foley" <notifications@github.commailto:notifications@github.com> wrote: Mike, Alexei - thanks for tracking this down. It was a bug I introduced, and a rather obvious one at that. I was clearly in cut-and-paste mode when I introduced this bug. Reply to this email directly or view it on GitHubhttps://github.com//issues/161#issuecomment-60406522. This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
Asymmetric preconditioning looks ok. now I get problems when choose QUDA_MATPC_EVEN_EVEN. But this is a seperate issue, that could appear in some early commits. |
Ok, Alexei's patch (6450fba) does fix this bug. It had the effect of doubling the number of threads running which was bound to do strange things. |
Do we need to file another issue then for QUDA_MATPC_EVEN_EVEN? |
yes, it should be a separate issue, I agree. At least we have a working non-deg dslash now for the asymmetic case. |
I also just pushed the same fix twisted-clover fermions, which also got the same copy and paste error. In fact it looks like twisted_clover had this issue for a long time, and could explain the segmentation faults Alex was getting on single GPU twisted clover a long time ago. (This issue was my copy and paste error not yours - I was the culprit who applied your policy to the twisted mass dslash.) |
this bug affected 2-flavor dslash (that is why one needs to correctly adjust a number of threads for this particular case). twisted-clover does not support non-degenerate doublet as far as I know, so I fear the problem you mentioned might still exist. |
It's also possible the twisted clover bug got fixed by fixing another bug. On the QUDA_MATPC_EVEN_EVEN issue, this problem only shoes up when solving a QUDA_MAT_SOLUTION and does not show up when doing QUDA_MATPC_SOLUTION. Also, the dslash_test passes. This suggests the problem is in the Dirac::reconstruct or Dirac::prepare functions. |
Now I have a bit of a break with the projects, I'll try to test the last I'm consideing writing the CPU counterpart. It takes time, but once done, Ciao, Alex El 24/10/2014, a las 19:25, mikeaclark notifications@github.com escribió: It's also possible the twisted clover bug got fixed by fixing another bug. On the QUDA_MATPC_EVEN_EVEN issue, this problem only shoes up when solving — |
I've pushed some changes for twisted-mass and twisted-clover, but it appears that twisted-clover is presently broken (dslash_test and invert_test are not working). But I'm confused, since there is no twisted-clover CPU counterpart, so how either test ever work? Do you have to force the clover term to be unit somehow to do the test instead of the default which is to construct the clover term on the device, but this doesn't change the CPU comparison which is against a dslash with no clover term. Anyway, the degenerate and non-degenerate twisted-mass dslash is now built in parallel for faster compilation time and I've added a flag "--flavor" for the dslash_test and invert_test for easy switching between the two. I think I've also fixed the twisted-clover bugs (subject to it getting the correct answer). There were two issues there (allocating texture objects for non-allocated fields, and an issue related to using 8-/12-reconstruction gauge fields when constructing the clover field. |
For testing I used the tmLQCD package, which is the only one implementation Ciao, Alex El 28/10/2014, a las 21:28, mikeaclark notifications@github.com escribió: I've pushed some changes for twisted-mass and twisted-clover, but it Anyway, the degenerate and non-degenerate twisted-mass dslash is now built I think I've also fixed the twisted-clover bugs (subject to it getting the — |
Ok, that's reassuring then that I can stop trying to work out if I broke it then! |
non-degenerate dslash operator is currently broken, also a single GPU version . In particular, the single GPU dslash test fails for single precision and reconstruction 12 (while all other options looked ok). Up to now, I noticed that this bug was absent in commit 3bfcdcc.
The text was updated successfully, but these errors were encountered: