Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-degenerate twisted mass dslash is broken #161

Closed
alexstrel opened this issue Oct 14, 2014 · 23 comments
Closed

non-degenerate twisted mass dslash is broken #161

alexstrel opened this issue Oct 14, 2014 · 23 comments
Labels
Milestone

Comments

@alexstrel
Copy link
Member

non-degenerate dslash operator is currently broken, also a single GPU version . In particular, the single GPU dslash test fails for single precision and reconstruction 12 (while all other options looked ok). Up to now, I noticed that this bug was absent in commit 3bfcdcc.

@alexstrel alexstrel added the bug label Oct 14, 2014
@maddyscientist
Copy link
Member

Can you use git bisect to work out where the bug happened?

@alexstrel
Copy link
Member Author

I'll do this. Actually, it looked strange for me why rec 12 for single precision is effected, might be just a side effect of some tricky problem: reconstruction stuff was not changed for quite a long period of time...

@alexstrel alexstrel added this to the QUDA 0.7.0 milestone Oct 14, 2014
@maddyscientist maddyscientist modified the milestone: QUDA 0.7.0 Oct 15, 2014
@maddyscientist
Copy link
Member

This is the bad commit bd97f59

@jpfoley
Copy link
Member

jpfoley commented Oct 24, 2014

Looks like my bug then. I wonder if the DslashFaceBuffer functor properly implements the older dslashCuda function. That could be where the bug lies. I'll try to take a look tomorrow. Apologies Alexei.

@maddyscientist
Copy link
Member

Some progress here. Different policies are used for different dslash types, in particular wilson, clover, staggered, improved staggered, twisted clover all use QUDA_DSLASH2 and domain wall, mobius and twisted mass all use QUDA_DSLASH. Switching twisted mass to use QUDA_DSLASH2 fixes the bug. So the bug looks like it is with the QudaDslashFaceBuffer policy.

Justin this may also be the source of your problem with fused mobius.

@maddyscientist
Copy link
Member

Ok, I spoke too soon. Changing policy doesn't fix the issue, but I've found that the issue has a lack of reproducibility to it. 8 / 18 reconstruct sometimes break as well. This is a nasty one.

@maddyscientist
Copy link
Member

Now I'm more confused than ever. Switching back to the old dslashCuda2, instead of using the new dslash policy class doesn't fix the issue. Continuing to investigate. Justin, at this rate, I would suggest that you don't waste your time working on this until I work more what's going on.

@maddyscientist
Copy link
Member

Some more data before I drop into bed. The errors always occur on the first or last time slice. Changing the volume affects the correctness. Given the lack of reproducibility, I should probably do another git bisect to double check where the error originates.

@jpfoley
Copy link
Member

jpfoley commented Oct 24, 2014

Yuck. Another low-level error reaching up from QUDA's depths?

@alexstrel
Copy link
Member Author

Ok, line 217 in dslash_twisted_mass.cu introduced the bug for non-deg dslash: argument in->Volume() must be replaced by bulk_threads and in->GhostFace() must be changed by ghost_threads. Let me check.

@jpfoley
Copy link
Member

jpfoley commented Oct 24, 2014

Mike, Alexei - thanks for tracking this down. It was a bug I introduced, and a rather obvious one at that. I was clearly in cut-and-paste mode when I introduced this bug.

@maddyscientist
Copy link
Member

Does this bug explain why it worked on say 4^4 volume but broke on 8^4?

Sent from my iPhone

On Oct 24, 2014, at 8:49, "Justin Foley" <notifications@github.commailto:notifications@github.com> wrote:

Mike, Alexei - thanks for tracking this down. It was a bug I introduced, and a rather obvious one at that. I was clearly in cut-and-paste mode when I introduced this bug.

Reply to this email directly or view it on GitHubhttps://github.com//issues/161#issuecomment-60406522.


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@alexstrel
Copy link
Member Author

Asymmetric preconditioning looks ok. now I get problems when choose QUDA_MATPC_EVEN_EVEN. But this is a seperate issue, that could appear in some early commits.

@maddyscientist
Copy link
Member

Ok, Alexei's patch (6450fba) does fix this bug. It had the effect of doubling the number of threads running which was bound to do strange things.

@maddyscientist
Copy link
Member

Do we need to file another issue then for QUDA_MATPC_EVEN_EVEN?

@alexstrel
Copy link
Member Author

yes, it should be a separate issue, I agree. At least we have a working non-deg dslash now for the asymmetic case.

@maddyscientist
Copy link
Member

I also just pushed the same fix twisted-clover fermions, which also got the same copy and paste error. In fact it looks like twisted_clover had this issue for a long time, and could explain the segmentation faults Alex was getting on single GPU twisted clover a long time ago.

(This issue was my copy and paste error not yours - I was the culprit who applied your policy to the twisted mass dslash.)

@alexstrel
Copy link
Member Author

this bug affected 2-flavor dslash (that is why one needs to correctly adjust a number of threads for this particular case). twisted-clover does not support non-degenerate doublet as far as I know, so I fear the problem you mentioned might still exist.

@maddyscientist
Copy link
Member

It's also possible the twisted clover bug got fixed by fixing another bug.

On the QUDA_MATPC_EVEN_EVEN issue, this problem only shoes up when solving a QUDA_MAT_SOLUTION and does not show up when doing QUDA_MATPC_SOLUTION. Also, the dslash_test passes. This suggests the problem is in the Dirac::reconstruct or Dirac::prepare functions.

@AlexVaq
Copy link
Member

AlexVaq commented Oct 28, 2014

Now I have a bit of a break with the projects, I'll try to test the last
release for correctness with twisted-clover and let you know what happens.

I'm consideing writing the CPU counterpart. It takes time, but once done,
testing is much easier...

Ciao,

Alex

El 24/10/2014, a las 19:25, mikeaclark notifications@github.com escribió:

It's also possible the twisted clover bug got fixed by fixing another bug.

On the QUDA_MATPC_EVEN_EVEN issue, this problem only shoes up when solving
a QUDA_MAT_SOLUTION and does not show up when doing QUDA_MATPC_SOLUTION.
Also, the dslash_test passes. This suggests the problem is in the
Dirac::reconstruct or Dirac::prepare functions.


Reply to this email directly or view it on GitHub
#161 (comment).

@maddyscientist
Copy link
Member

I've pushed some changes for twisted-mass and twisted-clover, but it appears that twisted-clover is presently broken (dslash_test and invert_test are not working). But I'm confused, since there is no twisted-clover CPU counterpart, so how either test ever work? Do you have to force the clover term to be unit somehow to do the test instead of the default which is to construct the clover term on the device, but this doesn't change the CPU comparison which is against a dslash with no clover term.

Anyway, the degenerate and non-degenerate twisted-mass dslash is now built in parallel for faster compilation time and I've added a flag "--flavor" for the dslash_test and invert_test for easy switching between the two.

I think I've also fixed the twisted-clover bugs (subject to it getting the correct answer). There were two issues there (allocating texture objects for non-allocated fields, and an issue related to using 8-/12-reconstruction gauge fields when constructing the clover field.

@AlexVaq
Copy link
Member

AlexVaq commented Oct 28, 2014

For testing I used the tmLQCD package, which is the only one implementation
of twisted-clover on CPUs, that I know. So the test was kind of useless. I
mean, I used it as intermediate step, because if the clover coeficcient was
small enough, twisted-clover should be like twisted mass; and for mu=0 you
should get clover. But no, there is no test in QUDA for twisted-clover, I
was using tmLQCD, which was a pain.

Ciao,

Alex

El 28/10/2014, a las 21:28, mikeaclark notifications@github.com escribió:

I've pushed some changes for twisted-mass and twisted-clover, but it
appears that twisted-clover is presently broken (dslash_test and
invert_test are not working). But I'm confused, since there is no
twisted-clover CPU counterpart, so how either test ever work? Do you have
to force the clover term to be unit somehow to do the test instead of the
default which is to construct the clover term on the device, but this
doesn't change the CPU comparison which is against a dslash with no clover
term.

Anyway, the degenerate and non-degenerate twisted-mass dslash is now built
in parallel for faster compilation time and I've added a flag "--flavor"
for the dslash_test and invert_test for easy switching between the two.

I think I've also fixed the twisted-clover bugs (subject to it getting the
correct answer). There were two issues there (allocating texture objects
for non-allocated fields, and an issue related to using
8-/12-reconstruction gauge fields when constructing the clover field.


Reply to this email directly or view it on GitHub
#161 (comment).

@maddyscientist
Copy link
Member

Ok, that's reassuring then that I can stop trying to work out if I broke it then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants