New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static structure factor S(q) #660
Conversation
Paraphrasing a discussion I had offline with @bdice, I'm a little confused at how the RDF-based calculation actually has better scaling behavior than the "direct" calculation. The RDF is also an |
Following up from discussion with @vyasr, here's some timing as of commit 3e9066b. This isn't quite enough data to assess "scaling behavior" but it's what I could quickly gather.
I'm not sure what the breakdown of time spent looks like, but this addresses the general question of "is RDF faster" with a clear yes. However, based on this small amount of data, the scaling might be something like O(N^2) for both methods as @vyasr had hypothesized, since the RDF is long-ranged. |
I just took a brief look at the code and I don't have any immediate thoughts on why either method as implemented here would be so much faster than the other. My intuition would be that if you looked at the code before the loop over k in both This seems like a fun problem to dig into, but unfortunately not one I have time for at the moment. I'll keep an eye on this though and maybe I'll get a chance before you finish up the PR :) |
@DomFijan You refactored the calculation to include all point-point distances in commit 03dd664, including those greater than the largest |
The results from the calculation weren't matching the "direct" implementation results which looked very sane and were matching Dynasor. Once I discovered that ASE implements Debye formula scattering I made a test that would directly compare the two Debye implementations. They weren't matching as one would expect, but ASE was matching the "direct" implementation pretty well (not perfectly because low k-values). So the first thing that I thought off was to revert back the change we made couple of months ago to go to ball query from whole system (if memory serves me correctly). And that fixed it. Note that ASE calculates the scattering WITHOUT PBC exclusively as far as I could figure out, while our implementation allows PBCs if box has PBCs turned on. Turning on PBCs gives worse results at low k-values then without PBCs (that would make sense), but the results are very similar in general. As for why this is correct physically or mathematically and why ball query doesn't seem to be working well I can only speculate. The amount of distance pairs when going from L/2 ball query to whole system is considerable (same order of magnitude - I did not expect this), and these are almost exclusively mid to low k-values that get affected. It could be that the whole nature of the sinc function makes the problem of low amount of sampling worse (I think this is the most likely explanation given how Debye compares to "direct" method and how Debye with and without PBCs look comparatively). Do you think we should state in the docs that our implementation is supposed to match ASE's? On another note I believe there is a rework for ASE's Debye implementation on their gitlab in works. |
Note that box is allowed to change when calculating trajectory | ||
average static structure factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about this comment:
Note that box is allowed to change when calculating trajectory average static structure factor.
I'm leaving it in but I think it's a little awkward. We discuss accumulating and resetting, but usually not "trajectory average." The accumulation/reset doesn't necessarily have to be over an MD trajectory (could be independent replicas, or Monte Carlo which isn't really a trajectory, etc.). It also doesn't fit with the rest of freud. I'm still not sure if changing boxes produce a valid result... we know that NVT (constant box) averages are fine but I can't say for sure if NPT averages, for instance, would be valid. Especially without periodic boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this doesn't seem to fit the general language of freud. I will change this in the new PR.
I can't see why NPT simulations would produce incorrect results. Especially in this implementation where there is no underlying k-space grid but you histogram distances directly which (aside from applying PBCs correctly) shouldn't care about box properties. The box properties are supposed to fluctuate around some equilibrium value anyways if a proper ensamble is sampled so this shouldn't be a problem. Is there anything that bothers you in particular here? We could also test this - by calculating S(q) for a system at same thermodynamic state point using NPT and NVT and comparing. But if proper equilibrium is reached in both there is no way that results would differ. This would be equivalent to some other structural property being different between NPT and NVT simulation under same conditions which shouldn't be the case(?).
As for calculations without PBCs, these probably don't make much sense, aside from possibly having a HUGE experimental sample of couple of millions ++ of particles where one would want to calculate scattering numerically from positions of particles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don’t have specific objections aside from inconsistent language. I know our accumulation code for RDF and several other classes has to normalize by the box volume, and that means that we require constant box volume for correct results (we could improve this behavior by tracking more info about each frame’s box and perhaps reducing data over threads for each frame?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That makes sense in those cases. S(k) doesn't care about volume (right?), so it shouldn't be an issue here.
As for other functions, perhaps a good approach would be to track if the box changes in the compute function and then reduce only when box change occurs? Would such an approach make more sense? Then box wouldn't have to be stored for each step or reduction happens only when it is actually needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this might be fine since it doesn't normalize by box volumes. I think it is possible to make changing boxes work for RDF and other functions using a strategy like you describe, but I think it would require careful attention. See also #616 for previous conversations on this topic.
@DomFijan Thanks for the info in your comment above. Your explanation and extensive testing are very helpful. I don't think we need to do anything else. I made a final pass and fixed a bunch of small things. I'll merge once tests pass. Then we can move on to #820, and then finalize the example notebooks. |
Why was this PR merged before getting an approving review? |
@tommy-waltmann I have been discussing with @DomFijan via Slack in the |
Description
A commonly desired quantity from simulations is S(q), the static structure factor. The recently introduced
diffraction
module (#596) is an ideal location for this feature. In my understanding, the structure factor can be calculated two different ways: directly (which is expensive,O(N^2)
for a system ofN
particles), or indirectly via a Fourier transform of a radial distribution function. I have heard there is significant controversy over the choice of method and the regimes in which each is correct. I hope to offer both methods in this pull request, as well as some clarity in the documentation for when each might be (in)appropriate for use.The second thing to outline is the scope of this pull request. This is a first-pass, and will only support systems with a single species, and only one set of particle positions (i.e.
points == query_points
). This PR is intentionally held to a very narrow scope, to reduce the complexity of the initial implementation and to solidify testing requirements for the "base case" upon which further features may someday be added, such as:points
andquery_points
as well as hints on how to normalize the result correctly)reset=False
approach -- but may require additional normalizations!)Motivation and Context
Discussed with @ramanishsingh and also desired for my own research.
The reason to implement this feature in freud (rather than point users to another existing package) is that it complements and can leverage the fast neighbor-finding and other features of freud, the feature itself can be implemented in parallelized C++, and it fits in the scope of colloidal-scale simulation analysis that freud emphasizes.
Resolves: #652
TODO (help welcome)
How Has This Been Tested?
I plan to compare this code against a few existing implementations to verify its accuracy.
Validate against:
Types of changes
Checklist: