Static structure factor S(q) #660

bdice · 2020-09-29T22:05:50Z

Description

A commonly desired quantity from simulations is S(q), the static structure factor. The recently introduced diffraction module (#596) is an ideal location for this feature. In my understanding, the structure factor can be calculated two different ways: directly (which is expensive, O(N^2) for a system of N particles), or indirectly via a Fourier transform of a radial distribution function. I have heard there is significant controversy over the choice of method and the regimes in which each is correct. I hope to offer both methods in this pull request, as well as some clarity in the documentation for when each might be (in)appropriate for use.

The second thing to outline is the scope of this pull request. This is a first-pass, and will only support systems with a single species, and only one set of particle positions (i.e. points == query_points). This PR is intentionally held to a very narrow scope, to reduce the complexity of the initial implementation and to solidify testing requirements for the "base case" upon which further features may someday be added, such as:

multi-species structure factors (via points and query_points as well as hints on how to normalize the result correctly)
accumulation over multiple frames (using freud's standard reset=False approach -- but may require additional normalizations!)
particle form factors (e.g. via a secondary array of values)
bonded contributions e.g. from polymers (I think there might be another term for this, but not sure)

Motivation and Context

Discussed with @ramanishsingh and also desired for my own research.

The reason to implement this feature in freud (rather than point users to another existing package) is that it complements and can leverage the fast neighbor-finding and other features of freud, the feature itself can be implemented in parallelized C++, and it fits in the scope of colloidal-scale simulation analysis that freud emphasizes.

Resolves: #652

TODO (help welcome)

Consolidate implementation code from Cython into C++
Validate results approximately match against a known reference
Validate FFT-RDF method against direct method for q values greater than 4pi/L (or something like that) where they should agree
Write docs and seek expert knowledge (literature) about when each method is valid
Write tests to ensure behavior is not broken if/when new features are added

How Has This Been Tested?

I plan to compare this code against a few existing implementations to verify its accuracy.

Validate against:

https://github.com/mattwthompson/scattering/
other implementations from Glotzer group members

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds or improves functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

I have read the CONTRIBUTING document.
My code follows the code style of this project.
I have updated the documentation (if relevant).
I have added tests that cover my changes (if relevant).
All new and existing tests passed.
I have updated the credits.
I have updated the Changelog.

vyasr · 2020-10-07T19:09:49Z

Paraphrasing a discussion I had offline with @bdice, I'm a little confused at how the RDF-based calculation actually has better scaling behavior than the "direct" calculation. The RDF is also an O(n^2) calculation, the only reason it's typically faster is because people use short cutoffs to reduce the number of interactions. If you use the maximum cutoff that is currently being used, I would expect that in a uniformly distributed system the RDF would be faster by the ratio of the volume of the ball of radius L/2 to the volume of the box, since the RDF will ignore some neighbors for that reason. Otherwise, the two should have the same scaling behavior. Before we finish this off I would like to see benchmarks to validate that and to consider how much benefit we get from providing both implementations. Not saying it's not worthwhile, just want to be sure.

bdice · 2020-10-20T02:48:25Z

Following up from discussion with @vyasr, here's some timing as of commit 3e9066b. This isn't quite enough data to assess "scaling behavior" but it's what I could quickly gather.

Time (s)	N=1000	N=2000	Notes
Python RDF	0.361	0.369	Uses freud (C++) for RDF, cost of Python FFT/integration is probably much greater.
Python Direct	7.30	39.0
C++ RDF	0.0056	0.0162
C++ Direct	1.25	5.09

I'm not sure what the breakdown of time spent looks like, but this addresses the general question of "is RDF faster" with a clear yes. However, based on this small amount of data, the scaling might be something like O(N^2) for both methods as @vyasr had hypothesized, since the RDF is long-ranged.

vyasr · 2020-10-20T14:15:52Z

I just took a brief look at the code and I don't have any immediate thoughts on why either method as implemented here would be so much faster than the other. My intuition would be that if you looked at the code before the loop over k in both accumulateDirect and accumulateRDF the performance would be comparable up to some small scaling factor; both RDF::compute and box.computeAllDistances are O(N^2) calculations, with the primary difference that RDF::compute has a larger prefactor because it also performs binning of the distances (meaning that the RDF code prior to the k loop should be slower). AFAICT the code in the k loop is also nearly identical; they're two nested for loops (with the inner loop hidden either by std::for_each or util::simpson_integrate) over the exact same variables. There must be some operation in the direct case that's more expensive, but it's not apparent to me what it is.

This seems like a fun problem to dig into, but unfortunately not one I have time for at the moment. I'll keep an eye on this though and maybe I'll get a chance before you finish up the PR :)

…dated.

cpp/util/utils.h

cpp/locality/NeighborQuery.h

…ce rvalues).

bdice · 2021-09-19T04:55:55Z

@DomFijan You refactored the calculation to include all point-point distances in commit 03dd664, including those greater than the largest r_max = L_min/2 permitted by the box. Can you comment on how you justified that change? Theory? Literature? Matching to other expected results? We should add a comment explaining that in the C++ code and/or Python docstrings. That's the only comment I have left on this PR and it can be discussed after merging, so I'm going to review #820 and then merge these PRs tomorrow.

DomFijan · 2021-09-19T21:32:41Z

@DomFijan You refactored the calculation to include all point-point distances in commit 03dd664, including those greater than the largest r_max = L_min/2 permitted by the box. Can you comment on how you justified that change? Theory? Literature? Matching to other expected results? We should add a comment explaining that in the C++ code and/or Python docstrings. That's the only comment I have left on this PR and it can be discussed after merging, so I'm going to review #820 and then merge these PRs tomorrow.

The results from the calculation weren't matching the "direct" implementation results which looked very sane and were matching Dynasor. Once I discovered that ASE implements Debye formula scattering I made a test that would directly compare the two Debye implementations. They weren't matching as one would expect, but ASE was matching the "direct" implementation pretty well (not perfectly because low k-values). So the first thing that I thought off was to revert back the change we made couple of months ago to go to ball query from whole system (if memory serves me correctly). And that fixed it. Note that ASE calculates the scattering WITHOUT PBC exclusively as far as I could figure out, while our implementation allows PBCs if box has PBCs turned on. Turning on PBCs gives worse results at low k-values then without PBCs (that would make sense), but the results are very similar in general.

As for why this is correct physically or mathematically and why ball query doesn't seem to be working well I can only speculate. The amount of distance pairs when going from L/2 ball query to whole system is considerable (same order of magnitude - I did not expect this), and these are almost exclusively mid to low k-values that get affected. It could be that the whole nature of the sinc function makes the problem of low amount of sampling worse (I think this is the most likely explanation given how Debye compares to "direct" method and how Debye with and without PBCs look comparatively).

Do you think we should state in the docs that our implementation is supposed to match ASE's? On another note I believe there is a rework for ASE's Debye implementation on their gitlab in works.

freud/diffraction.pyx

bdice · 2021-09-20T02:25:27Z

freud/diffraction.pyx

+                Note that box is allowed to change when calculating trajectory
+                average static structure factor.


I'm unsure about this comment:

Note that box is allowed to change when calculating trajectory average static structure factor.

I'm leaving it in but I think it's a little awkward. We discuss accumulating and resetting, but usually not "trajectory average." The accumulation/reset doesn't necessarily have to be over an MD trajectory (could be independent replicas, or Monte Carlo which isn't really a trajectory, etc.). It also doesn't fit with the rest of freud. I'm still not sure if changing boxes produce a valid result... we know that NVT (constant box) averages are fine but I can't say for sure if NPT averages, for instance, would be valid. Especially without periodic boundaries.

I agree that this doesn't seem to fit the general language of freud. I will change this in the new PR.

I can't see why NPT simulations would produce incorrect results. Especially in this implementation where there is no underlying k-space grid but you histogram distances directly which (aside from applying PBCs correctly) shouldn't care about box properties. The box properties are supposed to fluctuate around some equilibrium value anyways if a proper ensamble is sampled so this shouldn't be a problem. Is there anything that bothers you in particular here? We could also test this - by calculating S(q) for a system at same thermodynamic state point using NPT and NVT and comparing. But if proper equilibrium is reached in both there is no way that results would differ. This would be equivalent to some other structural property being different between NPT and NVT simulation under same conditions which shouldn't be the case(?).

As for calculations without PBCs, these probably don't make much sense, aside from possibly having a HUGE experimental sample of couple of millions ++ of particles where one would want to calculate scattering numerically from positions of particles.

I guess I don’t have specific objections aside from inconsistent language. I know our accumulation code for RDF and several other classes has to normalize by the box volume, and that means that we require constant box volume for correct results (we could improve this behavior by tracking more info about each frame’s box and perhaps reducing data over threads for each frame?).

I see. That makes sense in those cases. S(k) doesn't care about volume (right?), so it shouldn't be an issue here.

As for other functions, perhaps a good approach would be to track if the box changes in the compute function and then reduce only when box change occurs? Would such an approach make more sense? Then box wouldn't have to be stored for each step or reduction happens only when it is actually needed?

Yes, I think this might be fine since it doesn't normalize by box volumes. I think it is possible to make changing boxes work for RDF and other functions using a strategy like you describe, but I think it would require careful attention. See also #616 for previous conversations on this topic.

bdice · 2021-09-20T03:12:17Z

@DomFijan Thanks for the info in your comment above. Your explanation and extensive testing are very helpful. I don't think we need to do anything else.

I made a final pass and fixed a bunch of small things. I'll merge once tests pass. Then we can move on to #820, and then finalize the example notebooks.

tommy-waltmann · 2021-09-20T13:43:18Z

Why was this PR merged before getting an approving review?

bdice · 2021-09-20T13:54:45Z

@tommy-waltmann I have been discussing with @DomFijan via Slack in the #structure-factor channel and we agreed to merge this and make further improvements in a new PR. I feel it meets the quality standards of the package and left comments above to indicate my approval of @DomFijan’s hard work, but I was not able to “approve” the PR since I was the initial author. Your suggestions for improvement would be very welcome, and we can apply them in a new PR. This just needed to move forward (after a year of development) so that we can finalize #820. This feature is also “unstable” (as is DiffractionPattern) so we can adapt its API and behavior freely.

bdice requested a review from a team as a code owner September 29, 2020 22:05

bdice requested a review from vyasr September 29, 2020 22:05

bdice marked this pull request as draft September 29, 2020 22:06

bdice removed the request for review from vyasr September 29, 2020 22:06

bdice self-assigned this Sep 29, 2020

bdice added enhancement New feature or request diffraction labels Sep 29, 2020

bdice added this to the v2.4 milestone Sep 29, 2020

bdice marked this pull request as ready for review October 2, 2020 19:23

bdice requested review from ramanishsingh and thiv89 October 2, 2020 20:56

vyasr mentioned this pull request Oct 6, 2020

Feature/form factors glotzerlab/coxeter#92

Merged

8 tasks

bdice added 15 commits November 6, 2020 00:02

Add static structure factor calculation (direct method). Not yet vali…

0f23f16

…dated.

Add sinc function.

710c387

Add C++ implementation (untested, likely not correct yet).

1542860

Add C++ implementation.

38e8e6d

Use normalization as private variable.

42eb797

Use util function for integration, fix integral domain.

3ec7b9e

Mark Histogram methods as const.

d177b26

Use typename instead of class.

0ea3c91

Add C++ direct method.

aba583f

Use odd number of bins for Simpson's rule integration.

0b3b34d

Intermediate cleanup.

8933c43

Intermediate work on reset/reduce.

1e7bb7d

Fixes to BondHistogramCompute.h

c356caa

Remove direct method from Python code since C++ version is complete.

59b1ece

Update docs for StaticStructureFactor.

fcafd18

bdice commented Sep 19, 2021

View reviewed changes

cpp/util/utils.h Outdated Show resolved Hide resolved

bdice commented Sep 19, 2021

View reviewed changes

cpp/locality/NeighborQuery.h Outdated Show resolved Hide resolved

bdice added 6 commits September 18, 2021 23:40

Revert NeighborQuery constructor and make_ball.

ade41ce

Revert changes to Histogram constness (the methods return non-referen…

c1e2ea7

…ce rvalues).

Remove unused Simpson's integration.

e5230c0

Remove unused imports.

207846a

Improve bib style.

05c7aa7

Merge branch 'master' into feature/structure-factor

b4205b5

bdice added a commit that referenced this pull request Sep 19, 2021

Refactor C++ with changes from PR #660.

f116f3a

bdice added 7 commits September 19, 2021 20:02

Apply fixes to Debye from Direct implementation branch.

35af69b

Update changelog and credits.

c535179

Fix documentation.

f3ccf0c

Remove unused includes.

780f24b

Refactor min_valid_k.

48d8f72

clang-format

99dec5a

Update C++ docs.

b3d2bc5

bdice commented Sep 20, 2021

View reviewed changes

freud/diffraction.pyx Outdated Show resolved Hide resolved

bdice commented Sep 20, 2021

View reviewed changes

bdice added 3 commits September 19, 2021 21:40

Documentation revisions.

2895702

Pin to newest ase==3.22.0.

4ebe6b5

Update tests_require.

bf2ed8f

bdice merged commit 9b4db2b into master Sep 20, 2021

bdice deleted the feature/structure-factor branch September 20, 2021 03:28

tommy-waltmann added this to the v2.7.0 milestone Oct 1, 2021

tommy-waltmann mentioned this pull request Nov 2, 2021

Fix/refactor sq tests #865

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static structure factor S(q) #660

Static structure factor S(q) #660

bdice commented Sep 29, 2020 •

edited

vyasr commented Oct 7, 2020

bdice commented Oct 20, 2020

vyasr commented Oct 20, 2020

bdice commented Sep 19, 2021

DomFijan commented Sep 19, 2021 •

edited

bdice Sep 20, 2021 •

edited

DomFijan Sep 20, 2021

bdice Sep 20, 2021

DomFijan Sep 20, 2021 •

edited

bdice Sep 20, 2021

bdice commented Sep 20, 2021

tommy-waltmann commented Sep 20, 2021

bdice commented Sep 20, 2021

		Note that box is allowed to change when calculating trajectory
		average static structure factor.

Static structure factor S(q) #660

Static structure factor S(q) #660

Conversation

bdice commented Sep 29, 2020 • edited

Description

Motivation and Context

TODO (help welcome)

How Has This Been Tested?

Types of changes

Checklist:

vyasr commented Oct 7, 2020

bdice commented Oct 20, 2020

vyasr commented Oct 20, 2020

bdice commented Sep 19, 2021

DomFijan commented Sep 19, 2021 • edited

bdice Sep 20, 2021 • edited

Choose a reason for hiding this comment

DomFijan Sep 20, 2021

Choose a reason for hiding this comment

bdice Sep 20, 2021

Choose a reason for hiding this comment

DomFijan Sep 20, 2021 • edited

Choose a reason for hiding this comment

bdice Sep 20, 2021

Choose a reason for hiding this comment

bdice commented Sep 20, 2021

tommy-waltmann commented Sep 20, 2021

bdice commented Sep 20, 2021

bdice commented Sep 29, 2020 •

edited

DomFijan commented Sep 19, 2021 •

edited

bdice Sep 20, 2021 •

edited

DomFijan Sep 20, 2021 •

edited