New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix disjoint set size in EnvironmentMotifMatch #979
Conversation
@Charlottez112 please confirm that this fixes the bug that you observe. @erteich when you added this feature to freud, did you ever test the scenario that I document in the PR description (passing a The other thing that we should probably double check is that everything is correctly handling the case where different particles have different numbers of neighbors. From a quick scan that seems to be OK, and I tested Charlotte's code with a few different values of |
dj.m_max_num_neigh = motif_size; | ||
auto counts = nlist.getCounts(); | ||
auto* begin = counts.get(); | ||
auto* end = begin + counts.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, it would be nice to add begin
and end
methods to ManagedArray
to support iterators. Completely out of scope for this PR, though.
I just verified that with freud 1.0.0 I can trigger a seg fault with this script:
where |
Yes this fixes the seg fault!
I think you're right that the underlying issue is the mismatch between the number of particles in the local environment computed by In my code, the See code:
|
@Charlottez112 to verify that it's not a bug, you can try things like setting the threshold to some eps << 1 and see if that finds a match. My guess is that the problem isn't a bug in the code, just the fact that you have to tune the method very precisely when |
I think it's still sort of a bug. Because all the clusters found by I feel like this is something |
That's what I expected. I assume that if you lower the threshold you will also get matches without removing the 0 vectors since the 0 vectors probably just lower the similarity score (I don't recall the exact calculation). My inclination is that this is a problem that should be fixed by changing |
I don't quite follow why lowering the threshold would do the trick? Do you mean increase the input |
Sorry yeah I meant increasing, I'm just used to thinking about thresholds in the opposite way that it's defined in the environment matching code. However, I just took a look at the code and I guess it really only affects the pairwise comparisons between vectors. I don't know how the code handles the case where there's a mismatched number of vectors. I don't remember if we really have any knobs to turn there, or if it really is up to the user to ensure that the number of neighbors match. |
Hi all, @vyasr , thank you so much for tracking this down. To answer your question about how the code used to handle mismatched numbers of neighbors, I originally wrote it such that the motif against which we wanted to match had to contain the same number of neighbors as that used to construct the MatchEnv instance. So when a user would instantiate MatchEnv(rmax, k), then call MatchMotif with a certain motif, there is a line in MatchEnv.cc, in v0 and v1 of freud, that asserts that the number of reference points in the motif against which we want to match must match the value of k used to construct MatchEnv. For reference, that is line 751 of this MatchEnv.cc version, for example: https://github.com/glotzerlab/freud/blob/v1.2.2/cpp/environment/MatchEnv.cc So to make a long story short, I think the original spirit with which I wrote the code was fairly unambiguous, but not flexible- for example, if a user wanted to match multiple motifs with different numbers of neighbors, they would just create multiple instances of MatchEnv with different k values, each of which matched the number of neighbors for each of the motifs. Then when freud became a lot more flexible in terms of how users could pass neighbor lists to these methods, things got more ambiguous and complicated. At the very least for right now I think we should add something to the documentation recommending users create neighbor lists for the MatchMotif class that make particles have the same number of neighbors, and that this number of neighbors matches the number of points in the motif. That, at the very least, will make for an unambiguous comparison. I agree that the padding of the environments with 0 vectors is completely unnecessary; to be honest I probably wrote it that way just because I didn't think of a better way. Those 0 vectors are not physical, and would not correlate with any real environment vector unless a particle was directly on top of another. Then there would literally be no distance between a particle and its nearest neighbor. Perhaps that can happen in some very strange simulation cases, but it seems exceedingly unlikely. More generally, I think that the API of this class should be slightly overhauled- I think it contains relics of the way I originally wrote it that now simply don't make sense with the new API of freud. Changing EnvironmentCluster to return a list of vectors, and removing that global_search flag are two things I can think of off the top of my head. I think that under the hood there might be other left-overs of how I wrote things that should be removed, but I would have to take a closer look to be sure. I would be happy to work on this as a separate branch to be incorporated into a future version of freud. |
And one other thing is @Charlottez112 , if in your code you were to, say, use the following lines:
Then I think that would be quite unambiguous. Then for each motif, with its N reference points (neighbors), you would scan through the simulation for similar motifs by looking at the N nearest neighbors of each particle in the simulation, and seeing if they match the motif. I also think nothing should be done under the hood, in agreement with @vyasr , and that this is not really a bug- this way, the search process is doing exactly what it says it is doing, and leaving it to the user to interpret. For example, perhaps a particle's environment is a nested set of polyhedra. Perhaps there is a tetrahedral inner shell, and a dodecahedral outer shell, as is the case in some complicated intermetallic phases. Then one could compare all 4-nearest neighbor environments to a tetrahedron motif, and all 24-nearest neighbor environments to a tetrahedron+dodecahedron motif. Those two tests might tell the user different pieces of information, if for example some environments had the inner shell but not the outer shell. I feel like my example got a little convoluted, but hopefully that provides some justification for the way the code was originally written! Does that make sense? |
Yeah I did something similar and it worked indeed. I also agree with you guys that the 0 vectors are probably not something
|
@erteich thanks for pointing out how this worked in 1.x, that totally makes sense. These issues are definitely a case of me generalizing the code with my refactors for 2.0 without completely understanding how the different components were originally designed to work together. Now that I understand, I think that the changes I would propose are the following:
Does that sound reasonable to everyone? |
What about we say in the document that the vectors in the cluster provided by the user should all point from the origin? Is it what
Otherwise I have no more complains. Thank you guys so much! |
Great discussion on this so far, and I am on board with @vyasr suggestions. I would like to add that any proposed changes targeting freud v3.0 should have an issue created for them describing the necessary changes in detail (can refer to this discussion). Please also add the issue to the 3.0 milestone. |
Thanks to you both for all your work on this! @vyasr , thank you for your suggestions on how to move forward with this cleanly. Just to clarify, the user either provides a number of neighbors as a query argument, or a custom neighbor list, and everything else results in an immediate error, right? I took a closer look at the code and realized I built in another sort of fail-safe mechanism to discourage the comparison of environments that have different numbers of vectors: In the isSimilar method that actually compares two environments and returns whether they have matched (are similar), I wrote it such that environments with different numbers of vectors automatically never match. My reasoning was that, for this comparison method, the goal is to build a 1-1 map between vectors of each environment. The 1-1 map is ill-defined if the domains are different sizes. That fail-safe can be seen for example in lines 274-279 here: https://github.com/glotzerlab/freud/blob/master/cpp/environment/MatchEnv.cc I apologize that this was never made explicit in the documentation, outside of the comments in the code itself. I would suggest that we add that to the documentation, both in the EnvironmentCluster class (meaning, environments of different numbers of vectors are never found "matching" with each other) and in the MotifMatch class (environments of different numbers of vectors than the motif against which we are querying will never "match" that motif). Due to this fail-safe mechanism in the matching code, I think there could be another option for how to address our issue, instead of validating the neighbor list (which is definitely in keeping with the original version of the code as I wrote it). Instead, we could just throw a Warning every time the user calls MotifMatch, saying that only environments of the same number of vectors as the user-provided motif will be tested for "matching" against that motif. What do you think of that? The onus would be on the user to read that Warning and pay attention to it as they interpret their results, but the advantage would be greater flexibility for the user, I suppose. Say that the user really wanted to test if any close-packed environments were icosahedral around any particles- they could provide the 12 points of an icosahedron as the motif against which they want to match, and then some kind of rmax corresponding to the first well of the RDF as the query argument for building the neighbor list. Then all particles that don't even have 12 neighbors within that nearest neighbor shell characterized by rmax would just be returned as not similar in their environment to the icosahedron. That type of use case might crop up more often than I originally thought when I wrote the code and included that assertion that everyone had to have the same number of neighbors in MotifMatch. It would have one extra advantage, which would be that if the code ever becomes more flexible, including more "isSimilar" methods that compare point sets of different sizes against each other (possibly plugging into the ICP library or others), then the MotifMatch code could utilize them. What do you guys think? I am also very happy to include the neighbor list validation for now and remove it later if the code ever includes more "isSimilar" methods.. I see advantages to both strategies. @Charlottez112 , indeed, all environment vectors are defined with respect to the origin. EnvironmentCluster builds environments as sets of vectors pointing from each particle to its neighbors, so the vectors are relative to each particle, effectively meaning that they can be considered to be pointing from the origin (with the particle at the origin). Does that make sense? We can definitely include something in the documentation indicating that the motif provided in MotifMatch should be a set of vectors with respect to the origin, if you think that would be clearer! To answer your previous questions about my example:
|
@erteich I like your approach better. That way we can still use |
@Charlottez112 yes! that's a nice and succinct way of putting it. Based on the implementation of the code as it is now, that is correct. |
@alacour was pinged for review here, but since he's not an in-house peep anymore, I wouldn't be surprised if he didn't review. I've had a look at it, and there are still a couple of things to do here:
|
PR #980 I created takes care of the doc. I will create another PR to add warnings in the code and test the warnings. I haven't thought this through though - should we raise a warning every time the number of particles in the query particle's env does not match the template motif? Should this also be implemented in I assume the 3rd item in your list @tommy-waltmann refers to changing the output of |
@Charlottez112 yes the third item is about changing output of For different environment sizes, yes, I would warn every time in For warnings related to extra zeros in the output of I would not add any extra warnings in the |
It'll be a couple more days before I can come back and finish up this PR, but I do plan to and will respond to other PRs/issues that @Charlottez112 or anyone else opens in the meantime when I can. |
There isn't much left to do here, we could have this PR add the warnings when the environment sizes don't match instead of making a third PR for it as @Charlottez112 suggested earlier in the thread. @Vyas someone else could do that if you won't have the time for it for a while. |
I've updated this PR to throw a warning whenever the motif contains the zero vector, and I've added a test for that case.
I agree, this is something that is just solved with documentation in EnvironmentCluster. The warning that I've added here should address the problem case where someone takes a cluster environment from EnvironmentCluster and tries to use it as a motif in EnvironmentMotifMatch.
I skipped this for now because it's not easily doable with freud's data model. Since we decided not to restrict the user's flexibility in the |
I think this is good to go whenever everyone is happy. We can follow up separately on the docs PR and anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is good to go. Thank you guys so much!
I'm late, but this looks great to me- thanks to you all!! |
Description
EnvironmentMotifMatch constructs an environment for each particle and matches it to the environment of a reference particle (the motif). The size of the local environment for each particle may, however, be larger than the motif size, since any subset of that environment may be sufficient to match the motif. Therefore, the
m_max_num_neigh
parameter cannot be set to the size of the motif, but must instead be computed dynamically from the NeighborList.The underlying bug looks like it has always been present, but prior to freud 2.0 it was much harder to encounter because it would require a user to manually construct a NeighborList and then pass a value of
k
(the number of neighbors) to the constructor of theMatchEnv
object that did not match the one used for constructing theNeighborList
. The default constructedNeighborList
withinMatchEnv
would always match the value ofk
correctly. With freud 2.0, it became much easier for users to specify alternative neighbor specifications with the new query syntax, and the value ofk
was no longer part of the class definition. #489 introduced the specific error that made it easy to hit this error case by always setting the value ofEnvDisjointSet.m_max_num_neigh
to the size of the motif, so any neighbor specification resulting in particles with more neighbors in the NeighborList than the size of the motif would trigger this error.Motivation and Context
Resolves: #633
How Has This Been Tested?
Both the original script in #633 (with the data included there) and the example documented by @Charlottez112 on #978 seg fault on my machine without this change, and both of them complete successfully once these changes are included.
Types of changes
Checklist: