Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix disjoint set size in EnvironmentMotifMatch #979

Merged
merged 7 commits into from Jul 4, 2022
Merged

Conversation

vyasr
Copy link
Collaborator

@vyasr vyasr commented Jun 21, 2022

Description

EnvironmentMotifMatch constructs an environment for each particle and matches it to the environment of a reference particle (the motif). The size of the local environment for each particle may, however, be larger than the motif size, since any subset of that environment may be sufficient to match the motif. Therefore, the m_max_num_neigh parameter cannot be set to the size of the motif, but must instead be computed dynamically from the NeighborList.

The underlying bug looks like it has always been present, but prior to freud 2.0 it was much harder to encounter because it would require a user to manually construct a NeighborList and then pass a value of k (the number of neighbors) to the constructor of the MatchEnv object that did not match the one used for constructing the NeighborList. The default constructed NeighborList within MatchEnv would always match the value of k correctly. With freud 2.0, it became much easier for users to specify alternative neighbor specifications with the new query syntax, and the value of k was no longer part of the class definition. #489 introduced the specific error that made it easy to hit this error case by always setting the value of EnvDisjointSet.m_max_num_neigh to the size of the motif, so any neighbor specification resulting in particles with more neighbors in the NeighborList than the size of the motif would trigger this error.

Motivation and Context

Resolves: #633

How Has This Been Tested?

Both the original script in #633 (with the data included there) and the example documented by @Charlottez112 on #978 seg fault on my machine without this change, and both of them complete successfully once these changes are included.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds or improves functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation improvement (updates to user guides, docstrings, or developer docs)

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • I have updated the documentation (if relevant).
  • I have added tests that cover my changes (if relevant).
  • All new and existing tests passed.
  • I have updated the credits.
  • I have updated the Changelog.

@vyasr vyasr self-assigned this Jun 21, 2022
@vyasr vyasr added the bug Something isn't working label Jun 21, 2022
@vyasr vyasr changed the title Fix/issue 633 Fix disjoint set size in EnvironmentMotifMatch Jun 21, 2022
@vyasr vyasr mentioned this pull request Jun 21, 2022
7 tasks
@vyasr
Copy link
Collaborator Author

vyasr commented Jun 21, 2022

@Charlottez112 please confirm that this fixes the bug that you observe.

@erteich when you added this feature to freud, did you ever test the scenario that I document in the PR description (passing a NeighborList as nlist to MatchEnv.matchMotif with more neighbors per point than the value specified as k to the constructor of MatchEnv)? That is a much more obvious user error than what is happening now, but assuming my understanding of the code is correct that should have seg faulted as well.

The other thing that we should probably double check is that everything is correctly handling the case where different particles have different numbers of neighbors. From a quick scan that seems to be OK, and I tested Charlotte's code with a few different values of r_max just to be sure, but it's probably worth validating that with some validation tests like we discussed on our call.

dj.m_max_num_neigh = motif_size;
auto counts = nlist.getCounts();
auto* begin = counts.get();
auto* end = begin + counts.size();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, it would be nice to add begin and end methods to ManagedArray to support iterators. Completely out of scope for this PR, though.

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 21, 2022

@erteich when you added this feature to freud, did you ever test the scenario that I document in the PR description (passing a NeighborList as nlist to MatchEnv.matchMotif with more neighbors per point than the value specified as k to the constructor of MatchEnv)? That is a much more obvious user error than what is happening now, but assuming my understanding of the code is correct that should have seg faulted as well.

I just verified that with freud 1.0.0 I can trigger a seg fault with this script:

import freud
import gsd.hoomd
import numpy as np

# load the file
frame = gsd.hoomd.open('anneal_run_1_FINAL.gsd')[-1]
# motif to match
motif_CP = np.array([[(0.56667), 0.0, 0.0],
                     [0.0, (0.56667), 0.0],
                     [-0.56667, 0.0, 0.0],
                     [0.0, -0.56667, 0.0]])

box = freud.box.Box.from_box(frame.configuration.box)
k = 4
match = freud.environment.MatchEnv(box, 0.65, k)

k2 = 6
nn = freud.locality.NearestNeighbors(box.Lx/3, k2)
nn.compute(box, frame.particles.position, frame.particles.position)
A = match.matchMotif(frame.particles.position, motif_CP, 0.40, registration=False, nlist=nn.nlist)

where anneal_run_1_FINAL.gsd is the file contained in the zip archive @willzygmunt posted on #633. Setting k2 > k triggers the same seg fault that is observed now. The new querying syntax exposes this issue much more obviously than before, though.

@Charlottez112
Copy link
Contributor

Charlottez112 commented Jun 21, 2022

@Charlottez112 please confirm that this fixes the bug that you observe.

Yes this fixes the seg fault!

The other thing that we should probably double check is that everything is correctly handling the case where different particles have different numbers of neighbors. From a quick scan that seems to be OK, and I tested Charlotte's code with a few different values of r_max just to be sure, but it's probably worth validating that with some validation tests like we discussed on our call.

I think you're right that the underlying issue is the mismatch between the number of particles in the local environment computed by MotifMatch and the number of 0 vectors in the cluster provided.

In my code, the env_cluster function found 26 environments with 11, 12, and 13 coordination numbers. When I changed num_neighbors to 13, it was able to find matches for the clusters that have 13 non-zero vectors. But if I decreased num_neighbors to 11 or 12, it wasn't able to find any match.

See code:

import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
import freud

# Generate a perfect gamma-Brass crystal using its conventional unit cell
a = 2
unit_cell = [a, a, a, 0, 0, 0]
basis_positions = [[-0.67226005, -0.67226005, -0.67226005],
                   [-0.67226005, -0.32773998, -0.32773998],
                   [-0.32773998, -0.32773998, -0.67226005],
                   [-0.32773998, -0.67226005, -0.32773998],
                   [-0.17225999, -0.17225999, -0.17225999],
                   [-0.17225999, -0.82774   , -0.82774   ],
                   [-0.82774   , -0.82774   , -0.17225999],
                   [-0.82774   , -0.17225999, -0.82774   ],
                   [-0.89219   , -0.89219   , -0.89219   ],
                   [-0.89219   , -0.10780999, -0.10780999],
                   [-0.10780999, -0.10780999, -0.89219   ],
                   [-0.10780999, -0.89219   , -0.10780999],
                   [-0.39218998, -0.39218998, -0.39218998],
                   [-0.39218998, -0.60781   , -0.60781   ],
                   [-0.60781   , -0.60781   , -0.39218998],
                   [-0.60781   , -0.39218998, -0.60781   ],
                   [-0.64421   , -1.        , -1.        ],
                   [-1.        , -0.35579   , -1.        ],
                   [-0.35579   , -1.        , -1.        ],
                   [-1.        , -0.64421   , -1.        ],
                   [-1.        , -1.        , -0.35579   ],
                   [-1.        , -1.        , -0.64421   ],
                   [-0.14420998, -0.5       , -0.5       ],
                   [-0.5       , -0.85579   , -0.5       ],
                   [-0.85579   , -0.5       , -0.5       ],
                   [-0.5       , -0.14420998, -0.5       ],
                   [-0.5       , -0.5       , -0.85579   ],
                   [-0.5       , -0.5       , -0.14420998],
                   [-0.68844   , -0.68844   , -0.96326005],
                   [-0.68844   , -0.31155998, -0.03673998],
                   [-0.31155998, -0.31155998, -0.96326005],
                   [-0.31155998, -0.68844   , -0.03673998],
                   [-0.96326005, -0.68844   , -0.68844   ],
                   [-0.03673998, -0.68844   , -0.31155998],
                   [-0.96326005, -0.31155998, -0.31155998],
                   [-0.03673998, -0.31155998, -0.68844   ],
                   [-0.68844   , -0.03673998, -0.31155998],
                   [-0.68844   , -0.96326005, -0.68844   ],
                   [-0.31155998, -0.03673998, -0.68844   ],
                   [-0.31155998, -0.96326005, -0.31155998],
                   [-0.18844   , -0.18844   , -0.46326   ],
                   [-0.18844   , -0.81156003, -0.53674   ],
                   [-0.81156003, -0.81156003, -0.46326   ],
                   [-0.81156003, -0.18844   , -0.53674   ],
                   [-0.46326   , -0.18844   , -0.18844   ],
                   [-0.53674   , -0.18844   , -0.81156003],
                   [-0.46326   , -0.81156003, -0.81156003],
                   [-0.53674   , -0.81156003, -0.18844   ],
                   [-0.18844   , -0.53674   , -0.81156003],
                   [-0.18844   , -0.46326   , -0.18844   ],
                   [-0.81156003, -0.53674   , -0.18844   ],
                   [-0.81156003, -0.46326   , -0.81156003]]
uc = freud.data.UnitCell(box=unit_cell,
                         basis_positions=basis_positions)
n = 2
noise = 0.0
box, points = uc.generate_system(n, sigma_noise=noise)

first_well = 0.68
env_cluster = freud.environment.EnvironmentCluster()

threshold = 0.2 * first_well
neighbor = {'num_neighbors': 13}
# neighbor = {'r_max': 0.8}
env_neighbor = {'r_max': first_well}

env_cluster.compute((box, points), threshold=threshold,
                    neighbors=neighbor,
                    env_neighbors=env_neighbor,
                    registration=False,
                    global_search=True)
env_match = freud.environment.EnvironmentMotifMatch()
for cluster in env_cluster.cluster_environments:
    env_match.compute((box, points), cluster, threshold,
                      neighbor, registration=False)
    num_particles = 0
    for vector in cluster:
        if np.power(vector, 2).sum() > 0:
            num_particles += 1
    print(f'num particles in provided cluster: {num_particles}')
    print(f'    matches found: {np.sum(env_match.matches)}')

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 21, 2022

@Charlottez112 to verify that it's not a bug, you can try things like setting the threshold to some eps << 1 and see if that finds a match. My guess is that the problem isn't a bug in the code, just the fact that you have to tune the method very precisely when num_neighbors != motif_size. If you can't get matches with any combination of parameters (I assume @erteich will have a much better sense for the parametrization than I will) then I'd start worrying about another bug.

@vyasr vyasr marked this pull request as ready for review June 21, 2022 15:33
@vyasr vyasr requested a review from a team as a code owner June 21, 2022 15:34
@vyasr vyasr requested review from alacour and removed request for a team June 21, 2022 15:34
@Charlottez112
Copy link
Contributor

Charlottez112 commented Jun 21, 2022

@Charlottez112 to verify that it's not a bug

I think it's still sort of a bug. Because all the clusters found by env_cluster have the shape (13, 3), but only the ones with coordination number == 13 do not contain 0 vectors.
For the ones with coordination number = 11 and 12, after I deleted the 0 vectors in the cluster, and used the new clusters for comparison, env_match was able to match all of them. I didn't need to increase the threshold (threshold = 0.136)

I feel like this is something EnvironmentMotifMatch should do under the hood. Or at least we need to add documentation to warn users about the 0 vectors.

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 21, 2022

For the ones with coordination number = 11 and 12, after I deleted the 0 vectors in the cluster, and used the new clusters for comparison, env_match was able to match all of them. I didn't need to increase the threshold (threshold = 0.136)

That's what I expected. I assume that if you lower the threshold you will also get matches without removing the 0 vectors since the 0 vectors probably just lower the similarity score (I don't recall the exact calculation). My inclination is that this is a problem that should be fixed by changing EnvironmentCluster to return a list of vectors rather than using a large numpy array with 0 vectors. The data is ragged (not every environment is the same size) so I don't think it makes sense to return the data in a rectangular array. I wouldn't consider this a bug, but a nice enhancement. I would definitely advocate putting this into a separate PR since it would be an API break and therefore something that needs to go into freud 3.0. I don't think we want EnvironmentMotifMatch do this under the hood because depending on the input data you could actually have a point at the origin, right? So the current behavior of inserting zeros isn't necessarily unambiguous.

@Charlottez112
Copy link
Contributor

That's what I expected. I assume that if you lower the threshold you will also get matches without removing the 0 vectors since the 0 vectors probably just lower the similarity score (I don't recall the exact calculation).

I don't quite follow why lowering the threshold would do the trick? Do you mean increase the input threshold? I also tried that if that's what you suggested. I increased the threshold to 0.9 and got 0 matches for all clusters.

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 21, 2022

Sorry yeah I meant increasing, I'm just used to thinking about thresholds in the opposite way that it's defined in the environment matching code. However, I just took a look at the code and I guess it really only affects the pairwise comparisons between vectors. I don't know how the code handles the case where there's a mismatched number of vectors. I don't remember if we really have any knobs to turn there, or if it really is up to the user to ensure that the number of neighbors match.

@erteich
Copy link
Contributor

erteich commented Jun 22, 2022

Hi all,

@vyasr , thank you so much for tracking this down. To answer your question about how the code used to handle mismatched numbers of neighbors, I originally wrote it such that the motif against which we wanted to match had to contain the same number of neighbors as that used to construct the MatchEnv instance. So when a user would instantiate MatchEnv(rmax, k), then call MatchMotif with a certain motif, there is a line in MatchEnv.cc, in v0 and v1 of freud, that asserts that the number of reference points in the motif against which we want to match must match the value of k used to construct MatchEnv. For reference, that is line 751 of this MatchEnv.cc version, for example:

https://github.com/glotzerlab/freud/blob/v1.2.2/cpp/environment/MatchEnv.cc

So to make a long story short, I think the original spirit with which I wrote the code was fairly unambiguous, but not flexible- for example, if a user wanted to match multiple motifs with different numbers of neighbors, they would just create multiple instances of MatchEnv with different k values, each of which matched the number of neighbors for each of the motifs. Then when freud became a lot more flexible in terms of how users could pass neighbor lists to these methods, things got more ambiguous and complicated. At the very least for right now I think we should add something to the documentation recommending users create neighbor lists for the MatchMotif class that make particles have the same number of neighbors, and that this number of neighbors matches the number of points in the motif. That, at the very least, will make for an unambiguous comparison.

I agree that the padding of the environments with 0 vectors is completely unnecessary; to be honest I probably wrote it that way just because I didn't think of a better way. Those 0 vectors are not physical, and would not correlate with any real environment vector unless a particle was directly on top of another. Then there would literally be no distance between a particle and its nearest neighbor. Perhaps that can happen in some very strange simulation cases, but it seems exceedingly unlikely. More generally, I think that the API of this class should be slightly overhauled- I think it contains relics of the way I originally wrote it that now simply don't make sense with the new API of freud. Changing EnvironmentCluster to return a list of vectors, and removing that global_search flag are two things I can think of off the top of my head. I think that under the hood there might be other left-overs of how I wrote things that should be removed, but I would have to take a closer look to be sure. I would be happy to work on this as a separate branch to be incorporated into a future version of freud.

@erteich
Copy link
Contributor

erteich commented Jun 22, 2022

And one other thing is @Charlottez112 , if in your code you were to, say, use the following lines:

env_match = freud.environment.EnvironmentMotifMatch()
for cluster in env_cluster.cluster_environments:
    non_zero_vecs = np.dot(cluster, cluster.T).diagonal() != 0
    pared_cluster = cluster[non_zero_vecs]
    num_neigh = len(pared_cluster)
    neighbor = {'num_neighbors': num_neigh}
    env_match.compute((box, points), pared_cluster, threshold,
                      neighbor, registration=False)

Then I think that would be quite unambiguous. Then for each motif, with its N reference points (neighbors), you would scan through the simulation for similar motifs by looking at the N nearest neighbors of each particle in the simulation, and seeing if they match the motif. I also think nothing should be done under the hood, in agreement with @vyasr , and that this is not really a bug- this way, the search process is doing exactly what it says it is doing, and leaving it to the user to interpret. For example, perhaps a particle's environment is a nested set of polyhedra. Perhaps there is a tetrahedral inner shell, and a dodecahedral outer shell, as is the case in some complicated intermetallic phases. Then one could compare all 4-nearest neighbor environments to a tetrahedron motif, and all 24-nearest neighbor environments to a tetrahedron+dodecahedron motif. Those two tests might tell the user different pieces of information, if for example some environments had the inner shell but not the outer shell. I feel like my example got a little convoluted, but hopefully that provides some justification for the way the code was originally written! Does that make sense?

@Charlottez112
Copy link
Contributor

Then I think that would be quite unambiguous. Then for each motif, with its N reference points (neighbors), you would scan through the simulation for similar motifs by looking at the N nearest neighbors of each particle in the simulation, and seeing if they match the motif.

Yeah I did something similar and it worked indeed. I also agree with you guys that the 0 vectors are probably not something EnvironmentMotifMatch should handle. However, I do have some questions regarding your example and what I observed in my tests above:

  • In your example, it seems like the user should set the neighbor list of env_match to include the 4 nearest neighbors if the goal is to compare the tetrahedral shell, and set it to include the 24 nearest neighbors to compare against the template outer shell?

  • If the neighbor list includes the 24 nearest neighbors, and the inner shell of the query environment does not match the template cluster, this would still be considered a match right?

  • I feel like this example is different from the scenario in @vyasr's description of the PR. In this example, different number of particles in neighbor lists = totally different environments. For what Vyas described, I can only think of one case where a particle is on the bond that connects 2 other particles, which means that particle is not necessary to be included in the neighbor list.
    Besides, it seems to me that right now even though we can pass in a larger neighbor list and EnvironmentMotifMatch can construct a larger environment for the query particles, for the code to make correct comparison, the neighbor list still need to contain exactly the same number of particles as the template cluster.
    In my test, the largest coordination number is 13, but only environments with 13 particles were matched. This is also true when the environments are compared with clusters with only non-zero vectors. In another test on the same system, I set neighbors to {'num_neighbors': 16}, and threshold=0.6, to compare with clusters with non-zero vectors, but no matches were found.
    Therefore I think it makes sense to enforce that the number of neighbors is the same as the coordination number of the cluster, like what @erteich did originally.

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 23, 2022

@erteich thanks for pointing out how this worked in 1.x, that totally makes sense. These issues are definitely a case of me generalizing the code with my refactors for 2.0 without completely understanding how the different components were originally designed to work together. Now that I understand, I think that the changes I would propose are the following:

  • Make a follow-up PR in which EnvironmentMotifMatch validates the NeighborList. Specifically, there are two valid options:
    • The user provides a custom NeighborList, in which case all particles must have exactly the same number of neighbors as the motif. This can be checked (similarly to what is done in this PR) at the beginning of the code.
    • The user provides neighbors = {'num_neighbors': *}. The user may specify additional keys like exclude_ii or r_guess, but they may not specify r_min or r_max because that could lead to scenarios where the required number of neighbors cannot be found.
    • All other cases result in an immediate error. If a user wants to do a distance-based query like neighbors={'r_max': 3.0}, they must manually create the NeighborList (e.g. using freud.locality.LinkCell) and pass in neighbors=nlist because we have no way of validating that the correct number of neighbors will be found a priori.
  • For now, we can add a small check in EnvironmentMotifMatch (probably just in Python) that checks whether a particular motif contains any zeros. If it does, we raise a warning (but not an Exception) indicating that the user likely forgot to filter out the meaningless zeros generated by EnvironmentCluster.
  • We make a separate PR targeting freud 3.0 in which EnvironmentCluster returns a list of numpy arrays (1 per position) rather than an 2D numpy array with zeros. Then, we remove the warning from EnvironmentMotifMatch.

Does that sound reasonable to everyone?

@Charlottez112
Copy link
Contributor

  • For now, we can add a small check in EnvironmentMotifMatch (probably just in Python) that checks whether a particular motif contains any zeros. If it does, we raise a warning (but not an Exception) indicating that the user likely forgot to filter out the meaningless zeros generated by EnvironmentCluster.

What about we say in the document that the vectors in the cluster provided by the user should all point from the origin? Is it what EnvironmentCluster is doing right now?

Does that sound reasonable to everyone?

Otherwise I have no more complains. Thank you guys so much!

@tommy-waltmann
Copy link
Collaborator

Great discussion on this so far, and I am on board with @vyasr suggestions.

I would like to add that any proposed changes targeting freud v3.0 should have an issue created for them describing the necessary changes in detail (can refer to this discussion). Please also add the issue to the 3.0 milestone.

@erteich
Copy link
Contributor

erteich commented Jun 23, 2022

Thanks to you both for all your work on this!

@vyasr , thank you for your suggestions on how to move forward with this cleanly. Just to clarify, the user either provides a number of neighbors as a query argument, or a custom neighbor list, and everything else results in an immediate error, right? I took a closer look at the code and realized I built in another sort of fail-safe mechanism to discourage the comparison of environments that have different numbers of vectors: In the isSimilar method that actually compares two environments and returns whether they have matched (are similar), I wrote it such that environments with different numbers of vectors automatically never match. My reasoning was that, for this comparison method, the goal is to build a 1-1 map between vectors of each environment. The 1-1 map is ill-defined if the domains are different sizes. That fail-safe can be seen for example in lines 274-279 here:

https://github.com/glotzerlab/freud/blob/master/cpp/environment/MatchEnv.cc

I apologize that this was never made explicit in the documentation, outside of the comments in the code itself. I would suggest that we add that to the documentation, both in the EnvironmentCluster class (meaning, environments of different numbers of vectors are never found "matching" with each other) and in the MotifMatch class (environments of different numbers of vectors than the motif against which we are querying will never "match" that motif).

Due to this fail-safe mechanism in the matching code, I think there could be another option for how to address our issue, instead of validating the neighbor list (which is definitely in keeping with the original version of the code as I wrote it). Instead, we could just throw a Warning every time the user calls MotifMatch, saying that only environments of the same number of vectors as the user-provided motif will be tested for "matching" against that motif. What do you think of that? The onus would be on the user to read that Warning and pay attention to it as they interpret their results, but the advantage would be greater flexibility for the user, I suppose. Say that the user really wanted to test if any close-packed environments were icosahedral around any particles- they could provide the 12 points of an icosahedron as the motif against which they want to match, and then some kind of rmax corresponding to the first well of the RDF as the query argument for building the neighbor list. Then all particles that don't even have 12 neighbors within that nearest neighbor shell characterized by rmax would just be returned as not similar in their environment to the icosahedron. That type of use case might crop up more often than I originally thought when I wrote the code and included that assertion that everyone had to have the same number of neighbors in MotifMatch. It would have one extra advantage, which would be that if the code ever becomes more flexible, including more "isSimilar" methods that compare point sets of different sizes against each other (possibly plugging into the ICP library or others), then the MotifMatch code could utilize them. What do you guys think? I am also very happy to include the neighbor list validation for now and remove it later if the code ever includes more "isSimilar" methods.. I see advantages to both strategies.

@Charlottez112 , indeed, all environment vectors are defined with respect to the origin. EnvironmentCluster builds environments as sets of vectors pointing from each particle to its neighbors, so the vectors are relative to each particle, effectively meaning that they can be considered to be pointing from the origin (with the particle at the origin). Does that make sense? We can definitely include something in the documentation indicating that the motif provided in MotifMatch should be a set of vectors with respect to the origin, if you think that would be clearer!

To answer your previous questions about my example:

  • Yes, the user should set the neighbor list of env_match to include the 4 nearest neighbors if the goal is to compare the tetrahedral shell, and set it to include the 24 nearest neighbors if the goal is to compare both the inner and outer shell. The user would provide all 24 vectors (inner + outer) as the motif in MotifMatch.
  • If the neighbor list includes the 24 nearest neighbors, and the inner shell of the query environment does not match the template cluster, this would not be considered a match, depending on the threshold and how much the inner shell does or does not match the inner shell of the template 24-vector cluster.
  • I'm not completely sure exactly the tests you performed that you described in your final bullet point, but hopefully my comments above regarding the fact that the environment comparison method isSimilar (used both for Cluster and MotifMatch) returns that environments of different numbers of vectors do not match, will help to explain your results!

@Charlottez112
Copy link
Contributor

@erteich I like your approach better. That way we can still use r_max.
Regarding the examples, I think it really just boils down to 1) same number of neighbors doesn't necessarily guarantee same environments 2) different number of neighbors definitely mean different environments right?

@erteich
Copy link
Contributor

erteich commented Jun 23, 2022

@Charlottez112 yes! that's a nice and succinct way of putting it. Based on the implementation of the code as it is now, that is correct.

@tommy-waltmann
Copy link
Collaborator

tommy-waltmann commented Jun 27, 2022

@alacour was pinged for review here, but since he's not an in-house peep anymore, I wouldn't be surprised if he didn't review. I've had a look at it, and there are still a couple of things to do here:

  • raise appropriate warnings in EnvironmentMotifMatch for when the cluster size doesn't match the motif size, and when there are extra zeros in the cluster returned by EnvironmentCluster
  • add a test case which makes sure warnings are raised in the above cases
  • make an issue documenting the proposed API changes moving to v3 (Done, see Remove zero vectors from the clusters computed by EnvironmentCluster #981)

@Charlottez112
Copy link
Contributor

PR #980 I created takes care of the doc. I will create another PR to add warnings in the code and test the warnings. I haven't thought this through though - should we raise a warning every time the number of particles in the query particle's env does not match the template motif? Should this also be implemented in EnvironmentCluster?

I assume the 3rd item in your list @tommy-waltmann refers to changing the output of EnvironmentCluster.compute to a list of arrays, which may or may not have the same length?

@tommy-waltmann
Copy link
Collaborator

@Charlottez112 yes the third item is about changing output of compute, but I'm just asking for there to be an issue so we can keep track of it at this point. It's API breaking, so it will have to wait until later.

For different environment sizes, yes, I would warn every time in EnvironmentMotifMatch when the query particle's env does not have the same size as the input motif.

For warnings related to extra zeros in the output of EnvironmentCluster, I am on board with @vyasr suggestion that we validate the motif given to EnvironmentMotifMatch, making sure it doesn't have any zero vectors in it.

I would not add any extra warnings in the EnvironmentCluster code for now, but we could document that the smaller environments get padded with zeros if their size is less than the size of the largest cluster.

@vyasr
Copy link
Collaborator Author

vyasr commented Jun 30, 2022

It'll be a couple more days before I can come back and finish up this PR, but I do plan to and will respond to other PRs/issues that @Charlottez112 or anyone else opens in the meantime when I can.

@tommy-waltmann
Copy link
Collaborator

There isn't much left to do here, we could have this PR add the warnings when the environment sizes don't match instead of making a third PR for it as @Charlottez112 suggested earlier in the thread. @Vyas someone else could do that if you won't have the time for it for a while.

@vyasr
Copy link
Collaborator Author

vyasr commented Jul 2, 2022

For warnings related to extra zeros in the output of EnvironmentCluster, I am on board with @vyasr suggestion that we validate the motif given to EnvironmentMotifMatch, making sure it doesn't have any zero vectors in it.

I've updated this PR to throw a warning whenever the motif contains the zero vector, and I've added a test for that case.

I would not add any extra warnings in the EnvironmentCluster code for now, but we could document that the smaller environments get padded with zeros if their size is less than the size of the largest cluster.

I agree, this is something that is just solved with documentation in EnvironmentCluster. The warning that I've added here should address the problem case where someone takes a cluster environment from EnvironmentCluster and tries to use it as a motif in EnvironmentMotifMatch.

For different environment sizes, yes, I would warn every time in EnvironmentMotifMatch when the query particle's env does not have the same size as the input motif.

I skipped this for now because it's not easily doable with freud's data model. Since we decided not to restrict the user's flexibility in the neighbors argument (which I am on board with), the size of a particle's environment is not known in Python. The issue is that the NeighborList is not materialized until we get into C++ except in the case where the user provides a custom NeighborList as the neighbors argument. We need to implement some sort of C++ logging framework in order to be able to provide this warning. If someone feels strongly about this warning I would support looking into a good solution for logging (that would also help with @Charlottez112's request for progress meters in C++), but I don't think it's critical.

@vyasr
Copy link
Collaborator Author

vyasr commented Jul 2, 2022

I think this is good to go whenever everyone is happy. We can follow up separately on the docs PR and anything else.

Copy link
Contributor

@Charlottez112 Charlottez112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is good to go. Thank you guys so much!

@vyasr vyasr merged commit f0b10ae into master Jul 4, 2022
@vyasr vyasr deleted the fix/issue_633 branch July 4, 2022 17:48
@erteich
Copy link
Contributor

erteich commented Jul 5, 2022

I'm late, but this looks great to me- thanks to you all!!

@tommy-waltmann tommy-waltmann added this to the v2.11.0 milestone Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug when trying to use environment.EnvironmentMotifMatch with a neighborlist
5 participants