Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored graph matching code and added many new features #960

Merged
merged 98 commits into from
Aug 30, 2022
Merged

Conversation

bdpedigo
Copy link
Collaborator

@bdpedigo bdpedigo commented Jun 24, 2022

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
    • No
  • Does this PR modify any existing APIs?
    • Is the change to the API backwards compatible?
      • Very much no
  • Have you built the documentation (reference and/or tutorial) and verified the generated documentation is appropriate?

Reference Issues/PRs

Fixes #959
Fixes #858
Closes #792
Closes #425
Closes #346

What does this implement/fix? Highlights

Design decisions

  • I opted to use a function rather than a class for the user-facing graph matching tools. This is because graph matching is not exactly a "model," or in other words I think people would be very unlikely to ever "fit" a permutation to one pair of graphs and then "predict" using the estimated permutation on another.
  • I tried to make the interface somewhat like https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html in the sense that I am now outputing two sets of indices, but here the first one corresponds to the input matrix A and the second to the input matrix B. For cases where the two matrices are the same, then the first set of indices is basically worthless: it will always be just the sorted indices of A, or in otherwords np.arange(n). This is much like what happens with linear_sum_assignment in scipy. However, for differently sized matrices (in particular when A is bigger than B), then not everyone in A will get a match, so this matters. See Understanding output of Graph Matching for unequal sized matches #925 for some discussion (it was confusing what happens with differently sized matrices).
  • I added some better logging/timing features because I often want to know what aspects of the algorithm are taking a long time when I'm matching large matrices.
  • I opted to add the transport functionality simply as extra parameters - this is mainly because all of the other parameters would be the same, and I felt it would be annoying to have a completely separate function to do this small adjustment to the matching algorithm.
    • The downside is that this adds several parameters to the algorithm which wont matter when transport=False. Tons of parameters can be lame. But I did try to alleviate this by labeling those parameters with the transport_ prefix.
  • I opted to only support the numpy.random.Generator syntax. This is because the RandomState syntax is considered legacy by numpy so I don't think any new code should bend over backwards to support it, IMO https://numpy.org/doc/stable/reference/random/legacy.html#legacy
  • I opted to use for loops for the multigraph cases for readability and compatibility with sparse arrays/matrices. There is a way to do similar operations with np.einsum but it is very hard to read and reason about, IMO.
  • I changed how the input initialization is specified because I did not like having a single parameter that could be a string, array, or float (which was tough to reason about internally).
  • Rather than shuffle the entire input matrix to begin with as we did in the past, I simply shuffle/unshuffle every time linear_sum_assignment is called, since this is the only place this matters. I think the computational cost of the shuffle is minimal, and this makes the rest of the code easier to reason about since we dont have as many shuffle/unshuffle operations to do.
    • I almost wanted to do the same for the seed/unseed stuff, but I didn't go that far.
  • Decided to axe adding numba for now, just want to get in what we have here so far. Will make a new issue and revisit, potentially.

Remaining work [Done now]

This is currently a work in progress, making the PR to document progress and what needs to be done:

  • implement padding
  • implement wrapper function(s)
  • implement computation of "score" (objective function value)
  • type-hints
  • tests
  • documentation
  • implement handling of outlier cases (such as all nodes are seeded)
  • tutorials updated accordingly

@bdpedigo bdpedigo changed the title Reimplemented graph matching code and added new features Refactored graph matching code and added many new features Jun 29, 2022
@bdpedigo bdpedigo requested a review from daxpryce August 12, 2022 17:36
@bdpedigo
Copy link
Collaborator Author

@daxpryce I think I've addressed everything as well as all the stuff I said I was gonna do, lmk if you have any other comments.

I bumped the major version since this is API-breaking - hoping to release soon after this gets in

@bdpedigo
Copy link
Collaborator Author

@dokato would you have any interest in reviewing this PR (no pressure if not). If so, lmk and I can get you the permissions.

@dokato
Copy link
Contributor

dokato commented Aug 19, 2022

@bdpedigo sure, I'll have a look. I tested parts of it already anyway.

@ebridge2
Copy link
Collaborator

Big note: I think it would be useful to include the actual estimated permutation matrix, Phat, as a return for all graph matching related things. The place this might become hairy is in the case of paddings being performed, so perhaps there is some minor discussion to take place (possibly) by zoom later today.

Some notes:

  1. How does indices_A and indices_B behave for padding? reading through the code, it looks like it is going to include match details for nodes which are just padding, which seems possibly misleading. Feels like if padding is performed, perhaps for whichever network has padding done to it, you might want to note that that node is just padding somewhere in the indices, and you could just leave the corresponding match in the other network with a value of None for the outstanding nodes that were matched with a padded node.
  2. why not just have self.padded_A and self.padded_B as class attributes, instead of self.padded and self.padded_B? It seems more direct/clear to just have attributes for each of A and B, and not having a nested outcome that you have to do any mental adjustments for.

@bdpedigo
Copy link
Collaborator Author

bdpedigo commented Aug 19, 2022

@ebridge2 some responses you thoughts above:

Big note: I think it would be useful to include the actual estimated permutation matrix, Phat, as a return for all graph matching related things. The place this might become hairy is in the case of paddings being performed, so perhaps there is some minor discussion to take place (possibly) by zoom later today.

Can you explain why? I can see why for the book, but I think for most practices you never actually want to see this matrix, and it can be greated in one line. And I do feel it adds some complexity, as you mention. We could show how to make this matrix easily, and explain?

How does indices_A and indices_B behave for padding? reading through the code, it looks like it is going to include match details for nodes which are just padding, which seems possibly misleading. Feels like if padding is performed, perhaps for whichever network has padding done to it, you might want to note that that node is just padding somewhere in the indices, and you could just leave the corresponding match in the other network with a value of None for the outstanding nodes that were matched with a padded node.

I designed these to behave exactly like row_inds, col_inds for https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html because I wanted to base on an existing API, and I think this is intuitive. In the past, it was hard to tell for instance which elements of A got matched if A was the larger matrix. So if n_min is min(len(A), len(B)), then indices_A and indices_B will both have exactly n_min entries.

I dislike the idea of including Nones because then you cannot use these arrays to index into the adjacency matrix.

why not just have self.padded_A and self.padded_B as class attributes, instead of self.padded and self.padded_B? It seems more direct/clear to just have attributes for each of A and B, and not having a nested outcome that you have to do any mental adjustments for.

Are you talking about self.padded and self._padded_B in the solver? These are different things. The former is a parameter from the user that is just stored. The later is a private attribute that I use to keep track of which matrix was padded; I opted for one boolean instead of two.

@bdpedigo
Copy link
Collaborator Author

@daxpryce any chance we can add @dokato with reviewing powers?

@daxpryce
Copy link
Contributor

@daxpryce any chance we can add @dokato with reviewing powers?

he'll need to accept the invite, but it is done

@bdpedigo
Copy link
Collaborator Author

@daxpryce any chance we can add @dokato with reviewing powers?

he'll need to accept the invite, but it is done

many thanks!

@dokato
Copy link
Contributor

dokato commented Aug 22, 2022

Sorry, busy weekend. I accepted the invite now @bdpedigo

Copy link
Contributor

@dokato dokato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I tested it on a few experiments, so far all run smoothly. A few minor things to consider.

graspologic/match/wrappers.py Outdated Show resolved Hide resolved
graspologic/match/wrappers.py Outdated Show resolved Hide resolved
graspologic/nominate/VNviaSGM.py Show resolved Hide resolved
@bdpedigo bdpedigo merged commit facb6a8 into dev Aug 30, 2022
@bdpedigo bdpedigo deleted the new-gm branch August 30, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants