Refactored graph matching code and added many new features #960

bdpedigo · 2022-06-24T15:21:02Z

Does this PR have a descriptive title that could go in our release notes?
Does this PR add any new dependencies?
- No
Does this PR modify any existing APIs?
- Is the change to the API backwards compatible?
  - Very much no
Have you built the documentation (reference and/or tutorial) and verified the generated documentation is appropriate?

Reference Issues/PRs

Fixes #959
Fixes #858
Closes #792
Closes #425
Closes #346

What does this implement/fix? Highlights

Adds support for multigraph graph matching (see https://arxiv.org/abs/1908.02572)
Adds functionality for graph matching via optimal transport as an option (see https://arxiv.org/abs/2111.05366)
Adds support for sparse matrix inputs
Adds support for between-graph connection matching (see https://www.biorxiv.org/content/10.1101/2022.05.19.492713)
Fixes a bug in the current implementation when using seeds AND similarity matrix (see [BUG] graph matching performs poorly when seeded and using similarity #959)
Changes the API for graph matching
- Is now a function, not a class
- Outputs two sets of indices, one for the first graph and one for the second. This is more intuitive for when the graphs are of different sizes, regardless of the order in which they are input.

Design decisions

I opted to use a function rather than a class for the user-facing graph matching tools. This is because graph matching is not exactly a "model," or in other words I think people would be very unlikely to ever "fit" a permutation to one pair of graphs and then "predict" using the estimated permutation on another.
I tried to make the interface somewhat like https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html in the sense that I am now outputing two sets of indices, but here the first one corresponds to the input matrix A and the second to the input matrix B. For cases where the two matrices are the same, then the first set of indices is basically worthless: it will always be just the sorted indices of A, or in otherwords np.arange(n). This is much like what happens with linear_sum_assignment in scipy. However, for differently sized matrices (in particular when A is bigger than B), then not everyone in A will get a match, so this matters. See Understanding output of Graph Matching for unequal sized matches #925 for some discussion (it was confusing what happens with differently sized matrices).
I added some better logging/timing features because I often want to know what aspects of the algorithm are taking a long time when I'm matching large matrices.
I opted to add the transport functionality simply as extra parameters - this is mainly because all of the other parameters would be the same, and I felt it would be annoying to have a completely separate function to do this small adjustment to the matching algorithm.
- The downside is that this adds several parameters to the algorithm which wont matter when transport=False. Tons of parameters can be lame. But I did try to alleviate this by labeling those parameters with the transport_ prefix.
I opted to only support the numpy.random.Generator syntax. This is because the RandomState syntax is considered legacy by numpy so I don't think any new code should bend over backwards to support it, IMO https://numpy.org/doc/stable/reference/random/legacy.html#legacy
I opted to use for loops for the multigraph cases for readability and compatibility with sparse arrays/matrices. There is a way to do similar operations with np.einsum but it is very hard to read and reason about, IMO.
I changed how the input initialization is specified because I did not like having a single parameter that could be a string, array, or float (which was tough to reason about internally).
Rather than shuffle the entire input matrix to begin with as we did in the past, I simply shuffle/unshuffle every time linear_sum_assignment is called, since this is the only place this matters. I think the computational cost of the shuffle is minimal, and this makes the rest of the code easier to reason about since we dont have as many shuffle/unshuffle operations to do.
- I almost wanted to do the same for the seed/unseed stuff, but I didn't go that far.
Decided to axe adding numba for now, just want to get in what we have here so far. Will make a new issue and revisit, potentially.

Remaining work [Done now]

This is currently a work in progress, making the PR to document progress and what needs to be done:

implement padding
implement wrapper function(s)
implement computation of "score" (objective function value)
type-hints
tests
documentation
implement handling of outlier cases (such as all nodes are seeded)
tutorials updated accordingly

bdpedigo · 2022-08-12T17:37:10Z

@daxpryce I think I've addressed everything as well as all the stuff I said I was gonna do, lmk if you have any other comments.

I bumped the major version since this is API-breaking - hoping to release soon after this gets in

bdpedigo · 2022-08-18T15:08:57Z

@dokato would you have any interest in reviewing this PR (no pressure if not). If so, lmk and I can get you the permissions.

dokato · 2022-08-19T14:06:04Z

@bdpedigo sure, I'll have a look. I tested parts of it already anyway.

ebridge2 · 2022-08-19T15:49:32Z

Big note: I think it would be useful to include the actual estimated permutation matrix, Phat, as a return for all graph matching related things. The place this might become hairy is in the case of paddings being performed, so perhaps there is some minor discussion to take place (possibly) by zoom later today.

Some notes:

How does indices_A and indices_B behave for padding? reading through the code, it looks like it is going to include match details for nodes which are just padding, which seems possibly misleading. Feels like if padding is performed, perhaps for whichever network has padding done to it, you might want to note that that node is just padding somewhere in the indices, and you could just leave the corresponding match in the other network with a value of None for the outstanding nodes that were matched with a padded node.
why not just have self.padded_A and self.padded_B as class attributes, instead of self.padded and self.padded_B? It seems more direct/clear to just have attributes for each of A and B, and not having a nested outcome that you have to do any mental adjustments for.

bdpedigo · 2022-08-19T17:28:12Z

@ebridge2 some responses you thoughts above:

Big note: I think it would be useful to include the actual estimated permutation matrix, Phat, as a return for all graph matching related things. The place this might become hairy is in the case of paddings being performed, so perhaps there is some minor discussion to take place (possibly) by zoom later today.

Can you explain why? I can see why for the book, but I think for most practices you never actually want to see this matrix, and it can be greated in one line. And I do feel it adds some complexity, as you mention. We could show how to make this matrix easily, and explain?

How does indices_A and indices_B behave for padding? reading through the code, it looks like it is going to include match details for nodes which are just padding, which seems possibly misleading. Feels like if padding is performed, perhaps for whichever network has padding done to it, you might want to note that that node is just padding somewhere in the indices, and you could just leave the corresponding match in the other network with a value of None for the outstanding nodes that were matched with a padded node.

I designed these to behave exactly like row_inds, col_inds for https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html because I wanted to base on an existing API, and I think this is intuitive. In the past, it was hard to tell for instance which elements of A got matched if A was the larger matrix. So if n_min is min(len(A), len(B)), then indices_A and indices_B will both have exactly n_min entries.

I dislike the idea of including Nones because then you cannot use these arrays to index into the adjacency matrix.

why not just have self.padded_A and self.padded_B as class attributes, instead of self.padded and self.padded_B? It seems more direct/clear to just have attributes for each of A and B, and not having a nested outcome that you have to do any mental adjustments for.

Are you talking about self.padded and self._padded_B in the solver? These are different things. The former is a parameter from the user that is just stored. The later is a private attribute that I use to keep track of which matrix was padded; I opted for one boolean instead of two.

bdpedigo · 2022-08-19T17:31:50Z

@daxpryce any chance we can add @dokato with reviewing powers?

daxpryce · 2022-08-19T17:38:07Z

@daxpryce any chance we can add @dokato with reviewing powers?

he'll need to accept the invite, but it is done

bdpedigo · 2022-08-19T18:28:33Z

@daxpryce any chance we can add @dokato with reviewing powers?

he'll need to accept the invite, but it is done

many thanks!

dokato · 2022-08-22T09:13:05Z

Sorry, busy weekend. I accepted the invite now @bdpedigo

dokato

Looking good, I tested it on a few experiments, so far all run smoothly. A few minor things to consider.

graspologic/match/wrappers.py

graspologic/nominate/VNviaSGM.py

bdpedigo added 27 commits May 17, 2022 14:04

start on new gm code

1ff6574

add init

b784149

changes to solver

dfc5c4b

add a temporary warning about seeds

b6f0bdf

brief temporary description

b15204d

draft wrapper

d6b824e

cleaning up setup

bc5721d

Merge branch 'dev' into new-gm

64cdff4

fix seeding permutation logic

3bb6770

add shell of a user-facing wrapper

bae21fd

add types to wrapper

4f1d263

add some type hints, remove _doubly_stochastic

ad87bbf

take that, mypy

3f3052b

clean up constant terms handling

a46d556

more cleanup

b5dedb7

another type fix

f10a95d

try switching to typing_extensions for Literal

e5e381b

Merge branch 'dev' into new-gm

e5885b3

implement padding and optional shuffle

44edc98

first draft of a wrapper function for the solver

4c63903

fixes to wrapper

d4a3f56

fix omitted param

7ebc4ea

try to add csr_array safely based on scipy ver

a148284

black

ae3ea83

make mypy happy?

96a659c

some comments

3d813af

refactor module-specific typing

040ccbf

bdpedigo changed the title ~~Reimplemented graph matching code and added new features~~ Refactored graph matching code and added many new features Jun 29, 2022

bdpedigo added 2 commits June 30, 2022 09:05

update param name to max_iter

8a0d263

beartype the wrapper

311a111

bdpedigo added 5 commits August 12, 2022 12:50

add a citation for multiplex

e9c0ef9

improve docs

a5be386

bump major version for this breaking API change

0882291

run black

5532e5a

try to fix 3.7 error w/ trace

4086522

bdpedigo requested a review from daxpryce August 12, 2022 17:36

fix trace?

4026b16

daxpryce approved these changes Aug 19, 2022

View reviewed changes

dokato approved these changes Aug 24, 2022

View reviewed changes

graspologic/match/wrappers.py Outdated Show resolved Hide resolved

graspologic/match/wrappers.py Outdated Show resolved Hide resolved

graspologic/nominate/VNviaSGM.py Show resolved Hide resolved

bdpedigo added 10 commits August 29, 2022 11:30

Merge remote-tracking branch 'origin' into new-gm

efff764

remove unneeded sparse spec

63f3a20

further simplify sparse logic

27ff53c

remove unnecessary comment

4995700

address confusion about maximize

3e3bff4

allow partial_match to be a 2ple

65c2e80

rmv unnecessary comment

47b2aa8

add custom input check and test

9d2c1e9

take that mypy

cdfc257

does that make you happy, mypy?

23f1fe4

bdpedigo merged commit facb6a8 into dev Aug 30, 2022

bdpedigo deleted the new-gm branch August 30, 2022 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored graph matching code and added many new features #960

Refactored graph matching code and added many new features #960

bdpedigo commented Jun 24, 2022 •

edited

bdpedigo commented Aug 12, 2022

bdpedigo commented Aug 18, 2022

dokato commented Aug 19, 2022 •

edited

ebridge2 commented Aug 19, 2022

bdpedigo commented Aug 19, 2022 •

edited

bdpedigo commented Aug 19, 2022

daxpryce commented Aug 19, 2022

bdpedigo commented Aug 19, 2022

dokato commented Aug 22, 2022

dokato left a comment

Refactored graph matching code and added many new features #960

Refactored graph matching code and added many new features #960

Conversation

bdpedigo commented Jun 24, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Highlights

Design decisions

Remaining work [Done now]

bdpedigo commented Aug 12, 2022

bdpedigo commented Aug 18, 2022

dokato commented Aug 19, 2022 • edited

ebridge2 commented Aug 19, 2022

bdpedigo commented Aug 19, 2022 • edited

bdpedigo commented Aug 19, 2022

daxpryce commented Aug 19, 2022

bdpedigo commented Aug 19, 2022

dokato commented Aug 22, 2022

dokato left a comment

Choose a reason for hiding this comment

bdpedigo commented Jun 24, 2022 •

edited

dokato commented Aug 19, 2022 •

edited

bdpedigo commented Aug 19, 2022 •

edited