Bring threshold inside BaseReconstructor class #290

sdmccabe · 2020-02-18T19:46:58Z

A checklist of things that need to be done for this:

This is presented as a proposal; it's not ready for merging. (The tests pass
but a ton of doc changes are missing.)

This consolidates a lot of the thresholding and graph creation logic that exists
in individual reconstructors and in the utilities submodule within the
BaseReconstructor class.

Usage changes from something like

>>> reconstructor = ReconstructionAlgorithm()
>>> G = reconstructor.fit(TS, <some_params>, threshold_type='degree', avg_k=16)
>>> # or alternately, G = reconstructor.results['graph']

to

>>> reconstructor = ReconstructionAlgorithm()
>>> G = reconstructor.fit(TS, <some_params>).threshold('degree', avg_k=16).to_graph()

The advantages of such an approach are as follows:

Minimizes code repetition. Each of the reconstructors (with one or two
exceptions) currently imports create_graph() and threshold() and calls those
functions just prior to returning the graph. By moving this functionality within
the base class and using inheritance, there's less duplication and upkeep.
(Mostly) gets rid of kwargs. The use of kwargs within the
reconstruction and thresholding logic has always been extremely messy and
error-prone. Because we no longer need to pass in threshold_type and
threshold-specific arguments to Reconstructor.fit(), we don't need to capture
kwargs in that method call. The general-purpose Reconstructor.threshold()
still uses kwargs to dispatch threshold method-specific arguments, but now the
specific thresholders can be accessed as class methods, and their arguments
passed as positional arguments, instead.
More expressive (and consistent) transformations. There's a lot of
ambiguity right now about which methods use which threshold methods by default,
which ones remove self-loops, etc. Under this system, fit() just estimates the
weight matrix; it does no thresholding, or binarization, or self-loop removal.
These are handled explicitly by the user through method chaining. So, for
example,

G = (R.fit(TS)
      .remove_self_loops()
      .threshold_on_degree(16)
      .binarize()
      .to_graph())

This order of operations also removes ambiguity about the relationship between
threshold_on_degree() and remove_self_loops(); i.e., should those count
towards the thresholding or not? Now, you can decide for yourself by reversing
the order.
4. More flexibility in return types. For a lot of experiments, we didn't
actually need the graphs, but we needed the underlying matrix and had to finish
around self.results for the right matrix to use. This got easier after we
standardized on self.results['weights_matrix'] but was still pretty
inaccessible. Now, the return type is made flexible through the addition of the
.to_graph() and .to_matrix() methods.

There are some notable downsides to go with the benefits, however.

This approach has a lot of statefulness. Let's say I'm working in
Jupyter, using a slow reconstruction method. I create my slow reconstructor

R = netrd.reconstruction.SlowReconstructor()
R = R.fit(TS)

I'm not sure what thresholding to use, so I play around a bit.

R = R.threshold_on_degree(4)
R = R.threshold_on_degree(16)

This isn't going to work, because the first threshold_on_degree() has already
chucked out a lot of the information from the weights matrix generated by the
fit() call. This is the big problem here; I think it's a pretty serious
weakness but not necessarily a fatal one. If there are ways to avoid unexpected
statefulness, though, I'm all ears.
2. Possible memory inefficiency. The internal BaseReconstructor methods
are copying dense matrices around a lot, which is possible memory hog. I don't
think this is too big of a deal because the time complexity of most of the
reconstruction algorithms limit their use to small-to-moderate graph sizes
anyway.

See netsiphd#271. This is presented as a proposal; it's **not ready for merging**. (The tests pass but a ton of doc changes are missing.) This consolidates a lot of the thresholding and graph creation logic that exists in individual reconstructors and in the utilities submodule within the `BaseReconstructor` class. The advantages of such an approach are as follows: 1. **Minimizes code repetition.** Each of the reconstructors (with one or two exceptions) currently imports `create_graph()` and `threshold()` and calls those functions just prior to returning the graph. By moving this functionality within the base class and using inheritance, there's less duplication and upkeep. 2. **(Mostly) gets rid of kwargs.** The use of `kwargs` within the reconstruction and thresholding logic has always been extremely messy and error-prone. Because we no longer need to pass in `threshold_type` and threshold-specific arguments to `Reconstructor.fit()`, we don't need to capture `kwargs` in that method call. The general-purpose `Reconstructor.threshold()` still uses `kwargs` to dispatch threshold method-specific arguments, but now the specific thresholders can be accessed as class methods, and their arguments passed as positional arguments, instead. 3. **More expressive (and consistent) transformations.** There's a lot of ambiguity right now about which methods use which threshold methods by default, which ones remove self-loops, etc. Under this system, `fit()` just estimates the weight matrix; it does no thresholding, or binarization, or self-loop removal. These are handled explicitly by the user through method chaining. So, for example, ```python G = (R.fit(TS) .remove_self_loops() .threshold_on_degree(16) .binarize() .to_graph()) ``` This order of operations also removes ambiguity about the relationship between `threshold_on_degree()` and `remove_self_loops()`; i.e., should those count towards the thresholding or not? Now, you can decide for yourself by reversing the order. 4. **More flexibility in return types.** For a lot of experiments, we didn't actually need the graphs, but we needed the underlying matrix and had to finish around `self.results` for the right matrix to use. This got easier after we standardized on `self.results['weights_matrix']` but was still pretty inaccessible. Now, the return type is made flexible through the addition of the `.to_graph()` and `.to_matrix()` methods. There are some notable downsides to go with the benefits, however. 1. **This approach has a lot of statefulness.** Let's say I'm working in Jupyter, using a slow reconstruction method. I create my slow reconstructor ```python R = netrd.reconstruction.SlowReconstructor() R = R.fit(TS) ``` I'm not sure what thresholding to use, so I play around a bit. ```python R = R.threshold_on_degree(4) R = R.threshold_on_degree(16) ``` This isn't going to work, because the first `threshold_on_degree()` has already chucked out a lot of the information from the weights matrix generated by the `fit()` call. This is the big problem here; I think it's a pretty serious weakness but not necessarily a fatal one. If there are ways to avoid unexpected statefulness, though, I'm all ears. 2. **Possible memory inefficiency.** The internal `BaseReconstructor` methods are copying dense matrices around a lot, which is possible memory hog. I don't think this is too big of a deal because the time complexity of most of the reconstruction algorithms limit their use to small-to-moderate graph sizes anyway.

sdmccabe · 2020-02-18T19:51:16Z

~~I've avoided messing with docstrings because I assume some things will changes, but I will try to add some draft docstrings to BaseReconstructor at some point.~~ I've added a draft of the docstrings.

sdmccabe · 2020-02-18T22:41:35Z

CorrelationSpanningTree is removed because the spanning tree is a method for thresholding/backboning a graph; I have added, in addition to the previously existing threshold methods, a minimum_spanning_tree() method. So the old CST can be obtained:

from netrd.reconstruction import CorrelationMatrix
G = CorrelationMatrix().fit(TS) \
                       .remove_self_loops() \
                       .minimum_spanning_tree() \ 
                       .to_graph()

sdmccabe · 2020-02-18T23:23:03Z

On update_matrix(), update_graph(), to_matrix(), and to_graph(): There's a lot of work being done to make sure the graph representation and the matrix representation are kept consistent, on the assumption that either could be generated by fit(). In the methods that actually exist in netrd, almost all use a matrix representation (OptimalCausationEntropy is the exception; it uses an adjacency list).

If we committed to self.matrix as the canonical way of representing the network, and then only generated the graph from the matrix when we returned it, a lot of this abstraction could be avoided. The question is whether or not this is worthwhile; what if we later have a method that doesn't fit this workflow?

leotrs · 2020-02-19T15:50:45Z

Statefulness

I think keeping the original output of the reconstruction should be a priority. A possible workaround is to have methods that destroy the original output create a new Reconstructor object and return that instead. So R2 = R.threshold_on_degree(4), or R.threshold_on_degree(4, inplace=True) both work, depending on what the user wants. The second one is more memory efficient but destroys the output. The first one is a memory hog but keeps it. This also mimics some pandas methods so there's that too.

Inefficiency

What if we didn't copy dense matrices a lot? There's a whole PR here pertaining the pervasive use of sparse matrices. Also, I'd rather have one limiting resource (time) and two (time + memory).

Representation

I'm all for using self.matrix as the canonical representation. That's what most of the methods are doing already anyway. We already have .to_graph so keeping that around shouldn't be necessary. I can't see how this would break workflow.

This commit does not address any of the statefulness concerns. It focuses on the issues of memory usage and the canonical representation of the reconstructed network. To the former, it uses sparse matrices to represent thresholded graphs. To the latter, it removes `update_graph()` and some of the conditionals; the idea is that the following workflow is implied: 1. The reconstructor is initialized with `self.matrix` and `self.graph` as `None`. 2. The `fit()` method creates a matrix of weights that resides at `self.matrix`. 3. Optionally, the matrix is thresholded, turning `self.matrix` from a numpy array to a scipy sparse matrix. 4. Optionally, other operations are performed, like the removal of self loops or binarization. These functions must work with both dense and sparse matrices. 5. The graph is accessed through the `to_graph()` method, which creates the networkx object from `self.matrix`. This graph is stored at `self.graph` and returned again if `to_graph()` is called again.

…into threshold-feature

sdmccabe · 2020-02-20T14:43:19Z

My last commit hopefully addresses the inefficiency and representations points somewhat. Now, to the statefulness point.

I think return self is a valid approach here and that the statefulness isn't too bad here. But if we wanted to go the route of reducing statefulness, this SO post points to the SQLAlchemy code base, in particular its use of the Generative class and @_generative decorator, as an example of how to implement this. It's not just a matter of changing all the return selfs to return self.copy()s, because you've already changed class attributes by the time you're making the copy.

leotrs · 2020-02-20T16:00:53Z

Can we come up with a particular example of why statefulness is Bad and how this generative thing is going to help that particular issue?

The problem of copying after modification is easily solved by deciding which object to work on (self, or a copy) at the top of the function. This is easy if we use the inplace kwarg.

sdmccabe · 2020-02-20T16:34:29Z

I think the statefulness issue is only likely to come up in interactive use.

If I'm reading it right, the @_generative decorator basically does what you propose. It makes the copy at the top of the function:

        s = self.__class__.__new__(self.__class__)
        s.__dict__ = self.__dict__.copy()

I don't really care for the inplace kwarg, assuming it has similar semantics to Pandas's inplace or in-place numpy functions like np.fill_diagonal and returns None. If it doesn't return None, that's also a possible source of confusion for people who would expect it to return None.

leotrs · 2020-02-20T16:38:18Z

Language barrier here: by "I don't really care about inplace", do you mean to say you don't want to implement it?

In any case, if generative does that, then I guess why not? As long as we are sure it is doing what we want (and only what we want).

sdmccabe · 2020-02-20T16:39:57Z

I don't really like the design pattern of in-place operations in general. If other people want it, I'd be ok with implementing it down the road, but I think for now I'd rather focus on getting a single workflow set up and revisit inplace later.

leotrs · 2020-02-20T16:41:24Z

I disagree, but you have the last word here.

Can generative also execute in-place?

sdmccabe · 2020-02-21T18:52:51Z

couple of test warnings to track down before merging:

tests/test_reconstruction.py::test_graph_size
tests/test_reconstruction.py::test_graph_size
  /home/main/.pyenv/versions/anaconda3-2019.03/lib/python3.7/site-packages/scipy/sparse/compressed.py:213: RuntimeWarning: invalid value encountered in greater
    res = self._with_data(op(self.data, other), copy=True)

There should be any comparison operations happening on sparse matrices right now, so I'm not sure exactly where it's coming from.

All non-`fit()` methods that return `self` now operate on copies of the object. This should receive test coverage. Trying it out interactively, it seems to work as expected, returning new copies of things where appropriate. Memory usage didn't appreciably increase, presumably because the threshold functions are creating sparse matrices, but we should revisit when matrices are being copied or not if the whole class is being copied too. For now I've just repeated the relevant lines of code. In a general cleanup it would probably make sense to use a decorator for this, probably just copying from the MIT-licensed SQLAlchemy repo.

This commit renames `self.graph` and `self.matrix` to `self._graph` and `self._matrix`, a way of signalling that they are intended to be private. As a result, `self._matrix` should only be modified using `self.update_matrix()`; that change is also added in each reconstructor file.

sdmccabe · 2020-02-27T20:23:42Z

couple of test warnings to track down before merging:
tests/test_reconstruction.py::test_graph_size
tests/test_reconstruction.py::test_graph_size
  /home/main/.pyenv/versions/anaconda3-2019.03/lib/python3.7/site-packages/scipy/sparse/compressed.py:213: RuntimeWarning: invalid value encountered in greater
    res = self._with_data(op(self.data, other), copy=True)
There should be any comparison operations happening on sparse matrices right now, so I'm not sure exactly where it's coming from.

From some googling, the issue here might be something to do with nans in the matrix.

sdmccabe · 2020-03-11T16:30:17Z

@leotrs the main stuff is ready for review, everything else in the checklist is window dressing (and contingent on finalizing the design).

leotrs

Random question: do we ever operate (i.e. multiply) with the matrices? Multiplication with dense/sparse matrices has to be handled carefully.

Also, of course the diff won't catch changes that needed to be made but weren't so I trust you double checked all of the files that needed editing. How is our test coverage here?

netrd/reconstruction/base.py

leotrs · 2020-03-12T16:31:25Z

netrd/reconstruction/base.py

+    def _binarize_sparse(self, mat):
+        return np.abs(mat.sign())
+
+    def _binarize_dense(self, mat):
+        return np.abs(np.sign(mat))
+


Why do we need both?

netrd/reconstruction/base.py

leotrs · 2020-03-12T16:43:52Z

netrd/reconstruction/marchenko_pastur.py

-            G = nx.empty_graph(n=N)
-            self.results['graph'] = G
-            return G
+            self.update_matrix(np.zeros((N, N)))


Can we use sparse matrices for dummy matrices like this? Or is it too clunky later?

tests/test_reconstruction.py

sdmccabe · 2020-03-12T22:09:11Z

Random question: do we ever operate (i.e. multiply) with the matrices? Multiplication with dense/sparse matrices has to be handled carefully.

I don't think we do. Likely any matrix multiplication would happen in fit(), before update_matrix() is called.

Also, of course the diff won't catch changes that needed to be made but weren't so I trust you double checked all of the files that needed editing. How is our test coverage here?

I wouldn't trust me! The test coverage is so-so; we've never had great coverage in the reconstruction submodule. The threshold logic seems to work, and the CCM test. because it was testing for a known result, helped me catch a bunch of bugs. It would be nice (as always) if we had more tests that could apply to all the reconstruction methods.

Remove CST from docs and correct typo

691a1c2

sdmccabe added 4 commits February 18, 2020 17:53

remove remove_self_loops parameter in to_graph

fdc0e1e

access self.matrix through to_matrix in utility methods

bb64353

add draft docstrings to BaseReconstructor and associated methods

bd10912

autoformatter, again

13f3e74

sdmccabe and others added 3 commits February 18, 2020 19:11

fix CCM test by adding remove_self_loops

630640f

autoformatter, again

985f30f

Merge branch 'master' into threshold-feature

84efe18

sdmccabe added 2 commits February 19, 2020 18:33

Merge branch 'threshold-feature' of https://github.com/sdmccabe/netrd …

0d3e628

…into threshold-feature

sdmccabe added 2 commits February 21, 2020 13:53

Merge master, resolve PCI merge conflict

46fc46f

leotrs marked this pull request as ready for review March 12, 2020 12:53

leotrs requested changes Mar 12, 2020

View reviewed changes

doc tweaks

b95e7f8

sdmccabe added 6 commits March 12, 2020 17:26

return MSTs directly

2683d8c

rename _sparse_allclose to _sparse_check_symmetric

dedaca3

correct out-of-date exception messages

6d516ca

test for None rather than falseness

ff6cc80

add clarifying comment back to threshold_by_degree

18f261c

autoformatter

46ac1bf

sdmccabe and others added 2 commits October 28, 2020 10:59

Merge branch 'master' into threshold-feature

1d61eb5

refactor to_graph()

a59359f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring threshold inside BaseReconstructor class #290

Bring threshold inside BaseReconstructor class #290

sdmccabe commented Feb 18, 2020 •

edited

Loading

sdmccabe commented Feb 18, 2020 •

edited

Loading

sdmccabe commented Feb 18, 2020

sdmccabe commented Feb 18, 2020

leotrs commented Feb 19, 2020 •

edited

Loading

sdmccabe commented Feb 20, 2020

leotrs commented Feb 20, 2020

sdmccabe commented Feb 20, 2020 •

edited

Loading

leotrs commented Feb 20, 2020

sdmccabe commented Feb 20, 2020 •

edited

Loading

leotrs commented Feb 20, 2020

sdmccabe commented Feb 21, 2020

sdmccabe commented Feb 27, 2020

sdmccabe commented Mar 11, 2020

leotrs left a comment

leotrs Mar 12, 2020

leotrs Mar 12, 2020

sdmccabe commented Mar 12, 2020

Bring threshold inside BaseReconstructor class #290

Are you sure you want to change the base?

Bring threshold inside BaseReconstructor class #290

Conversation

sdmccabe commented Feb 18, 2020 • edited Loading

sdmccabe commented Feb 18, 2020 • edited Loading

sdmccabe commented Feb 18, 2020

sdmccabe commented Feb 18, 2020

leotrs commented Feb 19, 2020 • edited Loading

Statefulness

Inefficiency

Representation

sdmccabe commented Feb 20, 2020

leotrs commented Feb 20, 2020

sdmccabe commented Feb 20, 2020 • edited Loading

leotrs commented Feb 20, 2020

sdmccabe commented Feb 20, 2020 • edited Loading

leotrs commented Feb 20, 2020

sdmccabe commented Feb 21, 2020

sdmccabe commented Feb 27, 2020

sdmccabe commented Mar 11, 2020

leotrs left a comment

Choose a reason for hiding this comment

leotrs Mar 12, 2020

Choose a reason for hiding this comment

leotrs Mar 12, 2020

Choose a reason for hiding this comment

sdmccabe commented Mar 12, 2020

sdmccabe commented Feb 18, 2020 •

edited

Loading

sdmccabe commented Feb 18, 2020 •

edited

Loading

leotrs commented Feb 19, 2020 •

edited

Loading

sdmccabe commented Feb 20, 2020 •

edited

Loading

sdmccabe commented Feb 20, 2020 •

edited

Loading