Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring threshold inside BaseReconstructor class #290

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

sdmccabe
Copy link
Collaborator

@sdmccabe sdmccabe commented Feb 18, 2020

A checklist of things that need to be done for this:

  • draft the PR
  • figure out if we want to do this
  • make sure we want return self everywhere it's currently being used
  • figure out what in the class should be private by convention
  • finalize changes to BaseReconstructor
  • update documentation of other reconstruction methods
  • update netrdexplorer
  • update notebooks/ folder
  • make sure RTD and README reflect the changes
  • push new version and signal that there's a breaking change in the release notes

See #271.

This is presented as a proposal; it's not ready for merging. (The tests pass
but a ton of doc changes are missing.)

This consolidates a lot of the thresholding and graph creation logic that exists
in individual reconstructors and in the utilities submodule within the
BaseReconstructor class.

Usage changes from something like

>>> reconstructor = ReconstructionAlgorithm()
>>> G = reconstructor.fit(TS, <some_params>, threshold_type='degree', avg_k=16)
>>> # or alternately, G = reconstructor.results['graph']

to

>>> reconstructor = ReconstructionAlgorithm()
>>> G = reconstructor.fit(TS, <some_params>).threshold('degree', avg_k=16).to_graph()

The advantages of such an approach are as follows:

  1. Minimizes code repetition. Each of the reconstructors (with one or two
    exceptions) currently imports create_graph() and threshold() and calls those
    functions just prior to returning the graph. By moving this functionality within
    the base class and using inheritance, there's less duplication and upkeep.
  2. (Mostly) gets rid of kwargs. The use of kwargs within the
    reconstruction and thresholding logic has always been extremely messy and
    error-prone. Because we no longer need to pass in threshold_type and
    threshold-specific arguments to Reconstructor.fit(), we don't need to capture
    kwargs in that method call. The general-purpose Reconstructor.threshold()
    still uses kwargs to dispatch threshold method-specific arguments, but now the
    specific thresholders can be accessed as class methods, and their arguments
    passed as positional arguments, instead.
  3. More expressive (and consistent) transformations. There's a lot of
    ambiguity right now about which methods use which threshold methods by default,
    which ones remove self-loops, etc. Under this system, fit() just estimates the
    weight matrix; it does no thresholding, or binarization, or self-loop removal.
    These are handled explicitly by the user through method chaining. So, for
    example,
G = (R.fit(TS)
      .remove_self_loops()
      .threshold_on_degree(16)
      .binarize()
      .to_graph())

This order of operations also removes ambiguity about the relationship between
threshold_on_degree() and remove_self_loops(); i.e., should those count
towards the thresholding or not? Now, you can decide for yourself by reversing
the order.
4. More flexibility in return types. For a lot of experiments, we didn't
actually need the graphs, but we needed the underlying matrix and had to finish
around self.results for the right matrix to use. This got easier after we
standardized on self.results['weights_matrix'] but was still pretty
inaccessible. Now, the return type is made flexible through the addition of the
.to_graph() and .to_matrix() methods.

There are some notable downsides to go with the benefits, however.

  1. This approach has a lot of statefulness. Let's say I'm working in
    Jupyter, using a slow reconstruction method. I create my slow reconstructor
R = netrd.reconstruction.SlowReconstructor()
R = R.fit(TS)

I'm not sure what thresholding to use, so I play around a bit.

R = R.threshold_on_degree(4)
R = R.threshold_on_degree(16)

This isn't going to work, because the first threshold_on_degree() has already
chucked out a lot of the information from the weights matrix generated by the
fit() call. This is the big problem here; I think it's a pretty serious
weakness but not necessarily a fatal one. If there are ways to avoid unexpected
statefulness, though, I'm all ears.
2. Possible memory inefficiency. The internal BaseReconstructor methods
are copying dense matrices around a lot, which is possible memory hog. I don't
think this is too big of a deal because the time complexity of most of the
reconstruction algorithms limit their use to small-to-moderate graph sizes
anyway.

See netsiphd#271.

This is presented as a proposal; it's **not ready for merging**. (The tests pass
but a ton of doc changes are missing.)

This consolidates a lot of the thresholding and graph creation logic that exists
in individual reconstructors and in the utilities submodule within the
`BaseReconstructor` class. The advantages of such an approach are as follows:

1. **Minimizes code repetition.** Each of the reconstructors (with one or two
exceptions) currently imports `create_graph()` and `threshold()` and calls those
functions just prior to returning the graph. By moving this functionality within
the base class and using inheritance, there's less duplication and upkeep.
2. **(Mostly) gets rid of kwargs.** The use of `kwargs` within the
reconstruction and thresholding logic has always been extremely messy and
error-prone. Because we no longer need to pass in `threshold_type` and
threshold-specific arguments to `Reconstructor.fit()`, we don't need to capture
`kwargs` in that method call. The general-purpose `Reconstructor.threshold()`
still uses `kwargs` to dispatch threshold method-specific arguments, but now the
specific thresholders can be accessed as class methods, and their arguments
passed as positional arguments, instead.
3. **More expressive (and consistent) transformations.** There's a lot of
ambiguity right now about which methods use which threshold methods by default,
which ones remove self-loops, etc. Under this system, `fit()` just estimates the
weight matrix; it does no thresholding, or binarization, or self-loop removal.
These are handled explicitly by the user through method chaining. So, for
example,

```python
G = (R.fit(TS)
      .remove_self_loops()
      .threshold_on_degree(16)
      .binarize()
      .to_graph())
```

This order of operations also removes ambiguity about the relationship between
`threshold_on_degree()` and `remove_self_loops()`; i.e., should those count
towards the thresholding or not? Now, you can decide for yourself by reversing
the order.
4. **More flexibility in return types.** For a lot of experiments, we didn't
actually need the graphs, but we needed the underlying matrix and had to finish
around `self.results` for the right matrix to use. This got easier after we
standardized on `self.results['weights_matrix']` but was still pretty
inaccessible. Now, the return type is made flexible through the addition of the
`.to_graph()` and `.to_matrix()` methods.

There are some notable downsides to go with the benefits, however.

1. **This approach has a lot of statefulness.** Let's say I'm working in
Jupyter, using a slow reconstruction method. I create my slow reconstructor

```python
R = netrd.reconstruction.SlowReconstructor()
R = R.fit(TS)
```

I'm not sure what thresholding to use, so I play around a bit.

```python
R = R.threshold_on_degree(4)
R = R.threshold_on_degree(16)
```

This isn't going to work, because the first `threshold_on_degree()` has already
chucked out a lot of the information from the weights matrix generated by the
`fit()` call. This is the big problem here; I think it's a pretty serious
weakness but not necessarily a fatal one. If there are ways to avoid unexpected
statefulness, though, I'm all ears.
2. **Possible memory inefficiency.** The internal `BaseReconstructor` methods
are copying dense matrices around a lot, which is possible memory hog. I don't
think this is too big of a deal because the time complexity of most of the
reconstruction algorithms limit their use to small-to-moderate graph sizes
anyway.
@sdmccabe
Copy link
Collaborator Author

sdmccabe commented Feb 18, 2020

I've avoided messing with docstrings because I assume some things will changes, but I will try to add some draft docstrings to BaseReconstructor at some point. I've added a draft of the docstrings.

@sdmccabe
Copy link
Collaborator Author

CorrelationSpanningTree is removed because the spanning tree is a method for thresholding/backboning a graph; I have added, in addition to the previously existing threshold methods, a minimum_spanning_tree() method. So the old CST can be obtained:

from netrd.reconstruction import CorrelationMatrix
G = CorrelationMatrix().fit(TS) \
                       .remove_self_loops() \
                       .minimum_spanning_tree() \ 
                       .to_graph()

@sdmccabe
Copy link
Collaborator Author

On update_matrix(), update_graph(), to_matrix(), and to_graph(): There's a lot of work being done to make sure the graph representation and the matrix representation are kept consistent, on the assumption that either could be generated by fit(). In the methods that actually exist in netrd, almost all use a matrix representation (OptimalCausationEntropy is the exception; it uses an adjacency list).

If we committed to self.matrix as the canonical way of representing the network, and then only generated the graph from the matrix when we returned it, a lot of this abstraction could be avoided. The question is whether or not this is worthwhile; what if we later have a method that doesn't fit this workflow?

@leotrs
Copy link
Collaborator

leotrs commented Feb 19, 2020

Statefulness

I think keeping the original output of the reconstruction should be a priority. A possible workaround is to have methods that destroy the original output create a new Reconstructor object and return that instead. So R2 = R.threshold_on_degree(4), or R.threshold_on_degree(4, inplace=True) both work, depending on what the user wants. The second one is more memory efficient but destroys the output. The first one is a memory hog but keeps it. This also mimics some pandas methods so there's that too.

Inefficiency

What if we didn't copy dense matrices a lot? There's a whole PR here pertaining the pervasive use of sparse matrices. Also, I'd rather have one limiting resource (time) and two (time + memory).

Representation

I'm all for using self.matrix as the canonical representation. That's what most of the methods are doing already anyway. We already have .to_graph so keeping that around shouldn't be necessary. I can't see how this would break workflow.

This commit does not address any of the statefulness concerns. It focuses on the
issues of memory usage and the canonical representation of the reconstructed
network. To the former, it uses sparse matrices to represent thresholded graphs.
To the latter, it removes `update_graph()` and some of the conditionals; the
idea is that the following workflow is implied:

1. The reconstructor is initialized with `self.matrix` and `self.graph` as
`None`.
2. The `fit()` method creates a matrix of weights that resides at `self.matrix`.
3. Optionally, the matrix is thresholded, turning `self.matrix` from a numpy
array to a scipy sparse matrix.
4. Optionally, other operations are performed, like the removal of self loops or
binarization. These functions must work with both dense and sparse matrices.
5. The graph is accessed through the `to_graph()` method, which creates the
networkx object from `self.matrix`. This graph is stored at `self.graph` and
returned again if `to_graph()` is called again.
@sdmccabe
Copy link
Collaborator Author

My last commit hopefully addresses the inefficiency and representations points somewhat. Now, to the statefulness point.

I think return self is a valid approach here and that the statefulness isn't too bad here. But if we wanted to go the route of reducing statefulness, this SO post points to the SQLAlchemy code base, in particular its use of the Generative class and @_generative decorator, as an example of how to implement this. It's not just a matter of changing all the return selfs to return self.copy()s, because you've already changed class attributes by the time you're making the copy.

@leotrs
Copy link
Collaborator

leotrs commented Feb 20, 2020

Can we come up with a particular example of why statefulness is Bad and how this generative thing is going to help that particular issue?

The problem of copying after modification is easily solved by deciding which object to work on (self, or a copy) at the top of the function. This is easy if we use the inplace kwarg.

@sdmccabe
Copy link
Collaborator Author

sdmccabe commented Feb 20, 2020

I think the statefulness issue is only likely to come up in interactive use.

If I'm reading it right, the @_generative decorator basically does what you propose. It makes the copy at the top of the function:

        s = self.__class__.__new__(self.__class__)
        s.__dict__ = self.__dict__.copy()

I don't really care for the inplace kwarg, assuming it has similar semantics to Pandas's inplace or in-place numpy functions like np.fill_diagonal and returns None. If it doesn't return None, that's also a possible source of confusion for people who would expect it to return None.

@leotrs
Copy link
Collaborator

leotrs commented Feb 20, 2020

Language barrier here: by "I don't really care about inplace", do you mean to say you don't want to implement it?

In any case, if generative does that, then I guess why not? As long as we are sure it is doing what we want (and only what we want).

@sdmccabe
Copy link
Collaborator Author

sdmccabe commented Feb 20, 2020

I don't really like the design pattern of in-place operations in general. If other people want it, I'd be ok with implementing it down the road, but I think for now I'd rather focus on getting a single workflow set up and revisit inplace later.

@leotrs
Copy link
Collaborator

leotrs commented Feb 20, 2020

I disagree, but you have the last word here.

Can generative also execute in-place?

@sdmccabe
Copy link
Collaborator Author

couple of test warnings to track down before merging:

tests/test_reconstruction.py::test_graph_size
tests/test_reconstruction.py::test_graph_size
  /home/main/.pyenv/versions/anaconda3-2019.03/lib/python3.7/site-packages/scipy/sparse/compressed.py:213: RuntimeWarning: invalid value encountered in greater
    res = self._with_data(op(self.data, other), copy=True)

There should be any comparison operations happening on sparse matrices right now, so I'm not sure exactly where it's coming from.

All non-`fit()` methods that return `self` now operate on copies of the object.
This should receive test coverage.

Trying it out interactively, it seems to work as expected, returning new copies
of things where appropriate. Memory usage didn't appreciably increase,
presumably because the threshold functions are creating sparse matrices, but we
should revisit when matrices are being copied or not if the whole class is being
copied too.

For now I've just repeated the relevant lines of code. In a general cleanup it
would probably make sense to use a decorator for this, probably just copying
from the MIT-licensed SQLAlchemy repo.
This commit renames `self.graph` and `self.matrix` to `self._graph` and
`self._matrix`, a way of signalling that they are intended to be private. As a
result, `self._matrix` should only be modified using `self.update_matrix()`;
that change is also added in each reconstructor file.
@sdmccabe
Copy link
Collaborator Author

couple of test warnings to track down before merging:

tests/test_reconstruction.py::test_graph_size
tests/test_reconstruction.py::test_graph_size
  /home/main/.pyenv/versions/anaconda3-2019.03/lib/python3.7/site-packages/scipy/sparse/compressed.py:213: RuntimeWarning: invalid value encountered in greater
    res = self._with_data(op(self.data, other), copy=True)

There should be any comparison operations happening on sparse matrices right now, so I'm not sure exactly where it's coming from.

From some googling, the issue here might be something to do with nans in the matrix.

@sdmccabe
Copy link
Collaborator Author

@leotrs the main stuff is ready for review, everything else in the checklist is window dressing (and contingent on finalizing the design).

@leotrs leotrs marked this pull request as ready for review March 12, 2020 12:53
Copy link
Collaborator

@leotrs leotrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random question: do we ever operate (i.e. multiply) with the matrices? Multiplication with dense/sparse matrices has to be handled carefully.

Also, of course the diff won't catch changes that needed to be made but weren't so I trust you double checked all of the files that needed editing. How is our test coverage here?

netrd/reconstruction/base.py Outdated Show resolved Hide resolved
netrd/reconstruction/base.py Outdated Show resolved Hide resolved
netrd/reconstruction/base.py Show resolved Hide resolved
netrd/reconstruction/base.py Outdated Show resolved Hide resolved
netrd/reconstruction/base.py Outdated Show resolved Hide resolved
netrd/reconstruction/base.py Outdated Show resolved Hide resolved
Comment on lines +300 to +305
def _binarize_sparse(self, mat):
return np.abs(mat.sign())

def _binarize_dense(self, mat):
return np.abs(np.sign(mat))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both?

netrd/reconstruction/base.py Outdated Show resolved Hide resolved
G = nx.empty_graph(n=N)
self.results['graph'] = G
return G
self.update_matrix(np.zeros((N, N)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use sparse matrices for dummy matrices like this? Or is it too clunky later?

tests/test_reconstruction.py Show resolved Hide resolved
@sdmccabe
Copy link
Collaborator Author

Random question: do we ever operate (i.e. multiply) with the matrices? Multiplication with dense/sparse matrices has to be handled carefully.

I don't think we do. Likely any matrix multiplication would happen in fit(), before update_matrix() is called.

Also, of course the diff won't catch changes that needed to be made but weren't so I trust you double checked all of the files that needed editing. How is our test coverage here?

I wouldn't trust me! The test coverage is so-so; we've never had great coverage in the reconstruction submodule. The threshold logic seems to work, and the CCM test. because it was testing for a known result, helped me catch a bunch of bugs. It would be nice (as always) if we had more tests that could apply to all the reconstruction methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants