-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring threshold inside BaseReconstructor class #290
base: master
Are you sure you want to change the base?
Conversation
See netsiphd#271. This is presented as a proposal; it's **not ready for merging**. (The tests pass but a ton of doc changes are missing.) This consolidates a lot of the thresholding and graph creation logic that exists in individual reconstructors and in the utilities submodule within the `BaseReconstructor` class. The advantages of such an approach are as follows: 1. **Minimizes code repetition.** Each of the reconstructors (with one or two exceptions) currently imports `create_graph()` and `threshold()` and calls those functions just prior to returning the graph. By moving this functionality within the base class and using inheritance, there's less duplication and upkeep. 2. **(Mostly) gets rid of kwargs.** The use of `kwargs` within the reconstruction and thresholding logic has always been extremely messy and error-prone. Because we no longer need to pass in `threshold_type` and threshold-specific arguments to `Reconstructor.fit()`, we don't need to capture `kwargs` in that method call. The general-purpose `Reconstructor.threshold()` still uses `kwargs` to dispatch threshold method-specific arguments, but now the specific thresholders can be accessed as class methods, and their arguments passed as positional arguments, instead. 3. **More expressive (and consistent) transformations.** There's a lot of ambiguity right now about which methods use which threshold methods by default, which ones remove self-loops, etc. Under this system, `fit()` just estimates the weight matrix; it does no thresholding, or binarization, or self-loop removal. These are handled explicitly by the user through method chaining. So, for example, ```python G = (R.fit(TS) .remove_self_loops() .threshold_on_degree(16) .binarize() .to_graph()) ``` This order of operations also removes ambiguity about the relationship between `threshold_on_degree()` and `remove_self_loops()`; i.e., should those count towards the thresholding or not? Now, you can decide for yourself by reversing the order. 4. **More flexibility in return types.** For a lot of experiments, we didn't actually need the graphs, but we needed the underlying matrix and had to finish around `self.results` for the right matrix to use. This got easier after we standardized on `self.results['weights_matrix']` but was still pretty inaccessible. Now, the return type is made flexible through the addition of the `.to_graph()` and `.to_matrix()` methods. There are some notable downsides to go with the benefits, however. 1. **This approach has a lot of statefulness.** Let's say I'm working in Jupyter, using a slow reconstruction method. I create my slow reconstructor ```python R = netrd.reconstruction.SlowReconstructor() R = R.fit(TS) ``` I'm not sure what thresholding to use, so I play around a bit. ```python R = R.threshold_on_degree(4) R = R.threshold_on_degree(16) ``` This isn't going to work, because the first `threshold_on_degree()` has already chucked out a lot of the information from the weights matrix generated by the `fit()` call. This is the big problem here; I think it's a pretty serious weakness but not necessarily a fatal one. If there are ways to avoid unexpected statefulness, though, I'm all ears. 2. **Possible memory inefficiency.** The internal `BaseReconstructor` methods are copying dense matrices around a lot, which is possible memory hog. I don't think this is too big of a deal because the time complexity of most of the reconstruction algorithms limit their use to small-to-moderate graph sizes anyway.
|
from netrd.reconstruction import CorrelationMatrix
G = CorrelationMatrix().fit(TS) \
.remove_self_loops() \
.minimum_spanning_tree() \
.to_graph() |
On If we committed to |
StatefulnessI think keeping the original output of the reconstruction should be a priority. A possible workaround is to have methods that destroy the original output create a new Reconstructor object and return that instead. So InefficiencyWhat if we didn't copy dense matrices a lot? There's a whole PR here pertaining the pervasive use of sparse matrices. Also, I'd rather have one limiting resource (time) and two (time + memory). RepresentationI'm all for using self.matrix as the canonical representation. That's what most of the methods are doing already anyway. We already have |
This commit does not address any of the statefulness concerns. It focuses on the issues of memory usage and the canonical representation of the reconstructed network. To the former, it uses sparse matrices to represent thresholded graphs. To the latter, it removes `update_graph()` and some of the conditionals; the idea is that the following workflow is implied: 1. The reconstructor is initialized with `self.matrix` and `self.graph` as `None`. 2. The `fit()` method creates a matrix of weights that resides at `self.matrix`. 3. Optionally, the matrix is thresholded, turning `self.matrix` from a numpy array to a scipy sparse matrix. 4. Optionally, other operations are performed, like the removal of self loops or binarization. These functions must work with both dense and sparse matrices. 5. The graph is accessed through the `to_graph()` method, which creates the networkx object from `self.matrix`. This graph is stored at `self.graph` and returned again if `to_graph()` is called again.
…into threshold-feature
My last commit hopefully addresses the inefficiency and representations points somewhat. Now, to the statefulness point. I think |
Can we come up with a particular example of why statefulness is Bad and how this generative thing is going to help that particular issue? The problem of copying after modification is easily solved by deciding which object to work on (self, or a copy) at the top of the function. This is easy if we use the |
I think the statefulness issue is only likely to come up in interactive use. If I'm reading it right, the
I don't really care for the |
Language barrier here: by "I don't really care about In any case, if |
I don't really like the design pattern of in-place operations in general. If other people want it, I'd be ok with implementing it down the road, but I think for now I'd rather focus on getting a single workflow set up and revisit |
I disagree, but you have the last word here. Can |
couple of test warnings to track down before merging:
There should be any comparison operations happening on sparse matrices right now, so I'm not sure exactly where it's coming from. |
All non-`fit()` methods that return `self` now operate on copies of the object. This should receive test coverage. Trying it out interactively, it seems to work as expected, returning new copies of things where appropriate. Memory usage didn't appreciably increase, presumably because the threshold functions are creating sparse matrices, but we should revisit when matrices are being copied or not if the whole class is being copied too. For now I've just repeated the relevant lines of code. In a general cleanup it would probably make sense to use a decorator for this, probably just copying from the MIT-licensed SQLAlchemy repo.
This commit renames `self.graph` and `self.matrix` to `self._graph` and `self._matrix`, a way of signalling that they are intended to be private. As a result, `self._matrix` should only be modified using `self.update_matrix()`; that change is also added in each reconstructor file.
From some googling, the issue here might be something to do with |
@leotrs the main stuff is ready for review, everything else in the checklist is window dressing (and contingent on finalizing the design). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question: do we ever operate (i.e. multiply) with the matrices? Multiplication with dense/sparse matrices has to be handled carefully.
Also, of course the diff won't catch changes that needed to be made but weren't so I trust you double checked all of the files that needed editing. How is our test coverage here?
def _binarize_sparse(self, mat): | ||
return np.abs(mat.sign()) | ||
|
||
def _binarize_dense(self, mat): | ||
return np.abs(np.sign(mat)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both?
G = nx.empty_graph(n=N) | ||
self.results['graph'] = G | ||
return G | ||
self.update_matrix(np.zeros((N, N))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use sparse matrices for dummy matrices like this? Or is it too clunky later?
I don't think we do. Likely any matrix multiplication would happen in
I wouldn't trust me! The test coverage is so-so; we've never had great coverage in the reconstruction submodule. The threshold logic seems to work, and the CCM test. because it was testing for a known result, helped me catch a bunch of bugs. It would be nice (as always) if we had more tests that could apply to all the reconstruction methods. |
A checklist of things that need to be done for this:
return self
everywhere it's currently being usedBaseReconstructor
netrdexplorer
notebooks/
folderREADME
reflect the changesSee #271.
This is presented as a proposal; it's not ready for merging. (The tests pass
but a ton of doc changes are missing.)
This consolidates a lot of the thresholding and graph creation logic that exists
in individual reconstructors and in the utilities submodule within the
BaseReconstructor
class.Usage changes from something like
to
The advantages of such an approach are as follows:
exceptions) currently imports
create_graph()
andthreshold()
and calls thosefunctions just prior to returning the graph. By moving this functionality within
the base class and using inheritance, there's less duplication and upkeep.
kwargs
within thereconstruction and thresholding logic has always been extremely messy and
error-prone. Because we no longer need to pass in
threshold_type
andthreshold-specific arguments to
Reconstructor.fit()
, we don't need to capturekwargs
in that method call. The general-purposeReconstructor.threshold()
still uses
kwargs
to dispatch threshold method-specific arguments, but now thespecific thresholders can be accessed as class methods, and their arguments
passed as positional arguments, instead.
ambiguity right now about which methods use which threshold methods by default,
which ones remove self-loops, etc. Under this system,
fit()
just estimates theweight matrix; it does no thresholding, or binarization, or self-loop removal.
These are handled explicitly by the user through method chaining. So, for
example,
This order of operations also removes ambiguity about the relationship between
threshold_on_degree()
andremove_self_loops()
; i.e., should those counttowards the thresholding or not? Now, you can decide for yourself by reversing
the order.
4. More flexibility in return types. For a lot of experiments, we didn't
actually need the graphs, but we needed the underlying matrix and had to finish
around
self.results
for the right matrix to use. This got easier after westandardized on
self.results['weights_matrix']
but was still prettyinaccessible. Now, the return type is made flexible through the addition of the
.to_graph()
and.to_matrix()
methods.There are some notable downsides to go with the benefits, however.
Jupyter, using a slow reconstruction method. I create my slow reconstructor
I'm not sure what thresholding to use, so I play around a bit.
This isn't going to work, because the first
threshold_on_degree()
has alreadychucked out a lot of the information from the weights matrix generated by the
fit()
call. This is the big problem here; I think it's a pretty seriousweakness but not necessarily a fatal one. If there are ways to avoid unexpected
statefulness, though, I'm all ears.
2. Possible memory inefficiency. The internal
BaseReconstructor
methodsare copying dense matrices around a lot, which is possible memory hog. I don't
think this is too big of a deal because the time complexity of most of the
reconstruction algorithms limit their use to small-to-moderate graph sizes
anyway.