Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CCA ordination classes and add new constrained ordination methods #60

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

grovduck
Copy link
Member

@grovduck grovduck commented Oct 2, 2023

This PR partially addresses #49 once completed. At present, it only addresses CCA and will require further changes based on class designs that have changed. Here are the current changes and some comments which need to be addressed before proceeding.

Changes

  • Remove all unreferenced properties in OrdConstrained and renamed to ConstrainedOrdination
  • Change properties to variables when property is not called outside class
  • Change ContrainedOrdination to abstract class with two abstract methods: _transform_X and _transform_Y.
  • Refactored CCA to just implement _transform_X and _transform_Y
  • Added RDA transformer (redundancy analysis) as another estimator that inherits from ConstrainedOrdination
  • Added GNNRDARegressor as another GNN technique. This needs to be further refactored in order to reduce duplication with GNNRegressor.

Issues

  • While the design of CCA and RDA is reasonable (and two other ordination methods which follow this pattern can further be implemented), there is way too much duplication in the associated transformers (CCATransformer and RDATransformer) as well as NN estimators (GNNRegressor and GNNRDARegressor). This particular combination (RDA + kNN) is not fully supported in yaImpute and, to my knowledge, hasn't been widely used as a mapping technique. However, RDA can be used as a technique to compare two different multivariate datasets and, as such, can be used as a valid estimator. We may want to think about providing the method (i.e. "cca" vs. "rda" vs. others yet to be implemented) as a hyperparameter to a ConstrainedTransformer.
  • I think there may be some duplication with the new helper function called is_2d_numeric_numpy_array and constraints of the transformers themselves. We may not need to be as stringent with all checks as the transformer would likely cover most cases.
  • I collapsed non-referenced properties in what was OrdConstrained (now ConstrainedOrdination) into local variables and only retained the properties that are used by the enclosing transformer. This might limit the utility of these classes for other purposes, but we can always expose these local variables if needed.
  • I've added a __call__ method to ConstrainedOrdination as a means for setting the needed instance properties. The return value of __call__ is the instance itself which gets passed back to the enclosing transformer (e.g. CCATransformer). Not sure if this is a good pattern or not.

Update (2023.10.04)

  • CCATransformer has now been replaced with a generic ConstrainedTransformer class that takes a method hyperparameter argument that currently accepts [cca, rda] and defaults to cca.
  • Likewise GNNRegressor now takes a method hyperparameter and includes ConstrainedTransformer as its transformer.
  • Removed __call__ from ConstrainedOrdination in favor of placing it all into the __init__ function.

Update (2023.10.17)

  • Abstract methods ConstrainedOrdination._transform_X and ConstrainedOrdination._transform_Y have been combined into a single abstract method ConstrainedOrdination._set_initialization_attributes which is implemented in subclasses
  • ConstrainedOrdination._check_inputs now returns X and Y arrays rather than modifying the instance attributes directly
  • Cleaned up logic in ConstrainedTransformer to ensure that passed constrained_method is a valid option

- Remove all unreferenced properties in `OrdConstrained` and renamed to
  `ConstrainedOrdination`
- Change properties to variables when property is not called outside
  class
- Change `ContrainedOrdination` to abstract class with two abstract
  methods: `_transform_X` and `_transform_Y`.
- Refactored `CCA` to just implement `_transform_X` and `_transform_Y`
- Added `RDA` transformer (redundancy analysis) as another estimator
  that inherits from `ConstrainedOrdination`
- Added `GNNRDARegressor` as another GNN technique.  This needs to be
  further refactored in order to reduce duplication with `GNNRegressor`.
@grovduck
Copy link
Member Author

grovduck commented Oct 2, 2023

@aazuspan, this is a bit of a mess right now, but wanted to get your eyes on it before I went too much further with it. As I mention above, we can implement RDA and a couple of other ordination techniques fairly easily from the new ConstrainedOrdination superclass, but I think we'll want to think about how that cascades up to the transformer and then to the estimator. Please let me know if my explanation isn't clear enough. I've provided very scant documentation in the code as well, so I can enhance that if they are confusing.

Mainly, I wanted to get this committed and pushed so that I can look at your work on #59.

(Looks like I'm failing a documentation check, so I'll probably need a bit of hand-holding to know what I need to do to make that check pass.)

@grovduck grovduck self-assigned this Oct 2, 2023
@grovduck grovduck added refactor Code cleanup without changing functionality estimator Related to one or more estimators labels Oct 2, 2023
@grovduck grovduck added this to the Complete estimators milestone Oct 2, 2023
@grovduck grovduck changed the title WIP: Refactor ordination classes to remove extraneous details Refactor ordination classes and add new family of constrained ordination techniques Oct 2, 2023
@grovduck grovduck changed the title Refactor ordination classes and add new family of constrained ordination techniques Refactor ordination classes and add new constrained ordination methods Oct 2, 2023
Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another pass tomorrow and try to look a little closer at some of the details, but from what I've seen this looks great! Huge improvement! As you mentioned, there's room for reducing duplication, but that hopefully shouldn't be too hard once we figure out the best approach (inheritance vs. parameters or something else).

I've added a call method to ConstrainedOrdination as a means for setting the needed instance properties. The return value of call is the instance itself which gets passed back to the enclosing transformer (e.g. CCATransformer). Not sure if this is a good pattern or not.

This is one of the areas I need to take a closer look at, but was there an advantage to this approach over setting everything through __init__? If I understand right, you would have to re-initialize the ordination to get a different result out of __call__, but I'm not clear if there might be a case where you would want to initialize without calling or call multiple times.

I think there may be some duplication with the new helper function called is_2d_numeric_numpy_array and constraints of the transformers themselves. We may not need to be as stringent with all checks as the transformer would likely cover most cases.

Yeah, I noted something similar in one of my comments. I agree there's some redundancy between the validation checks, although I'm not sure whether it makes more sense to do the checks up front in the transformer or keep that complexity at a lower level in the ordination...

This particular combination (RDA + kNN) is not fully supported in yaImpute and, to my knowledge, hasn't been widely used as a mapping technique.

Cool, it's exciting that we'll be able to offer additional functionality beyond yaImpute! Out of curiosity, how did you generate the test data?

(Looks like I'm failing a documentation check, so I'll probably need a bit of hand-holding to know what I need to do to make that check pass.)

You can ignore that! I had to set up RTD to build docs for all PRs to make sure it was building correctly for #59, but because this branch doesn't have any config for the docs, RTD rightfully complains. I can disable that setting if it gets too annoying (it will trigger every new commit), but otherwise we can just ignore and merge with the failing check until the docs get merged in.

src/sknnr/_gnn_rda.py Outdated Show resolved Hide resolved
src/sknnr/_gnn_rda.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca_transformer.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca_transformer.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_cca.py Outdated Show resolved Hide resolved
- New file for ConstrainedOrdination superclass
- Separate RDA class into new file
- Abstract methods ConstrainedOrdination._transform_X and
  ConstrainedOrdiation._transform_Y now modify `X` and `Y` arrays in
  place
- Create new ConstrainedTransformer class which replaces CCATransformer
  and takes "method" as a hyperparameter
- Modify GNNRegressor to take "method" as a hyperparameter and to use
  ConstrainedTransformer instead of CCATransformer
- Fix tests to correctly "uncollect" parameterizations that are not
  logical
- Rename all "gnn_moscow_*" test data to "gnn_cca_moscow_*"
- Miscellaneous changes based on comments
@grovduck
Copy link
Member Author

grovduck commented Oct 4, 2023

@aazuspan, thanks for the great review. Very helpful comments (even on a pretty messy draft) and I think I've addressed most of your comments although there are a few still left to resolve. I'm also thinking about just having this PR address the CCA/RDA side of things and then I'll start another PR on CCorA. Otherwise, I think this will be too unwieldy.

A couple of comments below, but I mostly have inline responses to your review.

I've added a call method to ConstrainedOrdination as a means for setting the needed instance properties. The return value of call is the instance itself which gets passed back to the enclosing transformer (e.g. CCATransformer). Not sure if this is a good pattern or not.

This is one of the areas I need to take a closer look at, but was there an advantage to this approach over setting everything through __init__? If I understand right, you would have to re-initialize the ordination to get a different result out of __call__, but I'm not clear if there might be a case where you would want to initialize without calling or call multiple times.

No, you're right, there is no reason to have a __call__ method in there. I've removed it with this round and put it back into __init__. It felt a bit "dirty" to do all the actual work in __init__, but I think that's preferable to an unfamiliar API.

You can ignore that! I had to set up RTD to build docs for all PRs to make sure it was building correctly for #59, but because this branch doesn't have any config for the docs, RTD rightfully complains. I can disable that setting if it gets too annoying (it will trigger every new commit), but otherwise we can just ignore and merge with the failing check until the docs get merged in.

Sounds good. My guess is that we'll get #59 in place first and then I'll merge those changes into this branch.

@grovduck
Copy link
Member Author

grovduck commented Oct 4, 2023

Also, I'd love your review of test_port.py on this go around. We now have two reasons to "uncollect" tests:

  1. If an estimator does not support n_components and a value for n_components was specified
  2. If an estimator does not support method and a value for constrained_method was specified 1

For 2 though it gets even a bit more tricky in that we want to exclude the GNNRegressor tests that don't pass a constrained_method. So we really need to uncollect if the estimator supports method and a value for constrained_method was not specified.

There is also the issue that only one function can be specified for the pytest.mark.uncollect_if marker, so now there is a new function called estimator_does_not_support_parametrization that "or"s the two other conditions.

You'll also see that I've renamed the test data files called "gnn_moscow_" to "gnn_cca_moscow_" so that the method variable finds the correct dataset when calling load_moscow_stjoes_results. There's definitely some duplicated logic across the tests that I'm not particularly happy with, so if you have suggestions there, I'd love to hear it.

Footnotes

  1. You'll clearly note the inconsistency in naming here - does it make sense to change method to constrained_method in GNNRegressor and ConstrainedTransformer?

@aazuspan
Copy link
Contributor

aazuspan commented Oct 4, 2023

I'm also thinking about just having this PR address the CCA/RDA side of things and then I'll start another PR on CCorA.

Sounds good!

My guess is that we'll get #59 in place first and then I'll merge those changes into this branch.

Yeah, this is a good idea since we'll need to make some changes to the API Reference for the renamed CCATransformer.

Also, I'd love your review of test_port.py on this go around.

Definitely - I'll take a close look at the testing side tomorrow!

@grovduck
Copy link
Member Author

grovduck commented Oct 4, 2023

Out of curiosity, how did you generate the test data?

I forgot to answer this question. It's an ugly mess, but I created local copies of yaImpute functions (specifically yai and newtargets) and added a "gnn_rda" section that uses a call to vegan's rda instead of cca. If/when I finish this off, I can do the same with the other two methods: capscale and dbrda, but need to be a bit more confident that I know generally how they differ from cca/rda.

@lemma-osu lemma-osu deleted a comment from aazuspan Oct 5, 2023
@aazuspan
Copy link
Contributor

aazuspan commented Oct 5, 2023

Also, I'd love your review of test_port.py on this go around.

Thanks for the thorough explanation of the updated uncollecting system! I think your solution looks great, considering all the complexity of different cases we have to cover.

It's possible there's some way we could clean up the data loading and parameterization system now that we have a better idea of what we need, but I don't have any specific ideas, and I'm thinking it's probably not worth the time or effort to do a big redesign when it's ultimately going to be replaced by #42.

There is also the issue that only one function can be specified for the pytest.mark.uncollect_if marker, so now there is a new function called estimator_does_not_support_parametrization that "or"s the two other conditions.

This seems like a good workaround. I suspect we could probably modify pytest_collection_modifyitems to work with multiple uncollect_if markers by iterating over all markers instead of using get_closest_marker, but that seems like a lot of trouble for very little benefit, especially given that our port tests are going to eventually be replaced by #42.

You'll clearly note the inconsistency in naming here - does it make sense to change method to constrained_method in GNNRegressor and ConstrainedTransformer?

I'm fine with either! I don't think that we should change it for the sake of our tests, but it is a more descriptive name, so I don't think it would hurt to switch.

To help with the confusion between the method/constrained_method in the tests, I think we could modify the test code to refer to the estimator name as name or id or something else instead of method, but I don't have a strong feeling on that.

- Combine ConstrainedOrdination._transform_X (and _Y) into new
  abstractmethod ConstrainedOrdination._set_initialization_attributes
  which sets instance-level attributes
- Change parameter `method` to `constrained_method` in GNNRegressor and
  ConstrainedTransformer
- Set selection of subclass ordination based on dictionary lookup rather
  than if/else logic
@grovduck
Copy link
Member Author

grovduck commented Oct 5, 2023

It's possible there's some way we could clean up the data loading and parameterization system now that we have a better idea of what we need, but I don't have any specific ideas, and I'm thinking it's probably not worth the time or effort to do a big redesign when it's ultimately going to be replaced by #42.

I might be confused, but I think we'll still have to have the uncollect system in place to run the right combinations, but we won't have to build the correct estimator name in order to fetch the yaImpute-based files. Is that right? Or are you thinking that we won't test all combinations once we have regression testing in place?

I'm fine with either! I don't think that we should change it for the sake of our tests, but it is a more descriptive name, so I don't think it would hurt to switch.

To help with the confusion between the method/constrained_method in the tests, I think we could modify the test code to refer to the estimator name as name or id or something else instead of method, but I don't have a strong feeling on that.

I decided to change method to constrained_method in GNNRegressor and ConstrainedOrdination even though it felt a bit redundant to have ConstrainedOrdination.constrained_method. I kept the tests more or less the same.

Some other very small changes:

  • Changed is_2d_numeric_array to have required keyword arguments
  • Combined zero_sum_rows and zero_sum_columns to zero_sum_vectors with an axis parameter. It too has required keyword arguments.

That feels safer to me, but it might be too strigent?

@aazuspan
Copy link
Contributor

aazuspan commented Oct 6, 2023

I might be confused, but I think we'll still have to have the uncollect system in place to run the right combinations, but we won't have to build the correct estimator name in order to fetch the yaImpute-based files. Is that right?

I hadn't thought that far ahead, apparently! Of course you're right that we'll still need to parameterize and uncollect.

Changed is_2d_numeric_array to have required keyword arguments

My personal preference with required kwargs is to use them when it isn't clear what the args should be or when the order is arbitrary, so if it were just me I would probably keep arr positional but make axis a required kwarg. But that's just me! It certainly doesn't hurt to require them.

Combined zero_sum_rows and zero_sum_columns to zero_sum_vectors with an axis parameter

Good call!

@grovduck grovduck changed the title Refactor ordination classes and add new constrained ordination methods Refactor CCA ordination classes and add new constrained ordination methods Oct 17, 2023
- Changed first argument to be positional, rather than required keyword
- Edit error message to return list rather than dict_keys
@grovduck
Copy link
Member Author

My personal preference with required kwargs is to use them when it isn't clear what the args should be or when the order is arbitrary, so if it were just me I would probably keep arr positional but make axis a required kwarg. But that's just me! It certainly doesn't hurt to require them.

I agree with your advice on this. I've reverted these functions such that arr is positional and axis is a required kwarg.

@grovduck
Copy link
Member Author

Aaaand it looks like we'll need a docs change already 😉. I'll address that once I decide about adding the other two methods for ConstrainedOrdination.

https://readthedocs.org/projects/sknnr/builds/22259462/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimator Related to one or more estimators refactor Code cleanup without changing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants