-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for user to specify reduced number of components (axes) #33 #40
Conversation
This feature introduces a new mixin (ComponentReducerMixin) to handle dimension reduction for transformers which create "axes" as linear combinations of their X features (presently CCATransformer and CCorATransformer). For these transformers, there is an upper limit on the number of axes that can be used, either by matrix rank or using tests to determine significance of axes. The user can specify a hyperparameter (n_components) on these transformers to use fewer than the maximum number of axes, which in turn changes the n-dimensional space for NN finding. A few relevant changes: - CCA and CCorA ordination classes now have a "max_components" property to ensure consistency so that either can be used in conjunction with ComponentReducerMixin - CCA has been refactored to combine the coefficients and axis_weights into a single matrix, which has been moved into the projector method (similar to CCorA). - Both CCA and CCorA's projector method now accommodates a parameter for n_components - Tests have been added to ensure that transformed X data has the correct number of components and raise errors if the user specified value is outside the range 0 - max_components.
@aazuspan, I think this is still pretty rough, but wanted to get some feedback from you in terms of design and specifically the tests. I'll add in some specific comments inlined with the code, but here are the general points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, @grovduck! I left some relatively minor comments, but I think overall this is looking like a really clean design. I think the mixin approach worked great.
I followed the naming of sklearn's PCA and use n_components to describe the number of axes to retain. Note that I'm much more used to using the term "axes" instead of "components", but did so to remain consistent.
I think this was the right call!
I haven't yet created the data to test the estimators with reduced transformers, but I'd like to include those as part of this PR. These would be porting tests again, but I'll need to do this outside of yaImpute.
Great, it makes sense to me to include that in this PR too. I assume that means the ability to specify n_components
when instantiating estimators would happen here, too?
I'll spend a little more time thinking about test cases tomorrow and let you know if anything else occurs to me.
- replace estimator attributes cca_ and ccora_ with ordination_ - remove _n_features_out method and call self.n_components_ from get_feature_names_out - remove estimator attribute projector_ and call ordination's projector from transform - grouped CCATransformer and CCorATransformer into TEST_ORDINATION_TRANSFORMERS - added match clause to pytest raises
As always, a great review catching lots of little errors. Still need to work on estimator tests, but I think I've addressed most of your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great! I just noted a few things that I missed on my first pass.
@aazuspan, I've got the bones of this for
Do we want to change code to accommodate these new files. Is this a one-off for testing or will we want to keep these data files around? I like the "reassurance" that we're getting the correct values, but I don't know how long we want to retain all the porting info? Do you have any opinions on this? |
This is a great question. I originally thought the port tests would be temporary, but I agree with you that it's nice to know that everything is staying accurate, as well as functional, so I'm fully in favor of keeping them around in some form or another. Eventually, I suppose we will reach a point where we're confident everything is working and could theoretically test against previous versions of For now, I guess I would lean towards integrating |
- Using 3 components as that will be testable with both GNN and MSN - Add n_components optional keyword to load_moscow_stjoes_results - Test for both kneighbors and predict on GNNRegressor and MSNRegressor - Pass n_components from estimator to corresponding transformer
OK, I've got a straw man for testing in there using this suggestion. At present, I'm only testing against a 3-component version of both A couple of other things to note:
Sorry this PR is turning into such a beast - I appreciate your help! |
Oooo, super cool. I'd want to play around with these packages to fully understand them, but they seem totally applicable for our use. Nice find! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great - no notes from me! At first glance I was hoping we could parameterize n_components
as [None, 3]
and run it with all estimators to avoid duplicating those tests, but I'm guessing you probably already thought of that. I poked around a little while and couldn't come up with any good solution (apparently there's no way to silently skip a test in pytest), so I'd say go ahead and merge if you're ready!
For MSNRegressor, I "cheated" by actually running sknnr with reduced components to get the neighbor, distance, and predicted files, so there is no mystery why those tests are passing. I felt a little dirty about doing it this way, but I felt that it was going to take a while to do through yaImpute and I don't have it set up in pynnmap. And it will be good for future tests as well. I'm 99.9% sure that the test files I produced really are the correct answer (famous last words ...)
Good to know! If you're confident in the answers, that's good enough for me :)
GNNRegressor and MSNRegressor are looking way too similar to not be refactored into a common base class (at some point). We could easily abstract out the transformer in the base class and specify these in the derived classes. Probably good to tackle when I do the CCA and CCorA reworking.
I had the same thought, but I'm happy to postpone that for a future PR.
In test_port.py, functions test_kneighbors_reduced_components and test_predict_reduced_components, I have n_components = 3 hard-wired in there. Should we parametrize this even if we don't have other n_components to test?
Nah, I think this is a reasonable approach. If we eventually update the tests to use dynamically generated results with one of those tools we could parameterize over a range of n_components
, but this looks good for now!
Oooo, super cool. I'd want to play around with these packages to fully understand them, but they seem totally applicable for our use. Nice find!
Yeah, I'm pretty excited about those! Just glancing through the docs, it sounds like that should simplify out test dataset system a lot.
Again, you give me way too much credit. I hadn't thought of removing that duplication, but it makes sense. You're way ahead of me here, but is this what you were thinking?
It's a little verbose and has too much logic, but I'm happy to change to this if you like this form better. It's not silently skipping (as you say), but is that OK? I read through that thread you sent and it looks like the optimal solution would be to filter out the combinations that don't make sense (e.g. |
Yes, that's almost exactly the solution I came up with. It feels odd to me to have tests that will always be skipped, but maybe that's not a big deal? It is nice not to duplicate the logic of those two tests.
Correct me if I'm wrong, but my understanding is that this would require combining the two parameterized args ( Do you have any preference? I think I lean towards either leaving it as-is with duplicated tests or conditionally skipping the invalid tests, and I could live with either. |
In the thread you sent, I think this post was the closest I saw to a suggested solution. I'm playing around with it a bit - it looks like you use the If that's too much, I think I'll probably keep the duplicated tests in there, just because it sounds like the consensus is that |
Here's a worked example, copying most of what was in that post:
with the corresponding output:
|
EDIT: Didn't notice your last post! That looks great, thanks for implementing it! I don't understand exactly how the
|
- Use new pytest mark to specify how to exclude parameterization combinations - Remove duplicated logic in tests
Arrgh, I'm losing steam on this PR ... A couple of blockers/annoyances:
I could use your feedback on these issues. The first is a blocker for actually pushing the code and it might be easier to provide feedback on the second two after I commit/push my new changes. |
Wow, dict unions are new to me! I feel like I always miss out on the cool new features by maintaining backwards compatible packages... Back to the issue at hand, I'm open to dropping As far as how we remove it, I'm totally open to you just disabling it in this PR and then cleaning up the rest of the pre-commit stuff in #41. That's probably not strictly the cleanest way to do it, but I think the alternative of doing #41 first would end up blocking this for a while and probably create a headache when you have to merge in all the changes that will stem from running the As recommended, I'll take a closer look at the other questions once we get to a resolution here! |
Excellent, all To your earlier code suggestions:
I've done this and added a link to the specific comment in the
Cool, I didn't know you could do this here, but I think I've done it correctly.
I actually liked your naming suggestion for the function, so I kept it. I didn't inline it because we can use it for both tests (I think you meant to inlining the function within the test itself, right?)
I've done this as well and now Hopefully, the two questions/concerns I raised earlier will now make sense. Thanks for keeping with me on this one - it's turned out to be a bit epic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great! I suggested one minor change, but otherwise (knock on wood) I think this should be ready to merge!
I actually liked your naming suggestion for the function, so I kept it. I didn't inline it because we can use it for both tests (I think you meant to inlining the function within the test itself, right?)
That's what I meant, but wasn't thinking about the fact it would be duplicated. I agree with your choice.
We can filter out the impossible combinations of (method, n_components), but in the test we also need to change the hyperpameters based on the estimator type as well. This means we're writing the same check (i.e. hasattr(estimator(), "n_components") both in the new estimator_does_not_support_n_components function as well as the test itself.
Yeah, no way to repurpose the estimator_does_not_support_n_components
since it takes the weird result argument instead of an estimator. I'm okay with this, and maybe more importantly can't think of any way to improve it!
In order to use estimator_does_not_support_n_components in both test_kneighbors and test_predict, the function signature has to accomodate result and n_components from test_kneighbors, but also the weighted parameter from test_predict. For now, the function signature looks like def estimator_does_not_support_n_components(result, n_components, **kwargs), but I'm not sure that's the right approach.
I see! That function signature is a little ugly, but I think your solution makes the most sense and it's pretty clear that it will just ignore any additional arguments.
tests/test_port.py
Outdated
if n_components is not None and not hasattr(estimator(), "n_components"): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring nit:
if n_components is not None and not hasattr(estimator(), "n_components"): | |
return True | |
return n_components is not None and not hasattr(estimator(), "n_components") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I told you I was getting tired! 😉
Co-authored-by: Aaron Zuspan <50475791+aazuspan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
- Remove sourcery from pre-commit per discussion in #40 - Replace flake8 and isort with ruff - Add new checks via ruff including pyupgrade This also brings everything into compliance with the new checks: - Replace nested context managers with single block - Remove unused param from `test_estimators_support_dataframes` - Move noqas in docstrings. Ruff wants them placed at the end of the multiline string, not on the offending line. - Remove unused noqas.
- Remove sourcery from pre-commit per discussion in #40 - Replace flake8 and isort with ruff - Add new checks via ruff including pyupgrade - Bring all code into compliance with new checks
This feature introduces a new mixin (
ComponentReducerMixin
) to handle dimension reduction for transformers which create "axes" as linear combinations of theirX
features (presentlyCCATransformer
andCCorATransformer
). For these transformers, there is an upper limit on the number of axes that can be used, either by matrix rank or using tests to determine significance of axes. The user can specify a hyperparameter (n_components
) on these transformers to use fewer than the maximum number of axes, which in turn changes the n-dimensional space for NN finding.A few relevant changes:
CCA
andCCorA
ordination classes now have amax_components
property to ensure consistency so that either can be used in conjunction withComponentReducerMixin
CCA
has been refactored to combine thecoefficients
andaxis_weights
into a single matrix, which has been moved into theprojector
method (similar toCCorA
).CCA
andCCorA
's projector method now accommodates a parameter forn_components
X
data has the correct number of components and raise errors if the user specified value is outside the range [0 -max_components
].