Dataframe compatibility for transformers #34

aazuspan · 2023-06-06T19:41:08Z

This PR closes #20 by making our transformers compatible with the set_output API, allowing them to output dataframes based on configuration.

@grovduck, this ballooned into a pretty big PR, so I could definitely use a second set of eyes to make sure I didn't go off the rails anywhere! I tried to lay out the details and reasoning of the big changes below. Most of these were discussed in the issue, but a few things came up during implementation. Happy to discuss anything!

set_output

This method is automatically available for all BaseEstimator subclasses via the _SetOutputMixin once they have a get_feature_names_out method.

get_feature_names_out

This was already available for StandardScalerWithDOF via OneToOneFeatureMixin, and adding that subclass to MahalanobisTransformer made it available there.

CCATransformer and CCorATransformer can't use that subclass due to dimensionality reduction. We considered using ClassNamePrefixFeaturesOutMixin but decided against it since the feature names would be too long (e.g. ccatransformer0. Instead, we just implement the method from scratch for those two transformers. This required adding an _n_features_out property, which is based on the shape of the eigenvalues and projector attributes, respectively. These should continue to work even if additional dimensionality reduction is added by #33.

feature_names_in_

After some discussion, we decided that feature_names_in_ for our transformed estimators should return the names of the original features passed by the user rather than the transformed features that are used internally to fit the estimator. To implement that, we add a corresponding property that just grabs the attr from the stored transform_.

_check_feature_names

Predicting with a KNeighborsRegressor runs a variety of checks via _validate_data. One of those checks, _check_feature_names, will fail if the feature names of X don't match the feature_names_in_ that were stored during fitting. Because we override that property, we also need to override this check to avoid that error. However, if we still want to raise warnings when feature names are missing entirely, we need to copy the relevant portion of that method from sklearn, which will need to be licensed under BSD-3. Those warnings are tested by test_estimators_warn_for_missing_features.

@grovduck, how important do you think those warnings are? During my usage of sklearn they've never alerted me to any bugs (and have mostly just annoyed me when I'm intentionally using array data after fitting with a dataframe), but if we're aiming for maximum consistency, I suppose we should probably keep them? It just adds a little maintenance and licensing complexity. The alternative would be to just override with an empty check that does nothing.

Pinning sklearn

The set_output API and a few helper classes (including OneToOneFeatureMixin) were added in scikit-learn=1.2 in Dec. 2022, so this pins to that release. I don't love that restriction since it's still a relatively recent release, but we rely pretty heavily on its functionality so I don't see a good alternative.

Removing NamedFeatureArray

Previously, we had to use NamedFeatureArray to persist feature names after they were removed by transformers in our transformed estimators. Now, we can use set_output on the transformer based on the type of the input data, so that dataframes (and feature names) are preserved through transformation. To avoid permanently changing configuration on the transformer, this is done with a helper function set_temp_output that resets transformer configuration once it's done. Note that I'm currently using the presence of an iloc attribute on X to determine if it's a dataframe. This isn't a perfect approach, but it seems to be the go-to in sklearn (for example).

Consistency tests

This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from sklearn in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency with sklearn is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.

All transformers now support `get_feature_names_out` and `set_output` methods. The first method was manually implemented for CCA and CCorA and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The second method was automatically available once `get_feature_names_out` was implemented, because all transformers subclass `BaseEstimator` and indirectly `_SetOutputMixin`. To get `get_feature_names_out` working, this also implements `_n_features_out` properties for CCA and CCorA. Tests for these new features are included to ensure that the length of feature names matches the output features of each transformer, and that set_output is callable. Tests are passing, but warnings are raised when estimators are fit with dataframes. This will be fixed once we use `set_output` to set the transformer mode in our estimators and remove the `NamedFeatureArray` hack.

`NamedFeatureArray`, which was used to trick sklearn into storing feature names from array inputs, is now removed. Instead, we use the `set_output` method on transformers to ensure that they pass dataframes through to allow estimators to store feature names. `feature_names_in_` was overridden for all transformed estimators to return feature names from the transformer rather than the estimator. This means that the property will return the names of features that the user passed in to fit the estimator, rather than the transformed features that were used internally (e.g. cca0, cca1, etc). Overriding `feature_names_in_` caused `_check_feature_names` to fail when called during fitting because the `feature_names_in_` intentionally mismatch the names seen during fit. To overcome that, we override that method to remove the affected check. We still need to handle warnings if feature names are incorrectly missing, so we currently borrow part of the implementation for that method from sklearn (BSD-3 license). This commit modifies some calls to `_validate_data` from the previous commit to avoid overriding X. This is done because `_validate_data` casts to array, which can cause issues when a transformer calls a subsequent transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF) with a dataframe input, as feature names will not match between the transformers, leading to user warnings when predicting. Instead, X is explicitly cast when needed using `np.asarray` and validation is always done without overriding X. Note that `_validate_data` must be called by all transformers that do not call it via a superclass because this indirectly stores feature names. While implementing this, the difficulty of tracking what output types are expected from what transformers and estimators with what inputs and configuration options became VERY clear, so we now have some basic "consistency" tests that compare all of our transformers and estimators with a comparable sklearn implementation to check output types and attrs under a range of situations. A very small change is that parametrized estimator checks are now passed classes instead of instances because this leads to more helpful pytest errors.

- Pin scikit-learn>=1.2. This is the first release with the `set_output` API, and added a number of features and classes that we rely on and test against (e.g. OneToOneFeatureMixin). - Remove unused `ClassNamePrefixFeatureOutMixin` - Remove unused `input_features` arg to `get_feature_names_out`. This is only used for validating or replacing feature names in the sklearn implementation, so isn't relevant to our transformers that potentially apply dimensionality reduction. - Fix output types for test functions

grovduck · 2023-06-06T23:56:52Z

Awesome, @aazuspan. I'll put fresh eyes onto it in the morning ...

grovduck

@aazuspan, such amazing work. There is literally zero chance that I could have done this! It makes me have confidence that whatever slop I write, it will at least be guarded by the careful checks you've put it here.

This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from sklearn in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency with sklearn is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.

I think this is completely reasonable to test against known sklearn transformers/estimators to be sure that they behave similarly, even if it is unconventional. It seems to me that it will bake in checks against those estimators should sklearn change anything on their API, right?

The only other test I could think of was to make sure that the expected output ("pandas" vs "default") was returned if our transformers were put into a pipeline, but perhaps that is implicitly covered by one of your existing transformer tests.

I'm going to approve - I didn't have any substantive changes to make, mostly just a few clarification questions.

Thanks again for the great work.

pyproject.toml

src/sknnr/_base.py

src/sknnr/transformers/_cca_transformer.py

src/sknnr/transformers/_ccora_transformer.py

tests/test_estimators.py

tests/test_transformers.py

src/sknnr/_base.py

aazuspan · 2023-06-07T19:10:15Z

It seems to me that it will bake in checks against those estimators should sklearn change anything on their API, right?

Yes, that was my thought! Hopefully that won't happen, but better to know if it does.

The only other test I could think of was to make sure that the expected output ("pandas" vs "default") was returned if our transformers were put into a pipeline, but perhaps that is implicitly covered by one of your existing transformer tests.

Were you thinking of a case where set_output is called on the pipeline? Looking at the source for pipeline.set_output, that should work as long as our transformers have a set_output method, so I do think it's indirectly covered. But if you think there's something that's worth covering more explicitly, I'm always open to more tests!

grovduck · 2023-06-07T19:21:58Z

Were you thinking of a case where set_output is called on the pipeline? Looking at the source for pipeline.set_output, that should work as long as our transformers have a set_output method, so I do think it's indirectly covered. But if you think there's something that's worth covering more explicitly, I'm always open to more tests!

Yes, exactly what I was thinking and it makes sense that these would always be indirectly covered (I wrote a quick snippet to prove it to myself as well). I sometimes lean on "tests-as-feature-documentation" and often have redundancy in what I'm testing, so definitely not necessary to include additional tests.

src/sknnr/transformers/_base.py

grovduck · 2023-06-07T22:38:57Z

LGTM! Once again, awesome stuff!

aazuspan · 2023-06-07T22:48:35Z

Thanks Matt, merging now!

This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.

aazuspan and others added 3 commits June 6, 2023 11:04

aazuspan added enhancement New feature or request testing Related to test code or data labels Jun 6, 2023

aazuspan requested a review from grovduck June 6, 2023 19:41

grovduck approved these changes Jun 7, 2023

View reviewed changes

aazuspan commented Jun 7, 2023

View reviewed changes

src/sknnr/_base.py Outdated Show resolved Hide resolved

aazuspan added 2 commits June 7, 2023 21:19

Fix missing fitted checks and move set_temp_output

f4a13f8

Test that all transformers check they're fit before transforming #20

ba55d23

grovduck reviewed Jun 7, 2023

View reviewed changes

src/sknnr/transformers/_base.py Outdated Show resolved Hide resolved

typo

13ffaef

aazuspan merged commit 58396ff into fb_add_estimators Jun 7, 2023
10 checks passed

aazuspan deleted the transformer_dataframe_outputs branch June 7, 2023 22:53

aazuspan linked an issue Jun 7, 2023 that may be closed by this pull request

Transformed regressors drop dataframe feature names #20

Closed

aazuspan mentioned this pull request Jun 7, 2023

Transformed regressors drop dataframe feature names #20

Closed

aazuspan mentioned this pull request Jun 8, 2023

Warning when fitting transformed estimator with dataframe with no feature names #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe compatibility for transformers #34

Dataframe compatibility for transformers #34

aazuspan commented Jun 6, 2023 •

edited

Loading

grovduck commented Jun 6, 2023

grovduck left a comment

aazuspan commented Jun 7, 2023

grovduck commented Jun 7, 2023

grovduck commented Jun 7, 2023

aazuspan commented Jun 7, 2023

Dataframe compatibility for transformers #34

Dataframe compatibility for transformers #34

Conversation

aazuspan commented Jun 6, 2023 • edited Loading

set_output

get_feature_names_out

feature_names_in_

_check_feature_names

Pinning sklearn

Removing NamedFeatureArray

Consistency tests

grovduck commented Jun 6, 2023

grovduck left a comment

Choose a reason for hiding this comment

aazuspan commented Jun 7, 2023

grovduck commented Jun 7, 2023

grovduck commented Jun 7, 2023

aazuspan commented Jun 7, 2023

aazuspan commented Jun 6, 2023 •

edited

Loading