-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe compatibility for transformers #34
Conversation
All transformers now support `get_feature_names_out` and `set_output` methods. The first method was manually implemented for CCA and CCorA and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The second method was automatically available once `get_feature_names_out` was implemented, because all transformers subclass `BaseEstimator` and indirectly `_SetOutputMixin`. To get `get_feature_names_out` working, this also implements `_n_features_out` properties for CCA and CCorA. Tests for these new features are included to ensure that the length of feature names matches the output features of each transformer, and that set_output is callable. Tests are passing, but warnings are raised when estimators are fit with dataframes. This will be fixed once we use `set_output` to set the transformer mode in our estimators and remove the `NamedFeatureArray` hack.
`NamedFeatureArray`, which was used to trick sklearn into storing feature names from array inputs, is now removed. Instead, we use the `set_output` method on transformers to ensure that they pass dataframes through to allow estimators to store feature names. `feature_names_in_` was overridden for all transformed estimators to return feature names from the transformer rather than the estimator. This means that the property will return the names of features that the user passed in to fit the estimator, rather than the transformed features that were used internally (e.g. cca0, cca1, etc). Overriding `feature_names_in_` caused `_check_feature_names` to fail when called during fitting because the `feature_names_in_` intentionally mismatch the names seen during fit. To overcome that, we override that method to remove the affected check. We still need to handle warnings if feature names are incorrectly missing, so we currently borrow part of the implementation for that method from sklearn (BSD-3 license). This commit modifies some calls to `_validate_data` from the previous commit to avoid overriding X. This is done because `_validate_data` casts to array, which can cause issues when a transformer calls a subsequent transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF) with a dataframe input, as feature names will not match between the transformers, leading to user warnings when predicting. Instead, X is explicitly cast when needed using `np.asarray` and validation is always done without overriding X. Note that `_validate_data` must be called by all transformers that do not call it via a superclass because this indirectly stores feature names. While implementing this, the difficulty of tracking what output types are expected from what transformers and estimators with what inputs and configuration options became VERY clear, so we now have some basic "consistency" tests that compare all of our transformers and estimators with a comparable sklearn implementation to check output types and attrs under a range of situations. A very small change is that parametrized estimator checks are now passed classes instead of instances because this leads to more helpful pytest errors.
- Pin scikit-learn>=1.2. This is the first release with the `set_output` API, and added a number of features and classes that we rely on and test against (e.g. OneToOneFeatureMixin). - Remove unused `ClassNamePrefixFeatureOutMixin` - Remove unused `input_features` arg to `get_feature_names_out`. This is only used for validating or replacing feature names in the sklearn implementation, so isn't relevant to our transformers that potentially apply dimensionality reduction. - Fix output types for test functions
Awesome, @aazuspan. I'll put fresh eyes onto it in the morning ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aazuspan, such amazing work. There is literally zero chance that I could have done this! It makes me have confidence that whatever slop I write, it will at least be guarded by the careful checks you've put it here.
This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from sklearn in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency with sklearn is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.
I think this is completely reasonable to test against known sklearn
transformers/estimators to be sure that they behave similarly, even if it is unconventional. It seems to me that it will bake in checks against those estimators should sklearn
change anything on their API, right?
The only other test I could think of was to make sure that the expected output ("pandas" vs "default") was returned if our transformers were put into a pipeline, but perhaps that is implicitly covered by one of your existing transformer tests.
I'm going to approve - I didn't have any substantive changes to make, mostly just a few clarification questions.
Thanks again for the great work.
Yes, that was my thought! Hopefully that won't happen, but better to know if it does.
Were you thinking of a case where |
Yes, exactly what I was thinking and it makes sense that these would always be indirectly covered (I wrote a quick snippet to prove it to myself as well). I sometimes lean on "tests-as-feature-documentation" and often have redundancy in what I'm testing, so definitely not necessary to include additional tests. |
LGTM! Once again, awesome stuff! |
Thanks Matt, merging now! |
This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.
This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.
This PR closes #20 by making our transformers compatible with the set_output API, allowing them to output dataframes based on configuration.
@grovduck, this ballooned into a pretty big PR, so I could definitely use a second set of eyes to make sure I didn't go off the rails anywhere! I tried to lay out the details and reasoning of the big changes below. Most of these were discussed in the issue, but a few things came up during implementation. Happy to discuss anything!
set_output
This method is automatically available for all
BaseEstimator
subclasses via the_SetOutputMixin
once they have aget_feature_names_out
method.get_feature_names_out
This was already available for
StandardScalerWithDOF
viaOneToOneFeatureMixin
, and adding that subclass toMahalanobisTransformer
made it available there.CCATransformer
andCCorATransformer
can't use that subclass due to dimensionality reduction. We considered using ClassNamePrefixFeaturesOutMixin but decided against it since the feature names would be too long (e.g.ccatransformer0
. Instead, we just implement the method from scratch for those two transformers. This required adding an_n_features_out
property, which is based on the shape of theeigenvalues
andprojector
attributes, respectively. These should continue to work even if additional dimensionality reduction is added by #33.feature_names_in_
After some discussion, we decided that
feature_names_in_
for our transformed estimators should return the names of the original features passed by the user rather than the transformed features that are used internally to fit the estimator. To implement that, we add a corresponding property that just grabs the attr from the storedtransform_
._check_feature_names
Predicting with a
KNeighborsRegressor
runs a variety of checks via_validate_data
. One of those checks,_check_feature_names
, will fail if the feature names ofX
don't match thefeature_names_in_
that were stored during fitting. Because we override that property, we also need to override this check to avoid that error. However, if we still want to raise warnings when feature names are missing entirely, we need to copy the relevant portion of that method from sklearn, which will need to be licensed under BSD-3. Those warnings are tested bytest_estimators_warn_for_missing_features
.@grovduck, how important do you think those warnings are? During my usage of sklearn they've never alerted me to any bugs (and have mostly just annoyed me when I'm intentionally using array data after fitting with a dataframe), but if we're aiming for maximum consistency, I suppose we should probably keep them? It just adds a little maintenance and licensing complexity. The alternative would be to just override with an empty check that does nothing.
Pinning sklearn
The
set_output
API and a few helper classes (includingOneToOneFeatureMixin
) were added inscikit-learn=1.2
in Dec. 2022, so this pins to that release. I don't love that restriction since it's still a relatively recent release, but we rely pretty heavily on its functionality so I don't see a good alternative.Removing NamedFeatureArray
Previously, we had to use
NamedFeatureArray
to persist feature names after they were removed by transformers in our transformed estimators. Now, we can useset_output
on the transformer based on the type of the input data, so that dataframes (and feature names) are preserved through transformation. To avoid permanently changing configuration on the transformer, this is done with a helper functionset_temp_output
that resets transformer configuration once it's done. Note that I'm currently using the presence of aniloc
attribute onX
to determine if it's a dataframe. This isn't a perfect approach, but it seems to be the go-to in sklearn (for example).Consistency tests
This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from
sklearn
in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency withsklearn
is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.