Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe compatibility for transformers #34

Merged
merged 6 commits into from
Jun 7, 2023

Conversation

aazuspan
Copy link
Contributor

@aazuspan aazuspan commented Jun 6, 2023

This PR closes #20 by making our transformers compatible with the set_output API, allowing them to output dataframes based on configuration.

@grovduck, this ballooned into a pretty big PR, so I could definitely use a second set of eyes to make sure I didn't go off the rails anywhere! I tried to lay out the details and reasoning of the big changes below. Most of these were discussed in the issue, but a few things came up during implementation. Happy to discuss anything!

set_output

This method is automatically available for all BaseEstimator subclasses via the _SetOutputMixin once they have a get_feature_names_out method.

get_feature_names_out

This was already available for StandardScalerWithDOF via OneToOneFeatureMixin, and adding that subclass to MahalanobisTransformer made it available there.

CCATransformer and CCorATransformer can't use that subclass due to dimensionality reduction. We considered using ClassNamePrefixFeaturesOutMixin but decided against it since the feature names would be too long (e.g. ccatransformer0. Instead, we just implement the method from scratch for those two transformers. This required adding an _n_features_out property, which is based on the shape of the eigenvalues and projector attributes, respectively. These should continue to work even if additional dimensionality reduction is added by #33.

feature_names_in_

After some discussion, we decided that feature_names_in_ for our transformed estimators should return the names of the original features passed by the user rather than the transformed features that are used internally to fit the estimator. To implement that, we add a corresponding property that just grabs the attr from the stored transform_.

_check_feature_names

Predicting with a KNeighborsRegressor runs a variety of checks via _validate_data. One of those checks, _check_feature_names, will fail if the feature names of X don't match the feature_names_in_ that were stored during fitting. Because we override that property, we also need to override this check to avoid that error. However, if we still want to raise warnings when feature names are missing entirely, we need to copy the relevant portion of that method from sklearn, which will need to be licensed under BSD-3. Those warnings are tested by test_estimators_warn_for_missing_features.

@grovduck, how important do you think those warnings are? During my usage of sklearn they've never alerted me to any bugs (and have mostly just annoyed me when I'm intentionally using array data after fitting with a dataframe), but if we're aiming for maximum consistency, I suppose we should probably keep them? It just adds a little maintenance and licensing complexity. The alternative would be to just override with an empty check that does nothing.

Pinning sklearn

The set_output API and a few helper classes (including OneToOneFeatureMixin) were added in scikit-learn=1.2 in Dec. 2022, so this pins to that release. I don't love that restriction since it's still a relatively recent release, but we rely pretty heavily on its functionality so I don't see a good alternative.

Removing NamedFeatureArray

Previously, we had to use NamedFeatureArray to persist feature names after they were removed by transformers in our transformed estimators. Now, we can use set_output on the transformer based on the type of the input data, so that dataframes (and feature names) are preserved through transformation. To avoid permanently changing configuration on the transformer, this is done with a helper function set_temp_output that resets transformer configuration once it's done. Note that I'm currently using the presence of an iloc attribute on X to determine if it's a dataframe. This isn't a perfect approach, but it seems to be the go-to in sklearn (for example).

Consistency tests

This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from sklearn in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency with sklearn is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.

aazuspan and others added 3 commits June 6, 2023 11:04
All transformers now support `get_feature_names_out` and `set_output`
methods. The first method was manually implemented for CCA and CCorA
and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The
second method was automatically available once `get_feature_names_out`
was implemented, because all transformers subclass `BaseEstimator` and
indirectly `_SetOutputMixin`. To get `get_feature_names_out` working,
this also implements `_n_features_out` properties for CCA and CCorA.

Tests for these new features are included to ensure that the length of
feature names matches the output features of each transformer, and that
set_output is callable.

Tests are passing, but warnings are raised when estimators are fit with
dataframes. This will be fixed once we use `set_output` to set the
transformer mode in our estimators and remove the `NamedFeatureArray`
hack.
`NamedFeatureArray`, which was used to trick sklearn into storing
feature names from array inputs, is now removed. Instead, we use the
`set_output` method on transformers to ensure that they pass dataframes
through to allow estimators to store feature names.

`feature_names_in_` was overridden for all transformed estimators to
return feature names from the transformer rather than the estimator.
This means that the property will return the names of features that
the user passed in to fit the estimator, rather than the transformed
features that were used internally (e.g. cca0, cca1, etc).

Overriding `feature_names_in_` caused `_check_feature_names` to fail
when called during fitting because the `feature_names_in_` intentionally
mismatch the names seen during fit. To overcome that, we override that
method to remove the affected check. We still need to handle warnings if
feature names are incorrectly missing, so we currently borrow part of the
implementation for that method from sklearn (BSD-3 license).

This commit modifies some calls to `_validate_data` from the previous
commit to avoid overriding X. This is done because `_validate_data` casts
to array, which can cause issues when a transformer calls a subsequent
transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF)
with a dataframe input, as feature names will not match between the
transformers, leading to user warnings when predicting. Instead, X is
explicitly cast when needed using `np.asarray` and validation is
always done without overriding X. Note that `_validate_data` must be
called by all transformers that do not call it via a superclass because
this indirectly stores feature names.

While implementing this, the difficulty of tracking what output types
are expected from what transformers and estimators with what inputs and
configuration options became VERY clear, so we now have some basic
"consistency" tests that compare all of our transformers and estimators
with a comparable sklearn implementation to check output types and attrs
under a range of situations.

A very small change is that parametrized estimator checks are now passed
classes instead of instances because this leads to more helpful pytest
errors.
- Pin scikit-learn>=1.2. This is the first release with the `set_output`
	API, and added a number of features and classes that we rely on and
	test against (e.g. OneToOneFeatureMixin).
- Remove unused `ClassNamePrefixFeatureOutMixin`
- Remove unused `input_features` arg to `get_feature_names_out`.
	This is only used for validating or replacing feature names in
	the sklearn implementation, so isn't relevant to our transformers
	that potentially apply dimensionality reduction.
- Fix output types for test functions
@aazuspan aazuspan added enhancement New feature or request testing Related to test code or data labels Jun 6, 2023
@aazuspan aazuspan requested a review from grovduck June 6, 2023 19:41
@grovduck
Copy link
Member

grovduck commented Jun 6, 2023

Awesome, @aazuspan. I'll put fresh eyes onto it in the morning ...

Copy link
Member

@grovduck grovduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aazuspan, such amazing work. There is literally zero chance that I could have done this! It makes me have confidence that whatever slop I write, it will at least be guarded by the careful checks you've put it here.

This PR adds a lot of new tests. The majority of these are what I'm calling "consistency" tests, which just test that our transformers and estimators behave similarly to a reference from sklearn in terms of attributes and output types. I think this is a pretty unconventional way to test, but given that consistency with sklearn is one of our priorities, it seemed like a reasonable approach. The alternative would be to map out what the output types should be under all of these conditions and then manually test against those, rather than against the reference implementations.

I think this is completely reasonable to test against known sklearn transformers/estimators to be sure that they behave similarly, even if it is unconventional. It seems to me that it will bake in checks against those estimators should sklearn change anything on their API, right?

The only other test I could think of was to make sure that the expected output ("pandas" vs "default") was returned if our transformers were put into a pipeline, but perhaps that is implicitly covered by one of your existing transformer tests.

I'm going to approve - I didn't have any substantive changes to make, mostly just a few clarification questions.

Thanks again for the great work.

pyproject.toml Show resolved Hide resolved
src/sknnr/_base.py Outdated Show resolved Hide resolved
src/sknnr/_base.py Show resolved Hide resolved
src/sknnr/_base.py Show resolved Hide resolved
src/sknnr/transformers/_cca_transformer.py Show resolved Hide resolved
src/sknnr/transformers/_ccora_transformer.py Outdated Show resolved Hide resolved
tests/test_estimators.py Show resolved Hide resolved
tests/test_estimators.py Show resolved Hide resolved
tests/test_estimators.py Show resolved Hide resolved
tests/test_transformers.py Show resolved Hide resolved
src/sknnr/_base.py Outdated Show resolved Hide resolved
@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 7, 2023

It seems to me that it will bake in checks against those estimators should sklearn change anything on their API, right?

Yes, that was my thought! Hopefully that won't happen, but better to know if it does.

The only other test I could think of was to make sure that the expected output ("pandas" vs "default") was returned if our transformers were put into a pipeline, but perhaps that is implicitly covered by one of your existing transformer tests.

Were you thinking of a case where set_output is called on the pipeline? Looking at the source for pipeline.set_output, that should work as long as our transformers have a set_output method, so I do think it's indirectly covered. But if you think there's something that's worth covering more explicitly, I'm always open to more tests!

@grovduck
Copy link
Member

grovduck commented Jun 7, 2023

Were you thinking of a case where set_output is called on the pipeline? Looking at the source for pipeline.set_output, that should work as long as our transformers have a set_output method, so I do think it's indirectly covered. But if you think there's something that's worth covering more explicitly, I'm always open to more tests!

Yes, exactly what I was thinking and it makes sense that these would always be indirectly covered (I wrote a quick snippet to prove it to myself as well). I sometimes lean on "tests-as-feature-documentation" and often have redundancy in what I'm testing, so definitely not necessary to include additional tests.

@grovduck
Copy link
Member

grovduck commented Jun 7, 2023

LGTM! Once again, awesome stuff!

@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 7, 2023

Thanks Matt, merging now!

@aazuspan aazuspan merged commit 58396ff into fb_add_estimators Jun 7, 2023
10 checks passed
@aazuspan aazuspan deleted the transformer_dataframe_outputs branch June 7, 2023 22:53
@aazuspan aazuspan linked an issue Jun 7, 2023 that may be closed by this pull request
aazuspan added a commit that referenced this pull request Jun 8, 2023
This overhauls the feature validation to dramatically simplify
changes that were introduced by #22 and #34. I think the easiest way
to explain the changes of this commit are to explain how we ended up
with that complexity.

In #22, we needed to ensure that our estimators got feature names
when fitted with dataframe inputs. This was complicated by the fact
that they received transformed inputs, and our transformers cast
dataframes to arrays, dropping the names. We dealt with this using a
`NamedFeatureArray` class that would retain names after being cast.

In #34, we implemented the `set_output` API and used this instead
to ensure that our transformers returned dataframe outputs. At the
same time, we decided that estimator feature names should match the
names that original names, not the transformed names, and implemented
the `feature_names_in_` property.

What we failed to realize is that the `feature_names_in_` property
removed any need to extract feature names from the transformed inputs
because that extraction is handled entirely by the transformers! This
removed the need to use the `set_output` API.

The other major change is that the `_check_feature_names` method that
we overrode in #34 does not need to be implemented directly because
we can instead use that method from the transformer (via `_validate_data`)
without having to worry about mismatched feature names during fitting.
aazuspan added a commit that referenced this pull request Jun 8, 2023
This overhauls the feature validation to dramatically simplify
changes that were introduced by #22 and #34. I think the easiest way
to explain the changes of this commit are to explain how we ended up
with that complexity.

In #22, we needed to ensure that our estimators got feature names
when fitted with dataframe inputs. This was complicated by the fact
that they received transformed inputs, and our transformers cast
dataframes to arrays, dropping the names. We dealt with this using a
`NamedFeatureArray` class that would retain names after being cast.

In #34, we implemented the `set_output` API and used this instead
to ensure that our transformers returned dataframe outputs. At the
same time, we decided that estimator feature names should match the
names that original names, not the transformed names, and implemented
the `feature_names_in_` property.

What we failed to realize is that the `feature_names_in_` property
removed any need to extract feature names from the transformed inputs
because that extraction is handled entirely by the transformers! This
removed the need to use the `set_output` API.

The other major change is that the `_check_feature_names` method that
we overrode in #34 does not need to be implemented directly because
we can instead use that method from the transformer (via `_validate_data`)
without having to worry about mismatched feature names during fitting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request testing Related to test code or data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transformed regressors drop dataframe feature names
2 participants