Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken warning and reduce complexity of feature validation #35 #36

Merged
merged 1 commit into from
Jun 8, 2023

Conversation

aazuspan
Copy link
Contributor

@aazuspan aazuspan commented Jun 8, 2023

This closes #35 and simplifies feature validation. This comment and this commit go into more depth on the changes here, but the bullet points are:

  1. Remove set_temp_output because we no longer need transformers to pass dataframes into transformed estimators since we can get feature names from the transformer.
  2. Remove the warning logic from TransformedKNeighborsMixin._check_feature_names. That is now handled by calling transform_._validate_data. This method now just serves to override the internal check that would incorrectly fail when our estimator is fit with transformed features that mismatch the feature_names_in_ from the transformer.
  3. Update tests to make sure we can fit with a dataframe with no feature names without getting a warning.
  4. Parametrize a few related test cases just to improve error reporting.

This overhauls the feature validation to dramatically simplify
changes that were introduced by #22 and #34. I think the easiest way
to explain the changes of this commit are to explain how we ended up
with that complexity.

In #22, we needed to ensure that our estimators got feature names
when fitted with dataframe inputs. This was complicated by the fact
that they received transformed inputs, and our transformers cast
dataframes to arrays, dropping the names. We dealt with this using a
`NamedFeatureArray` class that would retain names after being cast.

In #34, we implemented the `set_output` API and used this instead
to ensure that our transformers returned dataframe outputs. At the
same time, we decided that estimator feature names should match the
names that original names, not the transformed names, and implemented
the `feature_names_in_` property.

What we failed to realize is that the `feature_names_in_` property
removed any need to extract feature names from the transformed inputs
because that extraction is handled entirely by the transformers! This
removed the need to use the `set_output` API.

The other major change is that the `_check_feature_names` method that
we overrode in #34 does not need to be implemented directly because
we can instead use that method from the transformer (via `_validate_data`)
without having to worry about mismatched feature names during fitting.
@aazuspan aazuspan added bug Something isn't working enhancement New feature or request estimator Related to one or more estimators labels Jun 8, 2023
@aazuspan aazuspan added this to the Core Estimators milestone Jun 8, 2023
@aazuspan aazuspan requested a review from grovduck June 8, 2023 22:38
@aazuspan aazuspan self-assigned this Jun 8, 2023
Copy link
Member

@grovduck grovduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Onward!

@@ -69,23 +69,26 @@ def test_estimators_support_continuous_multioutput(estimator, moscow_euclidean):
estimator.predict(moscow_euclidean.X)


@pytest.mark.parametrize("with_names", [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice parameterization. That's pretty cool!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't used parametrize much before this, but I'm finding it's pretty powerful!

@aazuspan aazuspan merged commit 5c083e9 into fb_add_estimators Jun 8, 2023
10 checks passed
@aazuspan aazuspan deleted the simplify_feature_validation branch June 8, 2023 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request estimator Related to one or more estimators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warning when fitting transformed estimator with dataframe with no feature names
2 participants