Warning when fitting transformed estimator with dataframe with no feature names #35

aazuspan · 2023-06-07T23:55:15Z

This might set a record for the shortest time between merging a PR and finding a new bug that it caused...

When fitting a transformed estimator with a dataframe without feature names, _check_feature_names incorrectly warns: UserWarning: X has feature names, but GNNRegressor was fitted without feature names.

To reproduce:

from sklearn.datasets import load_linnerud
import pandas as pd
from sknnr import GNNRegressor

X, y = load_linnerud(return_X_y=True)
# Note that columns are not specified
X_df = pd.DataFrame(X)
est = GNNRegressor().fit(X_df, y)

This wasn't caught by our tests because we specify column names for all of our dataframe inputs, and this will only occur when the dataframe has no column names. What happens is that est.transform_ gets fit with a dataframe with no feature names, but outputs a dataframe that does have feature names (in this case, [cca0, cca1]). Once fit works its way through the MRO to KNeighborsRegressor.fit, it calls _validate_data which calls our overridden _check_feature_names that complains about the mismatched feature names.

A quick fix for this is to change the condition under which we use pandas output mode on the transformer so that we only use it if sklearn can extract feature names, but I feel like that whole section might be unnecessarily complex, so I'll give this a little more thought before I push anything...

The text was updated successfully, but these errors were encountered:

This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.

grovduck · 2023-06-08T17:09:16Z

@aazuspan, great explanation of the issue and I understand what's going on. Correct me if I'm wrong, but this definitely feels like a bit of a corner case, in that a user would have to read in X as an array, then convert to a dataframe without specifying column names, right?

I'm a bit conflicted over what the expected behavior should be. One argument would be that given that the user (intentionally/unintentionally) passed X as a dataframe, it feels like we should respect that choice (as far as column names) even if column names are meaningless. But I think that goes against your proposed fix, because I think you'd set the output mode to be default, correct?

aazuspan · 2023-06-08T17:48:12Z

I think you're right that this is definitely an edge case, and probably not worth worrying about beyond maintaining consistency with sklearn. However...

While trying to solve this, I realized that I added a ton of unnecessary complexity in #34. This commit message goes into some depth, but essentially I think I got stuck on the idea that we needed the transformer to pass dataframes to our estimator so that we could get feature names, but that became totally unnecessary once we added the feature_names_in_ property since we now just pull the names straight from the transformer.

Removing that requirement simplifies the _apply_transform method a lot and allows us to ditch the overly confusing set_temp_output function.

With those gone, and after a few hours tracing through exactly what methods get called with what data, I realized that we could also simplify _check_feature_names. Basically, that method gets called in two circumstances:

When fitting the transformed estimator, it gets called by KNeighborsRegressor.fit and compares the transformed X (e.g. cca0) against the original feature names (e.g. PSME_BA). This is the check that gives us problems and the reason we need to override it, because unlike sklearn, we don't want those to match. This is also the check that was causing the error in this issue, and is indirectly tested by test_estimators_support_dataframes.
When predicting, it gets called by KNeighborsMixin.kneighbors and compares the new X (e.g. PSME_BA) to the X that the transformer was fit with (hopefully PSME_BA). We do want this check to ensure that a user doesn't accidentally predict with different features than they fit with. That's the behavior tested by test_estimators_warn_for_missing_features.

Trying to accomplish both of those objectives (don't warn when fitting but do warn when predicting) was why I had to copy some of the logic from the sklearn implementation. However, I realize now that we can avoid that complexity (and the licensing) by just disabling that check and instead running that check via the transformer in _apply_transform. That allows us to totally disable the check in case 1 above, but get the full check via the transformer in case 2.

Sorry for the whiplash of adding a bunch of features then pulling them all out again! I realize this is a lot to digest, so I'm happy to clarify or discuss further. But if this all makes sense and you don't see any glaring holes in my logic, I'm ready to make a PR that would fix this and simplify the validation.

grovduck · 2023-06-08T22:22:11Z

I'm so glad you understand everything, but I'm right at about 30-40% on this 😄. I'm going to take the cop-out that I generally understand how you've simplified things (and the code is much simpler) and trust that you have a good handle on all of this. I'm good with you moving forward with the PR and promise to stare at it more such that I get to 60-80%!

aazuspan · 2023-06-08T22:40:29Z

That's fair enough!

This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.

aazuspan · 2023-06-08T23:31:52Z

Resolved by #36

aazuspan added bug Something isn't working estimator Related to one or more estimators labels Jun 7, 2023

aazuspan added this to the Core Estimators milestone Jun 7, 2023

aazuspan self-assigned this Jun 7, 2023

aazuspan mentioned this issue Jun 8, 2023

Fix broken warning and reduce complexity of feature validation #35 #36

Merged

aazuspan closed this as completed Jun 8, 2023

grovduck linked a pull request Jun 15, 2023 that will close this issue

Fix broken warning and reduce complexity of feature validation #35 #36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning when fitting transformed estimator with dataframe with no feature names #35

Warning when fitting transformed estimator with dataframe with no feature names #35

aazuspan commented Jun 7, 2023

grovduck commented Jun 8, 2023

aazuspan commented Jun 8, 2023 •

edited

Loading

grovduck commented Jun 8, 2023

aazuspan commented Jun 8, 2023

aazuspan commented Jun 8, 2023

Warning when fitting transformed estimator with dataframe with no feature names #35

Warning when fitting transformed estimator with dataframe with no feature names #35

Comments

aazuspan commented Jun 7, 2023

grovduck commented Jun 8, 2023

aazuspan commented Jun 8, 2023 • edited Loading

grovduck commented Jun 8, 2023

aazuspan commented Jun 8, 2023

aazuspan commented Jun 8, 2023

aazuspan commented Jun 8, 2023 •

edited

Loading