-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store feature names for transformed estimators #22
Conversation
Transformed estimators like GNNRegressor run a transformer on X before fitting or predicting. When X is a dataframe, transforming converts it into an array, preventing sklearn from extracting feature names. To fix this, we wrap the transformed array in an ndarray subclass called NamedFeatureArray that is capable of storing a `columns` attribute prior to passing it to `fit` or `predict`. This tricks sklearn into thinking that it is a dataframe and allows feature names to be successfully accessed and set on the estimator. To accomplish this cleanly, we move all the actual transformation steps out of the individual estimators and in to the TransformedKNeighborsMixin methods. If we need to implement different `predict` methods for different estimators in the future, they can be re-implemented at the estimator level to use the _transform method of their superclass. To prevent regressions, this commit also expands the dataframe support test to check feature names are correctly stored.
Hey @aazuspan, beautiful PR. I like this design quite a bit and it feels really natural. To your questions:
I think
Yes, let's do this (I love how clean those top-level estimators are now!), although I think there are a couple of scenarios to consider. The first is an estimator like RF-NN which will use a "distance" measure of node similarity, where the distance between a pixel and plot is given by The second is a bit more nuanced. For
leaving
Totally understandable. Like you, I imagine #21 will take a while to get right and nice to have tests in place for this one for now. Especially because it didn't look like that was part of the "natural" suite of tests.
I do like the |
Yeah, I agree. That looks like a good organization to use if we end up needing a lot more utility code, but for now it's probably overkill.
This is a very interesting point... So In any case, it sounds like this probably won't interact with
Thanks for the explanation here! I think I follow the complication, and your logic of using model hyperparameters and
|
Looks great! Go forward, I say! |
Exactly right.
This is such a great question and I think the answer is complicated. I'll try to give a synopsis of how RF-NN works and where I see the
Obviously, I think there is still a bit of work to get this implemented and I'm not clear on the path forward yet. But if you see clear paths, please continue to ask questions. We can definitely lift this from here into a separate issue as well, although it may just be thoughts at this point. Moved discussion of this issue to #24. |
Very interesting, thanks for the detailed explanations! I can see why you're saving RF-NN for last given the additional complexity over the other estimators, but it sounds like you've already thought through a lot of the nuances.
This is probably a good idea! I think it's possible to go overboard setting up every foreseeable future issue, but this seems like it's clearly on the roadmap and it would be good to consolidate discussion in one place. |
All transformers now support `get_feature_names_out` and `set_output` methods. The first method was manually implemented for CCA and CCorA and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The second method was automatically available once `get_feature_names_out` was implemented, because all transformers subclass `BaseEstimator` and indirectly `_SetOutputMixin`. To get `get_feature_names_out` working, this also implements `_n_features_out` properties for CCA and CCorA. Tests for these new features are included to ensure that the length of feature names matches the output features of each transformer, and that set_output is callable. Tests are passing, but warnings are raised when estimators are fit with dataframes. This will be fixed once we use `set_output` to set the transformer mode in our estimators and remove the `NamedFeatureArray` hack.
`NamedFeatureArray`, which was used to trick sklearn into storing feature names from array inputs, is now removed. Instead, we use the `set_output` method on transformers to ensure that they pass dataframes through to allow estimators to store feature names. `feature_names_in_` was overridden for all transformed estimators to return feature names from the transformer rather than the estimator. This means that the property will return the names of features that the user passed in to fit the estimator, rather than the transformed features that were used internally (e.g. cca0, cca1, etc). Overriding `feature_names_in_` caused `_check_feature_names` to fail when called during fitting because the `feature_names_in_` intentionally mismatch the names seen during fit. To overcome that, we override that method to remove the affected check. We still need to handle warnings if feature names are incorrectly missing, so we currently borrow part of the implementation for that method from sklearn (BSD-3 license). This commit modifies some calls to `_validate_data` from the previous commit to avoid overriding X. This is done because `_validate_data` casts to array, which can cause issues when a transformer calls a subsequent transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF) with a dataframe input, as feature names will not match between the transformers, leading to user warnings when predicting. Instead, X is explicitly cast when needed using `np.asarray` and validation is always done without overriding X. Note that `_validate_data` must be called by all transformers that do not call it via a superclass because this indirectly stores feature names. While implementing this, the difficulty of tracking what output types are expected from what transformers and estimators with what inputs and configuration options became VERY clear, so we now have some basic "consistency" tests that compare all of our transformers and estimators with a comparable sklearn implementation to check output types and attrs under a range of situations. A very small change is that parametrized estimator checks are now passed classes instead of instances because this leads to more helpful pytest errors.
This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.
This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.
Hey @grovduck, I decided to move ahead with this since, as you mentioned, it should hopefully simplify #21 a little bit.
This would close #20 by wrapping dataframe
X
arrays in the newNamedFeatureArray
after transforming and before passing them onfit
,predict
, orkneighbors
methods. By storing theircolumns
attribute, this allowssklearn
to access and set feature names that would otherwise be lost.A few questions/things for you to consider as you look over this:
NamedFeatureArray
in_base
? It's not directly related tosklearn
which makes me think it could go elsewhere, but let me know what you think.predict
out of the individual estimators and intoTransformedKNeighborsMixin
. When we discussed this before we decided to leavepredict
duplicated in our estimators for now in case they need to be implemented differently, but since this would require modifying them all anyways, I went ahead and combined them. Let me know if you think that's premature and I should re-implement them in the subclasses.feature_names_in_
rather than trying to take advantage of theestimator_checks
module. I figure we'll have to get a lot more familiar with that module as we work on Get allsklearn
estimator checks passing #21, so if it makes sense we can switch to one of the built-in tests then.Just noticed there's definitely a typo in the commit title 🤦♂️
EDIT: One more question, what do you think about having a
_transform
method on objects with atransform_
attribute. Is this too confusing? Maybe the method should be renamed to something like_apply_transform
?