-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformed regressors drop dataframe feature names #20
Comments
Sorry to tack on to an already extensive issue, but I've got another possible option and another related problem to consider.
However, while experimenting with this I discovered... Another problemSuccessfully adding feature names to the estimator introduces new warnings when calling Short of overriding It's possible I'm too deep into this issue and I'm just getting tunnel vision, so please check my logic and let me know if there might be other solutions I haven't considered. |
@aazuspan, such a great writeup of the issue and possible solutions. It takes a lot of time to put this much effort into an issue like this, so I really appreciate it. I think I fully understand the issue and offer a few responses:
Yes, I agree with this approach. Just for my own edification, because now both superclasses (
This one took me a while to noodle on. Of the three options, I feel that option 3 (subclassing np.ndarray using the Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?
👍. I think this is a splendid idea. Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later. |
(Just uncommented the compatibility test and, oof, that Correct me if I'm wrong, but it doesn't seem like the |
My thinking was that we'll call
Admittedly, it is a little tough to track the flow of data there, but I think this will end up cleaner than a more functional approach, since we'll need to run step 3 when we call
I think moving step 3 above into a single method like this is a good idea, which can be called from
I'm on the same page with option 3, and after putting together a quick implementation, I think it's a pretty clean solution overall. It was going to be a lot to paste in here, so I threw it into a Gist if you want to take a look and let me know if you have any thoughts! Any preference on the name for the
Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed, but there may be a snag there that would be better handled by storing it on the arrays. I'll keep this in mind.
Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR... Don't want to dig us into a deeper hole.
You're 100% right. I assumed |
Once again, you amaze me 🤯. Super elegant solution and neat way of taking advantage of MRO. (I do always get a little baffled by calling I like the private
That name sounds perfectly fine to me - faithful to the concept of features in
You're too kind ... your solution (now that I understand it) seems like a better approach.
I'll leave this decision up to you. It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks, but you're the better judge here.
I didn't see anything that includes that check (along with a few others) as part of a wrapping function like Thanks for the deep thinking on this one. |
Transformed estimators like GNNRegressor run a transformer on X before fitting or predicting. When X is a dataframe, transforming converts it into an array, preventing sklearn from extracting feature names. To fix this, we wrap the transformed array in an ndarray subclass called NamedFeatureArray that is capable of storing a `columns` attribute prior to passing it to `fit` or `predict`. This tricks sklearn into thinking that it is a dataframe and allows feature names to be successfully accessed and set on the estimator. To accomplish this cleanly, we move all the actual transformation steps out of the individual estimators and in to the TransformedKNeighborsMixin methods. If we need to implement different `predict` methods for different estimators in the future, they can be re-implemented at the estimator level to use the _transform method of their superclass. To prevent regressions, this commit also expands the dataframe support test to check feature names are correctly stored.
Yeah, trying to figure out
Thanks for helping me think it through--wanted to make sure I wasn't shooting us in the foot just to get a quick fix merged!
Likewise! I'm mostly used to working on code in a vacuum, so being able to bounce ideas around is a big help. |
Transformed estimators like GNNRegressor run a transformer on X before fitting or predicting. When X is a dataframe, transforming converts it into an array, preventing sklearn from extracting feature names. To fix this, we wrap the transformed array in an ndarray subclass called NamedFeatureArray that is capable of storing a `columns` attribute prior to passing it to `fit` or `predict`. This tricks sklearn into thinking that it is a dataframe and allows feature names to be successfully accessed and set on the estimator. To accomplish this cleanly, we move all the actual transformation steps out of the individual estimators and in to the TransformedKNeighborsMixin methods. If we need to implement different `predict` methods for different estimators in the future, they can be re-implemented at the estimator level to use the _transform method of their superclass. To prevent regressions, this commit also expands the dataframe support test to check feature names are correctly stored.
Resolved by #22 |
Hey @aazuspan, continuing my bad habit of responding to already closed issues ... I found this video earlier this week. The main point of the video is that all transformers should support column names if either setting the global After a fairly deep dive and using
What I'm not entirely clear on is whether this obviates the need of |
Great point! I came across some articles mentioning that sklearn transformers support dataframes as a config option while researching this, but wasn't thinking about our transformers as part of the public API at that point. If we want them to be usable outside of our estimators (which I'm pretty sure we do), I think you're 100% right that they need to support those config options.
After poking around, it looks like all EDIT: In order for a transformer that subclasses
You may be a step ahead of me, but for this to work we need our transformer to return arrays when given arrays and dataframes when given dataframes, right? Are we doing this by calling
I noticed that EDIT: To use
Good question! I think if we can get our transformers to always respect their input data types within the estimators (without requiring users to set any config), we should be able to remove |
@aazuspan, great deep dive!
Yes, that was the same conclusion that I came to as well.
Just so I'm clear, this is an explanation of how this works for those estimators that can set the output feature names to be the same as the input feature names, correct? In our case, this would (currently) be If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this. But I'm happy to circle back to this as well. |
Exactly! Any thoughts on how to implement
Happy to! I think I have a relatively clear picture of how this will work, although that may change once I get into the nitty gritty details. One thing I'm particularly unsure on is how best to test this, so let me know if you have any thoughts there. In any case, I'll hold off until MSN is merged since that will be affected. |
That's perfect! Exactly what I was thinking in terms of naming. |
I just noticed that the My understanding is that we plan to make the number of output components (is there a more accurate term?) configurable for the But maybe there's a more direct way we can do this know, just by checking the shape of an attribute on the transformer after it's fit. This is more similar to how Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for |
Great question! (and one I've meant to circle back to ...). For
For (I can't promise that I won't go back and fiddle a bit with the API of both of these classes, including adding a Hopefully that gives you enough to go on now? |
@aazuspan, I thought that it might be better to track the dimensionality reduction in a separate issue [#33], given that there are more considerations than just naming the axes. For now, if you just want to go with the properties I named above to get this working, I think that would be better. Thoughts? |
That's exactly what I was looking for, thanks! I agree that we should track dimensionality reduction separately, but can move ahead here. A couple more questions. First, I set up a quick test for this that fits each transformer (currently just using the from sklearn.datasets import load_linnerud
@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer):
"""Test that all transformers get feature names out."""
X, y = load_linnerud(return_X_y=True)
feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
assert feature_names.shape == (X.shape[1],) This fails for If that's the case, I'll probably switch to testing each transformer's feature names separately so we can also confirm that the names are correct, too. Does that sound like a good plan? Second question: how do you feel about the fact that the output feature names for |
Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something! Another question - did this not fail for
Yeah, not ideal, eh? My preference would be that the names were |
I see, thanks for explaining!
No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs! Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for @pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
"""Test that all transformers get feature names out."""
X = moscow_euclidean.X
y = moscow_euclidean.y
feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
assert feature_names.shape == (X.shape[1],)
It didn't, at least not with
Actually, just implementing def get_feature_names_out(self, input_features=None) -> np.ndarray:
return np.asarray([f"ccora{i}" for i in range(self._n_features_out)], dtype=object) |
I think this is still interesting, though. Did they always return n-1 dimensions based on n features? The actual code that sets the number of eigenvalues is here, which basically takes the minimum of the rank of the least-squares regression or the number of positive eigenvalues. Although I don't fully understand matrix rank and how it relates to least-squares regression, I think rank differs based on whether you have under-, well-, or over-determined systems, which differs on the shape of the arrays passed. I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!
I'm struggling with this one. If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both
That's because I gave you the wrong attribute! Sorry about that. The number of output features should be
I like this option better if you're OK with this. But let me know if you feel differently. |
Yep, here's the code I was experimenting with if you want to take a closer look: import numpy as np
from sknnr.transformers import CCATransformer
for n_features in range(1, 20):
n_samples = 30
X = np.random.normal(loc=10, size=(n_samples, n_features))
y = np.random.normal(loc=10, size=(n_samples, n_features))
n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
print(n_features, n_features_out)
That's okay, I'm not sure I'll ever fully grok the stats, but as long as someone does and the tests pass, I'm happy!
Well said! I felt slightly uneasy about the test, and I think you captured what I didn't like about it. I do think we should have some test of output shape for @pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
"""Test that all transformers get feature names out."""
fit_transformer = transformer.fit(X=moscow_euclidean.X, y=moscow_euclidean.y)
feature_names = fit_transformer.get_feature_names_out()
X_transformed = fit_transformer.transform(X=moscow_euclidean.X)
assert feature_names.shape == (X_transformed.shape[1],)
No problem! I hate to say it, but I think this is exactly what Github Copilot suggested when I first started writing the test...
I have a slight reservation that none of the |
OK, I think I got it. If you have more features in
It may be more predictable than I think, but might require doing some checking of input array shape to determine. I think if you have completely collinear columns in
I like this approach much better. Thanks for being flexible!
Replaced by the machines!
That is pretty interesting, given the prevalence of estimators with |
New question for you, @grovduck! When you fit a Currently, test_estimators_support_dataframes assumes that it should be the name of the features from the dataframe (e.g. The docstring for KNeighborsRegressor says that EDIT: Another consideration is that once dimensionality reduction is implemented, there will also be a shape mismatch between the two sets of names. I think I tentatively lean towards returning the names that were actually used to fit the estimator ( What do you think? |
Oof, this is a good question and getting these names to work is a bit of a pain, eh? First off, I don't think I have a good answer, but I'll ramble for a bit ... I think fundamentally, I'm still viewing these estimators as more or less pipelines even if we're not using that workflow. In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own In a bit of a thought experiment, here's a pipeline with a
So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to I feel llike I'm totally off on a tangent, so reel me back in. |
Yeah, this is turning into a tougher issue than I expected!
You're probably right here, and the pipeline is a good analogy. The challenge is getting our estimators to work as pipelines while also meeting all the assumptions and checks associated with regressors. My first thought to solve this was to define a class TransformedKNeighborsMixin(KNeighborsRegressor):
"""
Mixin for KNeighbors regressors that apply transformations to the feature data.
"""
@property
def feature_names_in_(self):
return self.transform_.feature_names_in_
@feature_names_in_.setter
def feature_names_in_(self, value):
...
@feature_names_in_.deleter
def feature_names_in_(self):
... The problem with this is that Any alternatives you can think of, or does this approach seem okay to you?
I also assumed (and maybe even claimed before) that sklearn estimators return dataframes when predicting from dataframes, but it looks like that's not the case and |
Yes, I definitely understand what you're saying about it feeling a bit heavy handed, but I fully trust that you've thought through the other possibilities and this might be the only way around this issue. I will defer to your judgment here.
Good point! I think I was incorrectly thinking about transformers when I wrote that (thinking back to the video I saw). In that video, he explicitly says that dataframe output is not yet supported for Thanks for all your hard work on this one - it doesn't sound like it's been the most enjoyable dive. |
All transformers now support `get_feature_names_out` and `set_output` methods. The first method was manually implemented for CCA and CCorA and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The second method was automatically available once `get_feature_names_out` was implemented, because all transformers subclass `BaseEstimator` and indirectly `_SetOutputMixin`. To get `get_feature_names_out` working, this also implements `_n_features_out` properties for CCA and CCorA. Tests for these new features are included to ensure that the length of feature names matches the output features of each transformer, and that set_output is callable. Tests are passing, but warnings are raised when estimators are fit with dataframes. This will be fixed once we use `set_output` to set the transformer mode in our estimators and remove the `NamedFeatureArray` hack.
`NamedFeatureArray`, which was used to trick sklearn into storing feature names from array inputs, is now removed. Instead, we use the `set_output` method on transformers to ensure that they pass dataframes through to allow estimators to store feature names. `feature_names_in_` was overridden for all transformed estimators to return feature names from the transformer rather than the estimator. This means that the property will return the names of features that the user passed in to fit the estimator, rather than the transformed features that were used internally (e.g. cca0, cca1, etc). Overriding `feature_names_in_` caused `_check_feature_names` to fail when called during fitting because the `feature_names_in_` intentionally mismatch the names seen during fit. To overcome that, we override that method to remove the affected check. We still need to handle warnings if feature names are incorrectly missing, so we currently borrow part of the implementation for that method from sklearn (BSD-3 license). This commit modifies some calls to `_validate_data` from the previous commit to avoid overriding X. This is done because `_validate_data` casts to array, which can cause issues when a transformer calls a subsequent transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF) with a dataframe input, as feature names will not match between the transformers, leading to user warnings when predicting. Instead, X is explicitly cast when needed using `np.asarray` and validation is always done without overriding X. Note that `_validate_data` must be called by all transformers that do not call it via a superclass because this indirectly stores feature names. While implementing this, the difficulty of tracking what output types are expected from what transformers and estimators with what inputs and configuration options became VERY clear, so we now have some basic "consistency" tests that compare all of our transformers and estimators with a comparable sklearn implementation to check output types and attrs under a range of situations. A very small change is that parametrized estimator checks are now passed classes instead of instances because this leads to more helpful pytest errors.
- Pin scikit-learn>=1.2. This is the first release with the `set_output` API, and added a number of features and classes that we rely on and test against (e.g. OneToOneFeatureMixin). - Remove unused `ClassNamePrefixFeatureOutMixin` - Remove unused `input_features` arg to `get_feature_names_out`. This is only used for validating or replacing feature names in the sklearn implementation, so isn't relevant to our transformers that potentially apply dimensionality reduction. - Fix output types for test functions
Resolved (hopefully for good!) by #34 |
Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.
The problem
All of our
TransformedKNeighborsMixin
estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything butRawKNNRegressor
.This happens because they all run
X
through transformers that convert the dataframes to arrays before they get toKNeighborsRegressor.fit
where the features would be retrieved and stored. The same thing would happen withsklearn
transformers, so I think we should probably solve this inTransformedKNeighborsMixin
rather than in the transformers.EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.
Possible solutions
First of all, I think we should move the actual transformation out of the
fit
method for each estimator and into afit
method forTransformedKNeighborsMixin
. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify thatfit
method to make sure it sets appropriate feature names after transformation.To get feature names, all that sklearn does is look for a
columns
attribute onX
. If we could copy thatcolumns
attribute onto the transformed array before passing it toKNeighborsRegressor.fit
we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.I think that leaves us with a few options:
sklearn.utils.validation._get_feature_names
to get and validate the feature names before transforming, then manually set them asfeature_names_in_
after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently withsklearn
.sklearn.utils.validation._get_feature_names
into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with thesklearn
license. As with option 1, we would still need to handle setting thefeature_names_in_
attribute.ndarray
to support acolumn
attribute and pass that in to fit. As long assklearn
doesn't change how they identify features (which seems unlikely), we could letsklearn
handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the_fit_X
attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost._check_feature_names
method with the non-transformedX
after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.I don't love any of these options, so let me know what you think or if any other solutions occur to you.
Estimator checks
I noticed that the
sklearn.estimator_checks
would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.
The text was updated successfully, but these errors were encountered: