Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformed regressors drop dataframe feature names #20

Closed
aazuspan opened this issue May 12, 2023 · 26 comments · Fixed by #22 or #34
Closed

Transformed regressors drop dataframe feature names #20

aazuspan opened this issue May 12, 2023 · 26 comments · Fixed by #22 or #34
Assignees
Labels
bug Something isn't working estimator Related to one or more estimators

Comments

@aazuspan
Copy link
Contributor

aazuspan commented May 12, 2023

Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.

The problem

All of our TransformedKNeighborsMixin estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything but RawKNNRegressor.

@pytest.mark.parametrize("estimator", get_kneighbor_estimator_instances())
def test_estimators_support_dataframes(estimator, moscow_euclidean):
    """All estimators should fit and predict data stored as dataframes."""
    num_features = moscow_euclidean.X.shape[1]
    feature_names = [f"col_{i}" for i in range(num_features)]

    X_df = pd.DataFrame(moscow_euclidean.X, columns=feature_names)
    y_df = pd.DataFrame(moscow_euclidean.y)

    estimator.fit(X_df, y_df)
    estimator.predict(X_df)
    assert_array_equal(estimator.feature_names_in_, feature_names)

This happens because they all run X through transformers that convert the dataframes to arrays before they get to KNeighborsRegressor.fit where the features would be retrieved and stored. The same thing would happen with sklearn transformers, so I think we should probably solve this in TransformedKNeighborsMixin rather than in the transformers.

EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.

Possible solutions

First of all, I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify that fit method to make sure it sets appropriate feature names after transformation.

To get feature names, all that sklearn does is look for a columns attribute on X. If we could copy that columns attribute onto the transformed array before passing it to KNeighborsRegressor.fit we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.

I think that leaves us with a few options:

  1. Use sklearn.utils.validation._get_feature_names to get and validate the feature names before transforming, then manually set them as feature_names_in_ after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently with sklearn.
  2. Same as above but copy sklearn.utils.validation._get_feature_names into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with the sklearn license. As with option 1, we would still need to handle setting the feature_names_in_ attribute.
  3. Subclass ndarray to support a column attribute and pass that in to fit. As long as sklearn doesn't change how they identify features (which seems unlikely), we could let sklearn handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the _fit_X attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost.
  4. Use the _check_feature_names method with the non-transformed X after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.

I don't love any of these options, so let me know what you think or if any other solutions occur to you.

Estimator checks

I noticed that the sklearn.estimator_checks would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.

EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.

@aazuspan aazuspan added bug Something isn't working estimator Related to one or more estimators labels May 12, 2023
@aazuspan aazuspan self-assigned this May 12, 2023
@aazuspan
Copy link
Contributor Author

Sorry to tack on to an already extensive issue, but I've got another possible option and another related problem to consider.

BaseEstimator objects, including all our estimators, have a _check_feature_names method that can set feature names from a dataframe. I edited the original post to include this as Option 4.

However, while experimenting with this I discovered...

Another problem

Successfully adding feature names to the estimator introduces new warnings when calling predict. This is because, like fit, the transformation converts dataframe inputs to arrays. When predict is run without feature names on a model that was fitted with feature names, sklearn complains.

Short of overriding predict to skip the check for feature names (definitely not a good approach), I'm not sure there's any way to solve this secondary issue other than modifying X between transforming it and passing it in to super().predict(). So with that in mind, maybe Option 3 (subclassing ndarray to add a columns attribute and make it look like a dataframe) is the only remaining solution that tackles both problems?

It's possible I'm too deep into this issue and I'm just getting tunnel vision, so please check my logic and let me know if there might be other solutions I haven't considered.

@grovduck
Copy link
Member

@aazuspan, such a great writeup of the issue and possible solutions. It takes a lot of time to put this much effort into an issue like this, so I really appreciate it. I think I fully understand the issue and offer a few responses:

I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin

Yes, I agree with this approach. Just for my own edification, because now both superclasses (TransformedKNeighborsMixin and IDNeighborsRegressor) will have fit methods and you'll want to call TransformedKNeighborsMixin.fit first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)) and then call super().fit(X)? But the return of fit should be the estimator itself, so I'm a bit confused how this would work. Could you instead just create the TransformedKNeighborsMixin.transform method?

I think that leaves us with a few options ...

This one took me a while to noodle on. Of the three options, I feel that option 3 (subclassing np.ndarray using the RealisticInfoArray pattern on the page you linked) might be the best choice. I like that it doesn't require any tweaks to the sklearn code at all and it should be transparent to the user. (While I was writing, I just saw your additional post which narrows in on this remedy!). I do like this approach (option 3) better than option 4.

Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?

I wonder if we should prioritize getting those checks to pass before we add any more functionality

👍. I think this is a splendid idea. Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.

@grovduck
Copy link
Member

grovduck commented May 12, 2023

(Just uncommented the compatibility test and, oof, that GNNRegressor is going to take some work!)

Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency is actually part of the suite of checks that gets run on these estimators? How did you know that this would have tripped that check?

@aazuspan
Copy link
Contributor Author

Just for my own edification, because now both superclasses (TransformedKNeighborsMixin and IDNeighborsRegressor) will have fit methods and you'll want to call TransformedKNeighborsMixin.fit first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)) and then call super().fit(X)?

My thinking was that we'll call super().fit in each case and rely on MRO to fit them in the order below (using EuclideanKNNRegressor as an example):

  1. EuclideanKnnRegressor.fit - Sets the transform_ attribute.
  2. IDNeighborsRegressor.fit - Potentially sets the index_ attribute.
  3. TransformedKNeighborsMixin.fit - Checks for transform_ attribute, applies the transformation, and in the case of dataframes, stores the transformed X in our array subclass.
  4. KNeighborsRegressor.fit - Actually does the fitting.

Admittedly, it is a little tough to track the flow of data there, but I think this will end up cleaner than a more functional approach, since we'll need to run step 3 when we call fit, predict, or kneighbors on all of the transformed estimators.

Could you instead just create the TransformedKNeighborsMixin.transform method?

I think moving step 3 above into a single method like this is a good idea, which can be called from fit, predict, or kneighbors.

I do like this approach (option 3) better than option 4.

I'm on the same page with option 3, and after putting together a quick implementation, I think it's a pretty clean solution overall. It was going to be a lot to paste in here, so I threw it into a Gist if you want to take a look and let me know if you have any thoughts!

Any preference on the name for the ndarray subclass? As you'll see, my first thought was NamedFeatureArray, but I'm open to ideas!

Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?

Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed, but there may be a snag there that would be better handled by storing it on the arrays. I'll keep this in mind.

Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.

Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR... Don't want to dig us into a deeper hole.

Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency is actually part of the suite of checks that gets run on these estimators?

You're 100% right. I assumed check_estimator ran everything in that module, and I'm not totally sure why it doesn't... I suppose we'll have to run some manually then, unless there's some configuration I'm missing?

@grovduck
Copy link
Member

After putting together a quick implementation, I think it's a pretty clean solution overall

Once again, you amaze me 🤯. Super elegant solution and neat way of taking advantage of MRO. (I do always get a little baffled by calling super() in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin, but it's really calling the next-in-line superclass of EuclideanKNNRegressor, right? In this way, it's just picking up information as it goes down the MRO chain. Neat!)

I like the private _transform method in that class as well - that looks like a good use.

As you'll see, my first thought was NamedFeatureArray

That name sounds perfectly fine to me - faithful to the concept of features in sklearn.

Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed

You're too kind ... your solution (now that I understand it) seems like a better approach.

Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR

I'll leave this decision up to you. It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks, but you're the better judge here.

I suppose we'll have to run some manually then, unless there's some configuration I'm missing

I didn't see anything that includes that check (along with a few others) as part of a wrapping function like yield_all_checks. I'm not sure why they are left out of the set of checks.

Thanks for the deep thinking on this one.

aazuspan added a commit that referenced this issue May 15, 2023
Transformed estimators like GNNRegressor run a transformer on X
before fitting or predicting. When X is a dataframe, transforming
converts it into an array, preventing sklearn from extracting
feature names. To fix this, we wrap the transformed array in an
ndarray subclass called NamedFeatureArray that is capable of
storing a `columns` attribute prior to passing it to `fit` or
`predict`. This tricks sklearn into thinking that it is a
dataframe and allows feature names to be successfully accessed
and set on the estimator.

To accomplish this cleanly, we move all the actual transformation
steps out of the individual estimators and in to the
TransformedKNeighborsMixin methods. If we need to implement
different `predict` methods for different estimators in the
future, they can be re-implemented at the estimator level to use
the _transform method of their superclass.

To prevent regressions, this commit also expands the dataframe
support test to check feature names are correctly stored.
@aazuspan aazuspan linked a pull request May 15, 2023 that will close this issue
@aazuspan
Copy link
Contributor Author

I do always get a little baffled by calling super() in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin, but it's really calling the next-in-line superclass of EuclideanKNNRegressor, right? In this way, it's just picking up information as it goes down the MRO chain.

Yeah, trying to figure out super with multiple inheritance can be a little mind-bending, but I think you've got it. I think the trick is that in each of those calls to fit, self is still an instance of EuclideanKnnRegressor, which is how super is able to access the next method in MRO line.

It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks

Thanks for helping me think it through--wanted to make sure I wasn't shooting us in the foot just to get a quick fix merged!

Thanks for the deep thinking on this one.

Likewise! I'm mostly used to working on code in a vacuum, so being able to bounce ideas around is a big help.

aazuspan added a commit that referenced this issue May 15, 2023
Transformed estimators like GNNRegressor run a transformer on X
before fitting or predicting. When X is a dataframe, transforming
converts it into an array, preventing sklearn from extracting
feature names. To fix this, we wrap the transformed array in an
ndarray subclass called NamedFeatureArray that is capable of
storing a `columns` attribute prior to passing it to `fit` or
`predict`. This tricks sklearn into thinking that it is a
dataframe and allows feature names to be successfully accessed
and set on the estimator.

To accomplish this cleanly, we move all the actual transformation
steps out of the individual estimators and in to the
TransformedKNeighborsMixin methods. If we need to implement
different `predict` methods for different estimators in the
future, they can be re-implemented at the estimator level to use
the _transform method of their superclass.

To prevent regressions, this commit also expands the dataframe
support test to check feature names are correctly stored.
@aazuspan
Copy link
Contributor Author

Resolved by #22

@grovduck
Copy link
Member

Hey @aazuspan, continuing my bad habit of responding to already closed issues ...

I found this video earlier this week. The main point of the video is that all transformers should support column names if either setting the global set_config(transform_output="pandas") or the transformer specific transformer.set_output(transform="pandas") and the passed X is a dataframe. What was interesting to me was that some transformers (like PolynomialFeatures) create new output names because they set more columns as a result of the fit and transform.

After a fairly deep dive and using PolynomialFeatures as inspiration, I think I understand how this works a bit better. I think, at the very least, we need to support column names from our transformers (including the ones that transform the X array into reduced dimension arrays like CCATransformer). I think the three main things need to happen:

  1. Add _SetOutputMixin as a base class on our transformers so that set_output is available to us.
  2. Call self._check_feature_names or self._validate_data in fit with reset=True. This will set self.feature_names_in_ correctly if X is a dataframe.
  3. Create a get_feature_names_out method on each transformer that correctly sets the output transform names. Without it, transform complains as I think you found out.

What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray. I think it's entirely possible that you took even a deeper dive than I did 😄, so it's possible that I'm still missing something. I'm going to keep this issue closed for now, but let me know what you think.

@aazuspan
Copy link
Contributor Author

aazuspan commented May 25, 2023

Great point! I came across some articles mentioning that sklearn transformers support dataframes as a config option while researching this, but wasn't thinking about our transformers as part of the public API at that point. If we want them to be usable outside of our estimators (which I'm pretty sure we do), I think you're 100% right that they need to support those config options.

Add _SetOutputMixin as a base class on our transformers so that set_output is available to us.

After poking around, it looks like all BaseTransformer subclasses, including ours, already subclass _SetOutputMixin. In fact, StandardScalerWithDOF and CCATransformer already have set_output methods. MahalanobisTransformer doesn't, but I haven't totally worked out why yet. I think it comes down to this
_auto_wrap_is_configured check, but I haven't wrapped my head around that yet.

EDIT: In order for a transformer that subclasses _SetOutputMixin to access the set_output method, it needs to pass the _auto_wrap_is_configured check, which requires that the transformer has a get_feature_names_out attr. So I think solving that first will automatically give us access to set_output.

Call self._check_feature_names or self.validate_data in fit with reset=True. This will set self.feature_names_in correctly if X is a dataframe.

You may be a step ahead of me, but for this to work we need our transformer to return arrays when given arrays and dataframes when given dataframes, right? Are we doing this by calling set_output on the transformer based on the input X type in TransformedKNeighborsMixin._apply_transform? That was my first thought, but I haven't worked through all the potential difficulties there.

Create a get_feature_names_out method on each transformer that correctly sets the output transform names. Without it, transform complains as I think you found out.

I noticed that StandardScalerWithDOF already gets this method from OneToOneFeatureMixin, but I'm assuming we can't use that for our transformers that apply dimensionality reduction?

EDIT: To use OneToOneFeatureMixin.get_feature_names_out, an estimator must set n_features_in_ when it's fit. Aside from StandardScalerWithDOF, none of our transformers do this because they never call fit from a superclass. We can fix this by running self._validate_data at the start of each custom fit method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy the StandardScaler implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handle get_feature_names_out, and this will probably show up in the estimator checks in #21 as well.

What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray

Good question! I think if we can get our transformers to always respect their input data types within the estimators (without requiring users to set any config), we should be able to remove NamedFeatureArray because our X_transformed will now be a dataframe if a dataframe goes in.

@aazuspan aazuspan reopened this May 25, 2023
@grovduck
Copy link
Member

@aazuspan, great deep dive!

In order for a transformer that subclasses _SetOutputMixin to access the set_output method, it needs to pass the _auto_wrap_is_configured check, which requires that the transformer has a get_feature_names_out attr. So I think solving that first will automatically give us access to set_output.

Yes, that was the same conclusion that I came to as well.

To use OneToOneFeatureMixin.get_feature_names_out, an estimator must set n_features_in_ when it's fit. Aside from StandardScalerWithDOF, none of our transformers do this because they never call fit from a superclass. We can fix this by running self._validate_data at the start of each custom fit method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy the StandardScaler implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handle get_feature_names_out, and this will probably show up in the estimator checks in #21 as well.

Just so I'm clear, this is an explanation of how this works for those estimators that can set the output feature names to be the same as the input feature names, correct? In our case, this would (currently) be StandardScalerWithDOF and MahalanobisTransformer, right? You're not suggesting that we use the OneToOneFeatureMixin on CCATransformer or CCorATransformer, correct? Only that we need to ensure that we call self._validate_data in their fit methods, right? That was where I had landed as well.

If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this. But I'm happy to circle back to this as well.

@aazuspan
Copy link
Contributor Author

In our case, this would (currently) be StandardScalerWithDOF and MahalanobisTransformer, right? You're not suggesting that we use the OneToOneFeatureMixin on CCATransformer or CCorATransformer, correct? Only that we need to ensure that we call self._validate_data in their fit methods, right? That was where I had landed as well.

Exactly! Any thoughts on how to implement get_feature_names_out for CCA and CCorA? What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?

If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this.

Happy to! I think I have a relatively clear picture of how this will work, although that may change once I get into the nitty gritty details. One thing I'm particularly unsure on is how best to test this, so let me know if you have any thoughts there. In any case, I'll hold off until MSN is merged since that will be affected.

@grovduck
Copy link
Member

What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?

That's perfect! Exactly what I was thinking in terms of naming.

@aazuspan
Copy link
Contributor Author

I just noticed that the ClassNamePrefixFeaturesOutMixin.get_feature_names_out method requires that an _n_features_out attr is set on the transformer. For PCA, this is handled with this property.

My understanding is that we plan to make the number of output components (is there a more accurate term?) configurable for the CCA and CCorA ordinations, right? In that case, I imagine we would set an attribute when the transformer is instantiated, similar to PCA, and we could reference that for _n_features_out.

But maybe there's a more direct way we can do this know, just by checking the shape of an attribute on the transformer after it's fit. This is more similar to how PCA checks the shape of its components_ attr.

Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for CCATransformer and CCorATransformer?

@grovduck
Copy link
Member

grovduck commented May 31, 2023

Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for CCATransformer and CCorATransformer?

Great question! (and one I've meant to circle back to ...). For CCATransformer, after the transformer is fit, we can use self.cca_.eigenvalues.shape[0] (it's a 1-d numpy array, so we could use len if preferred). And your intuition is right on, we'll want to have an option to set the number of output axes to use, so I think the logic goes like this:

  1. Create an __init__ in CCATransformer that sets an attribute (something like n_axes) which would default to None
  2. At the end of fit, either modify self.n_axes or create a new estimator attribute that stores self.cca_.eigenvalues.shape[0] if self.n_axes is None, otherwise stores min(self.n_axes, self.cca_.eigenvalues.shape[0]). This step may not be entirely necessary and we could just code this into the property below.
  3. Create a property _n_features_out that references the attribute from above (or implements that logic)
  4. Modify transform to only use up to _n_features_out axes.

For CCorATransformer, I think it's probably safest to use self.ccora_.projector.shape[0]. There is a property in there called n_vec which would return the same number (and is used in projector), but I don't know how much of the internals of CCorA (and, for that matter, CCA) we want to expose. Because we already use projector in transform, that seems safest to me.

(I can't promise that I won't go back and fiddle a bit with the API of both of these classes, including adding a projector property to CCA, but to keep you going, I won't do anything now.)

Hopefully that gives you enough to go on now?

@grovduck
Copy link
Member

@aazuspan, I thought that it might be better to track the dimensionality reduction in a separate issue [#33], given that there are more considerations than just naming the axes. For now, if you just want to go with the properties I named above to get this working, I think that would be better. Thoughts?

@aazuspan
Copy link
Contributor Author

That's exactly what I was looking for, thanks! I agree that we should track dimensionality reduction separately, but can move ahead here.

A couple more questions. First, I set up a quick test for this that fits each transformer (currently just using the linnerud dataset) and compares the shapes of get_feature_names_out to the number of features in the X data, assuming that they should be the same.

from sklearn.datasets import load_linnerud

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer):
    """Test that all transformers get feature names out."""
    X, y = load_linnerud(return_X_y=True)

    feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
    assert feature_names.shape == (X.shape[1],)

This fails for CCATransformer, and after playing around with it, it looks like CCA will always reduce dimensionality by one feature. Am I understanding that right?

If that's the case, I'll probably switch to testing each transformer's feature names separately so we can also confirm that the names are correct, too. Does that sound like a good plan?

Second question: how do you feel about the fact that the output feature names for CCATransformer and CCorATransformer will be ccatransformer0, and ccoratransformer0, respectively (because they subclass ClassNamePrefixFeaturesOutMixin)? Those are obviously a little wordy and it feels kind of odd to include transformer in the feature names. We could think about shortening the names of the transformers, making them more akin to PCA, although we would obviously have to distinguish our CCA from the builtin CCA. We don't necessarily need to decide on this right now, unless we wanted to modify the implementation of get_feature_names_out to shorten feature names without renaming classes.

@grovduck
Copy link
Member

grovduck commented May 31, 2023

This fails for CCATransformer, and after playing around with it, it looks like CCA will always reduce dimensionality by one feature. Am I understanding that right?

CCA shouldn't always reduce dimensionality by one - it should be data-dependent, e.g.:

import pandas as pd
from sklearn.datasets import load_linnerud
from sknnr.transformers import CCATransformer

X, y = load_linnerud(return_X_y=True)
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (3, 2)

X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (28,  28)

Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!

Another question - did this not fail for CCorATransformer? This is where the statistical fit test usually reduces the number of axes quite a bit.

Second question: how do you feel about the fact that the output feature names for CCATransformer and CCorATransformer will be ccatransformer0, and ccoratransformer0, respectively (because they subclass ClassNamePrefixFeaturesOutMixin)?

Yeah, not ideal, eh? My preference would be that the names were cca0 and ccora0, but I understand that issues that causes with renaming those classes (and the fact that our CCorA is really supposed to produce the same results as sklearn's CCA). I don't think I'm in favor of renaming the transformers. Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out to splice out transformer? Having said that, I can live with whatever feels right to you (even a rename of the transformer class names).

@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 1, 2023

I see, thanks for explaining!

Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!

No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!

Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for get_feature_names_out?

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
    """Test that all transformers get feature names out."""
    X = moscow_euclidean.X
    y = moscow_euclidean.y

    feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
    assert feature_names.shape == (X.shape[1],)

Another question - did this not fail for CCorATransformer? This is where the statistical fit test usually reduces the number of axes quite a bit.

It didn't, at least not with linnerud or the np.random.normal arrays I was using.

Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out to splice out transformer?

Actually, just implementing get_feature_names_out from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output. I would probably lean towards that, with the only downside being a little bit of duplication between the two transformers. Something like:

def get_feature_names_out(self, input_features=None) -> np.ndarray:
    return np.asarray([f"ccora{i}" for i in range(self._n_features_out)], dtype=object)

@grovduck
Copy link
Member

grovduck commented Jun 1, 2023

No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!

I think this is still interesting, though. Did they always return n-1 dimensions based on n features? The actual code that sets the number of eigenvalues is here, which basically takes the minimum of the rank of the least-squares regression or the number of positive eigenvalues. Although I don't fully understand matrix rank and how it relates to least-squares regression, I think rank differs based on whether you have under-, well-, or over-determined systems, which differs on the shape of the arrays passed. I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!

Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for get_feature_names_out?

I'm struggling with this one. If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA and CCorA, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it. I just think the non-deterministic nature of these two ordination methods makes testing tricky.

It didn't, at least not with linnerud or the np.random.normal arrays I was using.

That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1] (the columns, not rows). If you run transform on the Moscow test data, you'll get this:

import pandas as pd
from sknnr.transformers import CCorATransformer

X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCorATransformer().fit(X, y)
print(txfr.ccora_.projector.shape[1])
# Prints 5
print(txfr.transform(X).shape)
# Prints (165, 5)

Actually, just implementing get_feature_names_out from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output.

I like this option better if you're OK with this. But let me know if you feel differently.

@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 1, 2023

Did they always return n-1 dimensions based on n features?

Yep, here's the code I was experimenting with if you want to take a closer look:

import numpy as np
from sknnr.transformers import CCATransformer

for n_features in range(1, 20):
    n_samples = 30

    X = np.random.normal(loc=10, size=(n_samples, n_features))
    y = np.random.normal(loc=10, size=(n_samples, n_features))

    n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
    print(n_features, n_features_out)

I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!

That's okay, I'm not sure I'll ever fully grok the stats, but as long as someone does and the tests pass, I'm happy!

If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA and CCorA, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it.

Well said! I felt slightly uneasy about the test, and I think you captured what I didn't like about it. I do think we should have some test of output shape for get_feature_names_out, since it would otherwise be very easy for a bug to go unnoticed there. What do you think about comparing the shape of the feature names to the shape of the transformed X, like below? It requires a little more computation, but probably not enough to ever notice with the small number of transformers we have.

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
    """Test that all transformers get feature names out."""
    fit_transformer = transformer.fit(X=moscow_euclidean.X, y=moscow_euclidean.y)
    feature_names = fit_transformer.get_feature_names_out()
    X_transformed = fit_transformer.transform(X=moscow_euclidean.X)

    assert feature_names.shape == (X_transformed.shape[1],)

That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1] (the columns, not rows).

No problem! I hate to say it, but I think this is exactly what Github Copilot suggested when I first started writing the test...

I like this option better if you're OK with this. But let me know if you feel differently.

I have a slight reservation that none of the sklearn transformers include "Transformer" in the name, but I also don't have a better suggestion. Maybe we can revisit naming again in the future once the rest of the API is in place, but for now I'm perfectly happy implementing get_feature_names_out from scratch, since it's such a simple method.

@grovduck
Copy link
Member

grovduck commented Jun 1, 2023

Yep, here's the code I was experimenting with if you want to take a closer look

OK, I think I got it. If you have more features in y than you do in X, then the number of eigenvalues should be equal to the number of features in X. Try this:

import numpy as np
from sknnr.transformers import CCATransformer

for n_features in range(1, 20):
    n_samples = 30

    X = np.random.normal(loc=10, size=(n_samples, n_features))
    y = np.random.normal(loc=10, size=(n_samples, n_features + 1))

    n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
    print(n_features, n_features_out)

It may be more predictable than I think, but might require doing some checking of input array shape to determine. I think if you have completely collinear columns in y, that will reduce the rank as well.

What do you think about comparing the shape of the feature names to the shape of the transformed X, like below?

I like this approach much better. Thanks for being flexible!

I hate to say it, but I think this is exactly what Github Copilot suggested

Replaced by the machines!

I have a slight reservation that none of the sklearn transformers include "Transformer" in the name

That is pretty interesting, given the prevalence of estimators with Classifier and Regressor in their names. I'm happy to revisit as well, but I think moving forward with the custom get_feature_names_out makes sense!

@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 1, 2023

New question for you, @grovduck!

When you fit a GNNRegressor or MSNRegressor with a dataframe, what do you think feature_names_in_ should be?

Currently, test_estimators_support_dataframes assumes that it should be the name of the features from the dataframe (e.g. PIPO_BA, PSME_BA, etc). However, implementing feature names for our transformers broke that test because feature_names_in_ now returns the names of the components (e.g. cca0, cca1, etc) that were returned from the transformer.

The docstring for KNeighborsRegressor says that feature_names_in_ should return "names of features seen during fit", which is technically the transformed components in our case. However, my concern is that users will expect to see the names of the features that were used to fit the transformer, not the estimator.

EDIT: Another consideration is that once dimensionality reduction is implemented, there will also be a shape mismatch between the two sets of names.

I think I tentatively lean towards returning the names that were actually used to fit the estimator (cca...), but making that clear in the docstring and pointing users to use estimator.transform_.feature_names_in_ if they want the features that were used for the transformer, but maybe that's exposing the user too much to the implementation details of the estimator?

What do you think?

@grovduck
Copy link
Member

grovduck commented Jun 2, 2023

Oof, this is a good question and getting these names to work is a bit of a pain, eh? First off, I don't think I have a good answer, but I'll ramble for a bit ...

I think fundamentally, I'm still viewing these estimators as more or less pipelines even if we're not using that workflow. In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_names_in_ - the transformer has the original feature names (e.g. PIPO_BA) and the regressor has the transformed feature names (e.g. cca0), but the container itself should have the original feature names rather than the transformed feature names.

In a bit of a thought experiment, here's a pipeline with a PCA and a KNeighborsRegressor.

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_linnerud
from sklearn.pipeline import Pipeline

pipe = Pipeline(
    [("PCA", PCA(n_components=2)), ("KNN", KNeighborsRegressor(n_neighbors=3))]
)

X, y = load_linnerud(return_X_y=True, as_frame=True)

# Component-wise
transformer = pipe[:-1].fit(X, y)
transformer.set_output(transform="pandas")
print(transformer.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
print(transformer.get_feature_names_out())
# Reports ['pca0' 'pca1']

X_transformed = transformer.transform(X)
print(X_transformed.columns)
# Reports ['pca0', 'pca1']

regressor = pipe[-1].fit(X_transformed, y)
print(regressor.feature_names_in_)
# Reports ['pca0', 'pca1']

# All at once
model = pipe.fit(X, y)
model.set_output(transform="pandas")
print(model.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
predicted = model.predict(X)
print(predicted.columns)
# AttributeError: 'numpy.ndarray' object has no attribute 'columns'

So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_names_in_ to the original attribute names for these estimators.

I feel llike I'm totally off on a tangent, so reel me back in.

@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 2, 2023

Oof, this is a good question and getting these names to work is a bit of a pain, eh?

Yeah, this is turning into a tougher issue than I expected!

In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_names_in_ - the transformer has the original feature names (e.g. PIPO_BA) and the regressor has the transformed feature names (e.g. cca0), but the container itself should have the original feature names rather than the transformed feature names.

You're probably right here, and the pipeline is a good analogy. The challenge is getting our estimators to work as pipelines while also meeting all the assumptions and checks associated with regressors.

My first thought to solve this was to define a feature_names_in_ property on the TransformedKNeighborsMixin that returns the attr from the wrapped transformer (note this requires defining property setters and deleters to prevent sklearn from modifying that attr when fitting).

class TransformedKNeighborsMixin(KNeighborsRegressor):
    """
    Mixin for KNeighbors regressors that apply transformations to the feature data.
    """
    @property
    def feature_names_in_(self):
        return self.transform_.feature_names_in_

    @feature_names_in_.setter
    def feature_names_in_(self, value):
        ...

    @feature_names_in_.deleter
    def feature_names_in_(self):
        ...

The problem with this is that BaseEstimator._check_feature_names gets called during prediction and throws an error because the feature names we overrode don't match the names that were used to fit the estimator, and the only solution I've come up with is to override _check_feature_names. Maybe that's okay given that we're essentially delegating responsibility for feature names to the transformer, but it feels a little heavy handed...

Any alternatives you can think of, or does this approach seem okay to you?

So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_names_in_ to the original attribute names for these estimators.

I also assumed (and maybe even claimed before) that sklearn estimators return dataframes when predicting from dataframes, but it looks like that's not the case and predict will always return an array, even outside a pipeline. They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?

@grovduck
Copy link
Member

grovduck commented Jun 2, 2023

Any alternatives you can think of, or does this approach seem okay to you?

Yes, I definitely understand what you're saying about it feeling a bit heavy handed, but I fully trust that you've thought through the other possibilities and this might be the only way around this issue. I will defer to your judgment here.

They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?

Good point! I think I was incorrectly thinking about transformers when I wrote that (thinking back to the video I saw). In that video, he explicitly says that dataframe output is not yet supported for predict, predict_proba, etc. Sorry for being confusing on this one - I agree that we keep the regressor output as arrays.

Thanks for all your hard work on this one - it doesn't sound like it's been the most enjoyable dive.

aazuspan added a commit that referenced this issue Jun 6, 2023
All transformers now support `get_feature_names_out` and `set_output`
methods. The first method was manually implemented for CCA and CCorA
and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The
second method was automatically available once `get_feature_names_out`
was implemented, because all transformers subclass `BaseEstimator` and
indirectly `_SetOutputMixin`. To get `get_feature_names_out` working,
this also implements `_n_features_out` properties for CCA and CCorA.

Tests for these new features are included to ensure that the length of
feature names matches the output features of each transformer, and that
set_output is callable.

Tests are passing, but warnings are raised when estimators are fit with
dataframes. This will be fixed once we use `set_output` to set the
transformer mode in our estimators and remove the `NamedFeatureArray`
hack.
aazuspan added a commit that referenced this issue Jun 6, 2023
`NamedFeatureArray`, which was used to trick sklearn into storing
feature names from array inputs, is now removed. Instead, we use the
`set_output` method on transformers to ensure that they pass dataframes
through to allow estimators to store feature names.

`feature_names_in_` was overridden for all transformed estimators to
return feature names from the transformer rather than the estimator.
This means that the property will return the names of features that
the user passed in to fit the estimator, rather than the transformed
features that were used internally (e.g. cca0, cca1, etc).

Overriding `feature_names_in_` caused `_check_feature_names` to fail
when called during fitting because the `feature_names_in_` intentionally
mismatch the names seen during fit. To overcome that, we override that
method to remove the affected check. We still need to handle warnings if
feature names are incorrectly missing, so we currently borrow part of the
implementation for that method from sklearn (BSD-3 license).

This commit modifies some calls to `_validate_data` from the previous
commit to avoid overriding X. This is done because `_validate_data` casts
to array, which can cause issues when a transformer calls a subsequent
transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF)
with a dataframe input, as feature names will not match between the
transformers, leading to user warnings when predicting. Instead, X is
explicitly cast when needed using `np.asarray` and validation is
always done without overriding X. Note that `_validate_data` must be
called by all transformers that do not call it via a superclass because
this indirectly stores feature names.

While implementing this, the difficulty of tracking what output types
are expected from what transformers and estimators with what inputs and
configuration options became VERY clear, so we now have some basic
"consistency" tests that compare all of our transformers and estimators
with a comparable sklearn implementation to check output types and attrs
under a range of situations.

A very small change is that parametrized estimator checks are now passed
classes instead of instances because this leads to more helpful pytest
errors.
aazuspan added a commit that referenced this issue Jun 6, 2023
- Pin scikit-learn>=1.2. This is the first release with the `set_output`
	API, and added a number of features and classes that we rely on and
	test against (e.g. OneToOneFeatureMixin).
- Remove unused `ClassNamePrefixFeatureOutMixin`
- Remove unused `input_features` arg to `get_feature_names_out`.
	This is only used for validating or replacing feature names in
	the sklearn implementation, so isn't relevant to our transformers
	that potentially apply dimensionality reduction.
- Fix output types for test functions
@aazuspan aazuspan linked a pull request Jun 7, 2023 that will close this issue
@aazuspan
Copy link
Contributor Author

aazuspan commented Jun 7, 2023

Resolved (hopefully for good!) by #34

@aazuspan aazuspan closed this as completed Jun 7, 2023
@grovduck grovduck added this to the Core Estimators milestone Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working estimator Related to one or more estimators
Projects
None yet
2 participants