Store feature names for transformed estimators #22

aazuspan · 2023-05-15T16:29:37Z

Hey @grovduck, I decided to move ahead with this since, as you mentioned, it should hopefully simplify #21 a little bit.

This would close #20 by wrapping dataframe X arrays in the new NamedFeatureArray after transforming and before passing them on fit, predict, or kneighbors methods. By storing their columns attribute, this allows sklearn to access and set feature names that would otherwise be lost.

A few questions/things for you to consider as you look over this:

What do you think about having NamedFeatureArray in _base? It's not directly related to sklearn which makes me think it could go elsewhere, but let me know what you think.
This moves predict out of the individual estimators and into TransformedKNeighborsMixin. When we discussed this before we decided to leave predict duplicated in our estimators for now in case they need to be implemented differently, but since this would require modifying them all anyways, I went ahead and combined them. Let me know if you think that's premature and I should re-implement them in the subclasses.
To keep things simple, I just wrote a manual test for feature_names_in_ rather than trying to take advantage of the estimator_checks module. I figure we'll have to get a lot more familiar with that module as we work on Get all sklearn estimator checks passing #21, so if it makes sense we can switch to one of the built-in tests then.

Just noticed there's definitely a typo in the commit title 🤦‍♂️

EDIT: One more question, what do you think about having a _transform method on objects with a transform_ attribute. Is this too confusing? Maybe the method should be renamed to something like _apply_transform?

Transformed estimators like GNNRegressor run a transformer on X before fitting or predicting. When X is a dataframe, transforming converts it into an array, preventing sklearn from extracting feature names. To fix this, we wrap the transformed array in an ndarray subclass called NamedFeatureArray that is capable of storing a `columns` attribute prior to passing it to `fit` or `predict`. This tricks sklearn into thinking that it is a dataframe and allows feature names to be successfully accessed and set on the estimator. To accomplish this cleanly, we move all the actual transformation steps out of the individual estimators and in to the TransformedKNeighborsMixin methods. If we need to implement different `predict` methods for different estimators in the future, they can be re-implemented at the estimator level to use the _transform method of their superclass. To prevent regressions, this commit also expands the dataframe support test to check feature names are correctly stored.

grovduck · 2023-05-15T19:22:50Z

Hey @aazuspan, beautiful PR. I like this design quite a bit and it feels really natural. To your questions:

What do you think about having NamedFeatureArray in _base?

I think _base seems perfectly appropriate unless there is some need for a _utils module. Knowing myself, once I have a utilities type module, I throw everything that doesn't fit elsewhere in there(!), so probably not a good habit to get into. It looks like sklearn has a utils package that has individual modules in there, so I guess it could be sknnr/utils/_named_feature_array.py or something like that, but I think that might be slicing it too thinly. I'm totally fine with keeping it in _base if you are.

This moves predict out of the individual estimators and into TransformedKNeighborsMixin.

Yes, let's do this (I love how clean those top-level estimators are now!), although I think there are a couple of scenarios to consider. The first is an estimator like RF-NN which will use a "distance" measure of node similarity, where the distance between a pixel and plot is given by (1 - (number of shared terminal nodes across tree)). So this ceases to be a KNeighborsRegressor anymore and won't fit this pattern anyway.

The second is a bit more nuanced. For Euclidean and Mahalanobis, the transformations are done on the input covariates themselves and scaled in this way, i.e. the "axes" in the multivariate neighbor space is still associated with the covariates. In this way, it doesn't make sense to eliminate any of the axes when neighbor finding. However, both MSN and GNN (and a few others that we might consider implementing) create axes as linear combinations of the input covariates, so they can also serve as dimension reduction tools. For example, we typically will run GNN using only a subset of the axes - this could either be a set number or a proportion of the cumulative sum of the eigenvalues. I think this would be one or more hyperparameters of these methods and would default to the full set of axes. And I think the dimension reduction can still happen in fit on these methods, e.g.

# Untested
class GNNRegressor(IDNeighborsRegressor, TransformedKNeighborsMixin):
    def __init__(self, num_cca_axes=None):
        self.num_cca_axes = num_cca_axes
    def fit(self, X, y, spp=None):
        # CCATransformer will be responsible for returning the correct number of axes
        self.transform_ = CCATransformer(self.num_cca_axes).fit(X, y=y, spp=spp)
        return super().fit(X, y)

leaving TransformedKNeighborsMixin.predict intact. I haven't experimented with this yet, so I may not be considering everything yet.

To keep things simple, I just wrote a manual test for feature_names_in_

Totally understandable. Like you, I imagine #21 will take a while to get right and nice to have tests in place for this one for now. Especially because it didn't look like that was part of the "natural" suite of tests.

One more question, what do you think about having a _transform method on objects with a transform_ attribute. Is this too confusing? Maybe the method should be renamed to something like _apply_transform?

I do like the _apply_transform alternative, unless you would rather be more explicit in the naming of the estimated attribute transform_. I'm OK with either approach ... I think it's more natural to think of methods as verbs so I have a slight preference for renaming to _apply_transform.

aazuspan · 2023-05-15T20:35:41Z

It looks like sklearn has a utils package that has individual modules in there, so I guess it could be sknnr/utils/_named_feature_array.py or something like that, but I think that might be slicing it too thinly.

Yeah, I agree. That looks like a good organization to use if we end up needing a lot more utility code, but for now it's probably overkill.

The first is an estimator like RF-NN which will use a "distance" measure of node similarity, where the distance between a pixel and plot is given by (1 - (number of shared terminal nodes across tree)). So this ceases to be a KNeighborsRegressor anymore and won't fit this pattern anyway.

This is a very interesting point... So RFNN will not identify neighbors in the same way as the the other KNeighborsRegressor estimators, but it will potentially generate predictions in the same way, i.e. using weighted means of nearest neighbors, right? And I assume we will need to be able to access kneighbors from the estimator? Is my thinking right that we effectively want it to inherit fit from RandomForestRegressor to build and train the trees, predict from KNeighborsRegressor to calculate weighted means of nearest neighbors, and kneighbors from a custom implementation that uses node similarity?

In any case, it sounds like this probably won't interact with TransformedKNeighborsMixin, so luckily not something we need to fully figure out yet.

I think this would be one or more hyperparameters of these methods and would default to the full set of axes. And I think the dimension reduction can still happen in fit on these methods

Thanks for the explanation here! I think I follow the complication, and your logic of using model hyperparameters and fit to run the dimensionality reduction make sense to me.

I think it's more natural to think of methods as verbs so I have a slight preference for renaming to _apply_transform.

~~100%, I'll make that change!~~ Changed!

grovduck · 2023-05-15T20:42:55Z

Looks great! Go forward, I say!

grovduck · 2023-05-15T21:41:46Z

This is a very interesting point... So RFNN will not identify neighbors in the same way as the the other KNeighborsRegressor estimators, but it will potentially generate predictions in the same way, i.e. using weighted means of nearest neighbors, right? And I assume we will need to be able to access kneighbors from the estimator?

Exactly right.

Is my thinking right that we effectively want it to inherit fit from RandomForestRegressor to build and train the trees, predict from KNeighborsRegressor to calculate weighted means of nearest neighbors, and kneighbors from a custom implementation that uses node similarity?

This is such a great question and I think the answer is complicated. I'll try to give a synopsis of how RF-NN works and where I see the sklearn estimators fitting.

fit - One or more y attributes (and typically multiply X) are used to fit different forests, one forest per each y attribute. The wa they've implemented it in yaImpute, each forest is actually a classification problem - the y attributes passed are either categorical attributes (think vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.). However, the actual prediction from random forests doesn't matter (this is the mind twist) because we only care about the node IDs where the references and targets land at the deepest level, so maybe it doesn't totally matter if we inherit from RandomForestRegressor or RandomForestClassifier. To me, the RFNN estimator is composed of one or more of these estimators (has-a) rather than being one of these (is-a). We'll be leaning on the apply method of each forest to return the leaf node IDs. I imagine that we will have an estimator attribute that holds the 2-D array of node IDs of (m forests by n trees) columns by p reference rows as a result of fitting.
kneighbors and predict - As opposed to all other estimators that we've introduced so far which have a neighbor space greater than 1D, the distances here are strictly based on the inverse of the number of nodes that the reference and target share in common (although a twist could be that each y attribute - or forest - represents its own "axis" and we'd have as many axes as y attributes - haven't tried this.) So I think you're absolutely right that kneighbors would have to come from the RFNN estimator because the finding isn't based on Euclidean coordinates. But I'm trying to figure out if we can lean on KNeighborsRegressor if we use the callable weights parameter, because given weights and y attributes, the actual calculation of predicted attributes will be exactly the same. At the same time, this would be a pretty trivial function (outside of KNeighborsRegressor) to do the predict - I think it would just be np.average with a weights parameter.

Obviously, I think there is still a bit of work to get this implemented and I'm not clear on the path forward yet. But if you see clear paths, please continue to ask questions. We can definitely lift this from here into a separate issue as well, although it may just be thoughts at this point.

Moved discussion of this issue to #24.

aazuspan · 2023-05-15T22:29:31Z

Very interesting, thanks for the detailed explanations! I can see why you're saving RF-NN for last given the additional complexity over the other estimators, but it sounds like you've already thought through a lot of the nuances.

We can definitely lift this from here into a separate issue as well, although it may just be thoughts at this point.

This is probably a good idea! I think it's possible to go overboard setting up every foreseeable future issue, but this seems like it's clearly on the roadmap and it would be good to consolidate discussion in one place.

All transformers now support `get_feature_names_out` and `set_output` methods. The first method was manually implemented for CCA and CCorA and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The second method was automatically available once `get_feature_names_out` was implemented, because all transformers subclass `BaseEstimator` and indirectly `_SetOutputMixin`. To get `get_feature_names_out` working, this also implements `_n_features_out` properties for CCA and CCorA. Tests for these new features are included to ensure that the length of feature names matches the output features of each transformer, and that set_output is callable. Tests are passing, but warnings are raised when estimators are fit with dataframes. This will be fixed once we use `set_output` to set the transformer mode in our estimators and remove the `NamedFeatureArray` hack.

`NamedFeatureArray`, which was used to trick sklearn into storing feature names from array inputs, is now removed. Instead, we use the `set_output` method on transformers to ensure that they pass dataframes through to allow estimators to store feature names. `feature_names_in_` was overridden for all transformed estimators to return feature names from the transformer rather than the estimator. This means that the property will return the names of features that the user passed in to fit the estimator, rather than the transformed features that were used internally (e.g. cca0, cca1, etc). Overriding `feature_names_in_` caused `_check_feature_names` to fail when called during fitting because the `feature_names_in_` intentionally mismatch the names seen during fit. To overcome that, we override that method to remove the affected check. We still need to handle warnings if feature names are incorrectly missing, so we currently borrow part of the implementation for that method from sklearn (BSD-3 license). This commit modifies some calls to `_validate_data` from the previous commit to avoid overriding X. This is done because `_validate_data` casts to array, which can cause issues when a transformer calls a subsequent transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF) with a dataframe input, as feature names will not match between the transformers, leading to user warnings when predicting. Instead, X is explicitly cast when needed using `np.asarray` and validation is always done without overriding X. Note that `_validate_data` must be called by all transformers that do not call it via a superclass because this indirectly stores feature names. While implementing this, the difficulty of tracking what output types are expected from what transformers and estimators with what inputs and configuration options became VERY clear, so we now have some basic "consistency" tests that compare all of our transformers and estimators with a comparable sklearn implementation to check output types and attrs under a range of situations. A very small change is that parametrized estimator checks are now passed classes instead of instances because this leads to more helpful pytest errors.

This overhauls the feature validation to dramatically simplify changes that were introduced by #22 and #34. I think the easiest way to explain the changes of this commit are to explain how we ended up with that complexity. In #22, we needed to ensure that our estimators got feature names when fitted with dataframe inputs. This was complicated by the fact that they received transformed inputs, and our transformers cast dataframes to arrays, dropping the names. We dealt with this using a `NamedFeatureArray` class that would retain names after being cast. In #34, we implemented the `set_output` API and used this instead to ensure that our transformers returned dataframe outputs. At the same time, we decided that estimator feature names should match the names that original names, not the transformed names, and implemented the `feature_names_in_` property. What we failed to realize is that the `feature_names_in_` property removed any need to extract feature names from the transformed inputs because that extraction is handled entirely by the transformers! This removed the need to use the `set_output` API. The other major change is that the `_check_feature_names` method that we overrode in #34 does not need to be implemented directly because we can instead use that method from the transformer (via `_validate_data`) without having to worry about mismatched feature names during fitting.

aazuspan added bug Something isn't working estimator Related to one or more estimators labels May 15, 2023

aazuspan linked an issue May 15, 2023 that may be closed by this pull request

Transformed regressors drop dataframe feature names #20

Closed

Rename _transform method to _apply_transform

ab8deb4

aazuspan merged commit d2c6aac into fb_add_estimators May 15, 2023
10 checks passed

aazuspan deleted the feature_names branch May 15, 2023 21:00

aazuspan mentioned this pull request May 15, 2023

Transformed regressors drop dataframe feature names #20

Closed

This was referenced May 19, 2023

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Closed

Design and implementation of random forest nearest neighbors (RF-NN) #24

Open

grovduck mentioned this pull request May 31, 2023

Support use of reduced number of axes in CCorATransformer, CCATransformer #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store feature names for transformed estimators #22

Store feature names for transformed estimators #22

aazuspan commented May 15, 2023 •

edited

Loading

grovduck commented May 15, 2023

aazuspan commented May 15, 2023 •

edited

Loading

grovduck commented May 15, 2023

grovduck commented May 15, 2023 •

edited

Loading

aazuspan commented May 15, 2023

Store feature names for transformed estimators #22

Store feature names for transformed estimators #22

Conversation

aazuspan commented May 15, 2023 • edited Loading

grovduck commented May 15, 2023

aazuspan commented May 15, 2023 • edited Loading

grovduck commented May 15, 2023

grovduck commented May 15, 2023 • edited Loading

aazuspan commented May 15, 2023

aazuspan commented May 15, 2023 •

edited

Loading

aazuspan commented May 15, 2023 •

edited

Loading

grovduck commented May 15, 2023 •

edited

Loading