Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store feature names for transformed estimators #22

Merged
merged 2 commits into from
May 15, 2023

Conversation

aazuspan
Copy link
Contributor

@aazuspan aazuspan commented May 15, 2023

Hey @grovduck, I decided to move ahead with this since, as you mentioned, it should hopefully simplify #21 a little bit.

This would close #20 by wrapping dataframe X arrays in the new NamedFeatureArray after transforming and before passing them on fit, predict, or kneighbors methods. By storing their columns attribute, this allows sklearn to access and set feature names that would otherwise be lost.

A few questions/things for you to consider as you look over this:

  1. What do you think about having NamedFeatureArray in _base? It's not directly related to sklearn which makes me think it could go elsewhere, but let me know what you think.
  2. This moves predict out of the individual estimators and into TransformedKNeighborsMixin. When we discussed this before we decided to leave predict duplicated in our estimators for now in case they need to be implemented differently, but since this would require modifying them all anyways, I went ahead and combined them. Let me know if you think that's premature and I should re-implement them in the subclasses.
  3. To keep things simple, I just wrote a manual test for feature_names_in_ rather than trying to take advantage of the estimator_checks module. I figure we'll have to get a lot more familiar with that module as we work on Get all sklearn estimator checks passing #21, so if it makes sense we can switch to one of the built-in tests then.

Just noticed there's definitely a typo in the commit title 🤦‍♂️

EDIT: One more question, what do you think about having a _transform method on objects with a transform_ attribute. Is this too confusing? Maybe the method should be renamed to something like _apply_transform?

Transformed estimators like GNNRegressor run a transformer on X
before fitting or predicting. When X is a dataframe, transforming
converts it into an array, preventing sklearn from extracting
feature names. To fix this, we wrap the transformed array in an
ndarray subclass called NamedFeatureArray that is capable of
storing a `columns` attribute prior to passing it to `fit` or
`predict`. This tricks sklearn into thinking that it is a
dataframe and allows feature names to be successfully accessed
and set on the estimator.

To accomplish this cleanly, we move all the actual transformation
steps out of the individual estimators and in to the
TransformedKNeighborsMixin methods. If we need to implement
different `predict` methods for different estimators in the
future, they can be re-implemented at the estimator level to use
the _transform method of their superclass.

To prevent regressions, this commit also expands the dataframe
support test to check feature names are correctly stored.
@aazuspan aazuspan added bug Something isn't working estimator Related to one or more estimators labels May 15, 2023
@aazuspan aazuspan linked an issue May 15, 2023 that may be closed by this pull request
@grovduck
Copy link
Member

Hey @aazuspan, beautiful PR. I like this design quite a bit and it feels really natural. To your questions:

What do you think about having NamedFeatureArray in _base?

I think _base seems perfectly appropriate unless there is some need for a _utils module. Knowing myself, once I have a utilities type module, I throw everything that doesn't fit elsewhere in there(!), so probably not a good habit to get into. It looks like sklearn has a utils package that has individual modules in there, so I guess it could be sknnr/utils/_named_feature_array.py or something like that, but I think that might be slicing it too thinly. I'm totally fine with keeping it in _base if you are.

This moves predict out of the individual estimators and into TransformedKNeighborsMixin.

Yes, let's do this (I love how clean those top-level estimators are now!), although I think there are a couple of scenarios to consider. The first is an estimator like RF-NN which will use a "distance" measure of node similarity, where the distance between a pixel and plot is given by (1 - (number of shared terminal nodes across tree)). So this ceases to be a KNeighborsRegressor anymore and won't fit this pattern anyway.

The second is a bit more nuanced. For Euclidean and Mahalanobis, the transformations are done on the input covariates themselves and scaled in this way, i.e. the "axes" in the multivariate neighbor space is still associated with the covariates. In this way, it doesn't make sense to eliminate any of the axes when neighbor finding. However, both MSN and GNN (and a few others that we might consider implementing) create axes as linear combinations of the input covariates, so they can also serve as dimension reduction tools. For example, we typically will run GNN using only a subset of the axes - this could either be a set number or a proportion of the cumulative sum of the eigenvalues. I think this would be one or more hyperparameters of these methods and would default to the full set of axes. And I think the dimension reduction can still happen in fit on these methods, e.g.

# Untested
class GNNRegressor(IDNeighborsRegressor, TransformedKNeighborsMixin):
    def __init__(self, num_cca_axes=None):
        self.num_cca_axes = num_cca_axes
    def fit(self, X, y, spp=None):
        # CCATransformer will be responsible for returning the correct number of axes
        self.transform_ = CCATransformer(self.num_cca_axes).fit(X, y=y, spp=spp)
        return super().fit(X, y)

leaving TransformedKNeighborsMixin.predict intact. I haven't experimented with this yet, so I may not be considering everything yet.

To keep things simple, I just wrote a manual test for feature_names_in_

Totally understandable. Like you, I imagine #21 will take a while to get right and nice to have tests in place for this one for now. Especially because it didn't look like that was part of the "natural" suite of tests.

One more question, what do you think about having a _transform method on objects with a transform_ attribute. Is this too confusing? Maybe the method should be renamed to something like _apply_transform?

I do like the _apply_transform alternative, unless you would rather be more explicit in the naming of the estimated attribute transform_. I'm OK with either approach ... I think it's more natural to think of methods as verbs so I have a slight preference for renaming to _apply_transform.

@aazuspan
Copy link
Contributor Author

aazuspan commented May 15, 2023

It looks like sklearn has a utils package that has individual modules in there, so I guess it could be sknnr/utils/_named_feature_array.py or something like that, but I think that might be slicing it too thinly.

Yeah, I agree. That looks like a good organization to use if we end up needing a lot more utility code, but for now it's probably overkill.

The first is an estimator like RF-NN which will use a "distance" measure of node similarity, where the distance between a pixel and plot is given by (1 - (number of shared terminal nodes across tree)). So this ceases to be a KNeighborsRegressor anymore and won't fit this pattern anyway.

This is a very interesting point... So RFNN will not identify neighbors in the same way as the the other KNeighborsRegressor estimators, but it will potentially generate predictions in the same way, i.e. using weighted means of nearest neighbors, right? And I assume we will need to be able to access kneighbors from the estimator? Is my thinking right that we effectively want it to inherit fit from RandomForestRegressor to build and train the trees, predict from KNeighborsRegressor to calculate weighted means of nearest neighbors, and kneighbors from a custom implementation that uses node similarity?

In any case, it sounds like this probably won't interact with TransformedKNeighborsMixin, so luckily not something we need to fully figure out yet.

I think this would be one or more hyperparameters of these methods and would default to the full set of axes. And I think the dimension reduction can still happen in fit on these methods

Thanks for the explanation here! I think I follow the complication, and your logic of using model hyperparameters and fit to run the dimensionality reduction make sense to me.

I think it's more natural to think of methods as verbs so I have a slight preference for renaming to _apply_transform.

100%, I'll make that change! Changed!

@grovduck
Copy link
Member

Looks great! Go forward, I say!

@aazuspan aazuspan merged commit d2c6aac into fb_add_estimators May 15, 2023
10 checks passed
@aazuspan aazuspan deleted the feature_names branch May 15, 2023 21:00
@grovduck
Copy link
Member

grovduck commented May 15, 2023

This is a very interesting point... So RFNN will not identify neighbors in the same way as the the other KNeighborsRegressor estimators, but it will potentially generate predictions in the same way, i.e. using weighted means of nearest neighbors, right? And I assume we will need to be able to access kneighbors from the estimator?

Exactly right.

Is my thinking right that we effectively want it to inherit fit from RandomForestRegressor to build and train the trees, predict from KNeighborsRegressor to calculate weighted means of nearest neighbors, and kneighbors from a custom implementation that uses node similarity?

This is such a great question and I think the answer is complicated. I'll try to give a synopsis of how RF-NN works and where I see the sklearn estimators fitting.

  1. fit - One or more y attributes (and typically multiply X) are used to fit different forests, one forest per each y attribute. The wa they've implemented it in yaImpute, each forest is actually a classification problem - the y attributes passed are either categorical attributes (think vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.). However, the actual prediction from random forests doesn't matter (this is the mind twist) because we only care about the node IDs where the references and targets land at the deepest level, so maybe it doesn't totally matter if we inherit from RandomForestRegressor or RandomForestClassifier. To me, the RFNN estimator is composed of one or more of these estimators (has-a) rather than being one of these (is-a). We'll be leaning on the apply method of each forest to return the leaf node IDs. I imagine that we will have an estimator attribute that holds the 2-D array of node IDs of (m forests by n trees) columns by p reference rows as a result of fitting.
  2. kneighbors and predict - As opposed to all other estimators that we've introduced so far which have a neighbor space greater than 1D, the distances here are strictly based on the inverse of the number of nodes that the reference and target share in common (although a twist could be that each y attribute - or forest - represents its own "axis" and we'd have as many axes as y attributes - haven't tried this.) So I think you're absolutely right that kneighbors would have to come from the RFNN estimator because the finding isn't based on Euclidean coordinates. But I'm trying to figure out if we can lean on KNeighborsRegressor if we use the callable weights parameter, because given weights and y attributes, the actual calculation of predicted attributes will be exactly the same. At the same time, this would be a pretty trivial function (outside of KNeighborsRegressor) to do the predict - I think it would just be np.average with a weights parameter.

Obviously, I think there is still a bit of work to get this implemented and I'm not clear on the path forward yet. But if you see clear paths, please continue to ask questions. We can definitely lift this from here into a separate issue as well, although it may just be thoughts at this point.

Moved discussion of this issue to #24.

@aazuspan
Copy link
Contributor Author

Very interesting, thanks for the detailed explanations! I can see why you're saving RF-NN for last given the additional complexity over the other estimators, but it sounds like you've already thought through a lot of the nuances.

We can definitely lift this from here into a separate issue as well, although it may just be thoughts at this point.

This is probably a good idea! I think it's possible to go overboard setting up every foreseeable future issue, but this seems like it's clearly on the roadmap and it would be good to consolidate discussion in one place.

aazuspan added a commit that referenced this pull request Jun 6, 2023
All transformers now support `get_feature_names_out` and `set_output`
methods. The first method was manually implemented for CCA and CCorA
and was inherited from `OneToOneFeatureMixin` for Mahalanobis. The
second method was automatically available once `get_feature_names_out`
was implemented, because all transformers subclass `BaseEstimator` and
indirectly `_SetOutputMixin`. To get `get_feature_names_out` working,
this also implements `_n_features_out` properties for CCA and CCorA.

Tests for these new features are included to ensure that the length of
feature names matches the output features of each transformer, and that
set_output is callable.

Tests are passing, but warnings are raised when estimators are fit with
dataframes. This will be fixed once we use `set_output` to set the
transformer mode in our estimators and remove the `NamedFeatureArray`
hack.
aazuspan added a commit that referenced this pull request Jun 6, 2023
`NamedFeatureArray`, which was used to trick sklearn into storing
feature names from array inputs, is now removed. Instead, we use the
`set_output` method on transformers to ensure that they pass dataframes
through to allow estimators to store feature names.

`feature_names_in_` was overridden for all transformed estimators to
return feature names from the transformer rather than the estimator.
This means that the property will return the names of features that
the user passed in to fit the estimator, rather than the transformed
features that were used internally (e.g. cca0, cca1, etc).

Overriding `feature_names_in_` caused `_check_feature_names` to fail
when called during fitting because the `feature_names_in_` intentionally
mismatch the names seen during fit. To overcome that, we override that
method to remove the affected check. We still need to handle warnings if
feature names are incorrectly missing, so we currently borrow part of the
implementation for that method from sklearn (BSD-3 license).

This commit modifies some calls to `_validate_data` from the previous
commit to avoid overriding X. This is done because `_validate_data` casts
to array, which can cause issues when a transformer calls a subsequent
transformer (e.g. MahalanobisTransformer calls StandardScalerWithDOF)
with a dataframe input, as feature names will not match between the
transformers, leading to user warnings when predicting. Instead, X is
explicitly cast when needed using `np.asarray` and validation is
always done without overriding X. Note that `_validate_data` must be
called by all transformers that do not call it via a superclass because
this indirectly stores feature names.

While implementing this, the difficulty of tracking what output types
are expected from what transformers and estimators with what inputs and
configuration options became VERY clear, so we now have some basic
"consistency" tests that compare all of our transformers and estimators
with a comparable sklearn implementation to check output types and attrs
under a range of situations.

A very small change is that parametrized estimator checks are now passed
classes instead of instances because this leads to more helpful pytest
errors.
aazuspan added a commit that referenced this pull request Jun 8, 2023
This overhauls the feature validation to dramatically simplify
changes that were introduced by #22 and #34. I think the easiest way
to explain the changes of this commit are to explain how we ended up
with that complexity.

In #22, we needed to ensure that our estimators got feature names
when fitted with dataframe inputs. This was complicated by the fact
that they received transformed inputs, and our transformers cast
dataframes to arrays, dropping the names. We dealt with this using a
`NamedFeatureArray` class that would retain names after being cast.

In #34, we implemented the `set_output` API and used this instead
to ensure that our transformers returned dataframe outputs. At the
same time, we decided that estimator feature names should match the
names that original names, not the transformed names, and implemented
the `feature_names_in_` property.

What we failed to realize is that the `feature_names_in_` property
removed any need to extract feature names from the transformed inputs
because that extraction is handled entirely by the transformers! This
removed the need to use the `set_output` API.

The other major change is that the `_check_feature_names` method that
we overrode in #34 does not need to be implemented directly because
we can instead use that method from the transformer (via `_validate_data`)
without having to worry about mismatched feature names during fitting.
aazuspan added a commit that referenced this pull request Jun 8, 2023
This overhauls the feature validation to dramatically simplify
changes that were introduced by #22 and #34. I think the easiest way
to explain the changes of this commit are to explain how we ended up
with that complexity.

In #22, we needed to ensure that our estimators got feature names
when fitted with dataframe inputs. This was complicated by the fact
that they received transformed inputs, and our transformers cast
dataframes to arrays, dropping the names. We dealt with this using a
`NamedFeatureArray` class that would retain names after being cast.

In #34, we implemented the `set_output` API and used this instead
to ensure that our transformers returned dataframe outputs. At the
same time, we decided that estimator feature names should match the
names that original names, not the transformed names, and implemented
the `feature_names_in_` property.

What we failed to realize is that the `feature_names_in_` property
removed any need to extract feature names from the transformed inputs
because that extraction is handled entirely by the transformers! This
removed the need to use the `set_output` API.

The other major change is that the `_check_feature_names` method that
we overrode in #34 does not need to be implemented directly because
we can instead use that method from the transformer (via `_validate_data`)
without having to worry about mismatched feature names during fitting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working estimator Related to one or more estimators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transformed regressors drop dataframe feature names
2 participants