Unwanted double transformation using TransformedKNeighborsMixin.predict #23

grovduck · 2023-05-19T22:40:25Z

In #22, we brought fit, predict, and kneighbors methods under the TransformedKNeighborsMixin class and took them out of the estimators. On predict, I believe we are unintentionally applying each estimator's transform twice. The current code for TransformedKNeighborsMixin.predict is this:

def predict(self, X):
    """Predict using transformed feature data."""
    X_transformed = self._apply_transform(X)
    return super().predict(X_transformed)

This will apply the transform to the X array and then call KNeighborsRegressor.predict. However, this method then in turn calls self.kneighbors which returns the call back to TransformedKNeighborsMixin.kneighbors which again applies the transform, this time to the already transformed data.

This probably went unnoticed because we aren't explicitly testing predict in test_port.py at present, only kneighbors. Presumably the check distances would not have matched after the double transformation.

This did surface in the implementation of the MSNRegressor (not yet pushed) where the canonical correlation analysis transform reduces the number of features based on significant CCA axes. This failed both the test_estimators_support_continuous_multioutput and test_estimators_support_dataframes test because on the first transform pass, we subset the X array down to fewer features, but then try to pass this reduced dimension array back to StandardScalerWithDOF which was originally fit with the full number of features.

If this diagnosis seems correct, I believe the solution is to just remove the TransformedKNeighborsMixin.predict method. Any call to an estimator's predict gets routed to KNeighborsRegressor.predict which in turn will call TransformedKNeighborsMixin.kneighbors and will do the appropriate transformation before neighbor finding. I have tested locally with this approach (and on the new MSNRegressor estimator) and it passes all current tests.

The text was updated successfully, but these errors were encountered:

grovduck · 2023-05-20T00:01:14Z

I've got a (pretty shabby) port of the MSNRegressor ready to go, but would like to wait until this one is resolved before I push that.

aazuspan · 2023-05-24T22:45:26Z

Great catch! I think your analysis and solution are spot on, so I'd say go ahead with the fix.

This probably went unnoticed because we aren't explicitly testing predict in test_port.py at present, only kneighbors. Presumably the check distances would not have matched after the double transformation.

Yes, that makes sense to me. How hard do you think it would be to add that test? It would be nice to have that in place to prevent this issue (or similar) in the future, but if it would hold this fix up too much or require additional changes it could probably wait for a separate PR.

grovduck · 2023-05-25T19:10:36Z

How hard do you think it would be to add that test? It would be nice to have that in place to prevent this issue (or similar) in the future, but if it would hold this fix up too much or require additional changes it could probably wait for a separate PR.

@aazuspan, thanks for the feedback. As much as I'd like to get this test in now, I think I might push this change without it for a couple of different reasons:

By calling predict, we're currently returning the k=5 means of the IDs and, because we don't store the distances anywhere, it seems difficult to do the distance check without explicitly calling kneighbors. I'm having a hard time figuring out how we'd verify that we're only doing a single transformation just by calling predict, but I'm feeling a bit slow today!
Ultimately, the y we're passing will be the stand attributes and predict should return the (weighted) means of those (related to Decide how to handle fitting, prediction, and ID data #2). So I think it's probably not worth putting a lot of effort into testing predict right now until we address that.

Please let me know if you feel differently. I've got a branch ready to push that will simply delete the predict method from TransformedKNeighborsMixin.

aazuspan · 2023-05-25T20:20:54Z

Ultimately, the y we're passing will be the stand attributes and predict should return the (weighted) means of those (related to #2). So I think it's probably not worth putting a lot of effort into testing predict right now until we address that.

Correct me if I'm wrong here, but once the double transformation is removed, our estimators will be capable of correctly predicting weighted means of stand attributes (or any multi-output target), won't they? In other words, ignoring the double transformation, isn't the code below a working example of generating predictions with GNN?

from sknnr import GNNRegressor
from sklearn.datasets import load_linnerud

X, y = load_linnerud(return_X_y=True)
GNNRegressor().fit(X, y).predict(X)

I know we're still missing some features like being able to configure dimensionality reduction and fit with different features than we predict with, but is there still core functionality missing in the prediction stage?

I'm having a hard time figuring out how we'd verify that we're only doing a single transformation just by calling predict

Assuming my understanding above is correct, my thought was that we could test the accuracy of stand attribute predictions for each of our estimators against predictions made by yaImpute, with the assumption that a double transformation would lead to incorrect predicted attributes.

I've got a branch ready to push that will simply delete the predict method from TransformedKNeighborsMixin.

I'm 100% willing to push ahead with this fix and handle the testing later, especially if there are some bigger steps needed to get prediction running correctly. Just want to make sure I'm on the right page with the points above!

grovduck · 2023-05-25T20:32:58Z

You're tracking this exactly right. The example you gave should work correctly like you say. I actually started working on this the previous week, in side-by-side R and Python sessions, and numbers weren't lining up. I hadn't yet circled back to why that was the case, but it was one of the next things on the docket. As you say, I was planning on exporting similar prediction data and testing that for predict which would test for this.

I'm probably being way too linear, but here was the roadmap as I saw it:

Get this one patched (Matt)
Push the MSNRegressor. It has working tests assuming that this one is patched (Matt)
Possibly circle back to both Decide how to handle fitting, prediction, and ID data #2 and Add load functions for example datasets #3 so that we commit to passing in the full y matrices which will change the test_port.py tests (Mostly Aaron)
Write tests to assure that predict passes (Both of us)

I'm going to go ahead and create the PR for this one and hopefully follow up shortly with MSN. Sound OK?

aazuspan · 2023-05-25T21:00:34Z

That plan sounds good to me, thanks for laying it out!

I don't have much experience with using milestones or projects in Github, but maybe those are worth looking into to try and solidify the big picture roadmap?

grovduck · 2023-05-25T21:37:29Z

I don't have much experience with using milestones or projects in Github, but maybe those are worth looking into to try and solidify the big picture roadmap?

Nor I! If you think that they'd be helpful, I'm definitely willing to put some effort into learning how they work. I guess we just informally laid down a milestone to get through these next steps. I do sometimes struggle with remembering what issue or PR our discussions are in, so if this helps organize that, it would be worth it to me.

One thing that I'm realizing is that it's much harder to be lazy when working with someone on a project. Up front, it's definitely more work, but so worth it!

grovduck · 2023-05-25T21:37:44Z

Closed via #25

grovduck added the bug Something isn't working label May 19, 2023

aazuspan assigned grovduck May 25, 2023

grovduck mentioned this issue May 25, 2023

Remove TransformedKNeighborsMixin.predict to avoid double transformation #25

Merged

grovduck closed this as completed May 25, 2023

aazuspan mentioned this issue May 25, 2023

Test accuracy of predict methods #26

Closed

grovduck linked a pull request Jun 7, 2023 that will close this issue

Remove TransformedKNeighborsMixin.predict to avoid double transformation #25

Merged

grovduck added this to the Core Estimators milestone Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

grovduck commented May 19, 2023 •

edited

Loading

grovduck commented May 20, 2023

aazuspan commented May 24, 2023

grovduck commented May 25, 2023

aazuspan commented May 25, 2023

grovduck commented May 25, 2023

aazuspan commented May 25, 2023

grovduck commented May 25, 2023

grovduck commented May 25, 2023

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Comments

grovduck commented May 19, 2023 • edited Loading

grovduck commented May 20, 2023

aazuspan commented May 24, 2023

grovduck commented May 25, 2023

aazuspan commented May 25, 2023

grovduck commented May 25, 2023

aazuspan commented May 25, 2023

grovduck commented May 25, 2023

grovduck commented May 25, 2023

grovduck commented May 19, 2023 •

edited

Loading