Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

Closed
grovduck opened this issue May 19, 2023 · 8 comments · Fixed by #25
Closed

Unwanted double transformation using TransformedKNeighborsMixin.predict #23

grovduck opened this issue May 19, 2023 · 8 comments · Fixed by #25
Assignees
Labels
bug Something isn't working

Comments

@grovduck
Copy link
Member

grovduck commented May 19, 2023

In #22, we brought fit, predict, and kneighbors methods under the TransformedKNeighborsMixin class and took them out of the estimators. On predict, I believe we are unintentionally applying each estimator's transform twice. The current code for TransformedKNeighborsMixin.predict is this:

def predict(self, X):
    """Predict using transformed feature data."""
    X_transformed = self._apply_transform(X)
    return super().predict(X_transformed)

This will apply the transform to the X array and then call KNeighborsRegressor.predict. However, this method then in turn calls self.kneighbors which returns the call back to TransformedKNeighborsMixin.kneighbors which again applies the transform, this time to the already transformed data.

This probably went unnoticed because we aren't explicitly testing predict in test_port.py at present, only kneighbors. Presumably the check distances would not have matched after the double transformation.

This did surface in the implementation of the MSNRegressor (not yet pushed) where the canonical correlation analysis transform reduces the number of features based on significant CCA axes. This failed both the test_estimators_support_continuous_multioutput and test_estimators_support_dataframes test because on the first transform pass, we subset the X array down to fewer features, but then try to pass this reduced dimension array back to StandardScalerWithDOF which was originally fit with the full number of features.

If this diagnosis seems correct, I believe the solution is to just remove the TransformedKNeighborsMixin.predict method. Any call to an estimator's predict gets routed to KNeighborsRegressor.predict which in turn will call TransformedKNeighborsMixin.kneighbors and will do the appropriate transformation before neighbor finding. I have tested locally with this approach (and on the new MSNRegressor estimator) and it passes all current tests.

@grovduck grovduck added the bug Something isn't working label May 19, 2023
@grovduck
Copy link
Member Author

I've got a (pretty shabby) port of the MSNRegressor ready to go, but would like to wait until this one is resolved before I push that.

@aazuspan
Copy link
Contributor

Great catch! I think your analysis and solution are spot on, so I'd say go ahead with the fix.

This probably went unnoticed because we aren't explicitly testing predict in test_port.py at present, only kneighbors. Presumably the check distances would not have matched after the double transformation.

Yes, that makes sense to me. How hard do you think it would be to add that test? It would be nice to have that in place to prevent this issue (or similar) in the future, but if it would hold this fix up too much or require additional changes it could probably wait for a separate PR.

@grovduck
Copy link
Member Author

How hard do you think it would be to add that test? It would be nice to have that in place to prevent this issue (or similar) in the future, but if it would hold this fix up too much or require additional changes it could probably wait for a separate PR.

@aazuspan, thanks for the feedback. As much as I'd like to get this test in now, I think I might push this change without it for a couple of different reasons:

  1. By calling predict, we're currently returning the k=5 means of the IDs and, because we don't store the distances anywhere, it seems difficult to do the distance check without explicitly calling kneighbors. I'm having a hard time figuring out how we'd verify that we're only doing a single transformation just by calling predict, but I'm feeling a bit slow today!
  2. Ultimately, the y we're passing will be the stand attributes and predict should return the (weighted) means of those (related to Decide how to handle fitting, prediction, and ID data #2). So I think it's probably not worth putting a lot of effort into testing predict right now until we address that.

Please let me know if you feel differently. I've got a branch ready to push that will simply delete the predict method from TransformedKNeighborsMixin.

@aazuspan
Copy link
Contributor

Ultimately, the y we're passing will be the stand attributes and predict should return the (weighted) means of those (related to #2). So I think it's probably not worth putting a lot of effort into testing predict right now until we address that.

Correct me if I'm wrong here, but once the double transformation is removed, our estimators will be capable of correctly predicting weighted means of stand attributes (or any multi-output target), won't they? In other words, ignoring the double transformation, isn't the code below a working example of generating predictions with GNN?

from sknnr import GNNRegressor
from sklearn.datasets import load_linnerud

X, y = load_linnerud(return_X_y=True)
GNNRegressor().fit(X, y).predict(X)

I know we're still missing some features like being able to configure dimensionality reduction and fit with different features than we predict with, but is there still core functionality missing in the prediction stage?

I'm having a hard time figuring out how we'd verify that we're only doing a single transformation just by calling predict

Assuming my understanding above is correct, my thought was that we could test the accuracy of stand attribute predictions for each of our estimators against predictions made by yaImpute, with the assumption that a double transformation would lead to incorrect predicted attributes.

I've got a branch ready to push that will simply delete the predict method from TransformedKNeighborsMixin.

I'm 100% willing to push ahead with this fix and handle the testing later, especially if there are some bigger steps needed to get prediction running correctly. Just want to make sure I'm on the right page with the points above!

@grovduck
Copy link
Member Author

You're tracking this exactly right. The example you gave should work correctly like you say. I actually started working on this the previous week, in side-by-side R and Python sessions, and numbers weren't lining up. I hadn't yet circled back to why that was the case, but it was one of the next things on the docket. As you say, I was planning on exporting similar prediction data and testing that for predict which would test for this.

I'm probably being way too linear, but here was the roadmap as I saw it:

  1. Get this one patched (Matt)
  2. Push the MSNRegressor. It has working tests assuming that this one is patched (Matt)
  3. Possibly circle back to both Decide how to handle fitting, prediction, and ID data #2 and Add load functions for example datasets #3 so that we commit to passing in the full y matrices which will change the test_port.py tests (Mostly Aaron)
  4. Write tests to assure that predict passes (Both of us)

I'm going to go ahead and create the PR for this one and hopefully follow up shortly with MSN. Sound OK?

@aazuspan
Copy link
Contributor

That plan sounds good to me, thanks for laying it out!

I don't have much experience with using milestones or projects in Github, but maybe those are worth looking into to try and solidify the big picture roadmap?

@grovduck
Copy link
Member Author

I don't have much experience with using milestones or projects in Github, but maybe those are worth looking into to try and solidify the big picture roadmap?

Nor I! If you think that they'd be helpful, I'm definitely willing to put some effort into learning how they work. I guess we just informally laid down a milestone to get through these next steps. I do sometimes struggle with remembering what issue or PR our discussions are in, so if this helps organize that, it would be worth it to me.

One thing that I'm realizing is that it's much harder to be lazy when working with someone on a project. Up front, it's definitely more work, but so worth it!

@grovduck
Copy link
Member Author

Closed via #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants