-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unwanted double transformation using TransformedKNeighborsMixin.predict #23
Comments
I've got a (pretty shabby) port of the |
Great catch! I think your analysis and solution are spot on, so I'd say go ahead with the fix.
Yes, that makes sense to me. How hard do you think it would be to add that test? It would be nice to have that in place to prevent this issue (or similar) in the future, but if it would hold this fix up too much or require additional changes it could probably wait for a separate PR. |
@aazuspan, thanks for the feedback. As much as I'd like to get this test in now, I think I might push this change without it for a couple of different reasons:
Please let me know if you feel differently. I've got a branch ready to push that will simply delete the |
Correct me if I'm wrong here, but once the double transformation is removed, our estimators will be capable of correctly predicting weighted means of stand attributes (or any multi-output target), won't they? In other words, ignoring the double transformation, isn't the code below a working example of generating predictions with GNN? from sknnr import GNNRegressor
from sklearn.datasets import load_linnerud
X, y = load_linnerud(return_X_y=True)
GNNRegressor().fit(X, y).predict(X) I know we're still missing some features like being able to configure dimensionality reduction and fit with different features than we predict with, but is there still core functionality missing in the prediction stage?
Assuming my understanding above is correct, my thought was that we could test the accuracy of stand attribute predictions for each of our estimators against predictions made by
I'm 100% willing to push ahead with this fix and handle the testing later, especially if there are some bigger steps needed to get prediction running correctly. Just want to make sure I'm on the right page with the points above! |
You're tracking this exactly right. The example you gave should work correctly like you say. I actually started working on this the previous week, in side-by-side R and Python sessions, and numbers weren't lining up. I hadn't yet circled back to why that was the case, but it was one of the next things on the docket. As you say, I was planning on exporting similar prediction data and testing that for predict which would test for this. I'm probably being way too linear, but here was the roadmap as I saw it:
I'm going to go ahead and create the PR for this one and hopefully follow up shortly with MSN. Sound OK? |
That plan sounds good to me, thanks for laying it out! I don't have much experience with using milestones or projects in Github, but maybe those are worth looking into to try and solidify the big picture roadmap? |
Nor I! If you think that they'd be helpful, I'm definitely willing to put some effort into learning how they work. I guess we just informally laid down a milestone to get through these next steps. I do sometimes struggle with remembering what issue or PR our discussions are in, so if this helps organize that, it would be worth it to me. One thing that I'm realizing is that it's much harder to be lazy when working with someone on a project. Up front, it's definitely more work, but so worth it! |
Closed via #25 |
In #22, we brought
fit
,predict
, andkneighbors
methods under theTransformedKNeighborsMixin
class and took them out of the estimators. Onpredict
, I believe we are unintentionally applying each estimator's transform twice. The current code forTransformedKNeighborsMixin.predict
is this:This will apply the transform to the X array and then call
KNeighborsRegressor.predict
. However, this method then in turn callsself.kneighbors
which returns the call back toTransformedKNeighborsMixin.kneighbors
which again applies the transform, this time to the already transformed data.This probably went unnoticed because we aren't explicitly testing
predict
intest_port.py
at present, onlykneighbors
. Presumably the check distances would not have matched after the double transformation.This did surface in the implementation of the
MSNRegressor
(not yet pushed) where the canonical correlation analysis transform reduces the number of features based on significant CCA axes. This failed both thetest_estimators_support_continuous_multioutput
andtest_estimators_support_dataframes
test because on the first transform pass, we subset theX
array down to fewer features, but then try to pass this reduced dimension array back toStandardScalerWithDOF
which was originally fit with the full number of features.If this diagnosis seems correct, I believe the solution is to just remove the
TransformedKNeighborsMixin.predict
method. Any call to an estimator'spredict
gets routed toKNeighborsRegressor.predict
which in turn will callTransformedKNeighborsMixin.kneighbors
and will do the appropriate transformation before neighbor finding. I have tested locally with this approach (and on the newMSNRegressor
estimator) and it passes all current tests.The text was updated successfully, but these errors were encountered: