Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Most Similar Neighbor (MSN) #27

Merged
merged 2 commits into from
May 30, 2023
Merged

Conversation

grovduck
Copy link
Member

This PR implements Most Similar Neighbor (MSN) (Moeur and Stage, 1998) which uses Canonical Correlation Analysis to associate X and y matrices. Linear coefficients for both the X and y matrices are calculated such that the calculated fitted scores of each matrix maximize the correlation between matrices.

This implementation generally follows the code presented for method msn in yaImpute. I was unable to directly port the R function cancor into Python. The R implementation uses both QR decomposition and SVD decomposition of matrices to arrive at the X and y coefficients and the QR decomposition can introduce reordering of columns that I was unable to reproduce in Python. I finally stumbled on statsmodels.CanCorr which uses SVD decomposition (exclusively) but throws an error if the matrices do not have full rank after the decomposition. I have borrowed heavily from this implementation and we need to make sure we are adhering to their license. Unfortunately, I wasn't able to figure out how to use sklearn's own CCA to produce the correct output. This may be a place for further follow-up.

Likewise, there is a test (ftest_cor) ported directly from the yai function in yaImpute which (presumably) finds the significant CCorA axes to use based on p-values from an F-test (I think I have this right?). This requires us to include scipy as a requirement, which is already a requirement of scikit-learn.

  • Moeur, M. and Stage, A.R., 1995. Most similar neighbor: an improved sampling inference procedure for natural resource planning. Forest science, 41(2), pp.337-359.

- canonical correlation analysis implementation follows statsmodels.CanCorr
- ftest_cor ported directly from yai function in yaImpute
@grovduck grovduck requested a review from aazuspan May 25, 2023 21:56
@grovduck grovduck added enhancement New feature or request estimator Related to one or more estimators labels May 25, 2023
Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.

I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then.

Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.

src/sknnr/transformers/_ccora.py Outdated Show resolved Hide resolved
src/sknnr/transformers/_ccora.py Show resolved Hide resolved
@grovduck
Copy link
Member Author

Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.

No bummer at all - this is really important to address and I agree that we need to figure this out. I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

You might be able to answer this question, though. When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then. Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

@aazuspan
Copy link
Contributor

I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

Good point! It looks like this will depend on which files we ported from then, since they seem to give a very permissive license for anything outside of the ANN code:

=====All other files including src/rfoneprox.cpp ==
All files in this package, other than those refered to above, were prepared by 
Nicholas L. Crookston, a U.S. Government employee working on official time 
and are in the public domain. 

So it may be worth noting in docstrings which files were ported, since that will determine the appropriate license.

When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

No, at least in the case of BSD-3 and MIT there are no restrictions on usage of code, only on redistribution or modification. I think GPL might be a little trickier in this area, and it seems like it's open to interpretation around whether GPL allows "dynamic linking" and whether importing in Python even counts as "dynamic linking".

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

Yes, squishy is definitely my impression too! I think the truth is that very few projects follow all the rules correctly, and there's virtually zero chance that anyone notices or cares. We're probably going above and beyond with this level of effort.

@grovduck
Copy link
Member Author

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

@aazuspan, can you check on the markers I left? Two issues:

  1. I used REFERENCE as a marker to be able to refer to these at a later point
  2. I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines

Neither of these decisions am I wedded to, so please suggest alternatives if you'd like.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used REFERENCE as a marker to be able to refer to these at a later point

We could use a notes and references section following numpydoc, but if these are just placeholders then I'm fine with it as is.

I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines

I never like having to do this, but I'm not sure there's any better alternative. Sphinx does have a way to cross-reference documentation from other projects and I imagine there's something similar for mkdocs, but I think a permalink is probably a better option in case their implementation or license changes.

I'm good to merge if you are!

src/sknnr/transformers/_ccora.py Show resolved Hide resolved
@aazuspan aazuspan mentioned this pull request May 30, 2023
@grovduck
Copy link
Member Author

... if these are just placeholders then I'm fine with it as is.

Yes, I think these are just placeholders and will eventually be removed. I'm going to merge this one to keep things going, but I've left a note in #29 to remind us to deal with these.

@grovduck grovduck merged commit 8289c2a into fb_add_estimators May 30, 2023
@grovduck grovduck deleted the msn branch May 30, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request estimator Related to one or more estimators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants