Implementation of Most Similar Neighbor (MSN) #27

grovduck · 2023-05-25T21:56:12Z

This PR implements Most Similar Neighbor (MSN) (Moeur and Stage, 1998) which uses Canonical Correlation Analysis to associate X and y matrices. Linear coefficients for both the X and y matrices are calculated such that the calculated fitted scores of each matrix maximize the correlation between matrices.

This implementation generally follows the code presented for method msn in yaImpute. I was unable to directly port the R function cancor into Python. The R implementation uses both QR decomposition and SVD decomposition of matrices to arrive at the X and y coefficients and the QR decomposition can introduce reordering of columns that I was unable to reproduce in Python. I finally stumbled on statsmodels.CanCorr which uses SVD decomposition (exclusively) but throws an error if the matrices do not have full rank after the decomposition. I have borrowed heavily from this implementation and we need to make sure we are adhering to their license. Unfortunately, I wasn't able to figure out how to use sklearn's own CCA to produce the correct output. This may be a place for further follow-up.

Likewise, there is a test (ftest_cor) ported directly from the yai function in yaImpute which (presumably) finds the significant CCorA axes to use based on p-values from an F-test (I think I have this right?). This requires us to include scipy as a requirement, which is already a requirement of scikit-learn.

Moeur, M. and Stage, A.R., 1995. Most similar neighbor: an improved sampling inference procedure for natural resource planning. Forest science, 41(2), pp.337-359.

- canonical correlation analysis implementation follows statsmodels.CanCorr - ftest_cor ported directly from yai function in yaImpute

aazuspan

Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.

I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then.

Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.

src/sknnr/transformers/_ccora.py

grovduck · 2023-05-26T18:01:20Z

Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels and (especially) yaImpute.

No bummer at all - this is really important to address and I agree that we need to figure this out. I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

You might be able to answer this question, though. When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then. Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

aazuspan · 2023-05-26T18:48:04Z

I just took a bit of a deep dive myself into yaImpute licensing history and it looks like they are GPL v3 because they relied on libraries that were GPL v3 (specifically ANN) - see here.

Good point! It looks like this will depend on which files we ported from then, since they seem to give a very permissive license for anything outside of the ANN code:

=====All other files including src/rfoneprox.cpp ==
All files in this package, other than those refered to above, were prepared by 
Nicholas L. Crookston, a U.S. Government employee working on official time 
and are in the public domain.

So it may be worth noting in docstrings which files were ported, since that will determine the appropriate license.

When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example, scikit-learn uses BSD-3 licensing, but because we're using this code rather than porting this code, are the terms different?

No, at least in the case of BSD-3 and MIT there are no restrictions on usage of code, only on redistribution or modification. I think GPL might be a little trickier in this area, and it seems like it's open to interpretation around whether GPL allows "dynamic linking" and whether importing in Python even counts as "dynamic linking".

I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information.

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate.

Yes, squishy is definitely my impression too! I think the truth is that very few projects follow all the rules correctly, and there's virtually zero chance that anyone notices or cares. We're probably going above and beyond with this level of effort.

grovduck · 2023-05-26T22:05:40Z

Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!

@aazuspan, can you check on the markers I left? Two issues:

I used REFERENCE as a marker to be able to refer to these at a later point
I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines

Neither of these decisions am I wedded to, so please suggest alternatives if you'd like.

aazuspan

I used REFERENCE as a marker to be able to refer to these at a later point

We could use a notes and references section following numpydoc, but if these are just placeholders then I'm fine with it as is.

I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines

I never like having to do this, but I'm not sure there's any better alternative. Sphinx does have a way to cross-reference documentation from other projects and I imagine there's something similar for mkdocs, but I think a permalink is probably a better option in case their implementation or license changes.

I'm good to merge if you are!

src/sknnr/transformers/_ccora.py

grovduck · 2023-05-30T21:31:47Z

... if these are just placeholders then I'm fine with it as is.

Yes, I think these are just placeholders and will eventually be removed. I'm going to merge this one to keep things going, but I've left a note in #29 to remind us to deal with these.

Implementation of MSN ported from yaImpute

28023d4

- canonical correlation analysis implementation follows statsmodels.CanCorr - ftest_cor ported directly from yai function in yaImpute

grovduck requested a review from aazuspan May 25, 2023 21:56

grovduck added enhancement New feature or request estimator Related to one or more estimators labels May 25, 2023

aazuspan reviewed May 26, 2023

View reviewed changes

src/sknnr/transformers/_ccora.py Outdated Show resolved Hide resolved

src/sknnr/transformers/_ccora.py Show resolved Hide resolved

Add reference markers for derived and licensed external code

f9fc703

aazuspan approved these changes May 30, 2023

View reviewed changes

src/sknnr/transformers/_ccora.py Show resolved Hide resolved

aazuspan mentioned this pull request May 30, 2023

Licensing #29

Open

grovduck merged commit 8289c2a into fb_add_estimators May 30, 2023

grovduck deleted the msn branch May 30, 2023 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Most Similar Neighbor (MSN) #27

Implementation of Most Similar Neighbor (MSN) #27

grovduck commented May 25, 2023

aazuspan left a comment

grovduck commented May 26, 2023

aazuspan commented May 26, 2023

grovduck commented May 26, 2023

aazuspan left a comment

grovduck commented May 30, 2023

Implementation of Most Similar Neighbor (MSN) #27

Implementation of Most Similar Neighbor (MSN) #27

Conversation

grovduck commented May 25, 2023

aazuspan left a comment

Choose a reason for hiding this comment

grovduck commented May 26, 2023

aazuspan commented May 26, 2023

grovduck commented May 26, 2023

aazuspan left a comment

Choose a reason for hiding this comment

grovduck commented May 30, 2023