-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Most Similar Neighbor (MSN) #27
Conversation
- canonical correlation analysis implementation follows statsmodels.CanCorr - ftest_cor ported directly from yai function in yaImpute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this review is a bit of a bummer! All the code looks great, but I think we have some work to do to be compliant with the licenses of statsmodels
and (especially) yaImpute
.
I'm not sure we need to get everything sorted out right now, and maybe we leave this for a final step before the first release, since some of this code will potentially change before then.
Let me know what you think. If you're okay with it, I'd be fine to hold off on getting into the nitty gritty of licensing and just make sure we're noting in every docstring where code is ported or adapted from, so that we can come back to it later.
No bummer at all - this is really important to address and I agree that we need to figure this out. I just took a bit of a deep dive myself into You might be able to answer this question, though. When we use another package and call its functionality directly, are we beholden to their licensing in the same way that would be required by porting? For example,
I agree that we probably don't want to high-center on this right now, but we definitely need to create an issue (possibly associated with a milestone!) that tracks this. That's a really good idea to put markers in docstrings where we've ported code from others - that would be a first step before including the actual license information. This is a new world to me and it all seems a bit squishy. But erring on the side of caution definitely seems appropriate. |
Good point! It looks like this will depend on which files we ported from then, since they seem to give a very permissive license for anything outside of the ANN code:
So it may be worth noting in docstrings which files were ported, since that will determine the appropriate license.
No, at least in the case of BSD-3 and MIT there are no restrictions on usage of code, only on redistribution or modification. I think GPL might be a little trickier in this area, and it seems like it's open to interpretation around whether GPL allows "dynamic linking" and whether importing in Python even counts as "dynamic linking".
Sounds good! For this PR, can you add those markers for the code included here? After that, I'm good to merge!
Yes, squishy is definitely my impression too! I think the truth is that very few projects follow all the rules correctly, and there's virtually zero chance that anyone notices or cares. We're probably going above and beyond with this level of effort. |
@aazuspan, can you check on the markers I left? Two issues:
Neither of these decisions am I wedded to, so please suggest alternatives if you'd like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used REFERENCE as a marker to be able to refer to these at a later point
We could use a notes and references section following numpydoc, but if these are just placeholders then I'm fine with it as is.
I used permalinks for derived lines of codes which exceed line length so I left #noqa: E501s at the ends of these lines
I never like having to do this, but I'm not sure there's any better alternative. Sphinx does have a way to cross-reference documentation from other projects and I imagine there's something similar for mkdocs, but I think a permalink is probably a better option in case their implementation or license changes.
I'm good to merge if you are!
Yes, I think these are just placeholders and will eventually be removed. I'm going to merge this one to keep things going, but I've left a note in #29 to remind us to deal with these. |
This PR implements Most Similar Neighbor (MSN) (Moeur and Stage, 1998) which uses Canonical Correlation Analysis to associate
X
andy
matrices. Linear coefficients for both theX
andy
matrices are calculated such that the calculated fitted scores of each matrix maximize the correlation between matrices.This implementation generally follows the code presented for method
msn
inyaImpute
. I was unable to directly port the R functioncancor
into Python. The R implementation uses both QR decomposition and SVD decomposition of matrices to arrive at theX
andy
coefficients and the QR decomposition can introduce reordering of columns that I was unable to reproduce in Python. I finally stumbled onstatsmodels.CanCorr
which uses SVD decomposition (exclusively) but throws an error if the matrices do not have full rank after the decomposition. I have borrowed heavily from this implementation and we need to make sure we are adhering to their license. Unfortunately, I wasn't able to figure out how to use sklearn's ownCCA
to produce the correct output. This may be a place for further follow-up.Likewise, there is a test (
ftest_cor
) ported directly from theyai
function inyaImpute
which (presumably) finds the significant CCorA axes to use based on p-values from an F-test (I think I have this right?). This requires us to includescipy
as a requirement, which is already a requirement ofscikit-learn
.