Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: phik correlation metrics #319

Open
gokceneraslan opened this issue Nov 19, 2019 · 5 comments
Open

Feature request: phik correlation metrics #319

gokceneraslan opened this issue Nov 19, 2019 · 5 comments

Comments

@gokceneraslan
Copy link
Contributor

gokceneraslan commented Nov 19, 2019

It'd be great to add other correlations as distances, such as Spearman's rho or the new correlation coefficient called phi_k. I am aware that it's possible to use any distance metric using either metric='precomputed' or implementing a custom function, but it'd be more convenient to pass metric='spearmanrho' or 'phik' directly.

@sleighsoft
Copy link
Collaborator

You could contribute to scipy upstream and then everyone would be able to use these metrics, not only in UMAP. Under the hood of UMAP mostly from sklearn.metrics import pairwise_distances is used with some special cases for sparse matrices.

@gokceneraslan
Copy link
Contributor Author

gokceneraslan commented Nov 19, 2019

@sleighsoft thanks for the reply. Actually, scipy already has a Spearman's rho implementation but since it's not defined as a scipy distance it's not accessible from scikit-learn's pairwise_distances.

Even if it were available as a distance, it would be only useful for small datasets since, UMAP uses pairwise scikit-learn distances only if the dataset is size is small (n<4096) (and if kNN approximation is not forced by the user) (see this and that). In all other cases, it falls back to UMAP's own distances.py file where distances are implemented with numba's JIT support.

So, what I'm proposing is to implement spearman (and maybe phi_k too, if anyone is willing to) in distances.py as a "native distance" in UMAP (just like Pearson correlation and many other distance metrics that are implemented from scratch although they are available in scikit-learn/scipy) so that we can use it with both small matrices in a pairwise fashion and the large matrices via approximated kNN. I hope it's clear.

@sleighsoft
Copy link
Collaborator

sleighsoft commented Nov 20, 2019

Oh I see! Thanks for clarifying the issue.
It is probably a good thing that you pinged the scipy people.

Btw, I do not find pearson in the distances.py. Is it named correlation instead?

@sleighsoft
Copy link
Collaborator

@gokceneraslan I look at this again and I believe the best place for this would actually be the pynndescent project https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/distances.py

It has the same number of metrics as UMAP and I assume it will be the default for UMAP.

@sleighsoft
Copy link
Collaborator

Added a PR.

@sleighsoft sleighsoft changed the title Feature request: Other correlation metrics Feature request: phik correlation metrics Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants