Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self implemented distance metric #200

Open
jianshu93 opened this issue Sep 17, 2022 · 3 comments
Open

self implemented distance metric #200

jianshu93 opened this issue Sep 17, 2022 · 3 comments

Comments

@jianshu93
Copy link

Dear Pynndescent team,

I am wondering whether there could be an option for this library to provide an API to allow external distance metric so that some unique distance metric can be used., e.g., for biological applications, distance computation was actually based on MinHash of genomic profiles, none of the provided metric is fast in this case.

Thanks,

Jianshu

@lmcinnes
Copy link
Owner

You can provide a custom metric if you wish as long as it can be jit compiled by numba. Just specify the function as the metric instead of a metric name as a string. So for example you might have

@numba.njit()
def custom_distance_function(a, b):
    ...

index = NNDescent(metric=custom_distance_function)

The biggest hurdle here is that numba supports a (relatively large) subset of python, but can't see inside other functions, so you will need your own implementation of the distance metric, you can't just call another library (unless they have jit compiled code, or it is a C library and you use CFFI in numba). Note that a custom metric may not be as fast as many of the pre-provided metrics since the latter have had a lot of optimization work go into them.

Lastly, if you are interested in using minhash, perhaps you could consider, instead, using jaccard on a sparse matrix representation of your data -- since minhash is really just an approximation of jaccard anyway, why not use the real thing? As long as you convert the data to scipy.sparse format it will be relatively fast, and may even be competitive with naive minhash implementations.

@jianshu93
Copy link
Author

hello Leland,

Thanks for the suggestion on MinHash, the problem for me is, the matrix representation of my data (dimension) is huge, 10^6 or above (and there are 10^6 of such data vector ), so to have fast approximation of Jaccard index, there are no choice for me right now but MinHash approximation. I understand the jit complied part and the MinHash software I am relying on is actually a C++ library. So it may be difficult to incorporate it or supported by jit since I am not family with JIT at all.

I am thinking kgraph C++ library here(https://github.com/aaalgo/kgraph) since Jaccard is a metric.

Thanks for your help.

Jianshu

@lmcinnes
Copy link
Owner

The ability to compute jaccard using sparse structure is dependent on the number of non-zeros per row, not on the actual dimension. So 10^6 dimensions might be fine if most of the data are zeros. This is very often the case for data with that many dimensions.

If you still need to use minhash then as long as you can import the function via ctypes then you should be able to use it within a numba jitted function -- you'll just have to be careful with what you pass it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants