Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for callable metrics #93

Closed
jgraving opened this issue Aug 31, 2019 · 6 comments
Closed

Add support for callable metrics #93

jgraving opened this issue Aug 31, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@jgraving
Copy link
Contributor

jgraving commented Aug 31, 2019

Expected behaviour

pynndescent supports passing callable metrics compiled with numba, so I would expect openTSNE to be able to support this

Actual behaviour

When I pass a numba-compiled callable metric, openTSNE throws a ValueError

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-44-b188b4c74e96> in <module>()
----> 1 embedding_train = tsne.fit(X_train)

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in fit(self, X)
   1040 
   1041         """
-> 1042         embedding = self.prepare_initial(X)
   1043 
   1044         try:

/home/jake/.local/lib/python3.6/site-packages/openTSNE/tsne.py in prepare_initial(self, X)
   1118             metric_params=self.metric_params,
   1119             n_jobs=self.n_jobs,
-> 1120             random_state=self.random_state,
   1121         )
   1122 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in __init__(self, data, perplexity, method, metric, metric_params, symmetrize, n_jobs, random_state)
    124         k_neighbors = min(self.n_samples - 1, int(3 * self.perplexity))
    125         self.knn_index, self.__neighbors, self.__distances = build_knn_index(
--> 126             data, method, k_neighbors, metric, metric_params, n_jobs, random_state
    127         )
    128 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/affinity.py in build_knn_index(data, method, k, metric, metric_params, n_jobs, random_state)
    274             metric_params=metric_params,
    275             n_jobs=n_jobs,
--> 276             random_state=random_state,
    277         )
    278 

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in __init__(self, metric, metric_params, n_jobs, random_state)
     11     def __init__(self, metric, metric_params=None, n_jobs=1, random_state=None):
     12         self.index = None
---> 13         self.metric = self.check_metric(metric)
     14         self.metric_params = metric_params
     15         self.n_jobs = n_jobs

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in check_metric(self, *args, **kwargs)
    184             )
    185 
--> 186         return super().check_metric(*args, **kwargs)
    187 
    188     def build(self, data, k):

/home/jake/.local/lib/python3.6/site-packages/openTSNE/nearest_neighbors.py in check_metric(self, metric)
     58         if metric not in self.VALID_METRICS:
     59             raise ValueError(
---> 60                 f"`{self.__class__.__name__}` does not support the `{metric}` "
     61                 f"metric. Please choose one of the supported metrics: "
     62                 f"{', '.join(self.VALID_METRICS)}."

ValueError: `NNDescent` does not support the `CPUDispatcher(<function kld at 0x7f7fe2311158>)` metric. Please choose one of the supported metrics: euclidean, l2, manhattan, taxicab, l1, chebyshev, linfinity, linfty, linf, minkowski, seuclidean, standardised_euclidean, wminkowski, weighted_minkowski, mahalanobis, canberra, cosine, correlation, haversine, braycurtis, hamming, jaccard, dice, matching, kulsinski, rogerstanimoto, russellrao, sokalsneath, sokalmichener, yule.
Steps to reproduce the behavior

Here is a minimal working example:

import numpy as np
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger
from numba import njit

@njit(fastmath=True)
def kld(p, q):
    result = 0.0
    for i in range(p.shape[0]):
        logp = np.log(p[i]) if p[i] > 0 else 0
        logq = np.log(q[i]) if q[i] > 0 else 0
        result += p[i] * (logq - logp)
    return -result / np.log(2.)

tsne = TSNE(
    perplexity=30,
    metric=kld,
    callbacks=ErrorLogger(),
    n_jobs=1
)

x_train = np.random.beta(0.5, 0.5, size=(500,10))
embedding_train = tsne.fit(x_train)
@pavlin-policar
Copy link
Owner

Almost all the problems I've encountered while maintaining openTSNE have been pynndescent related. As such, I would be very hesitant to add anything that ties openTSNE even closer to pynndescent.

However, it should be fairly simple to achieve what you want. I've made it so that the neighbors parameter can be an instance of openTSNE.nearest_neighbors.KNNIndex, so you can just copy over the NNDescent class and remove the check that disallows custom metrics. You can probably just change check_metric to return True and I'd guess that would do it.

@pavlin-policar pavlin-policar added the wontfix This will not be worked on label Aug 31, 2019
@jgraving
Copy link
Contributor Author

jgraving commented Aug 31, 2019

Sorry, I wasn't clear. I'm not suggesting you tie openTSNE closer to pynndescent. Passing a callable metric is a standard feature for many packages. scikit-learn also allows it for many of their algorithms, including their implementation of TSNE. Adding a check if metric is callable is fairly straightforward. If you're short of time/motivation then I can make a PR once I have time to look through the code. You're of course free to reject it, but I think this is a reasonable feature to have for a t-SNE implementation and would be potentially useful to quite a few people.

I'm trying to apply openTSNE to use these methods for analyzing animal behavior:
https://royalsocietypublishing.org/doi/10.1098/rsif.2014.0672
as an example analysis for my recently developed package:
https://github.com/jgraving/behavelet
Applying t-SNE to these data requires using KL-divergence as a distance metric between normalized time-frequency vectors. If accomplishing that requires that users subclass and modify your code to get it working then I will likely just find something else, as I'd like it to be straightforward for people to use.

Again, I appreciate the work you've done on this. I'm not trying to be pushy. Just wanted to let you know that openTSNE would be more widely useful if it had this feature.

@pavlin-policar
Copy link
Owner

Hmm, I'm all for custom distance metrics, but the "numba-compiled callable" threw me off a little bit. I'm not very familiar with numba so is it possible to call numba compiled from regular old python code? What I'm getting at is that opentsne supports both exact neighbor search via scikit-learn and approximate neighbor search via pynndescent, and I'd like to keep some consistency between the two. My concern was that if the function is numba-compiled, that it wouldn't be possible to use that callable in the scikit-learn BallTree.

Putting all numba-related things aside, adding custom distance metrics was definitely on my to-do list, but I've been and will likely be short on time for a while, so if you're willing to open a PR, I'd be more than happy to take a look. The changes should be fairly minimal anyhow.

@pavlin-policar pavlin-policar changed the title Add support for numba-compiled callable metrics Add support for callable metrics Aug 31, 2019
@pavlin-policar pavlin-policar added enhancement New feature or request and removed wontfix This will not be worked on labels Aug 31, 2019
@jgraving
Copy link
Contributor Author

jgraving commented Aug 31, 2019

Ah, I see. Sorry I assumed you were familiar with numba as it's one of the main dependencies for pynndescent. numba is a library that JIT-compiles pure python/numpy functions to make them faster, so a numba compiled function works just like any other python function in practice. scikit-learn's BallTree accepts user-defined functions, which includes numba compiled functions. Here is an example:

from sklearn.neighbors import BallTree
from numba import njit
import numpy as np

@njit(fastmath=True)
def l1(x, y):
    return np.sum(np.abs(x - y))

tree = BallTree(np.random.normal(size=(1000,10)), metric=l1)
distances, indices = tree.query(np.random.normal(size=(100,10)), k=5)

The only issue I can see is if the function isn't numba compiled and it's passed to pynndescent then pynndescent will throw an uninformative error message from numba. Whether or how to deal with this might require some thought, or maybe just make clear in the docstring and leave this for pynndescent to deal with. I assume if the user is savvy enough to pass a custom metric then they'd expect error messages are a possibility.

I'll take a look at the code and submit a PR with the changes.

@pavlin-policar
Copy link
Owner

pavlin-policar commented Aug 31, 2019

The only issue I can see is if the function isn't numba compiled and it's passed to pynndescent then pynndescent will throw an uninformative error message from numba.

I would hope that numba has some kind of way to check if a function is compiled. Then if it isn't, we could do that within the pynndescent wrapper to avoid potential errors. A decorator is just a function, after all.

After a bit of searching, I found it should be fairly straightforward to check if a function is an instance of the Dispatcher type and then if it isn't, to just call numba.njit()(f) otherwise.

@pavlin-policar
Copy link
Owner

Fixed via #94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants