-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyNNDescent on GPU #136
Comments
This sounds like great work -- I never imagined this would be easy at all, and it sounds like you have made a great deal of progress. I don't think it is unreasonable to have GPU support for only some subset of metrics (and, potentially, not supporting sparse input data, as that is definitely not going to play well in terms of dynamically allocated arrays, concatenates and so on). I would love to see what you have that works with euclidean distance. I would hope that cosine / angular distance should be manageable as well, and that would cover a vast amount of use cases for many people. Something like Wasserstein is just very hard. |
I've got a working implementation of this. It's quite fast based on other options I've seen. 5k x 5k Wasserstein distance matrix in ~18 s on the GPU (including a near-negligible compile time). Also have a njit'd version that's fairly fast (~1 min?). Something like a 200x speedup from |
Where in the code are distances calculated in "batches" (e.g. with |
That is the big catch. There are very few places where distances are computed in large batches. Almost everything is in relatively small batches. In terms of the main loop of the algorithm (as opposed to initialization) it is in pynndescent/pynndescent/pynndescent_.py Line 160 in ee915d2
|
The only thing with |
Yes, but it is probably possible to make a decent estimate of an upperbound, so as long as memory is relatively cheap you can overallocate and leave a sentinel value at the end. That does leave issues plumbing it back into the rest of the code, but I suspect a little wrapping of the actual GPU compute might work. Does that seem reasonable to you? |
I know I'm getting back pretty late on this - but yes, that does seem reasonable. This has been on my mind the last few months as I've been working on publishing mat_discover. Here's the distance matrix implementation I've been using: https://github.com/sparks-baird/dist-matrix. I think that even without using the GPU facilities, Wasserstein could be sped up by using the To know if it's worth running on the GPU, I'd probably need to estimate how many distance calculations are done in a single call to Calculating the full distance matrix is fairly straightforward, so I don't know if there is a (hacky) workaround of precomputing the distance matrix and accessing elements of the matrix rather than recomputing. Obviously, this would break down for problems where the full pairwise distance matrix can't fit into memory. The |
The number on a single call is often surprisingly small; anywhere from 400 to 3600 total distance computations (i.e. between 20 and 60 points all to all). That may be less good for GPU architectures. It seems like vector quantization or inverted index style algorithms might be more amenable to these sorts of GPU oriented approaches. |
@lmcinnes
Saw your twitter response back in January about GPU implementations.
I've been implementing a
cuda.jit
Numba implementation of the Wasserstein distance. Ran into lots of issues with replacing functions likenp.zeros
,np.concatenate
(anything that dynamically allocates memory), and ended up copying manual implementations (sort of for-loop C-style implementations) for e.g.np.argsort
. Not so bad for Euclidean. The main issue is the (very) limited support of NumPy. I recently started learning Numba, so an expert might have a different opinion or be able to do it more easily. I was able to get Euclidean distance running on the GPU with Numba, but am still running into various issues with something more involved like Wasserstein.For example, I haven't been able to get local array allocation working with
cuda.local.array
and have generally run into problems with slicing. An alternative is to pass everything in as arguments that are initialized at the top level, which is more work than just using overloaded NumPy functions that I write. I think what that means is if I want to operate on two vectors and their concatenated version, I need to pass all three in as arguments at the host (CPU) level. If I usenp.diff
which returns something of a different length, I also need to pass that in as an argument. The code will start to look a bit ugly, especially with nested functions - for Wasserstein I probably need about 5 additional input arguments. I'd rather be able to get it to work withcuda.local.array
first and check the performance, but I might need to jump straight to this option.Happy to share what I've tried and where I'm at if this is of interest.
The text was updated successfully, but these errors were encountered: