Multi-threaded implementation #12

tomwhite · 2019-03-07T12:44:40Z

Currently, pynndescent doesn't take full advantage of multiple cores on a machine. There are some methods annotated with the Numba parallel option, but these don't provide any noticeable speed up when more cores are available.

This PR allows nearest neighbor computation to take advantage of multiple cores by processing heaps in parallel chunks. The basic idea is an application of the MapReduce approach described in section 2.8 of Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures by Dong et al.

Note that in this PR all computation is carried out on a single machine using the standard Python concurrent.futures library, which complements Numba very well, as shown in this example.

Going into slightly more detail, a distributed heap can be updated in a MapReduce-style operation as follows.

We have a large array of all the heap updates held in memory and a thread pool to update chunks in parallel in two phases, called "Map" and "Reduce". Each Map partition holds an array of heap updates, corresponding to a row in the following diagram.

The Map phase adds updates in the order they were generated (from heap_push calls). After all the updates have been added for a partition, the updates for that partition are sorted by row number. Then the chunk boundaries corresponding to partitions are found, so that in the Reduce phase the array can process all the updates for a given partition together, and apply them to the target heap (not illustrated).

For the Map phase the array is processed row-wise, and in the Reduce phase it is processed column-wise (although the column bounds are different for each row owing to the different chunk boundaries). The result is an updated distributed heap.

This PR also adds tests that check the threaded implementation produces the same result as the regular implementation. Since NN-Descent uses a source of randomness, when running the tests the random seed is set to the value of each row index to ensure the same deterministic result is obtained. This per-row seeding is only used for tests. However, to make threaded runs more reproducible when run normally (outside of tests) each thread is seeded from the initial random seed, which means that the result from two threaded runs (with the same number of threads) is the same if the same initial seed is used in each case.

In terms of performance, the multi-threaded implementation outperforms the regular implementation with 4 cores or more. In one experiment, regular pynndescent took 27 minutes for a million rows (D=128, NN=25), compared to just over 3 minutes with 64 threads.

There is more work to consider how to wire up the implementation to the standard NNDescent object, but I wanted to get feedback on this approach before going any further.

lmcinnes · 2019-03-07T16:43:11Z

This looks fantastic! It is going to take me a little while to find the time to work through this properly and review it, but please don't mistake that delay for a lack of excitement about this pull request! Definitely looking forward to digging into this one, and would be extremely happy to get this merged.

lmcinnes · 2019-03-11T20:02:42Z

@tomwhite I still want to have some time to play around with this, but I have at least gone through and satisfied myself that I understand how it all works, and wanted to get some initial feedback as soon as possible.

First and foremost, this looks great! It is at heart very simple (which is a definite positive), and so fairly easy to follow, but very slickly done. I appreciate that you've kept the code very clean and that it matches nicely with the single threaded version.

On the question of going any further -- I think it definitely looks worth doing, and welcome whatever you view as appropriate for making it fit with the NNDescent object. I would also be happy to discuss options if you like. I haven't had the opportunity to do much benchmarking myself, but it sounds as if the single threaded case might be worth keeping for the 4-core or less case (depending on what the overhead is like for the threaded case).

Unless I've missed something it looks like your threaded version does not currently make use of the RP-trees when initialising the NNDescent. At least in the single threaded case this has been quite beneficial for performance. It is possible that this might get the threaded performance matching in the 4 or less core case. I will have to experiment and try some benchmarking to know for sure.

Lastly, given how well this works I would be interested in your thoughts on how well this might extend to working with dask (which has a concurrent.futures like interface) for out-of-core versions of pynndescent?

P.S.: Ignore the Codacy complaint, I turned it some time ago and was unimpressed but forgot to turn it off. It's off now.

lmcinnes · 2019-03-11T20:27:42Z

One other thing to note: since about 0.38 numba has supported passing jitted functions as parameters. This obviates the need to use the indirection of the make_nndescent functions that produce jitted functions -- instead you can simply pass dist and dist_args directly to the function in question. I've been in the process of shifting pynndescent to this, but have been busy with other projects so it isn't exactly complete yet.

tomwhite · 2019-03-12T16:06:32Z

@lmcinnes thanks for looking at the code, and for your kind comments. Let me address your points in turn.

To make it fit with the NNDescent object, I thought having an algorithm='threaded' might be a reasonable way of doing this. I will try that out to see if it works, but let me know if you've got another approach in mind.

Regarding the existing code path, I agree that it is worth keeping since it is faster for less than 4 cores. Or even more with RP-trees. You are correct that I haven't looked at parallelising RP-trees, but I think there are various ways that it would be possible, the simplest being initialising each tree in a separate thread. But that could be a later addition I think.

BTW for benchmarking I've been using these scripts. If these are generally useful I'd be happy to contribute them too.

Regarding Dask, I'm very interested in this direction, and in fact, running on a distributed engine like Dask or Spark was my initial motivation for optimising pynndescent (for Scanpy). Obviously, the MapReduce technique naturally transfers to these engines, although there may be challenges regarding the amount of data that needs to be shuffled for each iteration. This is well worth investigating for the potential gains for scaling to very large datasets though.

Finally, thanks for the info about the Numba upgrade. I'll rebase this PR to reflect the fact that make_nndescent is no longer needed.

lmcinnes · 2019-03-13T15:32:08Z

The algorithm='threaded' looks good for now. Ideally it might be beneficial to mirror sklearn's n_jobs approach to handling multiple threads/processes, but that can always be dealt with in the future.

I have been doing some benchmarking on datasets I expect to be more representative of real data than purely random data and the performance numbers seem to be comparable to what you describe, so I am fairly confident about this at this point. I am ready to hit the merge button as soon as you like -- did you have anything else you wanted to add before merging, or just get this merged and then look at anything else later?

tomwhite · 2019-03-13T15:52:57Z

@lmcinnes I don't have anything else to add before merging. I'll look at sklearn's n_jobs API separately.

lmcinnes · 2019-03-13T16:05:42Z

Excellent, thanks for all your work on this! I will likely be porting some or all of this over to umap at some point soon as well.

tomwhite · 2019-03-13T16:13:55Z

Thanks @lmcinnes! I'll submit some follow-up PRs that are related to this one.

Re umap, are there plans to have it use pynndescent as a dependency at some point? I'd be happy to help in that effort.

lmcinnes · 2019-03-13T20:50:03Z

I look forward to future PRs!

Adding a dependency on pynndescent is tempting from the point of view of removing duplication, but it does mean I would have to get pynndescent in a rather better state (testing, documentation, functionality), and right now umap is both in heavy development and also widely used so I was hoping to get to a good state on umap and associated new features before setting to work on pynndescent in earnest and getting to where it needs to be.

tomwhite added 3 commits March 12, 2019 16:49

Multi-threaded implementation

e387ec7

Remove make_nn_descent

086bd31

Add algorithm='threaded' to NNDescent

036ffe9

tomwhite force-pushed the threaded branch from 5a819b1 to 036ffe9 Compare March 13, 2019 11:10

lmcinnes merged commit 4b1e5be into lmcinnes:master Mar 13, 2019

stsievert mentioned this pull request Mar 19, 2019

Multi CPU / GPU capabilities? lmcinnes/umap#37

Open

This was referenced Apr 3, 2019

Test that setting random_state for NNDescent produces deterministic outputs #14

Merged

Fix non-deterministic search #11

Closed

pavlin-policar mentioned this pull request Apr 7, 2019

Sparse matrix support pavlin-policar/openTSNE#28

Closed

tomwhite mentioned this pull request May 23, 2019

Use pynndescent dependency to support threaded nearest neighbors scverse/scanpy#659

Closed

dkobak mentioned this pull request May 31, 2019

Why does transform() have exaggeration=2 by default? pavlin-policar/openTSNE#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded implementation #12

Multi-threaded implementation #12

tomwhite commented Mar 7, 2019

lmcinnes commented Mar 7, 2019

lmcinnes commented Mar 11, 2019

lmcinnes commented Mar 11, 2019

tomwhite commented Mar 12, 2019

lmcinnes commented Mar 13, 2019

tomwhite commented Mar 13, 2019

lmcinnes commented Mar 13, 2019

tomwhite commented Mar 13, 2019

lmcinnes commented Mar 13, 2019

Multi-threaded implementation #12

Multi-threaded implementation #12

Conversation

tomwhite commented Mar 7, 2019

lmcinnes commented Mar 7, 2019

lmcinnes commented Mar 11, 2019

lmcinnes commented Mar 11, 2019

tomwhite commented Mar 12, 2019

lmcinnes commented Mar 13, 2019

tomwhite commented Mar 13, 2019

lmcinnes commented Mar 13, 2019

tomwhite commented Mar 13, 2019

lmcinnes commented Mar 13, 2019