New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use parallel processing in the cython code for CCMetric #1051

wants to merge 10 commits into
base: master


None yet
3 participants

grlee77 commented May 14, 2016

This PR adds parallel computation to the CCMetric cython routines and is built on top of #1050 (the first 5 commits listed here are from that PR). I went ahead and uploaded this now to give a concrete demonstrate of how the refactored code from #1050 could be used.

There are really only a few minor changes despite the number of modified lines:

  • change with nogil to with nogil, parallel()
  • replace range with prange in the outermost loop of a routine
  • In two of the functions an extra dimension of size equal to the parallel loop range had to be added to the sums and lines variables. This is a simple way to ensure that thread conflicts will not occur. Total memory does not increase substantially because these variables are still much smaller than the larger factors, static and moving arrays.
  • a test of the threaded accuracy and performance was added

Benchmark Results

The following were generated for the ndim = 3 case (3D image of size 96 x 96 x 96)from test_cc_threads on two different computers. The precomputation routines take by far the longest and scale pretty nicely with the number of cores. The other routines are already an orders of magnitude
faster, but still benefit for the case shown below. These timings are just from a single run, but reflect the same sorts of speedups I saw with Ipython's timeit command as well.

Linux workstation with dual 8-core Xeon cpu

cpu count 32
thread count 16
default threads 16
1 threads:
      pre: 0.5550892353057861 s
      forward: 0.022975683212280273 s
      back: 0.023041486740112305 s
2 threads:
      pre: 0.29018259048461914 s
      forward: 0.01249837875366211 s
      back: 0.013074159622192383 s
16 threads:
      pre: 0.06494855880737305 s
      forward: 0.0053369998931884766 s
      back: 0.004540681838989258 s

quad core macbook

cpu count 8
thread count 4
default threads 4
1 threads:
      pre: 1.2701809406280518 s
      forward: 0.019701004028320312 s
      back: 0.01971602439880371 s
2 threads:
      pre: 0.655609130859375 s
      forward: 0.011089086532592773 s
      back: 0.011534929275512695 s
4 threads:
      pre: 0.3601951599121094 s
      forward: 0.007358074188232422 s
      back: 0.006587028503417969 s

@grlee77 grlee77 force-pushed the grlee77:crosscorr_parallel branch from 1f401da to fa6db2c May 14, 2016


This comment has been minimized.


omarocegueda commented May 14, 2016

Amazing results @grlee77! this is very exciting! =D
Some time ago I also tried to parallelize these routines but I was doing something wrong and didn't observe any substantial speed up. Thanks to your PR, seems like I will finally learn how to properly implement this kind of things in cython!


This comment has been minimized.


arokem commented Oct 21, 2016

Hey @omarocegueda : is this ready from your point of view?

Sorry @grlee77 for letting this hang for so long. Would you mind rebasing this on current master?


This comment has been minimized.


omarocegueda commented Oct 21, 2016

Hi Ariel!,
I'm closing this PR since this enhancement consists in parallelizing the old version of the CCMetric (the one whose computational time increases quadratically with the local window radius). The algorithm introduced with PR #1060 (without parallelization) is substantially faster than the parallel version of the old algorithm introduced by this PR. The plan is to parallelize the new algorithm after @grlee77's PR #1050 is merged to obtain even better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment