Unified GPU framework using either CUDA or OpenCL #250

mperrin · 2018-05-15T23:14:02Z

This branch both adds support for OpenCL and refactors the CUDA implementation somewhat.

Essentially all GPU and FFTW related functionality is now relocated into accel_math.py, inside wrapper functions, in particular one called fft_2d. This allows the algorithmic code in the other files to be written without having to concern itself with how a given FFT is implemented. In principle this extends and generales the _fft and _inv_fft methods on the FresnelWavefront class (which in fact just become stubs that pass through to accel_math.fft_2d. This change involved refactoring key algorithmic parts of both fresnel.py and poppy_core.py.
OpenCL-based implementation of GPU accelerated FFTs added. Dependencies are pyopencl, conda, gpyfft. The first two can be installed from conda-forge. The latter is in a user contributed conda package, so use conda install -c ljbo3 gpyfft. A new setting poppy.conf.use_cuda allows enabling or disabling this code, on machines with OpenCL-compatible GPUs.
Work in progress partial implementation of toggling between single and double precision floating point math, controlled by a new setting poppy.conf.double_precision which defaults to True. Many things work in single precision mode but currently many test cases fail, presumably since they're not smart enough to adjust tolerances appropriately. Best to ignore this setting for now for most purposes.

All tests pass locally for me, using any combo of plain numpy, numexpr+fftw, CUDA, or OpenCL. On the other hand this starts to get very machine-dependent in test setups so further testing is much appreciated.

Some areas for future work before merging this:

In refactoring the CUDA functionality out of the FresnelWavefront class, that lost the ability to cache CUDA FFT plans as part of FresnelWavefront objects. I tried caching them as a module-level variable in accel_math.py, but for unclear reasons doing so led to a memory leak on the GPU (buffers not being freed somehow?), which eventually led to calculations totally failing due to inability to malloc new buffers on the GPU. I couldn't debug where the leak was, but simply not caching the plans avoids it entirely. The time to create a new CUDA FFT plan each time appears < 1 ms, so it's not zero but is relatively negligible. I'm not totally happy with this but reliability comes first.
The single precision stuff is all use-at-your-own-risk. Needs more debugging and attention tothe unit test tolerances.
Needs docs!
Travis doesn't support CI for GPUs, so all the meaningful testing for this will need to be run locally. Need to investigate the available CUDA compute servers at STScI.
Benchmark functions would be nice to include. Probably a few relevant cases, perhaps (1) bare FFT (2) coronagraphic multi-plane OpticalSystem calc in Fraunhofer regime (3) Fresnel multi-plane calc.

mperrin · 2018-05-15T23:16:53Z

@douglase here's the PR I promised for you to take a look at. In particular I'm hoping you can confirm that my various edits to your code didn't break anything, or change the performance negatively. (Looks good as far as I can tell, but I'm cautious until I hear it's all working on your hardware as well)

coveralls · 2018-05-15T23:33:10Z

Coverage increased (+0.1%) to 63.672% when pulling db00838 on opencl into 34ea05c on master.

douglase

I'm seeing a noticeable runtime hit with this version. For a test fresnel optical system the runtime is 13.9 s ± 120 ms per loop vs. 9.43 s ± 76.4 ms per loop on the same hardware.

        45873 function calls (45670 primitive calls) in 14.009 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    5.988    0.855    7.081    1.012 accel_math.py:108(fft_2d)
       91    2.076    0.023    2.076    0.023 driver.py:284(safe_cuda_api_call)
       39    1.620    0.042    1.623    0.042 necompiler.py:732(evaluate)
        2    1.292    0.646    6.129    3.065 fresnel.py:597(_propagate_ptp)
        8    0.798    0.100    3.136    0.392 poppy_core.py:183(__imul__)
       83    0.519    0.006    0.519    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        4    0.337    0.084    0.337    0.084 {method 'take' of 'numpy.ndarray' objects}
        1    0.251    0.251    0.251    0.251 {built-in method numpy.core.multiarray.where}
        1    0.189    0.189    0.564    0.564 optics.py:1035(get_transmission)
        1    0.162    0.162    0.162    0.162 numeric.py:2030(indices)
        1    0.121    0.121    0.166    0.166 poppy_core.py:175(normalize)
        5    0.099    0.020    0.099    0.020 {method 'copy' of 'numpy.ndarray' objects}
     11/8    0.076    0.007    3.233    0.404 fresnel.py:825(__imul__)
        2    0.071    0.036    0.071    0.036 {built-in method numpy.core.multiarray.copyto}
        1    0.068    0.068    0.068    0.068 {method 'fill' of 'numpy.ndarray' objects}
     23/1    0.066    0.003   14.005   14.005 utils.py:1277(unit_check_wrapper)

vs the current master:

    42845 function calls (42646 primitive calls) in 9.332 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       74    1.643    0.022    1.644    0.022 driver.py:255(safe_cuda_api_call)
       39    1.629    0.042    1.632    0.042 necompiler.py:732(evaluate)
        2    1.286    0.643    4.233    2.116 fresnel.py:643(_propagate_ptp)
        9    0.881    0.098    0.881    0.098 {method 'copy' of 'numpy.ndarray' objects}
        8    0.825    0.103    3.167    0.396 poppy_core.py:179(__imul__)
        2    0.572    0.286    1.465    0.732 _utils.py:29(_Xfftn)
        2    0.377    0.188    1.859    0.930 fresnel.py:388(_inv_fft)
        6    0.333    0.055    0.333    0.055 {method 'take' of 'numpy.ndarray' objects}
        5    0.261    0.052    0.944    0.189 fresnel.py:365(_fft)
        1    0.243    0.243    0.243    0.243 {built-in method numpy.core.multiarray.where}
        1    0.193    0.193    0.561    0.561 optics.py:1018(get_transmission)
        1    0.166    0.166    0.166    0.166 numeric.py:1936(indices)
       54    0.134    0.002    0.134    0.002 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.124    0.124    0.170    0.170 poppy_core.py:171(normalize)
        2    0.095    0.048    0.486    0.243 _utils.py:55(_Xfftn)
     11/8    0.075    0.007    3.263    0.408 fresnel.py:871(__imul__)
        2    0.072    0.036    0.072    0.036 {built-in method numpy.core.multiarray.copyto}
        1    0.070    0.070    0.070    0.070 {method 'fill' of 'numpy.ndarray' objects}
     23/1    0.065    0.003    9.328    9.328 utils.py:1277(unit_check_wrapper)

Separately testing plan gives about a 10% speed up in the FFT, 145ms vs 164ms on a NVIDIA K80 for a 4096x4096 complex128 array for a direct array. This is consistent with the time it takes to run plan (23 ms on the same system). but the profiling results below make it look like there are a lot more calls to the CUDA driver. It's not the new library, testing pyculib vs accelerate.cuda the FFT times are identical.

douglase · 2018-05-16T12:34:51Z

poppy/accel_math.py

+    if (not forward) and fftshift: #inverse shift before backwards FFTs
+        wavefront = np.fft.ifftshift(wavefront)
+
+    print("  Pre FFT", wavefront.dtype)


Did you want this to be a _log.debug?

douglase · 2018-05-16T12:36:47Z

poppy/accel_math.py

+        method = 'numpy'
+
+    _log.debug("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method))
+    print("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method))


extra line?

douglase · 2018-05-16T14:41:20Z

poppy/fresnel.py

+        #if _USE_CUDA:
+            #_log.debug("_USE_CUDA enabled, will not retain intermediates")
+            # MP: Why disable this here? Should still work
+            #retain_intermediates=False


This is because the CUDA plan was part of the wavefront and can't be pickled and can't be copied, 234bfad shows I added this and rewrote some tests to still succeed without wavefront copies. If we're not going to use the plan, then that isn't an issue.

mperrin · 2018-05-16T15:57:07Z

Yeah I realized last night after submitting this that I had accidentally left a bunch of debugging print statements in there unintentionally. Those should mostly just come out entirely.

I suspect the unnecessary IO is part of what's causing the slowdowns.

mperrin · 2018-05-16T16:13:03Z

oh, actually it's worse than just unnecessary IO, since those debug statements were also summing the total intensity before and after each FFT. This was leftover from when I was trying to debug some normalization issues.

mperrin · 2018-05-16T16:27:13Z

plus I wasn't using your _fftshift wrapper properly.

…fft shifts more

douglase · 2018-05-16T20:16:28Z

poppy/accel_math.py

+
+
+_USE_CUDA = (conf.use_cuda and _CUDA_AVAILABLE)
+_USE_OPENCL = (conf.use_numexpr and _OPENCL_AVAILABLE)
 _USE_NUMEXPR = (conf.use_numexpr and _NUMEXPR_AVAILABLE)


shouldn't there be a _USE_FFTW for uniformity?

hmm, actually these all may be redundant I think - the check that matters happens at runtime, on line 167 and following. And there it does have a _USE_FFTW.

The underlying question is one of semantics: whether the .use_x flags can be toggled freely at runtime, or whether they have to be set by editing the config file prior to importing poppy.

also, what if _USE_CUDA and _USE_OPENCL are both True, is that handled somewhere else?

It's handled implicitly by the order of the if/elif inside fft_2d. _USE_CUDA has precedence.

at the moment, I'm flipping these bools to run my benchmarks, that's why I'm paying particular attention to them.

Right. I was trying to get it so that one could flip the poppy.conf.use_x bools and it would have the desired effect. On the general principle that what you're trying to do is configure a setting, so it should be possible via the configuration settings system. :-)

But I see that doesn't quite work right for all the cases of interest right now, since the file global versions are still accessed in a bunch of places.

well, it only costs a couple microseconds to just evaluate (conf.use_cuda and _CUDA_AVAILABLE) on the fly when needed. So there is very small cost to putting that check directly in the fftshift function and so on. That would be one way to let the conf settings be toggled at runtime and work.

What do you think? Should I try to update the code so that it checks the use_x conf settings on each instance? That seems a little inelegant, and those extra microseconds do add up eventually... It certainly works to have module-level state in the _USE_X flags, but those are implicitly "internal, private API" given the leading underscore, rather than public-facing user options.
It would be nice if updating the use_x would also update the _USE_X automatically, but the settings system doesn't work that way.

One compromise could be if e.g. there was some function accel_math.check_math_settings(), and that got called for instance inside Wavefront.__init__(). That way, a user could adjust the poppy.conf.use_x settings, and as long as they created at least one new wavefront prior to the next propagation (essentially guaranteed) then the updated settings would take effect.

How does that sound?

It's all a bit confusing but that seems like a reasonable path. I don't think the config file is as critical as having a setting that they can adjust at run time in their own script.

douglase · 2018-05-16T20:20:45Z

a6b6328 runtimes are looking good

douglase · 2018-05-16T20:24:44Z

poppy/accel_math.py

+            pyfftw.interfaces.cache.set_keepalive_time(30)
+
+            test_array = np.zeros(wavefront.shape)
+            test_array = do_fft(test_array, overwrite_input=True, planner_effort='FFTW_MEASURE',


if or when we restore FFT planning to CUDA, we could implement it here as well.

I'm starting to suspect I may have been wrong before about there being a memory leak from storing the plans. It's possible the root cause was actually there's a memory leak when test cases fail during pytest, in which case exceptions are raised and variables do not get cleaned up properly. In which case now that the tests are passing, that should not happen.

I'm cautiously optimistic this may be the case, and am running the test suite repeatedly on my laptop with CUDA to see...

I am increasingly convinced this is the case!

douglase · 2018-05-16T20:26:45Z

poppy/accel_math.py

+
+
+
+
 if  _USE_CUDA:


Not sure what this comment means?

it was supposed to be TO DO: figure out how to rewrite this in OpenCL

Right! And the lack of that is a big part of why I wrote fft_2d to prioritize CUDA rather than OpenCL if both are available...

mperrin · 2018-05-16T21:14:33Z

I was able to run the test suite a dozen times in a row with no memory errors, which is far beyond the problems I was seeing before. I'm almost entirely convinced the leak was indeed due to failure to cleanup GPU state when calculations error out part way through. Did you encounter anything like that before?
In any case I re-enabled the plan cache, using the version where it's file global inside accel_math, so it should not interfere with Wavefront copying or pickling.

mperrin · 2018-05-16T21:45:57Z

poppy/accel_math.py

+    # OpenCL cfFFT only can FFT certain array sizes.
+    # This check is more stringent that necessary - opencl can handle powers of a few small integers
+    # but this simple version helps during development
+    if _USE_OPENCL and not ispowerof2(wavefront.shape[0]):


@douglase are there constraints on which array sizes are allowed for CUDA FFTs? How well does it handle odd sized arrays? Not that I expect those to be constant, but we should check for and avoid cases where it would fail.

FWIW, the clfft library says it supports "powers of 2, 3, 5, 7, 11 and 13". but here I just check for powers of 2, because that's the common case, and I don't know how to write a computationally-efficient fast check for "factorable into powers of primes no greater than 13".

I don't necessarily put a high priority on optimal GPU acceleration of weird sized arrays, I just want to make sure the code doesn't encounter any situation that causes a fatal error. Although for some situations it is useful to FFT such arrays; for instance if you need bigger than 4096^2, it's still about 2x faster to FFT an array of size 6144^2 than to jump all the way to 8192^2.

Ah, this explains why suddenly my odd-sized test was the same size as FFTW.

CUDA definitely is happy with odd-sizes, but takes a slight hit for unusually sized arrays as seen in this figure from the current master branch where everything is power of two except for a 2800x2800 test:

This is consistent the cuFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html#introduction):

"
Features:

Algorithms highly optimized for input sizes that can be written in the form 2^a×3^b×^5^c^×7^d. In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.

An O(nlogn) algorithm for every input data size
...
"

So I would just pass any sized array to CUDA

OK, great thanks. So we can be flexible on CUDA for any size FFT, and I will update the OpenCL side to check more carefully for multiples of primes up to 13. I've just written a function to do that check and it only adds about a microsecond.

When you wrote "suddenly my odd-sized test was the same size as FFTW." , does that mean you're running OpenCL test there too in addition to CUDA? There's nothing that should slow down that odd sized test on the CUDA side as far as I know.

Note that I also added a check in the fftshift code for being a multiple of 32. Presumably we could generalize that to use a smaller block size if some array is not a multiple of 32, but I wasn't going to pursue that for now.

hmm, nope, not using opencl and it's only the opencl case your are testing. maybe it's a coincidence of this system. I will keep digging.

Never mind, it was jus a coincidence of that machine/run. I am seeing a speed up from CUDA at 2800x2800, albeit as slightly smaller one. It's a little hard to judge how much is due to undersized error bars vs. changes in the code without doing a few more runs to compare.

Can you point me at the benchmark code you’re using for that figure? You pointed me out a notebook a while ago but I don’t know if that’s still the up-to-date place to find it. I’ll make a similar figure for OpenCL.

Might want to tweak the X axis label to say “ array dimensions per axis”, to emphasize that the full array size is the square of that.

that notebook is still the most current.

mperrin · 2018-05-30T18:21:57Z

@douglase I'm going to go ahead and merge this PR now. I had hoped to get this poppy release done and out the door this week. At this point I'm not expecting to do any additional technical work on this in the near term - need to concentrate efforts on writing up for SPIE etc.

Incidentally, on my iMac Pro I've ended up in an interesting and surprising situation where plain numpy appears to be outperforming both FFTW and OpenCL. Have tested and benchmarked like crazy since this seemed nuts to me, but it proves to be the case. Apparently there's major speedups in numpy's intrinsic FFT in the Intel-optimized MKL-linked version of numpy which is now in Conda. This ends up being a very big deal on the iMac Pro since its CPU (Xeon W 2140) supports the AVX-512 SIMD extension, i.e. each core can operate on the equivalent of 4 * complex128s at once. On this particular hardware that ends up equalling or slightly beating the current result of FFTW (not as well optimized for the SIMD instructions as Intel's hand-tuned code?), and substantially outdoing even this fairly high-end AMD GPU (which is tuned much more for single-precision graphics performance than for double-precision GPGPU work; the chip has many fewer double-precision-capables cores than its total number of cores.)

Surprising but true, and goes to show that performance optimization has to be tuned for each set of hardware. Totally different optimal paths for my MacBook Pro laptop, iMac Pro desktop, and your EWS-instance GPGPU compute instances... Makes for a more complicated story but appears to be the case.

mperrin added 3 commits May 11, 2018 12:44

initial opencl support

8345f85

progress on unified & refactored GPU accel_math framework

51b923c

Unified GPU framework for CUDA & OpenCL working and all tests passing

e5048c3

douglase reviewed May 16, 2018

View reviewed changes

performance improvements - remove unnecessary debug prints, use CUDA …

a6b6328

…fft shifts more

douglase reviewed May 16, 2018

View reviewed changes

re-enable CUDA plan caching, and remove unneeded comments

db00838

mperrin commented May 16, 2018

View reviewed changes

mperrin added 4 commits May 29, 2018 14:47

Merge branch 'master' into opencl

78a4fa5

bugfix typo in settings initialization [ci skip]

dfd3085

improve fft logging, add better runtime accel_math mode toggling

920f7ab

updated FFT benchmarking function [ci skip]

8000c0f

mperrin merged commit ee4ed65 into master May 30, 2018

douglase mentioned this pull request Jun 3, 2018

re-importing modules breaks fresnel propagation #257

Open

mperrin mentioned this pull request Aug 27, 2018

Unified GPU framework using either CUDA or OpenCL spacetelescope/poppy#250

Closed

mperrin deleted the opencl branch April 29, 2019 19:56

Unified GPU framework using either CUDA or OpenCL #250

Unified GPU framework using either CUDA or OpenCL #250

Conversation

mperrin commented May 15, 2018

mperrin commented May 15, 2018

coveralls commented May 15, 2018 • edited Loading

douglase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mperrin commented May 16, 2018

mperrin commented May 16, 2018

mperrin commented May 16, 2018

douglase May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mperrin May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglase commented May 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mperrin commented May 16, 2018

Choose a reason for hiding this comment

douglase May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

douglase May 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mperrin commented May 30, 2018

coveralls commented May 15, 2018 •

edited

Loading

douglase May 16, 2018 •

edited

Loading

mperrin May 16, 2018 •

edited

Loading

douglase May 16, 2018 •

edited

Loading

douglase May 17, 2018 •

edited

Loading