Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified GPU framework using either CUDA or OpenCL #250

Merged
merged 9 commits into from
May 30, 2018
Merged

Unified GPU framework using either CUDA or OpenCL #250

merged 9 commits into from
May 30, 2018

Conversation

mperrin
Copy link
Owner

@mperrin mperrin commented May 15, 2018

This branch both adds support for OpenCL and refactors the CUDA implementation somewhat.

  • Essentially all GPU and FFTW related functionality is now relocated into accel_math.py, inside wrapper functions, in particular one called fft_2d. This allows the algorithmic code in the other files to be written without having to concern itself with how a given FFT is implemented. In principle this extends and generales the _fft and _inv_fft methods on the FresnelWavefront class (which in fact just become stubs that pass through to accel_math.fft_2d. This change involved refactoring key algorithmic parts of both fresnel.py and poppy_core.py.
  • OpenCL-based implementation of GPU accelerated FFTs added. Dependencies are pyopencl, conda, gpyfft. The first two can be installed from conda-forge. The latter is in a user contributed conda package, so use conda install -c ljbo3 gpyfft. A new setting poppy.conf.use_cuda allows enabling or disabling this code, on machines with OpenCL-compatible GPUs.
  • Work in progress partial implementation of toggling between single and double precision floating point math, controlled by a new setting poppy.conf.double_precision which defaults to True. Many things work in single precision mode but currently many test cases fail, presumably since they're not smart enough to adjust tolerances appropriately. Best to ignore this setting for now for most purposes.

All tests pass locally for me, using any combo of plain numpy, numexpr+fftw, CUDA, or OpenCL. On the other hand this starts to get very machine-dependent in test setups so further testing is much appreciated.

Some areas for future work before merging this:

  1. In refactoring the CUDA functionality out of the FresnelWavefront class, that lost the ability to cache CUDA FFT plans as part of FresnelWavefront objects. I tried caching them as a module-level variable in accel_math.py, but for unclear reasons doing so led to a memory leak on the GPU (buffers not being freed somehow?), which eventually led to calculations totally failing due to inability to malloc new buffers on the GPU. I couldn't debug where the leak was, but simply not caching the plans avoids it entirely. The time to create a new CUDA FFT plan each time appears < 1 ms, so it's not zero but is relatively negligible. I'm not totally happy with this but reliability comes first.
  2. The single precision stuff is all use-at-your-own-risk. Needs more debugging and attention tothe unit test tolerances.
  3. Needs docs!
  4. Travis doesn't support CI for GPUs, so all the meaningful testing for this will need to be run locally. Need to investigate the available CUDA compute servers at STScI.
  5. Benchmark functions would be nice to include. Probably a few relevant cases, perhaps (1) bare FFT (2) coronagraphic multi-plane OpticalSystem calc in Fraunhofer regime (3) Fresnel multi-plane calc.

@mperrin
Copy link
Owner Author

mperrin commented May 15, 2018

@douglase here's the PR I promised for you to take a look at. In particular I'm hoping you can confirm that my various edits to your code didn't break anything, or change the performance negatively. (Looks good as far as I can tell, but I'm cautious until I hear it's all working on your hardware as well)

@coveralls
Copy link

coveralls commented May 15, 2018

Coverage Status

Coverage increased (+0.1%) to 63.672% when pulling db00838 on opencl into 34ea05c on master.

Copy link
Contributor

@douglase douglase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a noticeable runtime hit with this version. For a test fresnel optical system the runtime is 13.9 s ± 120 ms per loop vs. 9.43 s ± 76.4 ms per loop on the same hardware.

        45873 function calls (45670 primitive calls) in 14.009 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        7    5.988    0.855    7.081    1.012 accel_math.py:108(fft_2d)
       91    2.076    0.023    2.076    0.023 driver.py:284(safe_cuda_api_call)
       39    1.620    0.042    1.623    0.042 necompiler.py:732(evaluate)
        2    1.292    0.646    6.129    3.065 fresnel.py:597(_propagate_ptp)
        8    0.798    0.100    3.136    0.392 poppy_core.py:183(__imul__)
       83    0.519    0.006    0.519    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        4    0.337    0.084    0.337    0.084 {method 'take' of 'numpy.ndarray' objects}
        1    0.251    0.251    0.251    0.251 {built-in method numpy.core.multiarray.where}
        1    0.189    0.189    0.564    0.564 optics.py:1035(get_transmission)
        1    0.162    0.162    0.162    0.162 numeric.py:2030(indices)
        1    0.121    0.121    0.166    0.166 poppy_core.py:175(normalize)
        5    0.099    0.020    0.099    0.020 {method 'copy' of 'numpy.ndarray' objects}
     11/8    0.076    0.007    3.233    0.404 fresnel.py:825(__imul__)
        2    0.071    0.036    0.071    0.036 {built-in method numpy.core.multiarray.copyto}
        1    0.068    0.068    0.068    0.068 {method 'fill' of 'numpy.ndarray' objects}
     23/1    0.066    0.003   14.005   14.005 utils.py:1277(unit_check_wrapper)

vs the current master:

    42845 function calls (42646 primitive calls) in 9.332 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       74    1.643    0.022    1.644    0.022 driver.py:255(safe_cuda_api_call)
       39    1.629    0.042    1.632    0.042 necompiler.py:732(evaluate)
        2    1.286    0.643    4.233    2.116 fresnel.py:643(_propagate_ptp)
        9    0.881    0.098    0.881    0.098 {method 'copy' of 'numpy.ndarray' objects}
        8    0.825    0.103    3.167    0.396 poppy_core.py:179(__imul__)
        2    0.572    0.286    1.465    0.732 _utils.py:29(_Xfftn)
        2    0.377    0.188    1.859    0.930 fresnel.py:388(_inv_fft)
        6    0.333    0.055    0.333    0.055 {method 'take' of 'numpy.ndarray' objects}
        5    0.261    0.052    0.944    0.189 fresnel.py:365(_fft)
        1    0.243    0.243    0.243    0.243 {built-in method numpy.core.multiarray.where}
        1    0.193    0.193    0.561    0.561 optics.py:1018(get_transmission)
        1    0.166    0.166    0.166    0.166 numeric.py:1936(indices)
       54    0.134    0.002    0.134    0.002 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.124    0.124    0.170    0.170 poppy_core.py:171(normalize)
        2    0.095    0.048    0.486    0.243 _utils.py:55(_Xfftn)
     11/8    0.075    0.007    3.263    0.408 fresnel.py:871(__imul__)
        2    0.072    0.036    0.072    0.036 {built-in method numpy.core.multiarray.copyto}
        1    0.070    0.070    0.070    0.070 {method 'fill' of 'numpy.ndarray' objects}
     23/1    0.065    0.003    9.328    9.328 utils.py:1277(unit_check_wrapper)

Separately testing plan gives about a 10% speed up in the FFT, 145ms vs 164ms on a NVIDIA K80 for a 4096x4096 complex128 array for a direct array. This is consistent with the time it takes to run plan (23 ms on the same system). but the profiling results below make it look like there are a lot more calls to the CUDA driver. It's not the new library, testing pyculib vs accelerate.cuda the FFT times are identical.

if (not forward) and fftshift: #inverse shift before backwards FFTs
wavefront = np.fft.ifftshift(wavefront)

print(" Pre FFT", wavefront.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want this to be a _log.debug?

method = 'numpy'

_log.debug("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method))
print("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line?

poppy/fresnel.py Outdated
#if _USE_CUDA:
#_log.debug("_USE_CUDA enabled, will not retain intermediates")
# MP: Why disable this here? Should still work
#retain_intermediates=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the CUDA plan was part of the wavefront and can't be pickled and can't be copied, 234bfad shows I added this and rewrote some tests to still succeed without wavefront copies. If we're not going to use the plan, then that isn't an issue.

@mperrin
Copy link
Owner Author

mperrin commented May 16, 2018

Yeah I realized last night after submitting this that I had accidentally left a bunch of debugging print statements in there unintentionally. Those should mostly just come out entirely.

I suspect the unnecessary IO is part of what's causing the slowdowns.

@mperrin
Copy link
Owner Author

mperrin commented May 16, 2018

oh, actually it's worse than just unnecessary IO, since those debug statements were also summing the total intensity before and after each FFT. This was leftover from when I was trying to debug some normalization issues.

@mperrin
Copy link
Owner Author

mperrin commented May 16, 2018

plus I wasn't using your _fftshift wrapper properly.



_USE_CUDA = (conf.use_cuda and _CUDA_AVAILABLE)
_USE_OPENCL = (conf.use_numexpr and _OPENCL_AVAILABLE)
_USE_NUMEXPR = (conf.use_numexpr and _NUMEXPR_AVAILABLE)
Copy link
Contributor

@douglase douglase May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a _USE_FFTW for uniformity?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually these all may be redundant I think - the check that matters happens at runtime, on line 167 and following. And there it does have a _USE_FFTW.

The underlying question is one of semantics: whether the .use_x flags can be toggled freely at runtime, or whether they have to be set by editing the config file prior to importing poppy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what if _USE_CUDA and _USE_OPENCL are both True, is that handled somewhere else?

Copy link
Owner Author

@mperrin mperrin May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's handled implicitly by the order of the if/elif inside fft_2d. _USE_CUDA has precedence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment, I'm flipping these bools to run my benchmarks, that's why I'm paying particular attention to them.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was trying to get it so that one could flip the poppy.conf.use_x bools and it would have the desired effect. On the general principle that what you're trying to do is configure a setting, so it should be possible via the configuration settings system. :-)

But I see that doesn't quite work right for all the cases of interest right now, since the file global versions are still accessed in a bunch of places.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it only costs a couple microseconds to just evaluate (conf.use_cuda and _CUDA_AVAILABLE) on the fly when needed. So there is very small cost to putting that check directly in the fftshift function and so on. That would be one way to let the conf settings be toggled at runtime and work.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Should I try to update the code so that it checks the use_x conf settings on each instance? That seems a little inelegant, and those extra microseconds do add up eventually... It certainly works to have module-level state in the _USE_X flags, but those are implicitly "internal, private API" given the leading underscore, rather than public-facing user options.
It would be nice if updating the use_x would also update the _USE_X automatically, but the settings system doesn't work that way.

One compromise could be if e.g. there was some function accel_math.check_math_settings(), and that got called for instance inside Wavefront.__init__(). That way, a user could adjust the poppy.conf.use_x settings, and as long as they created at least one new wavefront prior to the next propagation (essentially guaranteed) then the updated settings would take effect.

How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all a bit confusing but that seems like a reasonable path. I don't think the config file is as critical as having a setting that they can adjust at run time in their own script.

@douglase
Copy link
Contributor

a6b6328 runtimes are looking good

pyfftw.interfaces.cache.set_keepalive_time(30)

test_array = np.zeros(wavefront.shape)
test_array = do_fft(test_array, overwrite_input=True, planner_effort='FFTW_MEASURE',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if or when we restore FFT planning to CUDA, we could implement it here as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to suspect I may have been wrong before about there being a memory leak from storing the plans. It's possible the root cause was actually there's a memory leak when test cases fail during pytest, in which case exceptions are raised and variables do not get cleaned up properly. In which case now that the tests are passing, that should not happen.

I'm cautiously optimistic this may be the case, and am running the test suite repeatedly on my laptop with CUDA to see...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am increasingly convinced this is the case!





if _USE_CUDA:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TO

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this comment means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was supposed to be TO DO: figure out how to rewrite this in OpenCL

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! And the lack of that is a big part of why I wrote fft_2d to prioritize CUDA rather than OpenCL if both are available...

@mperrin
Copy link
Owner Author

mperrin commented May 16, 2018

I was able to run the test suite a dozen times in a row with no memory errors, which is far beyond the problems I was seeing before. I'm almost entirely convinced the leak was indeed due to failure to cleanup GPU state when calculations error out part way through. Did you encounter anything like that before?
In any case I re-enabled the plan cache, using the version where it's file global inside accel_math, so it should not interfere with Wavefront copying or pickling.

# OpenCL cfFFT only can FFT certain array sizes.
# This check is more stringent that necessary - opencl can handle powers of a few small integers
# but this simple version helps during development
if _USE_OPENCL and not ispowerof2(wavefront.shape[0]):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douglase are there constraints on which array sizes are allowed for CUDA FFTs? How well does it handle odd sized arrays? Not that I expect those to be constant, but we should check for and avoid cases where it would fail.

FWIW, the clfft library says it supports "powers of 2, 3, 5, 7, 11 and 13". but here I just check for powers of 2, because that's the common case, and I don't know how to write a computationally-efficient fast check for "factorable into powers of primes no greater than 13".

I don't necessarily put a high priority on optimal GPU acceleration of weird sized arrays, I just want to make sure the code doesn't encounter any situation that causes a fatal error. Although for some situations it is useful to FFT such arrays; for instance if you need bigger than 4096^2, it's still about 2x faster to FFT an array of size 6144^2 than to jump all the way to 8192^2.

Copy link
Contributor

@douglase douglase May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this explains why suddenly my odd-sized test was the same size as FFTW.

CUDA definitely is happy with odd-sizes, but takes a slight hit for unusually sized arrays as seen in this figure from the current master branch where everything is power of two except for a 2800x2800 test:

benchmarks12coresv100

This is consistent the cuFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html#introduction):

"
Features:

  • Algorithms highly optimized for input sizes that can be written in the form 2^a×3^b×^5^c^×7^d. In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.
  • An O(nlogn) algorithm for every input data size
    ...
    "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I would just pass any sized array to CUDA

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, great thanks. So we can be flexible on CUDA for any size FFT, and I will update the OpenCL side to check more carefully for multiples of primes up to 13. I've just written a function to do that check and it only adds about a microsecond.

When you wrote "suddenly my odd-sized test was the same size as FFTW." , does that mean you're running OpenCL test there too in addition to CUDA? There's nothing that should slow down that odd sized test on the CUDA side as far as I know.

Note that I also added a check in the fftshift code for being a multiple of 32. Presumably we could generalize that to use a smaller block size if some array is not a multiple of 32, but I wasn't going to pursue that for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, nope, not using opencl and it's only the opencl case your are testing. maybe it's a coincidence of this system. I will keep digging.

Copy link
Contributor

@douglase douglase May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, it was jus a coincidence of that machine/run. I am seeing a speed up from CUDA at 2800x2800, albeit as slightly smaller one. It's a little hard to judge how much is due to undersized error bars vs. changes in the code without doing a few more runs to compare.
benchmarks12coresv100

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me at the benchmark code you’re using for that figure? You pointed me out a notebook a while ago but I don’t know if that’s still the up-to-date place to find it. I’ll make a similar figure for OpenCL.

Might want to tweak the X axis label to say “ array dimensions per axis”, to emphasize that the full array size is the square of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that notebook is still the most current.

@mperrin
Copy link
Owner Author

mperrin commented May 30, 2018

@douglase I'm going to go ahead and merge this PR now. I had hoped to get this poppy release done and out the door this week. At this point I'm not expecting to do any additional technical work on this in the near term - need to concentrate efforts on writing up for SPIE etc.

Incidentally, on my iMac Pro I've ended up in an interesting and surprising situation where plain numpy appears to be outperforming both FFTW and OpenCL. Have tested and benchmarked like crazy since this seemed nuts to me, but it proves to be the case. Apparently there's major speedups in numpy's intrinsic FFT in the Intel-optimized MKL-linked version of numpy which is now in Conda. This ends up being a very big deal on the iMac Pro since its CPU (Xeon W 2140) supports the AVX-512 SIMD extension, i.e. each core can operate on the equivalent of 4 * complex128s at once. On this particular hardware that ends up equalling or slightly beating the current result of FFTW (not as well optimized for the SIMD instructions as Intel's hand-tuned code?), and substantially outdoing even this fairly high-end AMD GPU (which is tuned much more for single-precision graphics performance than for double-precision GPGPU work; the chip has many fewer double-precision-capables cores than its total number of cores.)

Surprising but true, and goes to show that performance optimization has to be tuned for each set of hardware. Totally different optimal paths for my MacBook Pro laptop, iMac Pro desktop, and your EWS-instance GPGPU compute instances... Makes for a more complicated story but appears to be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants