-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified GPU framework using either CUDA or OpenCL #250
Conversation
@douglase here's the PR I promised for you to take a look at. In particular I'm hoping you can confirm that my various edits to your code didn't break anything, or change the performance negatively. (Looks good as far as I can tell, but I'm cautious until I hear it's all working on your hardware as well) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a noticeable runtime hit with this version. For a test fresnel optical system the runtime is 13.9 s ± 120 ms per loop vs. 9.43 s ± 76.4 ms per loop on the same hardware.
45873 function calls (45670 primitive calls) in 14.009 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
7 5.988 0.855 7.081 1.012 accel_math.py:108(fft_2d)
91 2.076 0.023 2.076 0.023 driver.py:284(safe_cuda_api_call)
39 1.620 0.042 1.623 0.042 necompiler.py:732(evaluate)
2 1.292 0.646 6.129 3.065 fresnel.py:597(_propagate_ptp)
8 0.798 0.100 3.136 0.392 poppy_core.py:183(__imul__)
83 0.519 0.006 0.519 0.006 {method 'reduce' of 'numpy.ufunc' objects}
4 0.337 0.084 0.337 0.084 {method 'take' of 'numpy.ndarray' objects}
1 0.251 0.251 0.251 0.251 {built-in method numpy.core.multiarray.where}
1 0.189 0.189 0.564 0.564 optics.py:1035(get_transmission)
1 0.162 0.162 0.162 0.162 numeric.py:2030(indices)
1 0.121 0.121 0.166 0.166 poppy_core.py:175(normalize)
5 0.099 0.020 0.099 0.020 {method 'copy' of 'numpy.ndarray' objects}
11/8 0.076 0.007 3.233 0.404 fresnel.py:825(__imul__)
2 0.071 0.036 0.071 0.036 {built-in method numpy.core.multiarray.copyto}
1 0.068 0.068 0.068 0.068 {method 'fill' of 'numpy.ndarray' objects}
23/1 0.066 0.003 14.005 14.005 utils.py:1277(unit_check_wrapper)
vs the current master:
42845 function calls (42646 primitive calls) in 9.332 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
74 1.643 0.022 1.644 0.022 driver.py:255(safe_cuda_api_call)
39 1.629 0.042 1.632 0.042 necompiler.py:732(evaluate)
2 1.286 0.643 4.233 2.116 fresnel.py:643(_propagate_ptp)
9 0.881 0.098 0.881 0.098 {method 'copy' of 'numpy.ndarray' objects}
8 0.825 0.103 3.167 0.396 poppy_core.py:179(__imul__)
2 0.572 0.286 1.465 0.732 _utils.py:29(_Xfftn)
2 0.377 0.188 1.859 0.930 fresnel.py:388(_inv_fft)
6 0.333 0.055 0.333 0.055 {method 'take' of 'numpy.ndarray' objects}
5 0.261 0.052 0.944 0.189 fresnel.py:365(_fft)
1 0.243 0.243 0.243 0.243 {built-in method numpy.core.multiarray.where}
1 0.193 0.193 0.561 0.561 optics.py:1018(get_transmission)
1 0.166 0.166 0.166 0.166 numeric.py:1936(indices)
54 0.134 0.002 0.134 0.002 {method 'reduce' of 'numpy.ufunc' objects}
1 0.124 0.124 0.170 0.170 poppy_core.py:171(normalize)
2 0.095 0.048 0.486 0.243 _utils.py:55(_Xfftn)
11/8 0.075 0.007 3.263 0.408 fresnel.py:871(__imul__)
2 0.072 0.036 0.072 0.036 {built-in method numpy.core.multiarray.copyto}
1 0.070 0.070 0.070 0.070 {method 'fill' of 'numpy.ndarray' objects}
23/1 0.065 0.003 9.328 9.328 utils.py:1277(unit_check_wrapper)
Separately testing plan gives about a 10% speed up in the FFT, 145ms vs 164ms on a NVIDIA K80 for a 4096x4096 complex128 array for a direct array. This is consistent with the time it takes to run plan (23 ms on the same system). but the profiling results below make it look like there are a lot more calls to the CUDA driver. It's not the new library, testing pyculib vs accelerate.cuda the FFT times are identical.
poppy/accel_math.py
Outdated
if (not forward) and fftshift: #inverse shift before backwards FFTs | ||
wavefront = np.fft.ifftshift(wavefront) | ||
|
||
print(" Pre FFT", wavefront.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want this to be a _log.debug?
poppy/accel_math.py
Outdated
method = 'numpy' | ||
|
||
_log.debug("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method)) | ||
print("using {2} FFT of {0} array, FFT_direction={1}".format(str(wavefront.shape), 'forward' if forward else 'backward', method)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line?
poppy/fresnel.py
Outdated
#if _USE_CUDA: | ||
#_log.debug("_USE_CUDA enabled, will not retain intermediates") | ||
# MP: Why disable this here? Should still work | ||
#retain_intermediates=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the CUDA plan was part of the wavefront and can't be pickled and can't be copied, 234bfad shows I added this and rewrote some tests to still succeed without wavefront copies. If we're not going to use the plan, then that isn't an issue.
Yeah I realized last night after submitting this that I had accidentally left a bunch of debugging print statements in there unintentionally. Those should mostly just come out entirely. I suspect the unnecessary IO is part of what's causing the slowdowns. |
oh, actually it's worse than just unnecessary IO, since those debug statements were also summing the total intensity before and after each FFT. This was leftover from when I was trying to debug some normalization issues. |
plus I wasn't using your _fftshift wrapper properly. |
poppy/accel_math.py
Outdated
|
||
|
||
_USE_CUDA = (conf.use_cuda and _CUDA_AVAILABLE) | ||
_USE_OPENCL = (conf.use_numexpr and _OPENCL_AVAILABLE) | ||
_USE_NUMEXPR = (conf.use_numexpr and _NUMEXPR_AVAILABLE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a _USE_FFTW for uniformity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, actually these all may be redundant I think - the check that matters happens at runtime, on line 167 and following. And there it does have a _USE_FFTW.
The underlying question is one of semantics: whether the .use_x
flags can be toggled freely at runtime, or whether they have to be set by editing the config file prior to importing poppy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, what if _USE_CUDA and _USE_OPENCL are both True, is that handled somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's handled implicitly by the order of the if/elif inside fft_2d
. _USE_CUDA has precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the moment, I'm flipping these bools to run my benchmarks, that's why I'm paying particular attention to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was trying to get it so that one could flip the poppy.conf.use_x
bools and it would have the desired effect. On the general principle that what you're trying to do is configure a setting, so it should be possible via the configuration settings system. :-)
But I see that doesn't quite work right for all the cases of interest right now, since the file global versions are still accessed in a bunch of places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it only costs a couple microseconds to just evaluate (conf.use_cuda and _CUDA_AVAILABLE)
on the fly when needed. So there is very small cost to putting that check directly in the fftshift function and so on. That would be one way to let the conf
settings be toggled at runtime and work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Should I try to update the code so that it checks the use_x
conf settings on each instance? That seems a little inelegant, and those extra microseconds do add up eventually... It certainly works to have module-level state in the _USE_X
flags, but those are implicitly "internal, private API" given the leading underscore, rather than public-facing user options.
It would be nice if updating the use_x
would also update the _USE_X
automatically, but the settings system doesn't work that way.
One compromise could be if e.g. there was some function accel_math.check_math_settings()
, and that got called for instance inside Wavefront.__init__()
. That way, a user could adjust the poppy.conf.use_x
settings, and as long as they created at least one new wavefront prior to the next propagation (essentially guaranteed) then the updated settings would take effect.
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all a bit confusing but that seems like a reasonable path. I don't think the config file is as critical as having a setting that they can adjust at run time in their own script.
a6b6328 runtimes are looking good |
pyfftw.interfaces.cache.set_keepalive_time(30) | ||
|
||
test_array = np.zeros(wavefront.shape) | ||
test_array = do_fft(test_array, overwrite_input=True, planner_effort='FFTW_MEASURE', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if or when we restore FFT planning to CUDA, we could implement it here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to suspect I may have been wrong before about there being a memory leak from storing the plans. It's possible the root cause was actually there's a memory leak when test cases fail during pytest, in which case exceptions are raised and variables do not get cleaned up properly. In which case now that the tests are passing, that should not happen.
I'm cautiously optimistic this may be the case, and am running the test suite repeatedly on my laptop with CUDA to see...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am increasingly convinced this is the case!
poppy/accel_math.py
Outdated
|
||
|
||
|
||
|
||
if _USE_CUDA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this comment means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was supposed to be TO DO: figure out how to rewrite this in OpenCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right! And the lack of that is a big part of why I wrote fft_2d
to prioritize CUDA rather than OpenCL if both are available...
I was able to run the test suite a dozen times in a row with no memory errors, which is far beyond the problems I was seeing before. I'm almost entirely convinced the leak was indeed due to failure to cleanup GPU state when calculations error out part way through. Did you encounter anything like that before? |
poppy/accel_math.py
Outdated
# OpenCL cfFFT only can FFT certain array sizes. | ||
# This check is more stringent that necessary - opencl can handle powers of a few small integers | ||
# but this simple version helps during development | ||
if _USE_OPENCL and not ispowerof2(wavefront.shape[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@douglase are there constraints on which array sizes are allowed for CUDA FFTs? How well does it handle odd sized arrays? Not that I expect those to be constant, but we should check for and avoid cases where it would fail.
FWIW, the clfft library says it supports "powers of 2, 3, 5, 7, 11 and 13". but here I just check for powers of 2, because that's the common case, and I don't know how to write a computationally-efficient fast check for "factorable into powers of primes no greater than 13".
I don't necessarily put a high priority on optimal GPU acceleration of weird sized arrays, I just want to make sure the code doesn't encounter any situation that causes a fatal error. Although for some situations it is useful to FFT such arrays; for instance if you need bigger than 4096^2, it's still about 2x faster to FFT an array of size 6144^2 than to jump all the way to 8192^2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this explains why suddenly my odd-sized test was the same size as FFTW.
CUDA definitely is happy with odd-sizes, but takes a slight hit for unusually sized arrays as seen in this figure from the current master branch where everything is power of two except for a 2800x2800 test:
This is consistent the cuFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html#introduction):
"
Features:
- Algorithms highly optimized for input sizes that can be written in the form 2^a×3^b×^5^c^×7^d. In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.
- An O(nlogn) algorithm for every input data size
...
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I would just pass any sized array to CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, great thanks. So we can be flexible on CUDA for any size FFT, and I will update the OpenCL side to check more carefully for multiples of primes up to 13. I've just written a function to do that check and it only adds about a microsecond.
When you wrote "suddenly my odd-sized test was the same size as FFTW." , does that mean you're running OpenCL test there too in addition to CUDA? There's nothing that should slow down that odd sized test on the CUDA side as far as I know.
Note that I also added a check in the fftshift code for being a multiple of 32. Presumably we could generalize that to use a smaller block size if some array is not a multiple of 32, but I wasn't going to pursue that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, nope, not using opencl and it's only the opencl case your are testing. maybe it's a coincidence of this system. I will keep digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me at the benchmark code you’re using for that figure? You pointed me out a notebook a while ago but I don’t know if that’s still the up-to-date place to find it. I’ll make a similar figure for OpenCL.
Might want to tweak the X axis label to say “ array dimensions per axis”, to emphasize that the full array size is the square of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that notebook is still the most current.
@douglase I'm going to go ahead and merge this PR now. I had hoped to get this poppy release done and out the door this week. At this point I'm not expecting to do any additional technical work on this in the near term - need to concentrate efforts on writing up for SPIE etc. Incidentally, on my iMac Pro I've ended up in an interesting and surprising situation where plain numpy appears to be outperforming both FFTW and OpenCL. Have tested and benchmarked like crazy since this seemed nuts to me, but it proves to be the case. Apparently there's major speedups in numpy's intrinsic FFT in the Intel-optimized MKL-linked version of numpy which is now in Conda. This ends up being a very big deal on the iMac Pro since its CPU (Xeon W 2140) supports the AVX-512 SIMD extension, i.e. each core can operate on the equivalent of 4 * complex128s at once. On this particular hardware that ends up equalling or slightly beating the current result of FFTW (not as well optimized for the SIMD instructions as Intel's hand-tuned code?), and substantially outdoing even this fairly high-end AMD GPU (which is tuned much more for single-precision graphics performance than for double-precision GPGPU work; the chip has many fewer double-precision-capables cores than its total number of cores.) Surprising but true, and goes to show that performance optimization has to be tuned for each set of hardware. Totally different optimal paths for my MacBook Pro laptop, iMac Pro desktop, and your EWS-instance GPGPU compute instances... Makes for a more complicated story but appears to be the case. |
This branch both adds support for OpenCL and refactors the CUDA implementation somewhat.
accel_math.py
, inside wrapper functions, in particular one calledfft_2d
. This allows the algorithmic code in the other files to be written without having to concern itself with how a given FFT is implemented. In principle this extends and generales the_fft
and_inv_fft
methods on the FresnelWavefront class (which in fact just become stubs that pass through toaccel_math.fft_2d
. This change involved refactoring key algorithmic parts of bothfresnel.py
andpoppy_core.py
.conda install -c ljbo3 gpyfft
. A new settingpoppy.conf.use_cuda
allows enabling or disabling this code, on machines with OpenCL-compatible GPUs.poppy.conf.double_precision
which defaults to True. Many things work in single precision mode but currently many test cases fail, presumably since they're not smart enough to adjust tolerances appropriately. Best to ignore this setting for now for most purposes.All tests pass locally for me, using any combo of plain numpy, numexpr+fftw, CUDA, or OpenCL. On the other hand this starts to get very machine-dependent in test setups so further testing is much appreciated.
Some areas for future work before merging this:
accel_math.py
, but for unclear reasons doing so led to a memory leak on the GPU (buffers not being freed somehow?), which eventually led to calculations totally failing due to inability to malloc new buffers on the GPU. I couldn't debug where the leak was, but simply not caching the plans avoids it entirely. The time to create a new CUDA FFT plan each time appears < 1 ms, so it's not zero but is relatively negligible. I'm not totally happy with this but reliability comes first.