-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amd-fftw: problems with linking and performance #82
Comments
Like commented on issue 74, I was finally able to reproduce the issue on Ubuntu 24.04 and your finding fixed it there. I tried to also change the ulimit settings and that was at least not helping. One possibility is that the root cause is somekind of alignment bug that happens when compiling binaries on certain platforms. I suggest that we resolve the issue on 74 by removing the dynamic-dispatcher option but then keep this bug open. |
We can do that as a stopgap, sure. My thinking was that if we can't show that |
Of note is that AMD itself only mentions this targets EPYC. It's unclear how much benefit consumer hardware gets (it's the same micro-arch, so you'd expect at least some benefits, but still, if it optimizes for the cache sizes on EPYC this could be a detriment). Also, this specifically targets CPU optimizations, nothing to do with GPU, so I'd assume this is not relevant to anyone using an Intel CPU + AMD GPU combo (minority or not). All in all this means I think this library has something to prove if it wants to be part of an SDK. I've rebuilt the source AMD supplies from scratch (outside ROCm) with GCC, clang and hipcc, and while this eliminates the optimization problem, the resulting code is at best on par with the vanilla code, at least on my Ryzen 5 7600 and double precision. That said, I'm no expert in optimizing numerics, so it might be that there are modes that do get huge gains -- but if so I don't know how to bench them. |
The reason the library ends up unoptimized when building in ROCm is simply because we set We should double check that all the other projects that use autoconf (including quite fundamental ones like OpenMPI) are not similarly affected and devise some sort of general fix, if possible. Appending things like Unfortunately building the library with optimization, while reducing the size a bit, does not fix the segfault-on-dynamic-dispatch issue. Could have been nice. :P |
In other news, I've found the root cause for the segfault: the AMD version of the lib uses ifuncs (recently of xz backdoor fame :P) to switch between AVX implementations at runtime (but only for two functions, oddly enough -- some very specific optimization going on there, or maybe experimental stuff). Unfortunately the way they use them is incompatible with loading the library using This only happens if the library is loaded with immediate resolving; if loaded lazily, no segfault will happen. However, ifuncs really ought to be written in such a way that they're safe either way, and ideally they're not used at all, for these and other reasons. My current "fix" consists of crudely making static copies of the |
I tried to use the clang instead of gcc (CC=clang CXX=clang++) and then I had linking errors related to same ifuncs. |
The clang errors are a separate thing, in that it apparently uses implicit ifuncs as a result of |
Quick thought while I was struggling with this beast: I propose we simply build the library with gcc and |
I agree with you that the current patch in master for dropping the --enable-dynamic-dispatcher is fine as with it the system works on all Linux distros. And also the addition of "-O3" is important. On mageia I was able to benchmark all combinations and also the fftw-3.10 and it's hard to see the benefit of this dynamic-dispatcher at least on my 6900hs laptop. (It would be interesting to run these tests on epyc) But I think mtune=native is fine and no mtune=native is needed, results between those 2 seemed to quite same for me. (Can you double check) Some people may want to run the system on different machine than where it's build and that may give more incompatibilies. Anyway, here are the results from 1 run on each machine. I did just run the tests without reboots or temperature control between runs. When running multiple times, results can vary some percentages, so these are more like a guidelines than 100% scientific.
|
If you have some dirty patches available how to get the amd-fftw working on your machine with enable-dynamic-dispatching those could be worth to save. |
Yeah, I was just about to say I forgot that building on a VM is a semi-common use case where For perf comparisons I took the average of 10 runs because, as you've observed, the results wobble a bit and they're all within spitting distance of each other. I don't know if the native bench of fftw is necessarily representative of how real applications are going to be using it. I'll try again on my machine (including some bigger sizes) to see how much difference the additional flags make, if any. There is no downside to keeping general options like
I have a really dirty patch and I think I can upgrade it to a not-so-dirty patch (in terms of code duplication) that would still keep AMD's |
I agree, the addition of missing -O3 thing is anyway huge improvement that we should apply now. I did that just by patching the configure file, I do not know if you have better way. Anyway, you noticed it, so can you make the patch? I think one good thing could be to start a test/benchmark project that would collect all kind of existing tests to do a fast check. Some of the tests I have run could take hours, so looking something faster thats more easy to run often. Sometimes I test with the gpu benchmark and with some other image learning algorithms like vit. Mika |
I think a better way is the change the For benchmarking the main issue/problem for me would be identifying scenarios where fftw (so CPU-based FFT) actually gets stressed in a task that people fire up their notebooks for, so not the synthetic benchmark that comes with fftw. I'm basically a noob in this area, I'm still dickering around with ready-made docker images for my LLMs. :P I agree that some sort of general benchmark framework would be handy. At the very least something that can show order of magnitude differences between vanilla builds and ROCm builds, so we can catch oopsies like these were all optimization is actually missing. However, this would also need some way to check the vanilla builds, and since we don't, well, build those, it's a bit tricky. For some of the most common stuff we could probably use a popular ready-to-run container with most of it, it needn't support anything but CPU for sanity checks. |
Behold, the benchmark results none of us asked for but are here anyway. :P Notes:
The conclusions are:
I will provide a pull to add |
Add inline copies of ifunc resolvers to fix the segfault. This allows us to turn --enable-dynamic-dispatcher back on (in case it ever does anyone any good). Signed-off-by: Jeroen Mostert <jeroen.mostert@cm.com>
Follow-up from #74 with a more narrow focus.
On my system (Manjaro, latest rolling) building
amd-fftw
results in a brokenlibfftw3.so
library: attempting todlopen
it causes a segfault. This can be fixed by changing the build so that the--enable-dynamic-dispatcher
option is not passed, but while this produces a working library, it results in what appears to be a badly or completely unoptimized code path. For comparison, the results of a benchmark of this build on a Ryzen 5 7600:Now compare this to a benchmark of vanilla
fftw-3.3.10
built from source with GCC 14.1.1 withconfigure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-openmp --enable-shared
:These are both single CPU tests. It's obvious that the supposedly optimized AMD build is unfortunately nothing of the sort.
That raises two questions:
--enable-dynamic-dispatcher
, or is it a more basic issue? As setting this gives me a library that won't load at all I can't test this.fftw
really supposed to be, can someone verify? It's good to know if it's worth going through the hassle to get it to build, as opposed to just using the systemfftw
. If it doesn't consistently give big wins it's at least worth considering making it optional to avoid any issues.The text was updated successfully, but these errors were encountered: