Really bad performance with TensorFlow #669
-
|
Hi, first of all good job on getting the ROCm stack up on Arch, it was a huge hassle a couple months back when I tried. Anyway, I'm getting a real bad performance with TensorFlow using the rocm-arch 3.8 backend on my Radeon VII. Benchmark logs for arch and another run on Ubuntu 20.04 with more than 5x difference are attached. I also installed rocm-dkms from AUR and booted into 5.4.70-1-lts with no difference. Is someone able to explain what could be throttling this? |
Beta Was this translation helpful? Give feedback.
Replies: 24 comments
-
|
@DarjanKrijan thanks for reporting this. Based on your observations the issue is most likely not tensorflow so then we should go one level deeper. If you could go through the dependencies that tensorflow requires, like hip-rocclr, etc., run the test suites for them on both Arch and Ubuntu, and report the results (timing) we can take this conversation further. I'm afraid without this information we can't get to the bottom. Well, before that let's try to resolve the differences as seen by your log file. |
Beta Was this translation helpful? Give feedback.
-
|
Differences on your log files: Arch: So basically we have an incorrect I guess we should start with these two. Going through tf's codebase, here is where the value should be properly set: Looks like an issue with hip. I guess we should start making sure hip tests work. |
Beta Was this translation helpful? Give feedback.
-
|
@tpkessler do you have any idea on why the |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
hmm it's reassuring the GPU is being used, but the code is the same between Arch and Ubuntu (they are on the same versions). It is most likely discrepancy on how the rocm packages are build on Arch vs Ubuntu imo.
Let's try to limit the independent variables and do our tests with the tensorflow-rocm pip package instead of using the AUR package. Although for this particular issue we should try to run the same HIP function tensorflow-rocm is running in a HIP C++ file and see if we can reproduce it outside the tensorflow environment.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/CONTRIBUTING.md#unit-testing-environment Ideally we want these unit tests to happen in the PKGBUILD. |
Beta Was this translation helpful? Give feedback.
-
|
With my Vega 56 I get the correct bandwidth: Do you guys both use a Radeon VII? Tensorflow also reports But: 98% of my VRAM is being blocked by tensorflow, thus allocation of tensors needed for the benchmark fails. |
Beta Was this translation helpful? Give feedback.
-
I had a look at the source code and have no idea on how to compile and run them. The directory |
Beta Was this translation helpful? Give feedback.
-
|
The Also another log with the pip package attached - ROCm Fusion is enabled here but doesn't seem to be the culprit: I will check if I it somehow works when I symlink (some of) the Ubuntu packages into /opt/rocm to see if the performance changes (probably later today/tonight). |
Beta Was this translation helpful? Give feedback.
-
|
I see. Thanks for the hint. These are my results with Far better than your results on Arch Linux and, extrapolating the results of the Radeon VII, on par with Ubuntu. |
Beta Was this translation helpful? Give feedback.
-
|
I forgot to mention: I'm using the pip package as suggested by @acxz |
Beta Was this translation helpful? Give feedback.
-
|
Interesting, yours seems to work properly Maybe compare the installed rocm-arch packages? I haven't modified any PKGBUILDs except the tensorflow one slightly (but using the pip package for now anyway) |
Beta Was this translation helpful? Give feedback.
-
I haven't test on my machine (not near my AMD machine) was just trying to debug based on the logs.
hmm interesting, I gave a shot at it and couldn't find the targets. I guess we need to look deeper into this or might need to make an issue over at the HIP repo. It is reassuring that it works for you @tpkessler, @DarjanKrijan can you try the same batch size?
well I guess the logs are now exactly the same then, which is good, but that means the problem is somewhere deeper. Gonna put the title back to what it was previously since the log differences are not an issue anymore.
Also, even if the PKGBUILDs are the same version there could still be differences depending on the environment it was built in. It is recommended to build the packages in clean chroot or to use the arch4edu binary packages to keep consistency in the build packages. Although I doubt that could be an issue, but it is still important to keep in mind. Coincidentally |
Beta Was this translation helpful? Give feedback.
-
|
I just used the tensorflow-rocm pip package that time around, can verify by the Reducing the batch size actually improved things, though not to expected levels: Back to my theory that the bad compute kernels are produced with the backend on my Radeon VII - some bad access patterns might be hogging the memory bandwidth or clogging it with requests... would explain why the throughput is better with smaller batch sizes. |
Beta Was this translation helpful? Give feedback.
-
|
These are the packages I have installed: packages |
Beta Was this translation helpful? Give feedback.
-
|
So I did some tinkering. Symlinking /opt/rocm with the Ubuntu stack on my other boot didn't work out, this segfaults left and right. Reinstalled with the packages provided by arch4edu (before I built the rocm stack myself, directly from AUR), but now getting other errors about missing ISA or something: Checking the error messages led me to this: Would some rocm-arch/arch4edu packages be missing the gfx906 (Radeon VII) ISA? |
Beta Was this translation helpful? Give feedback.
-
That could be the case, if arch4edu isn't compiling on gfx906 hardware (so that it can be autodetected) or if gfx906 is not a specified build target, then I would expect you to see these errors. I guess the way around this is to figure out which packages are missing the gfx906 target and add them or to build the stack on a gfx906 machine so that it can be autodetected when building from source. The first solution is the appropriate one, although the second solution is prob the one you want to go with, even though you just removed your source builds and would have to rebuild the entire stack (feels bad). |
Beta Was this translation helpful? Give feedback.
-
|
Having the same issue with a Vega FE: Bad performance, and a performance improvement with smaller batch sizes. Tensorflow uses the entire VRAM according to both 'radeontop' and 'rocm-smi --showmeminfo all', regardless of batch size. No significant difference when using XLA, and/or eager vs graph mode, and/or TF_ROCM_FUSION_ENABLE. TF version is 2.4, build from source from the PKGBUILD provided here. I'll try building all necessary packages locally instead of using arch4edu and report back if there is any difference. rocminfo output===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE==========
|
Beta Was this translation helpful? Give feedback.
-
|
Something i noticed, but i don't know if that is the issue yet: |
Beta Was this translation helpful? Give feedback.
-
|
Hi @tarcey, have you compared the performance of tensorflow on Arch Linux with that on Ubuntu?
The first is a wrapper around the CUDA compiler designed by nVidia, the latter is what you have to use if you have an AMD GPU. Does the problem of full VRAM also appears in other applications? I'm thinking of OpenCL code or pure ROCm code like matrix factorisation by |
Beta Was this translation helpful? Give feedback.
-
I measure the performance using tf_cnn_benchmarks.py and compare with the results other people have been getting with the same or similar hardware at the same settings. I get only around 60-70 images/sec on resnet50, even though the GPU is fully utilized and pinned at 1.6GHz shader clock (i adjusted the power limit). If all else fails i will probably resort to dual booting another distro or trying docker first.
I have dug a bit deeper in the memory usage issue, and it seems that this is not an issue at all but expected behavior from tensorflow. There are options to disable this behavior, e.g. by setting the TF_FORCE_GPU_ALLOW_GROWTH envvar to true. But it does not resolve the performance issue. To address something that you mentioned earlier here, i do not get the memory bandwidth issue you and also others described. So the bad performance in TF might not be related to that. But at the same time, the circumstance that smaller batch sizes have better throughput does point to a memory access issue. |
Beta Was this translation helpful? Give feedback.
-
|
I found the problem/solution: It seems miopen ships with outdated or miscalculated tuning parameters for algorithms for some GPUs. I stumbled on this when comparing the 56cu and 64cu variant of the gfx900 in Running the benchmark again with After this, running it again at a different batch size (128) but without the environment variable leads to very low performance again (~50-60). So it seems the tuning parameters are specific to batch size. Rerunning both tests without the environment parameter leads to expected (good) results. Which is good, because the tuning takes a rather long time. Overall it is very close to what i was expecting (~180 img/sec). I already tested this with rock-dkms and a compatible kernel, and get ~170 img/sec, and further, enabling XLA with the Notes: |
Beta Was this translation helpful? Give feedback.
-
|
Wow, that's an impressive analysis! Thank you very much, @tarcey. Currently, tensorflow does not work on my PC (maybe related to python 3.9) so I cannot check this with my Vega 56. Can you reproduce the results @DarjanKrijan? |
Beta Was this translation helpful? Give feedback.
-
|
Did not have the time to check again. Last time I tried the version numbers of the packages were mixed (3.X and 4.0) and something didn't work out. This is problematic in general, maybe arch4edu should point towards a stable branch where the versions are updated at once and then build a clean stack. Will try again towards the weekend. |
Beta Was this translation helpful? Give feedback.
-
|
Considering it has been more than a year since the last comment on this thread, I'm marking this solved with @tarcey's analysis. |
Beta Was this translation helpful? Give feedback.

I found the problem/solution:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/doc/src/perfdatabase.md
It seems miopen ships with outdated or miscalculated tuning parameters for algorithms for some GPUs. I stumbled on this when comparing the 56cu and 64cu variant of the gfx900 in
/opt/rocm/miopen/share/miopen/db/. According to the measurements there, one would get the impression that the 56cu variant is much faster than the 64cu variant, which can't be right.Running the benchmark again with
MIOPEN_FIND_ENFORCE=DB_UPDATEyields much more sensible results: