-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed precision doesn't work properly on NVIDIA A10G GPUs #125
Comments
Can you try using |
@sanatmpa1, maybe my description wasn't fully clear - all tests were performed using |
@sachinprasadhs, please let me know if you need any more information about the issue found. If not, can you please change the label |
@reedwm could you take a look? |
Can you try again with the latest |
Thanks @reedwm. I confirmed that the error still occurs on the latest |
With a Titan RTX, I can reproduce with CUDA 11.2 and cudnn 8.1.1. However, I cannot reproduce with CUDA 11.3 and cudnn 8.2.4. So presumably this will be fixed when TensorFlow upgrades CUDA and/or cudnn. Even if I limit the memory usage to 10 GB instead of 24 GB, it still runs with CUDA 11.3/cudnn 8.2.4, so there is clearly a memory issue with the earlier version of CUDA/cudnn. @sanjoy, do you know when we plan on upgrading CUDA and cudnn? Is this planed for TF 2.8? @awpr, any ideas what could be causing this, and know of any ways to fix it for CUDA 11.2? Seems strange that an algorithm requires 4.7 GB of memory. @goncinious, as a temporary workaround, you can manually upgrade CUDA and cudnn if you know how. Understandably, this is difficult however. |
Thanks @reedwm. I've tested with NVIDIA TensorRT official Docker images, which have newer versions of CUDA/cuDNN and I still hit the same error (see logs attached). I've tested 2 image versions that match your CUDA/cuDNN versions - both 21.04 (CUDA 11.3 / cuDNN 8.2.0) and 21.09 (CUDA 11.4 / cuDNN 8.2.4). Steps to reproduce
Logs |
Unfortunately, I still cannot reproduce with the newer CUDA/cudnn versions. I tried running the docker commands in your previous post on an A100 but could not reproduce, even when limiting the memory to 10GiB to try to reproduce the @nluehr do you have access to GPUs with compute capability 8.6 that you can try to reproduce this issue on? |
Reproduced on a GTX 3090 (compute capability 8.6, 24GB of memory) |
Ok, so this will be fixed for all tested GPUs in cudnn 8.3.1. @sanjoy, do we plan to update cudnn to at least 8.3.1 anytime soon? |
Thank you both for looking into it. I can confirm that updating cuDNN manually in the TensorRT 21.04 container to 8.3.1 or using TensorRT 21.11 (which use CUDA 11.5 / cuDNN 8.3.0) fixes the issue. I have two questions to @nluehr for further clarification:
Did you also update CUDA here? Nevertheless,I think this is still a workaround, as we need to manually update cuDNN version and this CUDA/cuDNN version combination isn't part of the TensorFlow's tested build configurations - https://www.tensorflow.org/install/source#gpu. Therefore, knowing if CUDA/cuDNN update in TensorFlow will happen (and having an ETA) would be very useful. |
You can upgrade cudnn to newer minor versions (e.g., 8.3.0 over 8.2.x) without rebuilding TensorFlow. I did not update CUDA in my tests. It is generally safe to use a cuDNN built against a later CUDA of the same CUDA major version. (e.g., you can use a cudnn built against CUDA 11.5 with CUDA 11.3). If you also update the CUDA toolkit, I believe you would need to rebuild TensorFlow. As you point out, "generally works" and officially tested and supported are different things. If you are looking for TensorFlow containers built and tested with the latest cuDNN/CUDA combinations, you might check out the NGC TensorFlow releases. |
@nluehr - Thanks for your reply. After further investigation, I found that while upgrading cuDNN fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6), the model output became non-deterministic when running on multi-GPU with mirrored strategy (i.e gives slightly different outputs every time I run it on the same input volume). Crucially, I found that the output was deterministic when using 1-GPU only or when I switch to full precision, suggesting that something is broken with mixed precision when used on multi-GPUs with the latest compute capability. Note that the behaviour was always deterministic when using mixed precision using GPUs with older compute capability (i.e Tesla T4, which have CC=7.5). Steps to reproduceMy tests were performed on the latest official TensorFlow GPU Docker image (v2.7.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:
Follow the steps below to check that it works on 1-GPU:
Logsa10g_cudnn_updated_test_identical_4_gpu.log |
Determinism is not guaranteed by default, and as you observed, nondeterminism might only occur in specific cases. Running |
Thank you for your reply, @reedwm. I do understand why determinism is difficult to guarantee in training (e.g data sampling randomisation), but at inference it isn't that easy to understand, as the model and inputs are fixed. Do you mind expanding a bit on the sources for these non-determinism at inference time? My guess is the split of both data and model operations across GPUs that might cause discrepancies, but more detail would be very helpful. |
The split of data across GPUs can cause discrepancies, but these discrepancies can typically be removed by calling Another source of nondeterminism comes from using a process in TensorFlow called "autotuning". For many ops, such as convolutions, there are multiple different algorithms that can be used to compute the op. For example, convolutions can be computed as FFTs, or using matrix multiplications, or with various other algorithms. With autotuning, TensorFlow tries each algorithm the first time the op is run, then uses the fastest algorithm for subsequent runs of the op. However, if multiple algorithms take approximately the same amount of time to run, it is nondeterministic which algorithm will be fastest, so the algorithm TensorFlow selects is nondeterministic. Different algoirthms may have slightly different results on the same inputs, so autotuning can cause nondeterminism. Autotuning is disabled with |
Thanks @reedwm for you insights on GPU determinism - that was very useful. Previously, I found that cuDNN upgrade fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6) (see #125), however, I then found it breaks on a GPU with older compute capability (NVIDIA Tesla T4, CC=7.5) with a In summary:
Steps to reproduceMy tests were performed on the latest official TensorFlow GPU Docker image (v2.8.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:
Logs |
The error is because you are running out of memory. It's possible future cuDNN versions use more memory, although the overall memory usage of the model should not significantly increase. When you got the |
Thanks - no, determinism isn't enabled on the script I'm using, so cuDNN >=8.3 seems to be using more memory than before. Given that I'm restricted to use this model size and input shape on Tesla T4 GPU (16GB) - is there any options that I could try to reduce the memory usage? |
@nluehr, @awpr, any ideas why cuDNN 8.3 is using more memory than 8.1 on certain GPUs (despite using less on others)? I think the frontend API is not being used in either case, since TF is still compiled with cudnn 8.1, so it's not due to the frontend API. It's suspicious that 4718624784 bytes (4.7 GB) is being allocated. @goncinious the only advice I have is to try reducing the batch size. If training, |
I don't know why cuDNN 8.3 would use more memory, but it might be informative to check what algorithm and how much scratch space 8.1 was using for the same op, to see whether the newer version is using more scratch space for the same algorithms or is no longer able to use a different less memory-hungry algorithm -- that would help narrow down whether the issue is with over-allocating memory or with breaking/removing an algorithm we were previously relying on. |
I can repro the errors found in #125. I think this is an issue caused by the heuristics of cudnn, which keeps updating from version to version and also returns different results between platforms. So, I would suggest to update to the latest cudnn as you have already done. Then, you can try either of the two ways to work around the issue:
The weird thing about this issue is that the algorithm that requires 4718624784 bytes workspace should be skipped in the first place before even attempting the allocation, since it exceeds the default max limit of 4GB. I am still investigating the root cause. But I think the above should be sufficient to help in this case. Please let me know if that works for you. @goncinious |
I think the root cause is the workspace used in the
Since this algo is the only one that works for this conv case and we have a 4GB max limit for the allocator, the two "4.7GB" cases will simply fail which matches your observation in #125. I will file a bug to our cudnn team. And on your side, please try the above WARs for now. Thanks. |
@kaixih, thank you very much for investigating and finding what it looks like the root cause. I can confirm that both solutions ( Note that I noticed a significant difference in the initialisation time between the two methods (hangs a bit after Running
I guess the new API takes longer to load than the older one, but is it expected? Do you know in which version of cuDNN the new frontend API will become "default"?
Thank you. Will the ticket be available somewhere I can access? This would be useful, so I can track progress on it as well. |
Yes, by using the frontend API, the warmup usually takes longer since more engines are exposed than the previous algorithms and we need longer time on sweeping over them. But after the autotuning phase, the frontend API should be faster or at least equal to the old APIs. If not, there is a bug.
I believe the frontend API will become default when the TF is built over cudnn 8.2 or later. And we (NVIDIA) recommend using the frontend API and updating the cudnn to the latest version.
I have already created the bug ticket, but, it is internal to NVIDIA. Sorry for that. I think I can update this thread when I get some feedback. Btw, can you please share what is the use case in high level? Is that a real model or just some benchmarking? Thanks. |
Thanks a lot for the very quick turnover and for the informative answers! I found a use case where the solution proposed with the two env variables differ w.r.t to the number of blocks fed to model.predict() on a Tesla T4 GPU:
The latter case with
It can be reproduced by following the same steps as in #125 and replace the number of blocks 5->12 (line 18) in the Colab. Not sure if this adds more information to the issue found, but wanted to know what do you think why it fails with
Thank you - updating it here should be fine.
Sure, it's real clinical use case of a 3D U-net for segmentation of a large organ from a CT scan as input. The CT is too large to fit in GPU memory, so the input is first split into large blocks (each of shape 320^3 voxels), which are then fed to the model. Having a large block is important here, as we want to capture as much context as possible. Logs |
Thanks for sharing the info.
Yes, more blocks mean you use more layers and more weights will stay in the GPU memory. As mentioned in #125, we generally recommend users to switch to the frontend API. |
Sorry if I didn't explain that well - by blocks I meant the number of inputs that are passed to |
Ah, I see. I only noticed the Anyway, in this case, I think the size of the model's weights should be constant. And I actually tried your colab code by modifying: test_input = np.ones(shape=(15, *input_shape), dtype=np.float32) with
|
I believe you haven't upgraded cuDNN to >= 8.3 before running the script. However, we do need to upgrade it, so inference can work without OOM errors in both both T4 Tesla (compute capability=7.5) and A10g GPUs (compute capability=8.6). To upgrade cuDNN, I'm using steps 3-5 in #125. If you first upgrade cuDNN and then run the script with the #inputs=15 (as you did), you should be able to reproduce my findings. |
A large U-Net 3D model configured with mixed precision fails with
No algorithm worked!
(see fulla10g.log
attached) when running inference on a NVIDIA A10G 20GB GPU (compute capability 8.6).Using
tensorflow/tensorflow:nightly-gpu
Docker image, the error points to an out-of-memory issue (see full loga10g_tf_nightly.log
attached):I'm able to overcome the issue by using full precision instead (i.e by setting
mixed_precision.set_global_policy("float32")
.The same model configured with mixed precision works fine on the previous generation T4 Tesla GPU (compute capability 7.5), which have even less GPU memory - 16GB (see full
t4_tesla.log
attached).System information
tensorflow:latest-gpu
Docker image (sha256@fc5eb0604722c7bef7b499bb007b3050c4beec5859c2e0d4409d2cca5c14d442
)nvidia-smi
outputs for both GPU types provided in attachments.Describe the expected behavior
Mixed precision mode should not exhaust all GPU memory on the newest generation of NVIDIA A10G.
Standalone code to reproduce the issue
Steps to reproduce:
Start instance with A10G GPU
Start interactive Docker container and pass
test.py
(copy from Colab)Other info / logs
a10g.log
a10g_tf_nightly.log
t4_tesla.log
a10g_nvidia_smi.log
t4_tesla_nvidia_smi.log
The text was updated successfully, but these errors were encountered: