-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with get-cuda-devices on 4x RTX 6000 Ada #916
Comments
Hi @WarrenSchultz , are both the |
Neither are running inside a container, just straight from the Ubuntu shell. |
Thanks for your reply. That's a bit strange because we have never encountered a scenario where |
@arjunsuresh I'd rebooted earlier, but checked again now after getting tensorrt installed. Still no luck. I'd tried completely wiping the CM folder and rebuilding from fresh with the same procedure I used on the other machines again. |
@arjunsuresh Ok, more data. I went into the BIOS and disabled two of the GPUs, and it ran fine. Enabled the third card. Also fine. Enabled the fourth, and it failed again. With the error "Error: problem obtaining number of CUDA devices: 2", could this be a simple counting error in a script somewhere? When I ran it with 3 GPUs, I saved the output for reference.
|
We'll check that @WarrenSchultz. But it's unlikely because we have tested MLPerf on a 4 GPU system. |
Thanks. And just to be thorough, I tried switching which was the disabled card, and it didn't affect the outcome, so it's not a card-specific issue. |
Hi @WarrenSchultz . We didn't try the script on machines with 2+ GPUs . It may be a counting issue indeed. Let me check it ... |
By the way, this error happens here: https://github.com/mlcommons/ck/blob/master/cm-mlops/script/get-cuda-devices/print_cuda_devices.cu#L19
I see some related discussions at pytorch/pytorch#40671 . However, I don't think we have a problem with driver/cuda mismatch since it works when 1 card is disabled (if I understood correctly). @WarrenSchultz - maybe you can debug this code on your system to check what happens? Is there a way to print full error in this C++ code? Thanks a lot for your feedback - very appreciated! |
@gfursin
Yeah, I don't think that's a problem. You are correct. It works on three cards or less, just not four (or maybe more, no way to check that :)
Unfortunately, my C coding experience is about 20 years out of use at this point. I'm muddling through a bit, but not to the level I should be for proper debugging. |
@WarrenSchultz can you please share how much host memory the system is having? |
@arjunsuresh Sorry, meant to include that. Originally, it had 64GB available, but I've bumped up the amount allocated to WSL2 to 112GB, but that had no effect. |
Thank you @WarrenSchultz for explaining. Sorry I'm not able to guess any solution here. This is not a proper solution but one option is to just hardcode the ndev as 1 so that the script doesn't fail here. We can then see if the benchmark run goes well. |
@arjunsuresh Thanks, I'll give that a shot. As I was digging around for solutions, I was looking at the tensorrt json file, and was wondering about the value of "accelerators per node" being listed as 1 for the system (with 3 GPUs)
|
@arjunsuresh Hm. So, bearing in mind this is above my experience level, I tried debugging myself and setting the value to 1, and got the same result. I then fed it through ChatGPT, which came up with code that returned string value of the error "Error: problem obtaining number of CUDA devices: out of memory" Doing some more looking online, it seems like this may have to do with the model not fitting in GPU memory vs. system memory, (which seems odd that it would work for fewer GPUs), unless it's trying to load the model from all the GPUs into the memory of a single GPU? (Which doesn't particularly make sense, but this is outside my experience by far at this point. :) I saw some guidance about changing batch size, but didn't have any luck passing those arguments to CM having any effect. |
@arjunsuresh Well, it appears the root cause is WSL2 not handling that many GPUs properly. From the link: Workaround: I found that calling torch.cuda.device_count() before torch.cuda.is_available() circumvents the error. However, this workaround requires modifying each script to include this extra call." |
Thank you for reporting the issue here with the system_json -- we'll fix that but this is only for reporting purposes and should not be affecting the runs. "I tried debugging myself and setting the value to 1, and got the same result" |
I've been digging into this, and posted an update to a thread on the NVIDIA developer forums: Short version, there's an initialization issue happening at the driver level somewhere. Using Unfortunately, I'm doing performance benchmarking, and I need to actually see all four GPUs' performance together. |
Oh okay. Do you think a docker can help? |
I think it's on the Windows driver level, but it couldn't hurt to try. What do you suggest? |
Sure. Are you following these instructions for running bert? In that case you can just switch to the "using docker" section. |
Yup, that's what I've been using, I'll give it a shot, thanks. Looking at some of the other posts people made about the issue it seems docker is affected as well, but I'm fine with using docker if I'm lucky enough that it works. :) |
Same issue with Docker, unfortunately. Was worth trying, thanks. |
Thank you @WarrenSchultz for trying. If it is a driver issue can an older version help? Is dual booting to linux an option? |
I'm currently discussing with my team if dual-boot is an option. Driver version doesn't seem to matter. I saw a post that Ubuntu 20.04 worked for them, but I had no luck. |
@arjunsuresh Still discussing if switching to Ubuntu is an option. In the meantime, I was checking things with lower GPU counts. When I rebuilt the environment with a single GPU, then increased it after the initial test to have multiple GPUs, it didn't give an unmatched configuration error, but instead threw an error with the QPS value as a wrong number type. Trace is below. What is the correct way to generate a new custom spec with CM? I've tried finding it in the docs, but have had no luck.
|
@WarrenSchultz Sorry for the late reply. This is due to a bug in the code which we hadn't noticed because we were always giving the target_qps as input. This PR should fix it. |
Now that I've gotten the prereqs working, I've run into another issue. Trying to run a simple BERT99 test, the get-cuda-devices section failed, so I ran it independently and there was no different result. Thoughts? What other debug logs are there for me to dig up, if that would help?
This is an Intel-based workstation with 4x RTX 6000 Ada-gen cards. Drivers are up to date, matching the versions of the other systems I've tested this on, as well as the same procedure.
nvidia-smi
Output
The text was updated successfully, but these errors were encountered: