-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUDA GPU Meltdown "bug" #166
Comments
Thorsten, this is not a QUDA issue. If the GPUs are overheating, then this sounds like there are serious issues with the JUDGE system, e.g., airflow problems. I would suggest you contact the system administrator of JUDGE to explain your problem. It's possible that there are air flow issues on this cluster. I note almost all GPU in a server environment are passively cooled, since an external fan provides much better cooling than the active cooled fans. You can't build a GPU cluster without using passive cooling (or are willing to use liquid cooling). GPUs automatically should have this "heat control" built in whereby if they get too hot they will down clock to bring the temperature under control. What temperature are you seeing? A typically passively cooled GPU system should see temperatures of around 60C. The maximum reliable temperature is about 90C. Moreover, QUDA actually consumes a fraction of the the TDP of a GPU, because it is memory bandwidth bound, you typically only get 10-20% of peak throughput. What this means is that on a GPU with a TDP of 235 watts, running QUDA you will only consumer 100-150 watts at most. |
Hi Mike, the problems only occurs to me when I use QUDA. I asked the admins for some performance and electricity consumption data. The GPUS are old, but somehow they should not die, I agree. However, they said that they replace every single one which fails and if I just continue running, I basically get a new cluster to run on. Best
|
and, what is very strange is, that the m2050 die almost immediately, after an hour or so, while the m2070 can run for 12 hours without having any issue (almost). This morning I got the first DBE on one of two m2070 GPUs.
|
I have an NVIDIA colleague who is based at Jülich. Perhaps I should put you two in direct contact as he will be better placed to help resolve what's going on. |
Closing, since this is not a QUDA bug. |
Hey,
sorry for the pathetic naming bug cool "bugs" have to have cool names :).
I run QUDA on JUDGE at FZ Jülich in Germany and they have m2050 as well as m2070 Tesla GPUs and they are passively cooled (yes, that exists). When I run dozens of inversions on the card i get double bit errors, meaning that the ECC parity check cannot recover some flipped bits in memory. Shortly after these messages, the GPU die due to overheating. It cashes in the QUDA part of the code and I guess the reason for that is its efficiency. So therefore, it is not strictly a "bug" but an inconvenient feature. Therefore, would it be possible to add some heat control as an optional feature, such that the code halts for a couple of cycles if the GPU gets too hot? Is that possible? This sounds crazy but on the JUDGE machine the problem is serious, I already burnt >50 GPU within 3 months.
Shall I compile with host- and device-debug first and see what really happens and then send the output to someone?
Best
Thorsten
The text was updated successfully, but these errors were encountered: