-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA memory leak via cu-device.cc ? #473
Conversation
OK, well I have never seen anything in the documentation of cudaDeviceReset http://stackoverflow.com/questions/11608350/proper-use-of-cudadevicereset We should work around it if we can; however, I think it would be better to Dan On Mon, Feb 1, 2016 at 8:32 AM, Matt Haynes notifications@github.com
|
Adding Jeremy Appleyard from Nvidia, who has been helpful with NVidia On Mon, Feb 1, 2016 at 3:52 PM, Daniel Povey dpovey@gmail.com wrote:
|
(this time with the actual email address) On Mon, Feb 1, 2016 at 4:26 PM, Daniel Povey dpovey@gmail.com wrote:
|
Hey, thanks for the reply. FWIW we have only seen this problem on our newer hardware with CUDA 7.0, we had previously managed to complete training with the same recipe and data using AWS GPU boxes and CUDA 6.5. If you or Jeremy need more info on our setup we can provide that. I did try to put it in the destructor with logging to ensure it was called, but in my test case the memory leak still appeared. I wondered if the device was being re-initialized somehow but didn't have time to investigate further. I'll try pick it up again this week, |
Hi, AFAIK, the 'cudaDeviceReset()' call is redundant, as OS should dispose @matt: please check if it resolves your issue, thanks! Cheers, On 02/02/2016 12:38 PM, Matt Haynes wrote:
|
Hey @vesis84 Thanks, I just tested against your latest commit using my example (https://github.com/matth/kaldi-cuda-leak-example). Unfortunately with the cudaDeviceReset call in the destructor the memory leak still occurs. This is similar to what I found earlier. It very strange as the destructor definitely does get called. Also yes, I would expect the the memory should be freed regardless of a a call to deviceReset when the program exits. The fact it never gets released at all and takes a reboot to come back is very worrying. Perhaps indicative of a more serious error! Thanks, Matt |
This phenomenon occurs on my machines, one's hardware configuration is:
in training(use GPU), every iteration "leak" about 80M memory space, quickly run out of memory. |
I'm not sure what you mean by @matth's solution-- perhaps you refer to On Sat, May 14, 2016 at 8:51 AM, Feiteng Li notifications@github.com
|
Here is what Jeremy Appleyard told me. Sounds like a driver bug to me. As you say, updating the drivers should be If the problem persists, the best solution would be to file a bug with The bug submission form should give a bug ID for tracking with which I can Thanks, Jeremy I suspect that it will be hard for you to provide reproducing code, as your On Sat, May 14, 2016 at 1:50 PM, Daniel Povey dpovey@gmail.com wrote:
|
"leak system memory" : maybe the driver BUG causes kernel memory leak(I noticed that installing CUDA compile with kernel), some driver release notes mentioned kernel memory leak Fixed a kernel memory leak ... on Maxwell-based GPUs. I hope this information can help people who have similar trouble. |
Hello,
We've come across a memory leak when using nnet-train-simple, we have tracked it down to not calling cudaDevieReset() on exit.
The affects of this bug are quite severe, during our training stage we see around 30MB of memory lost for each nnet iteration.
Over the number of iterations in our training scripts we consume all memory on the machine (32GB) and cannot complete training.
The very strange thing is that this memory is never reclaimed at program exit and requires a machine reboot to retrieve. I am separately trying to isolate this and report to nvidia if applicable.
The attached pull request is our 'fix' for the issue. It's perhaps not ideal but I couldn't find a better place to call deviceReset from, tried the CuDevice destructor but that didn't seem to work.
I have full steps to reproduce the bug in this repo:
https://github.com/matth/kaldi-cuda-leak-example
Some of the environmental info is outlined below: