CUDA memory leak via cu-device.cc ? #473

matth · 2016-02-01T13:32:53Z

Hello,

We've come across a memory leak when using nnet-train-simple, we have tracked it down to not calling cudaDevieReset() on exit.

The affects of this bug are quite severe, during our training stage we see around 30MB of memory lost for each nnet iteration.

Over the number of iterations in our training scripts we consume all memory on the machine (32GB) and cannot complete training.

The very strange thing is that this memory is never reclaimed at program exit and requires a machine reboot to retrieve. I am separately trying to isolate this and report to nvidia if applicable.

The attached pull request is our 'fix' for the issue. It's perhaps not ideal but I couldn't find a better place to call deviceReset from, tried the CuDevice destructor but that didn't seem to work.

I have full steps to reproduce the bug in this repo:

https://github.com/matth/kaldi-cuda-leak-example

Some of the environmental info is outlined below:

    $ uname -a

    Linux 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

    $ cat /proc/driver/nvidia/version

    NVRM version: NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015
    GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

    $ nvidia-smi --query-gpu=index,name --format=csv

    index, name
    0, GeForce GTX TITAN X
    1, GeForce GTX TITAN X

    $ nvcc --version

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2015 NVIDIA Corporation
    Built on Mon_Feb_16_22:59:02_CST_2015
    Cuda compilation tools, release 7.0, V7.0.27`

danpovey · 2016-02-01T20:52:31Z

OK, well I have never seen anything in the documentation of cudaDeviceReset
to say that it is mandatory. In fact, I've seen a forum post saying it was
optional and only helped certain diagnostics programs. So this sounds to
me like a driver bug.

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gef69dd5c6d0206c2b8d099abac61f217

http://stackoverflow.com/questions/11608350/proper-use-of-cudadevicereset

We should work around it if we can; however, I think it would be better to
have it done in the destructor of class CuDevice, so that it will be
automatically called at program exit without requiring any change in
binary-level code.

Dan

On Mon, Feb 1, 2016 at 8:32 AM, Matt Haynes notifications@github.com
wrote:

Hello,

We've come across a memory leak when using nnet-train-simple, we have
tracked it down to not calling cudaDevieReset() on exit.

The affects of this bug are quite severe, during our training stage we see
around 30MB of memory lost for each nnet iteration.

Over the number of iterations in our training scripts we consume all
memory on the machine (32GB) and cannot complete training.

The very strange thing is that this memory is never reclaimed at program
exit and requires a machine reboot to retrieve. I am separately trying to
isolate this and report to nvidia if applicable.

The attached pull request is our 'fix' for the issue. It's perhaps not
ideal but I couldn't find a better place to call deviceReset from, tried
the CuDevice destructor but that didn't seem to work.

I have full steps to reproduce the bug in this repo:

https://github.com/matth/kaldi-cuda-leak-example

Some of the environmental info is outlined below:
$ uname -a

Linux 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

$ nvidia-smi --query-gpu=index,name --format=csv

index, name
0, GeForce GTX TITAN X
1, GeForce GTX TITAN X

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27`
You can view, comment on, or merge this pull request online at:

#473
Commit Summary

Add deviceReset call

File Changes

M src/cudamatrix/cu-device.h
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-0 (7)

M src/nnet2bin/nnet-train-simple.cc
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-1 (1)

Patch Links:

https://github.com/kaldi-asr/kaldi/pull/473.patch

https://github.com/kaldi-asr/kaldi/pull/473.diff

—
Reply to this email directly or view it on GitHub
#473.

danpovey · 2016-02-01T21:26:27Z

Adding Jeremy Appleyard from Nvidia, who has been helpful with NVidia
driver issues in the past.

On Mon, Feb 1, 2016 at 3:52 PM, Daniel Povey dpovey@gmail.com wrote:

OK, well I have never seen anything in the documentation of
cudaDeviceReset to say that it is mandatory. In fact, I've seen a forum
post saying it was optional and only helped certain diagnostics programs.
So this sounds to me like a driver bug.

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gef69dd5c6d0206c2b8d099abac61f217

http://stackoverflow.com/questions/11608350/proper-use-of-cudadevicereset

We should work around it if we can; however, I think it would be better to
have it done in the destructor of class CuDevice, so that it will be
automatically called at program exit without requiring any change in
binary-level code.

Dan

On Mon, Feb 1, 2016 at 8:32 AM, Matt Haynes notifications@github.com
wrote:
Hello,

We've come across a memory leak when using nnet-train-simple, we have
tracked it down to not calling cudaDevieReset() on exit.

The affects of this bug are quite severe, during our training stage we
see around 30MB of memory lost for each nnet iteration.

Over the number of iterations in our training scripts we consume all
memory on the machine (32GB) and cannot complete training.

The very strange thing is that this memory is never reclaimed at program
exit and requires a machine reboot to retrieve. I am separately trying to
isolate this and report to nvidia if applicable.

The attached pull request is our 'fix' for the issue. It's perhaps not
ideal but I couldn't find a better place to call deviceReset from, tried
the CuDevice destructor but that didn't seem to work.

I have full steps to reproduce the bug in this repo:

https://github.com/matth/kaldi-cuda-leak-example

Some of the environmental info is outlined below:
$ uname -a

Linux 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

$ nvidia-smi --query-gpu=index,name --format=csv

index, name
0, GeForce GTX TITAN X
1, GeForce GTX TITAN X

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27`
You can view, comment on, or merge this pull request online at:

#473
Commit Summary

Add deviceReset call

File Changes

M src/cudamatrix/cu-device.h
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-0 (7)

M src/nnet2bin/nnet-train-simple.cc
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-1 (1)

Patch Links:

https://github.com/kaldi-asr/kaldi/pull/473.patch

https://github.com/kaldi-asr/kaldi/pull/473.diff

—
Reply to this email directly or view it on GitHub
#473.

danpovey · 2016-02-01T21:26:56Z

(this time with the actual email address)

On Mon, Feb 1, 2016 at 4:26 PM, Daniel Povey dpovey@gmail.com wrote:

Adding Jeremy Appleyard from Nvidia, who has been helpful with NVidia
driver issues in the past.

On Mon, Feb 1, 2016 at 3:52 PM, Daniel Povey dpovey@gmail.com wrote:
OK, well I have never seen anything in the documentation of
cudaDeviceReset to say that it is mandatory. In fact, I've seen a forum
post saying it was optional and only helped certain diagnostics programs.
So this sounds to me like a driver bug.

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gef69dd5c6d0206c2b8d099abac61f217

http://stackoverflow.com/questions/11608350/proper-use-of-cudadevicereset

We should work around it if we can; however, I think it would be better
to have it done in the destructor of class CuDevice, so that it will be
automatically called at program exit without requiring any change in
binary-level code.

Dan

On Mon, Feb 1, 2016 at 8:32 AM, Matt Haynes notifications@github.com
wrote:
Hello,

We've come across a memory leak when using nnet-train-simple, we have
tracked it down to not calling cudaDevieReset() on exit.

The affects of this bug are quite severe, during our training stage we
see around 30MB of memory lost for each nnet iteration.

Over the number of iterations in our training scripts we consume all
memory on the machine (32GB) and cannot complete training.

The very strange thing is that this memory is never reclaimed at program
exit and requires a machine reboot to retrieve. I am separately trying to
isolate this and report to nvidia if applicable.

The attached pull request is our 'fix' for the issue. It's perhaps not
ideal but I couldn't find a better place to call deviceReset from, tried
the CuDevice destructor but that didn't seem to work.

I have full steps to reproduce the bug in this repo:

https://github.com/matth/kaldi-cuda-leak-example

Some of the environmental info is outlined below:
$ uname -a

Linux 3.19.0-47-generic #53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)

$ nvidia-smi --query-gpu=index,name --format=csv

index, name
0, GeForce GTX TITAN X
1, GeForce GTX TITAN X

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Mon_Feb_16_22:59:02_CST_2015
Cuda compilation tools, release 7.0, V7.0.27`
You can view, comment on, or merge this pull request online at:

#473
Commit Summary

Add deviceReset call

File Changes

M src/cudamatrix/cu-device.h
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-0 (7)

M src/nnet2bin/nnet-train-simple.cc
https://github.com/kaldi-asr/kaldi/pull/473/files#diff-1 (1)

Patch Links:

https://github.com/kaldi-asr/kaldi/pull/473.patch

https://github.com/kaldi-asr/kaldi/pull/473.diff

—
Reply to this email directly or view it on GitHub
#473.

matth · 2016-02-02T11:38:06Z

Hey, thanks for the reply. FWIW we have only seen this problem on our newer hardware with CUDA 7.0, we had previously managed to complete training with the same recipe and data using AWS GPU boxes and CUDA 6.5. If you or Jeremy need more info on our setup we can provide that.

I did try to put it in the destructor with logging to ensure it was called, but in my test case the memory leak still appeared. I wondered if the device was being re-initialized somehow but didn't have time to investigate further. I'll try pick it up again this week,

KarelVesely84 · 2016-02-02T12:08:23Z

Hi,
I added the call of 'cudaDeviceReset()' to the destructor of 'CuDevice',
after the release of CUBLAS. And a log-print told me that it really gets
called.

AFAIK, the 'cudaDeviceReset()' call is redundant, as OS should dispose
all the resources of the process anyway. On the other causes no harm.

bd062ee

@matt: please check if it resolves your issue, thanks!

Cheers,
Karel.

On 02/02/2016 12:38 PM, Matt Haynes wrote:

Hey, thanks for the reply. FWIW we have only seen this problem on our
newer hardware with CUDA 7.0, we had previously managed to complete
training with the same recipe and data using AWS GPU boxes and CUDA
6.5. If you or Jeremy need more info on our setup we can provide that.

I did try to put it in the destructor with logging to ensure it was
called, but in my test case the memory leak still appeared. I wondered
if the device was being re-initialized somehow but didn't have time to
investigate further. I'll try pick it up again this week,

—
Reply to this email directly or view it on GitHub
#473 (comment).

matth · 2016-02-02T15:38:14Z

Hey @vesis84

Thanks, I just tested against your latest commit using my example (https://github.com/matth/kaldi-cuda-leak-example).

Unfortunately with the cudaDeviceReset call in the destructor the memory leak still occurs. This is similar to what I found earlier. It very strange as the destructor definitely does get called.

Also yes, I would expect the the memory should be freed regardless of a a call to deviceReset when the program exits. The fact it never gets released at all and takes a reboot to come back is very worrying. Perhaps indicative of a more serious error!

Thanks,

Matt

lifeiteng · 2016-05-14T12:51:39Z

This phenomenon occurs on my machines, one's hardware configuration is:

AUSU mainboard with multi 980Ti GPUs ubuntu 14.04 kernel 3.13.0-86-generic CUDA 7.5/7.0

in training(use GPU), every iteration "leak" about 80M memory space, quickly run out of memory.
@matth 's solution works.

danpovey · 2016-05-14T17:50:43Z

I'm not sure what you mean by @matth's solution-- perhaps you refer to
rebooting.
This is clearly a driver (or possibly OS) bug. No operating system
currently in use allows a userspace program to leak system memory. I'll
forward your email to Jeremy Appleyard from NVidia again.
I'd suggest to update your NVidia drivers to see if they have fixed the
problem.

On Sat, May 14, 2016 at 8:51 AM, Feiteng Li notifications@github.com
wrote:

This phenomenon occurs on my machines:

AUSU X99 mainboard with multi 980Ti GPUs ubuntu 14.04 kernel
3.13.0-86-generic CUDA 7.5

MSI mainboard with multi MSI 970 GPUs ubuntu 14.04 3.13.0-85-generic
CUDA 7.0/7.5

in training(use GPU), every iteration "leak" about 80M memory space,
quickly run out of memory.
@matth https://github.com/matth 's solution works.

—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#473 (comment)

danpovey · 2016-05-14T19:34:56Z

Here is what Jeremy Appleyard told me.

Sounds like a driver bug to me. As you say, updating the drivers should be
the first course of action.

If the problem persists, the best solution would be to file a bug with
reproducing code though the NVIDIA website. The user would have to register
here:https://developer.nvidia.com/accelerated-computing-developer. Once
registered there’s a fairly prominent “Submit a New Bug” button in the
member area. There’s a shell script: nvidia-bug-report.sh which ships with
the drivers which will help the team diagnose the problem.

The bug submission form should give a bug ID for tracking with which I can
keep an eye on it.

Thanks,

Jeremy

I suspect that it will be hard for you to provide reproducing code, as your
setup presumably needs speech data to run, but it's possible that the
output of the nvidia-bug-report.sh script will show them something.
Anyway, first try updating the drivers.

On Sat, May 14, 2016 at 1:50 PM, Daniel Povey dpovey@gmail.com wrote:

I'm not sure what you mean by @matth's solution-- perhaps you refer to
rebooting.
This is clearly a driver (or possibly OS) bug. No operating system
currently in use allows a userspace program to leak system memory. I'll
forward your email to Jeremy Appleyard from NVidia again.
I'd suggest to update your NVidia drivers to see if they have fixed the
problem.

On Sat, May 14, 2016 at 8:51 AM, Feiteng Li notifications@github.com
wrote:

This phenomenon occurs on my machines:

AUSU X99 mainboard with multi 980Ti GPUs ubuntu 14.04 kernel
3.13.0-86-generic CUDA 7.5

MSI mainboard with multi MSI 970 GPUs ubuntu 14.04
3.13.0-85-generic CUDA 7.0/7.5

in training(use GPU), every iteration "leak" about 80M memory space,
quickly run out of memory.
@matth https://github.com/matth 's solution works.

—
You are receiving this because you modified the open/close state.
Reply to this email directly or view it on GitHub
#473 (comment)

lifeiteng · 2016-05-15T13:26:42Z

NVIDIA Driver Version: 352.39, installed from CUDA 7.5 local run file;
Installing newest driver caused the machine lose contact, so I can't remote ssh on it and will not test this again;
someone solved 'leak' problem by upgrading driver 361.18-r4 up to 364.15

"leak system memory" : maybe the driver BUG causes kernel memory leak(I noticed that installing CUDA compile with kernel), some driver release notes mentioned kernel memory leak Fixed a kernel memory leak ... on Maxwell-based GPUs.

I hope this information can help people who have similar trouble.

Add deviceReset call

87f5b99

danpovey closed this Feb 4, 2016

lifeiteng mentioned this pull request Jun 7, 2016

Consistent GPU crashes in nnet training srvk/eesen#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA memory leak via cu-device.cc ? #473

CUDA memory leak via cu-device.cc ? #473

matth commented Feb 1, 2016

danpovey commented Feb 1, 2016

danpovey commented Feb 1, 2016

danpovey commented Feb 1, 2016

matth commented Feb 2, 2016

KarelVesely84 commented Feb 2, 2016

matth commented Feb 2, 2016

lifeiteng commented May 14, 2016 •

edited

danpovey commented May 14, 2016

danpovey commented May 14, 2016

lifeiteng commented May 15, 2016 •

edited

CUDA memory leak via cu-device.cc ? #473

CUDA memory leak via cu-device.cc ? #473

Conversation

matth commented Feb 1, 2016

danpovey commented Feb 1, 2016

danpovey commented Feb 1, 2016

danpovey commented Feb 1, 2016

matth commented Feb 2, 2016

KarelVesely84 commented Feb 2, 2016

matth commented Feb 2, 2016

lifeiteng commented May 14, 2016 • edited

danpovey commented May 14, 2016

danpovey commented May 14, 2016

lifeiteng commented May 15, 2016 • edited

lifeiteng commented May 14, 2016 •

edited

lifeiteng commented May 15, 2016 •

edited