Memory Leak #255

staticfloat · 2017-06-11T04:49:18Z

TensorFlow seems to be leaking memory, but I have not yet figured out where this is happening. It's not leaking Julia objects, because whos() can't account for the memory usage. A graph of the free memory in my system is shown here. You can see my system starting to swap out to disk around 20:00. I killed the Julia process around 20:44.

My best guess is that we're leaking memory within the TensorFlow C library within my train loop. I've tried reproducing this with a smaller example like examples/logistic.jl but of course it doesn't happen. Using gdb to look at places where mmap() is being called, it's all either within Julia's array allocation routines during feed_dict construction time, or within Eigen inside of tensorflow.

I would post my code, but there's so much of it it would be unfair to you. Do you have any general debugging tips for tracking something like this down?

The text was updated successfully, but these errors were encountered:

oxinabox · 2017-06-12T06:16:41Z

I have also been wondering lately if it were leaking.
I had a loop that was calling a function that was training and evaluating my model, in a hyperparameter sweep of the dev set.
Each instance of the loop was independant and should have uses the same memory,
but I was getting OOM'd half way through the second loop.

I wish we could use Valgrind for this.

malmaud · 2017-06-12T15:38:10Z

Hmm, maybe it's easiest if we just start at the parts of core.jl that deal with memory allocation and deallocation and see if there are obvious mistakes. It would be nice if we could get some kind of test code that could tell us if the memory leak was still happening, even if it's not minimal.

malmaud · 2017-06-12T16:11:27Z

I tried a few changes on https://github.com/malmaud/TensorFlow.jl/tree/jmm/memory_leak, not sure if they will help.

oxinabox · 2017-06-13T02:37:21Z

It would be nice if we could get some kind of test code that could tell us if the memory leak was still happening, even if it's not minimal.

This is expressly what valgind is for.
There are long instructions for getting it working with julia: https://docs.julialang.org/en/latest/devdocs/valgrind/

It is not minimal, and I'm not sure it is going to actually give useful information.
It should do, particularly if the errors are indeed coming in C.

Rust's TensorFlow bindings are looking at Vagrinding: tensorflow/rust#69

Failing that,
something to say whether or not a leak is occuring could be combled together out of:
manually calling gc at top-level (see here),
and doing something like parsing readstring(``ps -aux --pid=($getpid())``)

staticfloat · 2017-06-13T04:22:38Z

@malmaud I think you may have plugged it! Memory usage is stable after 20 epochs, which it definitely never was before. Nice work! It jumps up (significantly!) between epochs, but drops back down to a stable point which it did not do before. Cheers!

oxinabox · 2017-06-13T04:30:58Z

@staticfloat I think we should leave this open til that branch gets merged.
Just for good practice.

malmaud · 2017-06-13T19:30:06Z

Closed in #256

staticfloat closed this as completed Jun 13, 2017

oxinabox reopened this Jun 13, 2017

malmaud closed this as completed Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak #255

Memory Leak #255

staticfloat commented Jun 11, 2017

oxinabox commented Jun 12, 2017

malmaud commented Jun 12, 2017

malmaud commented Jun 12, 2017

oxinabox commented Jun 13, 2017

staticfloat commented Jun 13, 2017

oxinabox commented Jun 13, 2017 •

edited

Loading

malmaud commented Jun 13, 2017

Memory Leak #255

Memory Leak #255

Comments

staticfloat commented Jun 11, 2017

oxinabox commented Jun 12, 2017

malmaud commented Jun 12, 2017

malmaud commented Jun 12, 2017

oxinabox commented Jun 13, 2017

staticfloat commented Jun 13, 2017

oxinabox commented Jun 13, 2017 • edited Loading

malmaud commented Jun 13, 2017

oxinabox commented Jun 13, 2017 •

edited

Loading