Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak #255

Closed
staticfloat opened this issue Jun 11, 2017 · 7 comments
Closed

Memory Leak #255

staticfloat opened this issue Jun 11, 2017 · 7 comments

Comments

@staticfloat
Copy link
Contributor

TensorFlow seems to be leaking memory, but I have not yet figured out where this is happening. It's not leaking Julia objects, because whos() can't account for the memory usage. A graph of the free memory in my system is shown here. You can see my system starting to swap out to disk around 20:00. I killed the Julia process around 20:44.

screen shot 2017-06-10 at 8 51 29 pm

My best guess is that we're leaking memory within the TensorFlow C library within my train loop. I've tried reproducing this with a smaller example like examples/logistic.jl but of course it doesn't happen. Using gdb to look at places where mmap() is being called, it's all either within Julia's array allocation routines during feed_dict construction time, or within Eigen inside of tensorflow.

I would post my code, but there's so much of it it would be unfair to you. Do you have any general debugging tips for tracking something like this down?

@oxinabox
Copy link
Collaborator

I have also been wondering lately if it were leaking.
I had a loop that was calling a function that was training and evaluating my model, in a hyperparameter sweep of the dev set.
Each instance of the loop was independant and should have uses the same memory,
but I was getting OOM'd half way through the second loop.

I wish we could use Valgrind for this.

@malmaud
Copy link
Owner

malmaud commented Jun 12, 2017

Hmm, maybe it's easiest if we just start at the parts of core.jl that deal with memory allocation and deallocation and see if there are obvious mistakes. It would be nice if we could get some kind of test code that could tell us if the memory leak was still happening, even if it's not minimal.

@malmaud
Copy link
Owner

malmaud commented Jun 12, 2017

I tried a few changes on https://github.com/malmaud/TensorFlow.jl/tree/jmm/memory_leak, not sure if they will help.

@oxinabox
Copy link
Collaborator

It would be nice if we could get some kind of test code that could tell us if the memory leak was still happening, even if it's not minimal.

This is expressly what valgind is for.
There are long instructions for getting it working with julia: https://docs.julialang.org/en/latest/devdocs/valgrind/

It is not minimal, and I'm not sure it is going to actually give useful information.
It should do, particularly if the errors are indeed coming in C.

Rust's TensorFlow bindings are looking at Vagrinding: tensorflow/rust#69

Failing that,
something to say whether or not a leak is occuring could be combled together out of:
manually calling gc at top-level (see here),
and doing something like parsing readstring(``ps -aux --pid=($getpid())``)

@staticfloat
Copy link
Contributor Author

@malmaud I think you may have plugged it! Memory usage is stable after 20 epochs, which it definitely never was before. Nice work! It jumps up (significantly!) between epochs, but drops back down to a stable point which it did not do before. Cheers!

@oxinabox oxinabox reopened this Jun 13, 2017
@oxinabox
Copy link
Collaborator

oxinabox commented Jun 13, 2017

@staticfloat I think we should leave this open til that branch gets merged.
Just for good practice.

@malmaud
Copy link
Owner

malmaud commented Jun 13, 2017

Closed in #256

@malmaud malmaud closed this as completed Jun 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants