-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak during model.fit() #5935
Comments
+1 |
Hi @HareeshBahuleyan, I use your model as target to monitor memory usage. # ENV: Macbook Pro 2012, keras: '2.0.1', theano: '0.9.0.dev-a4126bcced010b4bf0022ebef3e3080878adc480'
import resource
class MemoryCallback(Callback):
def on_epoch_end(self, epoch, log={}):
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
# ...
X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)
model.fit(X_train, Y_train, batch_size=32, epochs=10,
validation_data=(X_test, Y_test), verbose=0, callbacks=[MemoryCallback()]) Result shows that no obvious memory leak found. Can you add the callback to monitor memory usage on your own data?
|
Hi @joelthchao
With batch_size=1,
Moreover, when I monitor memory using I am running it on a server with AMD Opteron(tm) Processor 4284. A similar issue has been raised by #5924 |
@HareeshBahuleyan Your memory usage is too low which is quite weird.
|
@joelthchao Yes I too noticed that, inspite of having larger train and test set.
|
@HareeshBahuleyan I switch to another environment which is almost same as yours.
Does your testing script involve any other operations which are not shown here? |
@joelthchao No I don't have any other operations other than this (just loading the data before this).
I get the callback out put as However, I also monitor the memory usage with the command As you see, the free memory keeps decreasing and the process gets killed eventually (if it is run for too many epochs). The final Also, like I mentioned this free memory remains constant with |
|
@joelthchao I am loading hickle and pickle objects. The script can be found here. Monitoring memory with I don't think this is an issue with this specific network. I have tried other network architectures on the same machine and I still face the same issue. Update: I ran the code on another Ubuntu machine and did not face any memory leak issues. This could mean that the issue is present only on certain CPUs as mentioned in #5924 Thanks for the help! |
I got the same issue. Keras 2.0.2 on Windows with Theano backend. The memory consumption keeps increasing and finally the program crashed. |
I got the same issue on Linux with Keras 2.0.2, but it works fine on macOS Sierra |
@duykienvp Please help run the test script to verify memory leak problem |
Python 2.7.13 MacOs El Capitan 10.11.6 No problem 3083554816 |
---- System 1 (Google Cloud Engine): 3679756 ---- System 2 (Google Cloud Engine) (same machine with System 1): 3015396 ---- System 3 : 3025858560 |
@duykienvp |
I am getting the same issue, but on Windows. Memory leaks I am using this example script https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py I have reproduced the issue on Keras 2.0. and 2.0.3 (Python = 2.7.11, Theano = 0.9.0, numpy =1.12.1 ) on Windows 10 This issue was not happening before I upgraded from Keras 1.2 to 2.0, so I suspect it is with one of the dependent c using libraries which were also upgraded |
Are the Theano devs aware of this? If not, please open an issue there.
…On 9 April 2017 at 21:09, hft7h11 ***@***.***> wrote:
I am getting the same issue. Memory leaks.
I am using this example script
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
I have reproduced the issue on Keras 2.0. and 2.0.2 (Python = 2.7.11,
Theano = 0.9.0 )
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AArWb6qUagLlmo7y5j7rfbxJ2SsO-qRSks5ruatjgaJpZM4Ml--j>
.
|
Same problem here; Centos6, Python 2.7.13, Keras 2.0.2, Theano 0.9.0. Running on CPU. Would appreciate a suggestions for a solution. |
There is a pr to fix this in Theano.
The problem only happen if Theano can't link directly to BLAS. One work
around that should also speed up computation is to install a good library
that Theo can reuse.
Le lun. 10 avr. 2017 10:05, hft7h11 <notifications@github.com> a écrit :
… @fchollet <https://github.com/fchollet>
I have commented on the Theano bug ticket. In the mean time this is a bit
of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the
memory leak, however certain layers such MaxPooling2D seem to depend on
Theano 9.0 as per #5785 <#5785>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALC-_hI8Ppg_Yd22kPx1-6F40NBHgarks5rujcPgaJpZM4Ml--j>
.
|
I can confirm the fix mentioned by @nouiz . Properly linking to MKL solved the problem. |
The leak was masted in the master of Theano. I would recommand to link to a good blas. This will give you speed up at the same time. Otherwise, update Theano to the dev version. |
I think this issue can be closed. |
@nouiz Traceback (most recent call last): If I just install Theano using pip install Theano==0.9, the code won't break but I still have the memory issue. |
@hft7h11 I got the same error that 'module' object has no attribute 'ifelse'. Is there a good method to solve the problem? |
Python 3.6.1 I load data using pickle and had a similar memory leak when using model.fit |
Link properly against MKL as described in my previous post and it will be solved.
Verstuurd vanaf mijn iPhone
… Op 24 apr. 2017 om 19:59 heeft andcut ***@***.***> het volgende geschreven:
Python 3.6.1
Keras 2.0.2
Tensorflow 1.0.1
Ubuntu 16.04
I load data using pickle and had a similar memory leak when using model.fit
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have the same problem I tested the same code under python2, it does not have such issue, only for python 3. @gzapatas His method can solve the leaking problem! Thanks... Run If you run the program on GPU, other packages are also highly recommended:
|
I had the same problem, Solved by switching to tensorflow backend. |
I had the same problem(tensorflow backend. python3.5, keras2.1.2) |
Hi there, I had the same problem(tensorflow backend. python3.5, keras '2.1.3', UBUNTU 17.10, GPU= Nvidia gtx1060, RAM= 16 GIG, they all installed on Hard ssd128 gig) |
@joelthchao |
similar problem on ubuntu 18.04 with tensorflow backend, cpu only - memory is successively eaten by the keras fit function with a simple 2 layer DNN |
This works. |
I have exactly your problem with my custom generator. TensorFlow 1.12, Keras 2.2.4 & Ubuntu 18.04. Did everyone solve just typing the following?
|
This message is not encrypted but sent from a verified user on the dmail
blockchain <https://dmail.io>
Hi everyone,
you can use googe collaboratory to solve the memory leak.
https://colab.research.google.com/notebooks/gpu.ipynb
good luck
[image: Mailtrack]
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
Sender
notified by
Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&>
01/21/19,
10:11:39 PM
…On Mon, Jan 21, 2019 at 4:34 PM Nicola Marinello ***@***.***> wrote:
@joelthchao <https://github.com/joelthchao>
I'm having the same problem with tensorflow backend (like @BIGBALLON
<https://github.com/BIGBALLON>), on Ubuntu 16.04 with keras 2.2.0 and
tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model
is trained for too many epochs I'm out of RAM. I use a custom generator
with use_multiprocessing=False and workers=1.
I have exactly your problem with my custom generator. TensorFlow 1.12,
Keras 2.2.4 & Ubuntu 18.04.
Did everyone solve just typing the following?
sudo apt-get install libblas-dev
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5935 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AkQjk5n4GuUb84lq-5SVJqNFfAoBt5GIks5vFbrrgaJpZM4Ml--j>
.
|
Was this thread ever closed? I have the same issue. Simple feed forward network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've tried most of the solutions on this thread (except going to colab instead of jupyter). Nothing seems to fix the memory leak. |
Hi David,
why do not you try this command:
'sudo apt-get install libblas-dev', it also install theano dependency
of blas library and it didn't have any memory leak again.
and I also recomend to use Goggle colab which is free and is so
similar to Jupiter.
…On Mon, Sep 23, 2019 at 8:01 AM David Hou ***@***.***> wrote:
Was this thread ever closed? I have the same issue. Simple feed forward
network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've
tried most of the solutions on this thread (except going to colab instead
of jupyter). Nothing seems to fix the memory leak.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5935?email_source=notifications&email_token=AJCCHEYTVQXKEZJN5YEXLKTQLBBCPA5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JYPEI#issuecomment-533956497>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJCCHE7YWYDICDNXAQGJU4DQLBBCPANCNFSM4DEX56RQ>
.
|
Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also libblas-dev was already installed anyways. Do you have any other suggestions? |
Dear David,
I am used to applying google collaboratory to solve the problem.
check it out and do not hesitate to ask me your question.
https://colab.research.google.com/drive/1pGErg2HaWfFVa3jCCFSDRGNjlkDaqzE4#updateTitle=true&folderId=1xFFASkjeHGhgH2EWfUBHY2QzErXu6Dpn
Best
Maryam
…On Thu, Sep 26, 2019 at 6:24 AM David Hou ***@***.***> wrote:
Hi David, why do not you try this command: 'sudo apt-get install
libblas-dev', it also install theano dependency of blas library and it
didn't have any memory leak again. and I also recomend to use Goggle colab
which is free and is so similar to Jupiter.
Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also
libblas-dev was already installed anyways. Do you have any other
suggestions?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5935?email_source=notifications&email_token=AJCCHE3VX2XZIDFZK4WIDYLQLQP65A5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7UCW2A#issuecomment-535309160>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJCCHE4RITYYRI2PRNMXF33QLQP65ANCNFSM4DEX56RQ>
.
|
I finally found the issue for me. Tensorflow 1.14 has this memory leak but 1.13 does not. |
For Ubuntu and TF 2.0 using Keras backend, I was able to work around (but not solve) this problem. I recommend re-opening the issue, my "solution" involves potentially double the amount of computation time: |
I found the same problem with TF2.2 and tf.keras model.fit() |
I am also having this issue with TF2.2 and model.fit in 5 fold cross validation. I explicitly delete all the objects, clear session, and call the garbage collector at the end of the fit function for every iteration but still get a memory buildup so I am limited despite having 4 P100s being used in parallel and 120GB of RAM |
I'm also having this issue with TF2.3.1. |
@justinmulli @yymarcin I suggest doing |
Thanks for your reply- but as I mentioned in my comment, I already do those things I will try what yymarcin did to fix it |
@justinmulli if that does not work, let me know if you actually followed the exact order and instructions I suggested. I do not know what "explicitly deleting objects" means, but using the python command |
Thanks again for the quick reply I basically delete everything, that includes del model I clear session, then delete the model and other objects, and then finally call GC collect. I can try calling gc.collect() before (when I get back from holiday next week) but as you said yourself it shouldn’t be any better than calling it after |
I followed the exact order and instructions you suggested and it still does not work. |
While I was hoping this would work for me, It then doesn't allow using the class_weights argument in model.fit with distributed training and hence results in an error message |
Hi there,
U all can use google colaboratory instead.
https://colab.research.google.com/notebooks/intro.ipynb
…On Thu, Oct 15, 2020 at 9:49 PM justinmulli ***@***.***> wrote:
I'm also having this issue with TF2.3.1.
Using tf.compat.v1.disable_v2_behavior() fixed it.
While I was hoping this would work for me, It then doesn't allow using the
class_weights argument in model.fit with distributed training and hence
results in an error message
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5935 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJCCHE6CCPNUFQTHZ35IA2DSK44M7ANCNFSM4DEX56RQ>
.
|
This works. Had the issue on kaggle notebooks when fitting a model twice and this solved it. |
i'm using tf 2.3.0 and tf 2.4.0 and see this issue a lot. i dug into the source code and see there is a race condition issue which will keep on increasing memories. The issue is that for the training data generator, model.fit function only initialize one time and then there is a for loop in the class to iterate the data repeatedly forever. Every full epoch, it will close the process pool and the gc will start to work but not immediate if your data is huge. But at the same time, the fit function is creating a new validation generator immediate even before the gc finishes. this new validation generator will create a new process pool which will inherit the training generator's leftover memory (python on linux by default uses copy mechanism to inherit shared memories) and then copy them into the new process pool. here is one layer of memory leak. Then once the validation finishes, the same thing happens with the new training generator process pool where it will copy the left over memory from the validation generator again. This is like rolling a snow ball and the memory keeps on growing. i tried adding you can validate this by adding a sleep in |
I found the same problem , When the program starts running, the memory is only 280 M, and the memory is increased to 357 M when the program is finished. python version: 3.7.12 my code is:
The printout log is:
|
Hi,
I am trying to train a simple CNN in keras. During training via model.fit(), the system free memory keeps reducing and eventually it runs out of memory with a "Killed" error. When I train it one epoch at a time, I can clearly see the reduction in free memory after each epoch. Is this normal?
However, when I set batch_size = 1, I see that it works fine with no memory leaks.
I am running keras version 2.02 with theano backend on CPU.
Thanks,
Hareesh
Edit: I was not facing this issue with Keras version 1.2.
Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
The text was updated successfully, but these errors were encountered: