Memory leak during model.fit() #5935

HareeshBahuleyan · 2017-03-23T00:14:01Z

Hi,

I am trying to train a simple CNN in keras. During training via model.fit(), the system free memory keeps reducing and eventually it runs out of memory with a "Killed" error. When I train it one epoch at a time, I can clearly see the reduction in free memory after each epoch. Is this normal?

model = Sequential()
model.add(Conv2D(input_shape=(1,30,300), filters=10, kernel_size=(3, 300), padding='valid', 
                        data_format="channels_first", activation='relu'))
model.add(Reshape((10, 28)))
model.add(MaxPooling1D(pool_size=10))

model.add(Flatten())
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(5, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid')) 

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, Y_train, batch_size=32, epochs=1, validation_data=(X_test, Y_test), verbose=0)

However, when I set batch_size = 1, I see that it works fine with no memory leaks.

I am running keras version 2.02 with theano backend on CPU.

Thanks,
Hareesh

Edit: I was not facing this issue with Keras version 1.2.

Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

The text was updated successfully, but these errors were encountered:

Kajiyu · 2017-03-24T02:51:21Z

+1

joelthchao · 2017-03-24T14:43:08Z

Hi @HareeshBahuleyan, I use your model as target to monitor memory usage.
Script:

# ENV: Macbook Pro 2012, keras: '2.0.1', theano: '0.9.0.dev-a4126bcced010b4bf0022ebef3e3080878adc480'
import resource
class MemoryCallback(Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
# ...
X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

model.fit(X_train, Y_train, batch_size=32, epochs=10,
          validation_data=(X_test, Y_test), verbose=0, callbacks=[MemoryCallback()])

Result shows that no obvious memory leak found. Can you add the callback to monitor memory usage on your own data?

HareeshBahuleyan · 2017-03-24T21:39:17Z

Hi @joelthchao
I get the following output with batch_size=32

With batch_size=1,

Moreover, when I monitor memory using free -m on the command line, I see a clear decline in the free memory as the training progresses, for batch sizes larger than 1.

I am running it on a server with AMD Opteron(tm) Processor 4284.

A similar issue has been raised by #5924

joelthchao · 2017-03-25T06:30:50Z

@HareeshBahuleyan Your memory usage is too low which is quite weird.

What's your environment: keras/theano/Python version? OS?
What's your actual shape of X_train, Y_train, X_test, Y_test?

HareeshBahuleyan · 2017-03-25T14:02:20Z

@joelthchao Yes I too noticed that, inspite of having larger train and test set.

Python 2.7.3
Keras 2.0.2
Theano 0.9.0

uname -a
Linux boxname 3.6.7-4.fc17.x86_64 #1 SMP Tue Nov 20 19:40:01 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/*-release
Fedora release 17 (Beefy Miracle)
...

The shapes are:

X_train: (79983, 1, 30, 300)
Y_train: (79983,)
X_test: (19996, 1, 30, 300)
Y_test: (19996,)

joelthchao · 2017-03-25T15:56:07Z

@HareeshBahuleyan I switch to another environment which is almost same as yours.
However, I cannot reproduce your result even use inputs with same shape.

ENV: 
python 2.7, theano 0.9.0, linux (ubuntu), keras 2.0.2

keras.json:
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "theano",
    "image_data_format": "channels_first"
}

Result (Run on cpu):
7190724, 7190724, 7190724, 7190724, ...

Does your testing script involve any other operations which are not shown here?

HareeshBahuleyan · 2017-03-25T19:31:52Z

@joelthchao No I don't have any other operations other than this (just loading the data before this).
I tried with your input shapes:

X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

I get the callback out put as
14201068, 14201068, 14201068, 14201068, 14201068

However, I also monitor the memory usage with the command free -m from another screen as the model.fit() progresses. The output is:

As you see, the free memory keeps decreasing and the process gets killed eventually (if it is run for too many epochs). The final free -m is after the script is completed and the program is exited.
Note that I am not running any other processes.

Also, like I mentioned this free memory remains constant with batch_size=1.

joelthchao · 2017-03-26T03:59:30Z

@HareeshBahuleyan

free monitors whole system's free memory, which is a little bit dangerous and inaccurate, even you don't run any other processes. In my case, I use top -p PID as an alternative way to monitor memory usage.
To help you, I need to reproduce your problem. What is (just loading the data before this)? pickle or numpy load? If possible, please provide runnable script which produces memory leak.

HareeshBahuleyan · 2017-03-26T16:09:36Z

@joelthchao I am loading hickle and pickle objects. The script can be found here.

Monitoring memory with top -p PID also shows the increase in %MEM as training progresses.

I don't think this is an issue with this specific network. I have tried other network architectures on the same machine and I still face the same issue.

Update: I ran the code on another Ubuntu machine and did not face any memory leak issues. This could mean that the issue is present only on certain CPUs as mentioned in #5924

Thanks for the help!

ayuyu18 · 2017-03-26T23:04:01Z

I got the same issue. Keras 2.0.2 on Windows with Theano backend. The memory consumption keeps increasing and finally the program crashed.

duykienvp · 2017-03-28T19:01:29Z

I got the same issue on Linux with Keras 2.0.2, but it works fine on macOS Sierra

joelthchao · 2017-03-29T01:40:07Z

@duykienvp Please help run the test script to verify memory leak problem

ebalp · 2017-03-29T02:22:34Z

Python 2.7.13
tensorflow (1.0.0)
Keras (2.0.2)

MacOs El Capitan 10.11.6

No problem

3083554816
3085885440
3085885440
3085885440
3085885440
3085885440
3085885440
3085885440
3085885440
3085885440

duykienvp · 2017-03-30T03:34:22Z

---- System 1 (Google Cloud Engine):
Ubuntu 16.04 LTS
Linux 4.4.0-70-generic
Python 2.7.13 :: Anaconda 4.3.1 (64-bit)
Theano: 0.9.0
Keras: 2.0.2

3679756
4372232
5065440
5759912
6452212
7145628
....

---- System 2 (Google Cloud Engine) (same machine with System 1):
Ubuntu 16.04 LTS
Linux 4.4.0-70-generic
Python 2.7.13 :: Anaconda 4.3.1 (64-bit)
TensorFlow: 1.0.1
Keras: 2.0.2

3015396
3017412
3018072
3019272
3019452
3019528
3019528
3019528
3019528
3019528

---- System 3 :
macOS Sierra 10.12.4
Python 2.7.13 :: Continuum Analytics, Inc.
Theano: 0.9.0
Keras: 2.0.2

3025858560
3025862656
3025862656
3025866752
3025891328
3025891328
3025891328
3025891328
3025891328
3025891328

joelthchao · 2017-03-30T05:00:59Z

@duykienvp
Cool! It seems like the problem comes to theano 0.9.0, on linux with some CPUs.

hft7h11 · 2017-04-10T04:09:04Z

I am getting the same issue, but on Windows. Memory leaks

I am using this example script

https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

I have reproduced the issue on Keras 2.0. and 2.0.3 (Python = 2.7.11, Theano = 0.9.0, numpy =1.12.1 ) on Windows 10

This issue was not happening before I upgraded from Keras 1.2 to 2.0, so I suspect it is with one of the dependent c using libraries which were also upgraded

fchollet · 2017-04-10T04:19:06Z

Are the Theano devs aware of this? If not, please open an issue there.

…

On 9 April 2017 at 21:09, hft7h11 ***@***.***> wrote: I am getting the same issue. Memory leaks. I am using this example script https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py I have reproduced the issue on Keras 2.0. and 2.0.2 (Python = 2.7.11, Theano = 0.9.0 ) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5935 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AArWb6qUagLlmo7y5j7rfbxJ2SsO-qRSks5ruatjgaJpZM4Ml--j> .

hft7h11 · 2017-04-10T14:05:01Z

@fchollet

I have commented on the Theano bug ticket. In the mean time this is a bit of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the memory leak, however certain layers such MaxPooling2D seem to depend on Theano 9.0 as per #5785

MrtnStnwk · 2017-04-11T22:20:11Z

Same problem here; Centos6, Python 2.7.13, Keras 2.0.2, Theano 0.9.0. Running on CPU. Would appreciate a suggestions for a solution.

nouiz · 2017-04-12T01:53:09Z

There is a pr to fix this in Theano. The problem only happen if Theano can't link directly to BLAS. One work around that should also speed up computation is to install a good library that Theo can reuse. Le lun. 10 avr. 2017 10:05, hft7h11 <notifications@github.com> a écrit :

…

@fchollet <https://github.com/fchollet> I have commented on the Theano bug ticket. In the mean time this is a bit of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the memory leak, however certain layers such MaxPooling2D seem to depend on Theano 9.0 as per #5785 <#5785> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5935 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALC-_hI8Ppg_Yd22kPx1-6F40NBHgarks5rujcPgaJpZM4Ml--j> .

MrtnStnwk · 2017-04-12T09:39:25Z

I can confirm the fix mentioned by @nouiz . Properly linking to MKL solved the problem.
http://deeplearning.net/software/theano/troubleshooting.html#test-blas

nouiz · 2017-04-12T13:22:49Z

The leak was masted in the master of Theano. I would recommand to link to a good blas. This will give you speed up at the same time.

Otherwise, update Theano to the dev version.

nouiz · 2017-04-12T13:23:04Z

I think this issue can be closed.

hft7h11 · 2017-04-13T00:57:50Z

@nouiz
I think it is worth noting that bleeding edge theano is not working with Keras 2.03 for LSTM
I just installed bleeding edge Theano and am getting the following error:

Traceback (most recent call last):
File "C:\Anaconda2\workspace\machineLearning\textClassifier.py", line 101, in
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
File "C:\Anaconda2\lib\site-packages\keras\models.py", line 463, in add
output_tensor = layer(self.outputs[0])
File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 257, in call
return super(Recurrent, self).call(inputs, **kwargs)
File "C:\Anaconda2\lib\site-packages\keras\engine\topology.py", line 578, in call
output = self.call(inputs, **kwargs)
File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 294, in call
constants = self.get_constants(inputs, training=None)
File "C:\Anaconda2\lib\site-packages\keras\layers\recurrent.py", line 1068, in get_constants
training=training) for _ in range(4)]
File "C:\Anaconda2\lib\site-packages\keras\backend\theano_backend.py", line 1361, in in_train_phase
x = theano.ifelse.ifelse(training, x, alt)
AttributeError: 'module' object has no attribute 'ifelse'

If I just install Theano using pip install Theano==0.9, the code won't break but I still have the memory issue.

itachi4869 · 2017-04-23T03:23:08Z

@hft7h11 I got the same error that 'module' object has no attribute 'ifelse'. Is there a good method to solve the problem?

andcut · 2017-04-24T17:59:05Z

Python 3.6.1
Keras 2.0.2
Tensorflow 1.0.1
Ubuntu 16.04

I load data using pickle and had a similar memory leak when using model.fit

MrtnStnwk · 2017-04-24T18:02:25Z

Link properly against MKL as described in my previous post and it will be solved. Verstuurd vanaf mijn iPhone

…

Op 24 apr. 2017 om 19:59 heeft andcut ***@***.***> het volgende geschreven: Python 3.6.1 Keras 2.0.2 Tensorflow 1.0.1 Ubuntu 16.04 I load data using pickle and had a similar memory leak when using model.fit — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

xiaoleihuang · 2017-08-07T19:00:18Z

I have the same problem
Features:
OS Ubuntu 16.04.2 64 bit
Keras 2.0.6
Theano 0.9.0
Python 3

I tested the same code under python2, it does not have such issue, only for python 3.

@gzapatas His method can solve the leaking problem! Thanks...

Run sudo apt-get install libblas-dev python-dev
Take a look at Theano official website, it requires "BLAS" installation:
http://deeplearning.net/software/theano/install_ubuntu.html

If you run the program on GPU, other packages are also highly recommended:

libgpuarray
pycuda and skcuda
etc.

RoozbehBandpey · 2017-11-28T06:52:12Z

I had the same problem, Solved by switching to tensorflow backend.

BIGBALLON · 2017-12-03T08:17:12Z

I had the same problem(tensorflow backend. python3.5, keras2.1.2)

maryam2013 · 2018-06-18T15:30:56Z

Hi there, I had the same problem(tensorflow backend. python3.5, keras '2.1.3', UBUNTU 17.10, GPU= Nvidia gtx1060, RAM= 16 GIG, they all installed on Hard ssd128 gig)
it gave me the error:
File "/home/mary/anaconda3/envs/virenv/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 180, in _load_word2vec_format
result.vectors = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError
I try to load pre-trained word2vec model.
How can i solve it ?
vocab_size= 59655 , EMBEDDING_DIM=300

Tixierae · 2018-07-31T07:46:44Z

@joelthchao
I'm having the same problem with tensorflow backend (like @BIGBALLON), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1.

mittermario · 2018-10-16T21:00:18Z

similar problem on ubuntu 18.04 with tensorflow backend, cpu only - memory is successively eaten by the keras fit function with a simple 2 layer DNN

prakhar21 · 2019-01-15T10:31:39Z

Hello everybody,

I'm new in the forum and I also face the same memory leak proble in ubuntu.

Features:
OS Ubuntu 16.04.2 64 bit
Keras 2.0.6
Theano 0.9.0

Solve:
* I use the command line 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again.

This works.

nicolamarinello · 2019-01-21T13:03:49Z

@joelthchao
I'm having the same problem with tensorflow backend (like @BIGBALLON), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1.

I have exactly your problem with my custom generator. TensorFlow 1.12, Keras 2.2.4 & Ubuntu 18.04.

Did everyone solve just typing the following?

sudo apt-get install libblas-dev

maryam2013 · 2019-01-21T22:12:39Z

This message is not encrypted but sent from a verified user on the dmail blockchain <https://dmail.io> Hi everyone, you can use googe collaboratory to solve the memory leak. https://colab.research.google.com/notebooks/gpu.ipynb good luck [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 01/21/19, 10:11:39 PM

…

On Mon, Jan 21, 2019 at 4:34 PM Nicola Marinello ***@***.***> wrote: @joelthchao <https://github.com/joelthchao> I'm having the same problem with tensorflow backend (like @BIGBALLON <https://github.com/BIGBALLON>), on Ubuntu 16.04 with keras 2.2.0 and tensorflow-gpu 1.5.0. The RAM gets eat up after each epoch, so if the model is trained for too many epochs I'm out of RAM. I use a custom generator with use_multiprocessing=False and workers=1. I have exactly your problem with my custom generator. TensorFlow 1.12, Keras 2.2.4 & Ubuntu 18.04. Did everyone solve just typing the following? sudo apt-get install libblas-dev — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5935 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AkQjk5n4GuUb84lq-5SVJqNFfAoBt5GIks5vFbrrgaJpZM4Ml--j> .

hougrammer · 2019-09-23T04:29:55Z

Was this thread ever closed? I have the same issue. Simple feed forward network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've tried most of the solutions on this thread (except going to colab instead of jupyter). Nothing seems to fix the memory leak.

maryam2013 · 2019-09-23T12:06:59Z

Hi David, why do not you try this command: 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again. and I also recomend to use Goggle colab which is free and is so similar to Jupiter.

…

On Mon, Sep 23, 2019 at 8:01 AM David Hou ***@***.***> wrote: Was this thread ever closed? I have the same issue. Simple feed forward network with no looping. Ubuntu 18.04, Keras 2.2, Tensorflow 1.13. I've tried most of the solutions on this thread (except going to colab instead of jupyter). Nothing seems to fix the memory leak. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5935?email_source=notifications&email_token=AJCCHEYTVQXKEZJN5YEXLKTQLBBCPA5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JYPEI#issuecomment-533956497>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJCCHE7YWYDICDNXAQGJU4DQLBBCPANCNFSM4DEX56RQ> .

hougrammer · 2019-09-26T02:52:52Z

Hi David, why do not you try this command: 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again. and I also recomend to use Goggle colab which is free and is so similar to Jupiter.

Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also libblas-dev was already installed anyways. Do you have any other suggestions?

maryam2013 · 2019-09-26T13:32:37Z

Dear David, I am used to applying google collaboratory to solve the problem. check it out and do not hesitate to ask me your question. https://colab.research.google.com/drive/1pGErg2HaWfFVa3jCCFSDRGNjlkDaqzE4#updateTitle=true&folderId=1xFFASkjeHGhgH2EWfUBHY2QzErXu6Dpn Best Maryam

…

On Thu, Sep 26, 2019 at 6:24 AM David Hou ***@***.***> wrote: Hi David, why do not you try this command: 'sudo apt-get install libblas-dev', it also install theano dependency of blas library and it didn't have any memory leak again. and I also recomend to use Goggle colab which is free and is so similar to Jupiter. Thanks Mary. Unfortunately I'm using Tensorflow and not Theano. Also libblas-dev was already installed anyways. Do you have any other suggestions? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5935?email_source=notifications&email_token=AJCCHE3VX2XZIDFZK4WIDYLQLQP65A5CNFSM4DEX56R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7UCW2A#issuecomment-535309160>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJCCHE4RITYYRI2PRNMXF33QLQP65ANCNFSM4DEX56RQ> .

hougrammer · 2019-09-27T03:39:42Z

I finally found the issue for me. Tensorflow 1.14 has this memory leak but 1.13 does not.

fPkX6F1nGTX · 2020-04-16T16:25:02Z

For Ubuntu and TF 2.0 using Keras backend, I was able to work around (but not solve) this problem. I recommend re-opening the issue, my "solution" involves potentially double the amount of computation time:
https://stackoverflow.com/a/61252435/12763497

raqueldias · 2020-06-26T02:59:24Z

I found the same problem with TF2.2 and tf.keras model.fit()

Justus-M · 2020-10-02T13:00:07Z

I am also having this issue with TF2.2 and model.fit in 5 fold cross validation. I explicitly delete all the objects, clear session, and call the garbage collector at the end of the fit function for every iteration but still get a memory buildup so I am limited despite having 4 P100s being used in parallel and 120GB of RAM

yymarcin · 2020-10-09T16:48:50Z

I'm also having this issue with TF2.3.1.
Using tf.compat.v1.disable_v2_behavior() fixed it.

fPkX6F1nGTX · 2020-10-09T17:04:45Z

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Justus-M · 2020-10-09T17:16:33Z

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Thanks for your reply- but as I mentioned in my comment, I already do those things

I will try what yymarcin did to fix it

fPkX6F1nGTX · 2020-10-09T17:21:50Z

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Thanks for your reply- but as I mentioned in my comment, I already do those things

I will try what yymarcin did to fix it

@justinmulli if that does not work, let me know if you actually followed the exact order and instructions I suggested. I do not know what "explicitly deleting objects" means, but using the python command del model after a call to gc.collect() could actually be important here.

Justus-M · 2020-10-09T17:32:14Z

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Thanks for your reply- but as I mentioned in my comment, I already do those things
I will try what yymarcin did to fix it

@justinmulli if that does not work, let me know if you actually followed the exact order and instructions I suggested. I do not know what "explicitly deleting objects" means, but using the python command del model after a call to gc.collect() could actually be important here.

Thanks again for the quick reply

I basically delete everything, that includes del model

I clear session, then delete the model and other objects, and then finally call GC collect. I can try calling gc.collect() before (when I get back from holiday next week) but as you said yourself it shouldn’t be any better than calling it after
In fact I would think it makes more sense to call it after all activity is done - that being said I am not a Seasoned expert

Justus-M · 2020-10-15T18:17:18Z

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Thanks for your reply- but as I mentioned in my comment, I already do those things
I will try what yymarcin did to fix it

@justinmulli if that does not work, let me know if you actually followed the exact order and instructions I suggested. I do not know what "explicitly deleting objects" means, but using the python command del model after a call to gc.collect() could actually be important here.

I followed the exact order and instructions you suggested and it still does not work.

Justus-M · 2020-10-15T18:19:42Z

I'm also having this issue with TF2.3.1.
Using tf.compat.v1.disable_v2_behavior() fixed it.

While I was hoping this would work for me, It then doesn't allow using the class_weights argument in model.fit with distributed training and hence results in an error message

maryam2013 · 2020-10-15T20:59:12Z

Hi there, U all can use google colaboratory instead. https://colab.research.google.com/notebooks/intro.ipynb

…

On Thu, Oct 15, 2020 at 9:49 PM justinmulli ***@***.***> wrote: I'm also having this issue with TF2.3.1. Using tf.compat.v1.disable_v2_behavior() fixed it. While I was hoping this would work for me, It then doesn't allow using the class_weights argument in model.fit with distributed training and hence results in an error message — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5935 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJCCHE6CCPNUFQTHZ35IA2DSK44M7ANCNFSM4DEX56RQ> .

nateGeorge · 2021-01-29T00:13:03Z

I'm also having this issue with TF2.3.1.
Using tf.compat.v1.disable_v2_behavior() fixed it.

This works. Had the issue on kaggle notebooks when fitting a model twice and this solved it.

fPkX6F1nGTX · 2021-02-01T15:25:52Z

I use Optuna to repeat the process of building and tearing down models sequentially. Their study.optimize() method has an argument gc_after_trial=False (default) which has solved this problem for me completely when set to True.

ysyyork · 2021-09-18T10:51:47Z

i'm using tf 2.3.0 and tf 2.4.0 and see this issue a lot. i dug into the source code and see there is a race condition issue which will keep on increasing memories. The issue is that for the training data generator, model.fit function only initialize one time and then there is a for loop in the class to iterate the data repeatedly forever. Every full epoch, it will close the process pool and the gc will start to work but not immediate if your data is huge. But at the same time, the fit function is creating a new validation generator immediate even before the gc finishes. this new validation generator will create a new process pool which will inherit the training generator's leftover memory (python on linux by default uses copy mechanism to inherit shared memories) and then copy them into the new process pool. here is one layer of memory leak. Then once the validation finishes, the same thing happens with the new training generator process pool where it will copy the left over memory from the validation generator again. This is like rolling a snow ball and the memory keeps on growing.

i tried adding gc.collect() every time after one epoch but again the gc takes time. it doesn't matter if you call it or not.

you can validate this by adding a sleep in on_test_begin and on_test_end, it can alleviate a little bit of this symptom. But for some reason some of the memories got never released as long as the OrderedEnqueuer object exists even the pool has been closed and i waited for very long. So i finally modified it so that for validation data, after every run, i just return the OrderedEnqueuer's run function. then all the process created by the validation data generator will disappear when i check top. then although there are still memory leak between train -> validation, but since validation back to train is cut, memory leak disappeared.

danerlt · 2022-09-08T07:06:56Z

I found the same problem , When the program starts running, the memory is only 280 M, and the memory is increased to 357 M when the program is finished.

python version: 3.7.12
tensorflow-cpu version: 2.9.0
keras version: 2.9.0

my code is:

import datetime
import os
from functools import wraps

import psutil
import tensorflow as tf
import numpy as np
from tensorflow import keras



def memory(func):
    @wraps(wrapped=func)
    def wrapper(*args, **kwargs):
        try:
            memo = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
            print(f'before call funcion：{func.__name__}, memory used：%.4f MB' % memo)
            result = func(*args, **kwargs)
            memo = psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
            print(f'after call funcion：{func.__name__} memory used：%.4f MB' % memo)
            return result
        except Exception as e:
            print("memory error: e")
            raise e

    return wrapper


@memory
def bar():
    xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
    ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

    model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
    model.compile(optimizer='sgd', loss='mean_squared_error')
    # begin train , 5000 times
    model.fit(xs, ys, epochs=5000, verbose=0)
    # predict
    print(model.predict([10.0]))


def foo_bar():
    for i in range(1, 10):
        bar()


if __name__ == '__main__':
    foo_bar()

The printout log is:

before call funcion：bar, memory used：280.8164 MB
2022-09-08 14:57:56.952970: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
1/1 [==============================] - 0s 77ms/step
[[18.999987]]
after call funcion：bar memory used：313.5234 MB
before call funcion：bar, memory used：313.5234 MB
1/1 [==============================] - 0s 47ms/step
[[18.999987]]
after call funcion：bar memory used：316.9922 MB
before call funcion：bar, memory used：316.9922 MB
1/1 [==============================] - 0s 45ms/step
[[18.999987]]
after call funcion：bar memory used：324.0781 MB
before call funcion：bar, memory used：324.0781 MB
1/1 [==============================] - 0s 46ms/step
[[18.999987]]
after call funcion：bar memory used：331.7500 MB
before call funcion：bar, memory used：331.7500 MB
WARNING:tensorflow:5 out of the last 5 calls to <function Model.make_predict_function.<locals>.predict_function at 0x0000026EFCE8DB88> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
1/1 [==============================] - 0s 49ms/step
[[18.999987]]
after call funcion：bar memory used：339.4805 MB
before call funcion：bar, memory used：339.4805 MB
WARNING:tensorflow:6 out of the last 6 calls to <function Model.make_predict_function.<locals>.predict_function at 0x0000026EFDF2F828> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
1/1 [==============================] - 0s 46ms/step
[[18.999987]]
after call funcion：bar memory used：341.6758 MB
before call funcion：bar, memory used：341.6758 MB
1/1 [==============================] - 0s 53ms/step
[[18.999987]]
after call funcion：bar memory used：348.3633 MB
before call funcion：bar, memory used：348.3633 MB
1/1 [==============================] - 0s 50ms/step
[[18.999987]]
after call funcion：bar memory used：355.4258 MB
before call funcion：bar, memory used：355.4258 MB
1/1 [==============================] - 0s 53ms/step
[[18.999987]]
after call funcion：bar memory used：357.1992 MB

a-gardner1 mentioned this issue Apr 6, 2017

Memory gradually increasing until crash while training simple MLP on CPU Theano/Theano#5810

Closed

HareeshBahuleyan closed this as completed Apr 12, 2017

joelthchao mentioned this issue Apr 12, 2017

ImageDataGenerator flow function continue to increase the amount of memory usage #5835

Closed

demet47 mentioned this issue Jun 1, 2023

model.fit of keras is being killed after some treshold value due to memory bloat demet47/Cappuccino-Preparing-Robot-Baxter#2

Closed

Memory leak during model.fit() #5935

Memory leak during model.fit() #5935

Comments

HareeshBahuleyan commented Mar 23, 2017 • edited Loading

Kajiyu commented Mar 24, 2017

joelthchao commented Mar 24, 2017 • edited Loading

HareeshBahuleyan commented Mar 24, 2017

joelthchao commented Mar 25, 2017

HareeshBahuleyan commented Mar 25, 2017

joelthchao commented Mar 25, 2017

HareeshBahuleyan commented Mar 25, 2017

joelthchao commented Mar 26, 2017

HareeshBahuleyan commented Mar 26, 2017

ayuyu18 commented Mar 26, 2017

duykienvp commented Mar 28, 2017

joelthchao commented Mar 29, 2017

ebalp commented Mar 29, 2017

duykienvp commented Mar 30, 2017 • edited Loading

joelthchao commented Mar 30, 2017

hft7h11 commented Apr 10, 2017 • edited Loading

fchollet commented Apr 10, 2017 via email

hft7h11 commented Apr 10, 2017

MrtnStnwk commented Apr 11, 2017

nouiz commented Apr 12, 2017 via email

MrtnStnwk commented Apr 12, 2017

nouiz commented Apr 12, 2017

nouiz commented Apr 12, 2017

hft7h11 commented Apr 13, 2017

itachi4869 commented Apr 23, 2017

andcut commented Apr 24, 2017

MrtnStnwk commented Apr 24, 2017 via email

xiaoleihuang commented Aug 7, 2017 • edited Loading

RoozbehBandpey commented Nov 28, 2017

BIGBALLON commented Dec 3, 2017

maryam2013 commented Jun 18, 2018 • edited Loading

Tixierae commented Jul 31, 2018

mittermario commented Oct 16, 2018

prakhar21 commented Jan 15, 2019

nicolamarinello commented Jan 21, 2019

maryam2013 commented Jan 21, 2019 via email

hougrammer commented Sep 23, 2019

maryam2013 commented Sep 23, 2019 via email

hougrammer commented Sep 26, 2019 • edited Loading

maryam2013 commented Sep 26, 2019 via email

hougrammer commented Sep 27, 2019 • edited Loading

fPkX6F1nGTX commented Apr 16, 2020

raqueldias commented Jun 26, 2020

Justus-M commented Oct 2, 2020

yymarcin commented Oct 9, 2020

fPkX6F1nGTX commented Oct 9, 2020

Justus-M commented Oct 9, 2020 • edited Loading

fPkX6F1nGTX commented Oct 9, 2020

Justus-M commented Oct 9, 2020 • edited Loading

Justus-M commented Oct 15, 2020

Justus-M commented Oct 15, 2020

maryam2013 commented Oct 15, 2020 via email

nateGeorge commented Jan 29, 2021

fPkX6F1nGTX commented Feb 1, 2021

ysyyork commented Sep 18, 2021

danerlt commented Sep 8, 2022

HareeshBahuleyan commented Mar 23, 2017 •

edited

Loading

joelthchao commented Mar 24, 2017 •

edited

Loading

duykienvp commented Mar 30, 2017 •

edited

Loading

hft7h11 commented Apr 10, 2017 •

edited

Loading

xiaoleihuang commented Aug 7, 2017 •

edited

Loading

maryam2013 commented Jun 18, 2018 •

edited

Loading

hougrammer commented Sep 26, 2019 •

edited

Loading

hougrammer commented Sep 27, 2019 •

edited

Loading

Justus-M commented Oct 9, 2020 •

edited

Loading

Justus-M commented Oct 9, 2020 •

edited

Loading