Memory usage grows over time and ends with memory error #116

berniejerom · 2019-01-20T18:03:13Z

Expected behavior

memory usage probaly shouldn't rise a few GB over time.

Actual behavior

I am able to run a few hundreds epochs but then I get the memory error.
Training window process memory usage is stable, but the other python process can rise up to 2GB each.

I get memory error when the app uses about ~8.5GB, even though some of RAM is still available.
GPU mem usage looks to be stable - about 10GB

 g [#004905][1596ms] src_loss:0.842 dst_loss:0.318
Traceback (most recent call last):

  File "D:\deepFaceLab\mainscripts\Trainer.py", line 71, in trainerThread
    loss_string = model.train_one_epoch()

  File "D:\deepFaceLab\models\ModelBase.py", line 308, in train_one_epoch
    self.last_sample = self.generate_next_sample()

  File "D:\deepFaceLab\models\ModelBase.py", line 301, in generate_next_sample
    return [next(generator) for generator in self.generator_list]

  File "D:\deepFaceLab\models\ModelBase.py", line 301, in <listcomp>
    return [next(generator) for generator in self.generator_list]

  File "D:\deepFaceLab\samples\SampleGeneratorFace.py", line 55, in __next__
    return next(generator)

  File "D:\deepFaceLab\utils\iter_utils.py", line 57, in __next__
    gen_data = self.cs_queue.get()

  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\queues.py", line 94, in get
    res = self._recv_bytes()
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 318, in _recv_bytes
    return self._get_more_data(ov, maxsize)
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 344, in _get_more_data
    f.write(ov.getbuffer())
MemoryError
Done.

Steps to reproduce

Train any model, I think it also occurs for liaef and DF, sometimes it just takes more time.

Other relevant information

Python 3.6.5 64bit
Windows 10
1080TI
16 GB + Swap 10GB
CUDA 9.0
CUDNN v7.4.1.5 also tried newest one
Note: I've changed save period to 2min.

The text was updated successfully, but these errors were encountered:

TooMuchFun · 2019-01-20T19:21:43Z

Same for me on Ubuntu w/ 16GB and a 1070. It will last ~700 iterations or 10 15 minutes before ~~the TF session dies~~ all output freezes and GPU usage stops. The python process remains on, tqdm updating the status line, but the preview no longer responds and the system becomes very unstable.

iperov · 2019-01-20T19:28:37Z

8571MB is total for all processes.
Which process mem growing?

iperov · 2019-01-20T19:29:33Z

how many src and dst images used?

TooMuchFun · 2019-01-20T19:32:08Z

Between 1000 and 1500 for both. The python subprocesses are what grow for me, not the main python thread.

iperov · 2019-01-20T19:32:29Z

not you

iperov · 2019-01-20T19:33:03Z

TF session dies - another issue.

berniejerom · 2019-01-20T19:58:01Z

src 319 files, 30MB
dst 1570 files, 122MB
all 4 python processes are growing except the one which is training preview

with CPU I think it starts about ~7GB and it works though it takes so long that I cannot test if there is some memory leak

and I am talking about the RAM usage as the GPU usage seems to be stable
it is about 300 epochs

It starts at ~1960MB (preview window) + 4 x 80MB (so these 4 grow to 1.7 GB each)

iperov · 2019-01-20T20:32:54Z

these 4 processes generate samples for training. 2 for src and 2 for dst.

I changed ModelBase.py to this code

    def train_one_epoch(self):    
...
        while True: #<- new line
            self.last_sample = self.generate_next_sample()

to generate samples as fast as possible in loop before training start
but no memoryleak

cannot reproduce in my prebuilt windows binary.

berniejerom · 2019-01-20T20:41:57Z

did the same thing and it is growing

I will try to narrow the scope in a next few days, thanks for the support

iperov · 2019-01-20T20:48:18Z

and why are you not using prebuilt windows binary?

berniejerom · 2019-01-20T22:18:28Z

Sometimes I need to customize existing code like with the saving interval, so it is nice to use repository instead, but actually the need was caused by the out of memory problem.

I have no memory leak using prebuild binary, also when I've just copied the code from prebuild internal to deepFaceLab the memory leak was visible so it seems it is related to some external module

berniejerom · 2019-01-21T22:10:34Z

Hi, I have used yolk to compare installed site packages in prebuilt package and downloaded project.
After some trial and error it seems that numpy 1.16 is causing the issue - prebuild package has 1.15.4 version

iperov · 2019-01-22T05:32:14Z

@berniejerom wow interesting.

iperov · 2019-01-22T06:17:21Z

I have upgraded numpy to 1.16 and got memory leak too

iperov · 2019-01-22T06:21:18Z

looks like memory leak caused by np array interprocess pickling

iperov · 2019-01-22T07:13:05Z

numpy/numpy#12793

they will fix it in 1.16.1

iperov closed this as completed Jan 22, 2019

TroyHernandez mentioned this issue Mar 1, 2021

can not assign GPU device for operation nagadit/DeepFaceLab_Linux#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage grows over time and ends with memory error #116

Memory usage grows over time and ends with memory error #116

berniejerom commented Jan 20, 2019

TooMuchFun commented Jan 20, 2019 •

edited

Loading

iperov commented Jan 20, 2019

iperov commented Jan 20, 2019

TooMuchFun commented Jan 20, 2019

iperov commented Jan 20, 2019

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019 •

edited

Loading

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019

berniejerom commented Jan 21, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

Memory usage grows over time and ends with memory error #116

Memory usage grows over time and ends with memory error #116

Comments

berniejerom commented Jan 20, 2019

Expected behavior

Actual behavior

Steps to reproduce

Other relevant information

TooMuchFun commented Jan 20, 2019 • edited Loading

iperov commented Jan 20, 2019

iperov commented Jan 20, 2019

TooMuchFun commented Jan 20, 2019

iperov commented Jan 20, 2019

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019 • edited Loading

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019

iperov commented Jan 20, 2019

berniejerom commented Jan 20, 2019

berniejerom commented Jan 21, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

iperov commented Jan 22, 2019

TooMuchFun commented Jan 20, 2019 •

edited

Loading

berniejerom commented Jan 20, 2019 •

edited

Loading