Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

Memory usage grows over time and ends with memory error #116

Closed
berniejerom opened this issue Jan 20, 2019 · 16 comments
Closed

Memory usage grows over time and ends with memory error #116

berniejerom opened this issue Jan 20, 2019 · 16 comments

Comments

@berniejerom
Copy link

Expected behavior

memory usage probaly shouldn't rise a few GB over time.

Actual behavior

I am able to run a few hundreds epochs but then I get the memory error.
Training window process memory usage is stable, but the other python process can rise up to 2GB each.

bern

I get memory error when the app uses about ~8.5GB, even though some of RAM is still available.
GPU mem usage looks to be stable - about 10GB

 g [#004905][1596ms] src_loss:0.842 dst_loss:0.318
Traceback (most recent call last):

  File "D:\deepFaceLab\mainscripts\Trainer.py", line 71, in trainerThread
    loss_string = model.train_one_epoch()

  File "D:\deepFaceLab\models\ModelBase.py", line 308, in train_one_epoch
    self.last_sample = self.generate_next_sample()

  File "D:\deepFaceLab\models\ModelBase.py", line 301, in generate_next_sample
    return [next(generator) for generator in self.generator_list]

  File "D:\deepFaceLab\models\ModelBase.py", line 301, in <listcomp>
    return [next(generator) for generator in self.generator_list]

  File "D:\deepFaceLab\samples\SampleGeneratorFace.py", line 55, in __next__
    return next(generator)

  File "D:\deepFaceLab\utils\iter_utils.py", line 57, in __next__
    gen_data = self.cs_queue.get()

  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\queues.py", line 94, in get
    res = self._recv_bytes()
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 318, in _recv_bytes
    return self._get_more_data(ov, maxsize)
  File "c:\users\bern\appdata\local\programs\python\python36\Lib\multiprocessing\connection.py", line 344, in _get_more_data
    f.write(ov.getbuffer())
MemoryError
Done.

Steps to reproduce

Train any model, I think it also occurs for liaef and DF, sometimes it just takes more time.

Other relevant information

Python 3.6.5 64bit
Windows 10
1080TI
16 GB + Swap 10GB
CUDA 9.0
CUDNN v7.4.1.5 also tried newest one
Note: I've changed save period to 2min.

@TooMuchFun
Copy link
Contributor

TooMuchFun commented Jan 20, 2019

Same for me on Ubuntu w/ 16GB and a 1070. It will last ~700 iterations or 10 15 minutes before the TF session dies all output freezes and GPU usage stops. The python process remains on, tqdm updating the status line, but the preview no longer responds and the system becomes very unstable.

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

8571MB is total for all processes.
Which process mem growing?

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

how many src and dst images used?

@TooMuchFun
Copy link
Contributor

Between 1000 and 1500 for both. The python subprocesses are what grow for me, not the main python thread.

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

not you

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

TF session dies - another issue.

@berniejerom
Copy link
Author

berniejerom commented Jan 20, 2019

src 319 files, 30MB
dst 1570 files, 122MB
all 4 python processes are growing except the one which is training preview

with CPU I think it starts about ~7GB and it works though it takes so long that I cannot test if there is some memory leak

and I am talking about the RAM usage as the GPU usage seems to be stable
it is about 300 epochs

It starts at ~1960MB (preview window) + 4 x 80MB (so these 4 grow to 1.7 GB each)

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

these 4 processes generate samples for training. 2 for src and 2 for dst.

I changed ModelBase.py to this code

    def train_one_epoch(self):    
...
        while True: #<- new line
            self.last_sample = self.generate_next_sample() 

to generate samples as fast as possible in loop before training start
but no memoryleak
2019-01-21_00-30-05
cannot reproduce in my prebuilt windows binary.

@berniejerom
Copy link
Author

did the same thing and it is growing
obraz
obraz

I will try to narrow the scope in a next few days, thanks for the support

@iperov
Copy link
Owner

iperov commented Jan 20, 2019

and why are you not using prebuilt windows binary?

@berniejerom
Copy link
Author

Sometimes I need to customize existing code like with the saving interval, so it is nice to use repository instead, but actually the need was caused by the out of memory problem.

I have no memory leak using prebuild binary, also when I've just copied the code from prebuild internal to deepFaceLab the memory leak was visible so it seems it is related to some external module

@berniejerom
Copy link
Author

Hi, I have used yolk to compare installed site packages in prebuilt package and downloaded project.
After some trial and error it seems that numpy 1.16 is causing the issue - prebuild package has 1.15.4 version

@iperov
Copy link
Owner

iperov commented Jan 22, 2019

@berniejerom wow interesting.

@iperov
Copy link
Owner

iperov commented Jan 22, 2019

I have upgraded numpy to 1.16 and got memory leak too

@iperov
Copy link
Owner

iperov commented Jan 22, 2019

looks like memory leak caused by np array interprocess pickling

@iperov
Copy link
Owner

iperov commented Jan 22, 2019

numpy/numpy#12793

they will fix it in 1.16.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants