BUG: saving a huge object array with np.savez leaks memory on PyPy #15775

mattip · 2020-03-18T17:39:41Z

The test_large_zip test in numpy/lib/tests/test_io.py is currently skipped on PyPy since the allocated memory does not seem to be freed, even after multiple calls to gc.collect. The test passes, but later in the test suite calls to subprocess fail, it seems the call to fork does not use copy-on-write and the non-freed memory requires more than 50% of the available memory.

It is not clear to me where the problem is, what objects are not released. I tried to instrument with weakref.ref but did not find live objects after gc.collect, and valgrind also did not show what is holding memory.

The text was updated successfully, but these errors were encountered:

anirudh2290 · 2020-03-19T17:20:39Z

@mattip I am interested in taking a look at this issue. Can you please post the link to the test suite that calls subprocess.

mattip · 2020-03-19T17:32:38Z

When you run python runtests.py -mfull, this test is run in the middle of a CI run. Later tests call subprocess.Popen. You can simulate this by changing the test, put out = subprocess.check_output(['ls']) immediately after the test. If you use ulimit` to restrict memory to the minimum needed for the first part of the test to pass, you should be able to cause a crash even on systems with "unlimited" memory.

anirudh2290 · 2020-03-19T17:39:04Z

@mattip Thanks for the info ! will take a look.

anirudh2290 · 2020-03-24T03:29:32Z

I tried the following :

import numpy as np
from numpy.testing import tempdir
import os
import gc
import subprocess

def test_leak():
    test_data = np.asarray([np.random.rand(50, 4)
                           for i in range(800000)], dtype=object)
    with tempdir() as tmpdir:
        np.savez(os.path.join(tmpdir, 'test.npz'), test_data=test_data)

test_leak()
gc.collect()
out = subprocess.check_output(['ls'])

Setting ulimit -v 12600000 causes the program to crash with the subprocess line commented out.
With ulimit -v 12610000 causes the program to run successfully, even with the subprocess line uncommented.
This shows that the additional subprocess line took a max of 10MB extra.
Having said that, one likely theory for the failure, may be related to what instagram experienced and which also lead to adding of gc.freeze API to CPython garbage collector:

The garbage collector's collect() call touches every object's PyGC_head, and this triggers CoW even though the object weren't written to in the user program.
The default threshold for gc being low, this may be happening in the some of the subprocess calls in numpy tests.

One option that we have here is to push the test with high memory requirements (test_large_zip in this case) to its own subprocess. WDYT?

mattip · 2020-03-24T08:01:05Z

With ulimit -v 12610000 causes the program to run successfully, even with the subprocess line uncommented.

On CPython or PyPy? In my experience CPython will work but PyPy will crash. While running the test in its own subprocess would probably avoid the problem, I would like to get to the root cause: why is PyPy not releasing the memory and allowing the subprocess call to succeed? What objects are being held?

Once gc.collect runs (you should run it at least 3 times on PyPy to make sure it has broken any reference cycles), any objects used during the call to test_leak should have been released. There should be minimal memory to be copied even if COW is not triggered. On PyPy, I think the needed memory delta to run the subprocess call is on the order of 4GB.

anirudh2290 · 2020-03-24T17:17:51Z

On CPython or PyPy?

My testing has been with CPython . I will try with PyPy.

anirudh2290 · 2020-03-27T01:07:09Z

import numpy as np
from numpy.testing import tempdir
import os
import gc
import sys
from time import sleep
import subprocess

def test_leak():
    test_data = np.asarray([np.random.rand(50, 4)
                            for i in range(800000)], dtype=object)
    with tempdir() as tmpdir:
        np.savez(os.path.join(tmpdir, 'test.npz'), test_data=test_data)
test_leak()
gc.collect()
gc.collect()
gc.collect()
print("gc collect completed")
print("sleeping")
sleep(20)
#out = subprocess.check_output(['ls'])

Here is what I did: ulimit -v 21650000 to run the above script with pypy3-c. This was the maximum memory without memory error that is required to run the above script.
It works both with the subprocess call and without the subprocess call. If you reduce the ulimit to 21600000 it will cause the code to crash irrespective of the subprocess call. So at max it requires 50 MB extra with the subprocess call is what I see on my machine.

Also, I tried to test whether there is a leak, with the existing code without the subprocess call. I ran the following script:

import numpy as np
from numpy.testing import tempdir
import os
import gc
import sys
from time import sleep
import subprocess

def test_leak():
    test_data = np.asarray([np.random.rand(50, 4)
                            for i in range(800000)], dtype=object)
    with tempdir() as tmpdir:
        np.savez(os.path.join(tmpdir, 'test.npz'), test_data=test_data)

def main():
    while True:
        test_leak()
        gc.collect()
        gc.collect()
        gc.collect()
        print("gc collect completed")
        print("sleeping")
        sleep(20)

if __init__ == '__main__':
    main()

I wasn't able to reproduce memory leak with the 21650000 KB memory ulimit that I mentioned above. The above script ran close to 50 times with no increase in peak RSS of the process. I think as mentioned before the issue of subprocess OOM in numpy tests may be happening because of the following : 1. The garbage collection in one of the tests running in subprocess touches objects PyGC head, triggerring writes even without a write op. 2. The child process is allocating memory in shared pages, triggering CoW.

The difference between the CPython and Pypy RSS memory is huge though : close to 9 to 10 GB. Going through the garbage collector doc for pypy mentions the following: http://doc.pypy.org/en/latest/cpython_differences.html

If the program allocates a lot of memory, and then frees it all by dropping all references to it, then we might expect to see the RSS to drop. (RSS = Resident Set Size on Linux, as seen by “top”; it is an approximation of the actual memory usage from the OS’s point of view.) This might not occur: the RSS may remain at its highest value. This issue is more precisely caused by the process not returning “free” memory to the OS. We call this case “unreturned memory”.

Probably what we are seeing here is a lot of unreturned memory that the garbage collector is holding on to and using it for its future memory allocations.
I used the get_stats call with memory pressure=True as mentioned in the doc, and I see the allocated memory to be much lower (around 4 GB) when compared to RSS (21 to 22 GB) which probably indicates unreturned memory.

@mattip please let me know if i am missing something.

mattip · 2020-03-27T05:10:19Z

Thanks for the analysis. It seems to make sense. Let's leave this open until we can re-enable PyPy in CI (waiting on a fix for OpenBLAS, gh-15796) and try some alternatives like running this test in its own subprocess.

anirudh2290 · 2020-03-27T20:13:40Z

Thanks @mattip, will wait for #2533 to be merged to openblas-libs and pypy re-enabled, before adding the test in its own subprocess.

anirudh2290 · 2020-04-01T19:34:54Z

hi @mattip, this probably not a blocker but I observe something weird when I was trying to adjust the requires_memory parameter.

Running the following:


import numpy as np
from numpy.testing import tempdir
import os
import gc
import sys
from time import sleep
import subprocess

def test_leak():
    test_data = np.asarray([np.random.rand(50, 4)
                           for i in range(800000)], dtype=object)
    with tempdir() as tmpdir:
        np.savez(os.path.join(tmpdir, 'test.npz'), test_data=test_data)

if __name__ == '__main__':
    test_leak()
    gc.collect()
    gc.collect()
    gc.collect()

has a very different RSS peak (difference around 10 to 12 GB) when compared to running inside tests:

      def test_large_zip(self):
          def check_large_zip():
              # The test takes at least 6GB of memory, writes a file larger than 4GB
              test_data = np.asarray([np.random.rand(np.random.randint(50,100),4)
                                     for i in range(800000)], dtype=object)
              with tempdir() as tmpdir:
                  np.savez(os.path.join(tmpdir, 'test.npz'), test_data=test_data)
          check_large_zip()
          gc.collect()
          gc.collect()
          gc.collect()

Do you know why we such a big difference ?

mattip · 2020-04-01T19:54:35Z

so the stand-alone test takes 6GB and the in-test one takes 16GB?

anirudh2290 · 2020-04-01T21:01:28Z

stand alone takes close to 20 GB and the in test one takes around 8 GB.

mattip · 2020-04-03T07:05:58Z

Closing, gh-15893 was merged

mattip mentioned this issue Mar 18, 2020

BUG, TST: fix f2py for PyPy, skip one test for PyPy #15750

Merged

vrakesh added the 00 - Bug label Mar 19, 2020

anirudh2290 mentioned this issue Apr 1, 2020

TST: Run test_large_zip in a child process #15893

Merged

mattip closed this as completed Apr 3, 2020

antlarr-suse mentioned this issue Jul 17, 2020

test_large_zip fails in i586 after #15893 #16889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: saving a huge object array with np.savez leaks memory on PyPy #15775

BUG: saving a huge object array with np.savez leaks memory on PyPy #15775

mattip commented Mar 18, 2020

anirudh2290 commented Mar 19, 2020

mattip commented Mar 19, 2020

anirudh2290 commented Mar 19, 2020

anirudh2290 commented Mar 24, 2020

mattip commented Mar 24, 2020

anirudh2290 commented Mar 24, 2020

anirudh2290 commented Mar 27, 2020

mattip commented Mar 27, 2020

anirudh2290 commented Mar 27, 2020

anirudh2290 commented Apr 1, 2020

mattip commented Apr 1, 2020

anirudh2290 commented Apr 1, 2020

mattip commented Apr 3, 2020

BUG: saving a huge object array with np.savez leaks memory on PyPy #15775

BUG: saving a huge object array with np.savez leaks memory on PyPy #15775

Comments

mattip commented Mar 18, 2020

anirudh2290 commented Mar 19, 2020

mattip commented Mar 19, 2020

anirudh2290 commented Mar 19, 2020

anirudh2290 commented Mar 24, 2020

mattip commented Mar 24, 2020

anirudh2290 commented Mar 24, 2020

anirudh2290 commented Mar 27, 2020

mattip commented Mar 27, 2020

anirudh2290 commented Mar 27, 2020

anirudh2290 commented Apr 1, 2020

mattip commented Apr 1, 2020

anirudh2290 commented Apr 1, 2020

mattip commented Apr 3, 2020