Multiarray Double Free or Unmap Pointer error for huge datasets #2995

Closed
asaluja opened this Issue Feb 16, 2013 · 12 comments

3 participants

@asaluja

Hello NumPy team,

I seem to have encountered some unusual behavior when running NumPy/SciPy on massive datasets. Please see below for the error. Also, this ticket/bug report has been cross-posted on stackoverflow.com (http://stackoverflow.com/questions/14906962/python-double-free-error-for-huge-datasets) and the SciPy developers list (http://projects.scipy.org/scipy/ticket/1846). It was upon the suggestion of an SOer that I decided to post here and on the SciPy dev list.

I am running Linux x86-64-bit OpenSuSE 11.4, NumPy version 1.5.1, SciPy version 0.9.0, Python 2.7.

I have a very simple script in Python, but for some reason I get the following error when running a large amount of data:

*** glibc detected *** python: double free or corruption (out): 0x00002af5a00cc010 ***

I am used to these errors coming up in C or C++, when one tries to free memory that has already been freed. However, by my understanding of Python (and especially the way I've written the code), I really don't understand why this should happen.

Here is the code:

#!/usr/bin/python -tt                                                                                                                                                                                                                         

import sys, commands, string
import numpy as np
import scipy.io as io
from time import clock

W = io.loadmat(sys.argv[1])['W']
size = W.shape[0]
numlabels = int(sys.argv[2])
Q = np.zeros((size, numlabels), dtype=np.double)
P = np.zeros((size, numlabels), dtype=np.double)
Q += 1.0 / Q.shape[1]
nu = 0.001
mu = 0.01
start = clock()
mat = -nu + mu*(W*(np.log(Q)-1))
end = clock()
print >> sys.stderr, "Time taken to compute matrix: %.2f seconds"%(end-start)

One may ask, why declare a P and a Q numpy array? I simply do that to reflect the actual conditions (as this code is simply a segment of what I actually do, where I need a P matrix and declare it beforehand).

I have access to a 192GB machine, and so I tested this out on a very large SciPy sparse matrix (2.2 million by 2.2 million, but very sparse, that's not the issue). The main memory is taken up by the Q, P, and mat matrices, as they are all 2.2 million by 2000 dense matrices (size = 2.2 million, numlabels = 2000). The peak memory goes up to 131GB, which comfortably fits in memory. While the mat matrix is being computed, I get the glibc error, and my process automatically goes into the sleep (S) state, without deallocating the 131GB it has taken up.

Given the bizarre (for Python) error (I am not explicitly deallocating anything), and the fact that this works nicely for smaller matrix sizes (around 1.5 million by 2000), I am really not sure where to start to debug this.

As a starting point, I have set "ulimit -s unlimited" before running, but to no avail.

Any help or insight into numpy's behavior with really large amounts of data would be welcome.

Note that this is NOT an out of memory error - I have 196GB, and my process reaches around 131GB and stays there for some time before giving the error above.

As per suggestions, I ran Python with GDB. Interestingly, on one GDB run I forgot to set the stack size limit to "unlimited", and got the following output:

*** glibc detected *** /usr/bin/python: munmap_chunk(): invalid pointer: 0x00007fe7508a9010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733b6)[0x7ffff6ec23b6]
/usr/lib64/python2.7/site-packages/numpy/core/multiarray.so(+0x4a496)[0x7ffff69fc496]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4e67)[0x7ffff7af48c7]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x309)[0x7ffff7af6c49]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7ffff7b25592]
/usr/lib64/libpython2.7.so.1.0(+0xfcc61)[0x7ffff7b33c61]
/usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x84)[0x7ffff7b34074]
/usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x189)[0x7ffff7b347c9]
/usr/lib64/libpython2.7.so.1.0(Py_Main+0x36c)[0x7ffff7b3e1bc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffff6e6dbfd]
/usr/bin/python[0x4006e9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 09:01 50336181                           /usr/bin/python2.7
00600000-00601000 r--p 00000000 09:01 50336181                           /usr/bin/python2.7
00601000-00602000 rw-p 00001000 09:01 50336181                           /usr/bin/python2.7
00602000-00e5f000 rw-p 00000000 00:00 0                                  [heap]
7fdf2584c000-7ffff0a66000 rw-p 00000000 00:00 0 
7ffff0a66000-7ffff0a6b000 r-xp 00000000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0a6b000-7ffff0c6a000 ---p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6a000-7ffff0c6b000 r--p 00004000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6b000-7ffff0c6c000 rw-p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6c000-7ffff0c77000 r-xp 00000000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0c77000-7ffff0e76000 ---p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e76000-7ffff0e77000 r--p 0000a000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e77000-7ffff0e78000 rw-p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e78000-7ffff0e79000 rw-p 00000000 00:00 0 
7ffff0e79000-7ffff0e9b000 r-xp 00000000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff0e9b000-7ffff109a000 ---p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109a000-7ffff109b000 r--p 00021000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109b000-7ffff109f000 rw-p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109f000-7ffff10a0000 rw-p 00000000 00:00 0 
7ffff10a0000-7ffff10a5000 r-xp 00000000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff10a5000-7ffff12a4000 ---p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a4000-7ffff12a5000 r--p 00004000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a5000-7ffff12a7000 rw-p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a7000-7ffff12ad000 r-xp 00000000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff12ad000-7ffff14ac000 ---p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ac000-7ffff14ad000 r--p 00005000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ad000-7ffff14ae000 rw-p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ae000-7ffff14b5000 r-xp 00000000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff14b5000-7ffff16b4000 ---p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b4000-7ffff16b5000 r--p 00006000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b5000-7ffff16b6000 rw-p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b6000-7ffff17c2000 r-xp 00000000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff17c2000-7ffff19c2000 ---p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c2000-7ffff19c3000 r--p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c3000-7ffff19c6000 rw-p 0010d000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c6000-7ffff19d5000 r-xp 00000000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff19d5000-7ffff1bd4000 ---p 0000f000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff1bd4000-7ffff1bd5000 r--p 0000e000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
Program received signal SIGABRT, Aborted.
0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
#1  0x00007ffff6e82fb6 in abort () from /lib64/libc.so.6
#2  0x00007ffff6ebcdd3 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ec23b6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69fc496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#5  0x00007ffff7af48c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#6  0x00007ffff7af6c49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#7  0x00007ffff7b25592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#8  0x00007ffff7b33c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#9  0x00007ffff7b34074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#10 0x00007ffff7b347c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007ffff7b3e1bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007ffff6e6dbfd in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004006e9 in _start ()

When I set the stack size limit to unlimited, I get the following:

*** glibc detected *** /usr/bin/python: double free or corruption (out): 0x00002abb2732c010 ***
^X^C
Program received signal SIGINT, Interrupt.
0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab969f2e in _L_lock_9927 () from /lib64/libc.so.6
#2  0x00002aaaab9682d1 in free () from /lib64/libc.so.6
#3  0x00002aaaaaabbfe2 in _dl_scope_free () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaab70a4 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#5  0x00002aaaaaabcaa0 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00002aaaaaabc5da in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00002aaaab9fb530 in do_dlopen () from /lib64/libc.so.6
#9  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00002aaaab9fb5cf in dlerror_run () from /lib64/libc.so.6
#11 0x00002aaaab9fb637 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00002aaaab9d60c5 in init () from /lib64/libc.so.6
#13 0x00002aaaab080933 in pthread_once () from /lib64/libpthread.so.0
#14 0x00002aaaab9d61bc in backtrace () from /lib64/libc.so.6
#15 0x00002aaaab95dde7 in __libc_message () from /lib64/libc.so.6
#16 0x00002aaaab9633b6 in malloc_printerr () from /lib64/libc.so.6
#17 0x00002aaaab9682dc in free () from /lib64/libc.so.6
#18 0x00002aaaabef1496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#19 0x00002aaaaad888c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#20 0x00002aaaaad8ac49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#21 0x00002aaaaadb9592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#22 0x00002aaaaadc7c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#23 0x00002aaaaadc8074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#24 0x00002aaaaadc87c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#25 0x00002aaaaadd21bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#26 0x00002aaaab90ebfd in __libc_start_main () from /lib64/libc.so.6
#27 0x00000000004006e9 in _start ()

This makes me believe the basic issue is with the numpy multiarray core module (line #4 in the first output and line #18 in the second). I will bring it up as a bug report in both numpy and scipy just in case.

Has anyone seen this before?

@charris
NumPy member

That's a pretty old version of numpy, any chance you can test with the latest release?

@asaluja

I tested it on a machine with NumPy 1.7.0. This time, I just got a segmentation fault without any output to STDOUT or STDERR. I ran it through GDB, and got the following output:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaac5fcd04 in PyDataMem_FREE (ptr=<optimized out>, $K8=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:3510
#2  array_dealloc (self=0xc00ab7edbfc228fe) at numpy/core/src/multiarray/arrayobject.c:416
#3  0x0000000000498eac in PyEval_EvalFrameEx ()
#4  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#5  0x00000000004a9081 in PyRun_FileExFlags ()
#6  0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#7  0x00000000004aa8bd in Py_Main ()
#8  0x00002aaaabe4f76d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041b9b1 in _start ()

Do you know what the issue is? I will try my best to recompile numpy with debugging symbols so that it can be more informative, but perhaps you may have more insight based on the above than me.

@charris
NumPy member

#2 is where the call to PyDataMem_FREE is made in array_dealloc, which is the deallocation function for PyArray_Type. Can you create/delete big arrays successfully? Tracking this down could be a bit tricky as a lot of things could be going on. I'm wondering if the data pointer has become corrupted, or maybe the memory is being managed by scipy, or maybe something else. Running valgrind might show something, but I don't know if it can handle things that big, and it might be slow too. Someone else can give you better advice about that than I.

@charris
NumPy member

@pv IIRC, there is also a memory limitation with sparse matrices related to the Fortran being compiled with 32 bit integers as indexes.

@asaluja

@charris Yes, if I just comment out mat = -nu + mu*(W*(np.log(Q)-1)) in my code, the code completes successfully, showing that the loading of W (which is not too large memory-wise since it's very sparse, but has large dimensions), P, and Q (dense, 2 million by 2000 matrices) and their subsequent deleting happens without issue. However, while computing the line mat = -nu + mu*(W*(np.log(Q)-1)), the code breaks.

I unraveled this mathematical operation to try and see where exactly it breaks. If I replace the mat = ... line with mat = np.log(Q)-1 , the code completes successfully, however simply computing mat = W*(np.log(Q)-1) causes the crash. The sparse matrix - dense matrix matrix multiplication seems to be the culprit?

I am currently unable to install valgrind on the machine that has numpy 1.7, but if you think it would be useful to see let me know and I'll see what I can do.

To be honest, the data passed in through the first argument is not too large (497MB) and I can send you a pointer to it, but unless you have a large memory machine you may not be able to replicate my error when running this code.

@charris
NumPy member

I suspect the problem is with the sparse code, @pv can help you much more there than myself. It might be worth testing the scipy 12 beta that was just released to see if the problem persists. If you search the scipy list for sparse, dense, multiplication, memory, and such related terms I'm pretty sure you will find some some relevant posts.

@charris
NumPy member

Another question, you say matrix multiplication whereas the '*' operator is normally element-wise multiplication. Are you using the numpy matrix class by any chance?

@asaluja

@charris Yes, I want to do matrix multiplication, not element-wise multiplication. Since W is a sparse matrix, I thought that A*B, where A is a sparse matrix and B is a numpy array (as in my code), always corresponds to matrix multiplication? See http://stackoverflow.com/questions/13740691/multiplication-operator-for-scipy-sparse-matrices for more details. I am not explicitly using the numpy matrix class. One option for me would be to compute the matrix multiplication through numpy.dot(A,B) but my understanding is that this converts both matrices to dense matrices, which would be terrible for me for my sparse matrix.

@pv
NumPy member
pv commented Feb 18, 2013

@charris: using sparse matrices with nnz >= MAX_INT32 will cause problems. I don't think there's any Fortran code involved in the examples given above, but sparsetools has this data type limitation.

If W is very sparse without special structure, then (W Q).nnz ~ numlabels*W.nnz ~ 2000 W.nnz. If W.nnz is not too far from the maximum limit, it seems likely an overflow can occur. (I think sparsetools uses signed indices, so offsets may wrap to negative numbers -> boom. Should be converted to use intp internally even if array data is int32.)

@asaluja: check the number k of nonzero rows in W. If numlabels*k >= 2**31, then unfortunately Scipy doesn't support this large nnz with CSR/CSC. If it's smaller, try to reproduce the issue without io.loadmat.

@asaluja

@pv I think you have identified the issue here. While W is very sparse, all the rows are non-zero. In this particular example, W.nnz = 42666354, but number of non-zero rows in W is 2,192,596. Therefore, 2192596*2000 > 2^31. So, may I safely say that for such problem sizes, I should not use Numpy/Scipy at all?

@pv
NumPy member
pv commented Feb 18, 2013

@asaluja: yes, it is unfortunate, but it is the situation right now that nobody has been interested enough in this to push through the support for 64-bit nnz. There is some experimental work done on this here, though: http://projects.scipy.org/scipy/ticket/1307 https://github.com/pv/scipy-work/commits/ticket/1307 You may try this out, if you are interested.

Alternatively, there is also a second sparse matrix library for Python: http://pysparse.sourceforge.net/ --- but I'm not sure if it's better in this respect.

This discussion should be moved to -> http://projects.scipy.org/scipy/ticket/1307 --- this looks like a bug in Scipy, not in numpy arrays.

@pv pv closed this Feb 18, 2013
@asaluja

Ok thanks @pv I had already opened a ticket on Scipy (http://projects.scipy.org/scipy/ticket/1846), maybe you can consolidate with the ticket you had linked to? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment