Skip to content

Loading…

Bug in sorting structured numpy array with more than 2^31 elements #427

Closed
parantapa opened this Issue · 5 comments

3 participants

@parantapa

When using structured arrays of size more than 2^31, sorting doesn't work. The
sort function returns immediately without sorting. Following is a simple test
case.

import numpy as np

dt = "u4, u4"
a = np.empty(2 ** 31, dtype=dt)
a["f0"][:] = np.random.randint(1, 1e9, 2 ** 31)
a["f1"][:] = np.random.randint(1, 1e9, 2 ** 31)

a.sort(order=["f0", "f1"])
for i in xrange(len(a) -1):
    u1, v1 = a[i]
    u2, v2 = a[i + 1]

    assert u1 < u2 or (u1 == u2 and v1 <= v2)

The above has been tested using Python 2.7 and Numpy version 1.6.2 as well as
the Numpy Git version a72ce7e. The test was done on a 64bit linux system with
48G of memory.

@ilaudy

I have the same problem with sort, and the all three kinds do not work.
It report a Segmentation fault.
Here is the gdb results:

GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
(gdb) run mockgalaxy.py
Starting program: /usr/bin/python mockgalaxy.py
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffee4e3700 (LWP 22553)]
1024 1105

Program received signal SIGSEGV, Segmentation fault.
_new_argsort (op=, axis=0, which=) at numpy/core/src/multiarray/item_selection.c:880
880 *iptr++ = i;

@ilaudy

Here is my solution:
change the file numpy-1.6.2/numpy/core/src/multiarray/item_selection.c
line 818 to
long long needcopy = 0, i;
rebuild your numpy, then reinstall numpy
It should be worked.
Enjoy!

@seberg
NumPy member

Nice catch! Would you mind preparing a pull request? The correct datatype for i in this context is npy_intp, just like N (i is used for iteration over keys too, which I think can safely be assumed to be int, that is fine though. A check on the sequence length would not hurt, though I admit giving 2**31 keys is pretty absurd -- this may be in lexsort not in _new_argsort). The Lexsort function seems to have the same problem.

@ilaudy

I totally agree with the change of datatype of i. It is much safer with npy_intp.
Go ahead with the pull request.

@seberg
NumPy member

Fixed in master and an open PR for backporting to 1.7.

@seberg seberg closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.