Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast array size to int64 when loading from archive #7598

Merged
merged 1 commit into from
May 9, 2016

Conversation

drasmuss
Copy link
Contributor

@drasmuss drasmuss commented May 2, 2016

When loading an array from a npz archive, the size of the array is computed via numpy.multiply.reduce(shape). This defaults to int32 on some systems, including 64bit systems where it is possible to create arrays large enough to cause that value to overflow. Here is a minimal example illustrating the problem:

import platform
import sys

import numpy as np

print(platform.architecture())
print(sys.version)
print("np version:", np.__version__)
print("default int type:", np.int_)

a = np.empty((2**30, 2), dtype=np.uint8)
with open("tmp.npz", "wb") as f:
    np.savez(f, a)

print(a.shape)
print(np.multiply.reduce(a.shape))

with open("tmp.npz", "rb") as f:
    data = np.load(f)
    new_a = data["arr_0"]

output on my system:

('64bit', 'WindowsPE')
3.4.4 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:54:04) [MSC v.1600 64 bit (AMD64)]
np version: 1.11.0
default int type: <class 'numpy.int32'>
(1073741824, 2)
-2147483648
Traceback (most recent call last):
  File "<...>/tmp.py", line 20, in <module>
    new_a = data["arr_0"]
  File "<...>\lib\site-packages\numpy\lib\npyio.py", line 224, in __getitem__
    pickle_kwargs=self.pickle_kwargs)
  File "<...>\lib\site-packages\numpy\lib\format.py", line 660, in read_array
    array = numpy.empty(count, dtype=dtype)
ValueError: negative dimensions are not allowed

I think the solution is just to change it to numpy.multiply.reduce(shape, dtype=numpy.int64)

@njsmith
Copy link
Member

njsmith commented May 2, 2016

I think it should be dtype=np.intp -- intp is "integer pointer" i.e. 32 bits on machines with 32 bits of addressable memory and 64 bits on machines with 64 bits of addressable memory.

Would it be possible to add a test? I guess this will be a somewhat heavyweight test since it has to write/read several gibibytes of data, but that's okay if you use the @dec.slow decorator to mark it as "slow".

@drasmuss drasmuss force-pushed the master branch 2 times, most recently from 7df0f25 to fad6296 Compare May 3, 2016 15:08
@drasmuss
Copy link
Contributor Author

drasmuss commented May 3, 2016

Didn't know about intp, that's definitely better. Added a test as well.

@charris
Copy link
Member

charris commented May 3, 2016

@njsmith npy_intp might overflow on 32 bit systems if the file is produced in a 64 bit system, so it would probably be safer to use 64 bits and then check if the size of the result fits in npy_intp.

@njsmith
Copy link
Member

njsmith commented May 3, 2016

@charris: oh, great point

@charris
Copy link
Member

charris commented May 4, 2016

@drasmuss Needs a bit more work.

@drasmuss
Copy link
Contributor Author

drasmuss commented May 4, 2016

What do you think the behaviour should be in that case? I would lean towards just trying to open the file, and then failing with the normal out of memory error. We could also raise a custom error here though, if that seems like it wouldn't be informative enough.

@charris
Copy link
Member

charris commented May 5, 2016

Either way works for me. IIRC (I checked yesterday), count is converted to Py_ssize_t, which is the same as npy_intp these days, so I expect the error should occur when the arguments are converted from Python objects to C. Might check that that actually happens.

Prevents overflow errors for large arrays on systems
where the default int type is int32.
@drasmuss
Copy link
Contributor Author

drasmuss commented May 9, 2016

Alright, changed it back to int64, which I think will cause an out of memory error at the array = numpy.empty(count, dtype=dtype) line if someone tries to load a file too large for their system. That seems like the simplest solution, since that is the same error someone would get regardless of any int32/int64 issues. I don't have access to a 32bit system to confirm that though.

@charris charris merged commit cb6a5bf into numpy:master May 9, 2016
@charris
Copy link
Member

charris commented May 9, 2016

Thanks @drasmuss .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants