nibabel.load crashes for very large files on python 3, not python 2 #362

surchs · 2015-10-20T15:05:29Z

Hi,

I just noticed the following error: loading a very large file (such as a resting state image from the HCP dataset, e.g. with dimensions (91, 109, 91, 1200)) works fine with nibabel and python 2.7. The same command on python 3.5 fails with the following error:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-5-6d7286f1eb52> in <module>()
----> 1 in_vol = in_img.get_data()

/usr/local/lib/python3.5/dist-packages/nibabel/spatialimages.py in get_data(self, caching)
    580         if self._data_cache is not None:
    581             return self._data_cache
--> 582         data = np.asanyarray(self._dataobj)
    583         if caching == 'fill':
    584             self._data_cache = data

/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
    512 
    513     """
--> 514     return array(a, dtype, copy=False, order=order, subok=True)
    515 
    516 def ascontiguousarray(a, dtype=None):

/usr/local/lib/python3.5/dist-packages/nibabel/arrayproxy.py in __array__(self)
    142     def __array__(self):
    143         # Read array and scale
--> 144         raw_data = self.get_unscaled()
    145         return apply_read_scaling(raw_data, self._slope, self._inter)
    146 

/usr/local/lib/python3.5/dist-packages/nibabel/arrayproxy.py in get_unscaled(self)
    137                                        offset=self._offset,
    138                                        order=self.order,
--> 139                                        mmap=self._mmap)
    140         return raw_data
    141 

/usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
    512     if hasattr(infile, 'readinto'):
    513         data_bytes = bytearray(n_bytes)
--> 514         n_read = infile.readinto(data_bytes)
    515         needs_copy = False
    516     else:

/usr/lib/python3.5/gzip.py in read(self, size)
    272             import errno
    273             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 274         return self._buffer.read(size)
    275 
    276     def read1(self, size=-1):

/usr/lib/python3.5/_compression.py in readinto(self, b)
     66     def readinto(self, b):
     67         with memoryview(b) as view, view.cast("B") as byte_view:
---> 68             data = self.read(len(byte_view))
     69             byte_view[:len(data)] = data
     70         return len(data)

/usr/lib/python3.5/gzip.py in read(self, size)
    467             buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    468 
--> 469             uncompress = self._decompressor.decompress(buf, size)
    470             if self._decompressor.unconsumed_tail != b"":
    471                 self._fp.prepend(self._decompressor.unconsumed_tail)

OverflowError: Python int too large for C unsigned int

Happy to provide further info, not sure what is going on.

The text was updated successfully, but these errors were encountered:

effigies · 2015-10-20T15:11:42Z

What specific file from HCP did you use to produce this crash?

Also, can you try with Python 3.4? I don't believe nibabel is testing on 3.5, yet.

surchs · 2015-10-20T15:30:16Z

I used a resting state nifti file with all 4 sessions combined, hence the 1200 timepoints.

matthew-brett · 2015-10-20T16:05:33Z

Is it possible you have a 32-bit version of Python 3.5?

surchs · 2015-10-20T16:09:49Z

Pretty sure I got 64

In [1]: import sys

In [2]: sys.maxsize
Out[2]: 9223372036854775807

matthew-brett · 2015-10-20T16:16:43Z

Can you try debug at the IPython prompt after the error, and find out what the value of size is in the traceback?

surchs · 2015-10-20T16:19:28Z

Sure

ipdb> size
4332617728

matthew-brett · 2015-10-20T17:40:38Z

As the error message says, that will just overflow a 32-bit unsigned integer.

Would you mind putting a 1/0 or similar at line 512 of volumeutils.py in order to raise an error, then check the size of n_bytes when reading in Python 2.7? I guess it is the same?

surchs · 2015-10-20T17:48:29Z

It's pretty close:

/usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
    509         return np.array([])
    510     # Read data from file
--> 511     1/0
    512     infile.seek(offset)
    513     if hasattr(infile, 'readinto'):

ZeroDivisionError: division by zero

In [5]: debug
> /usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py(511)array_from_file()
    510     # Read data from file
--> 511     1/0
    512     infile.seek(offset)

ipdb> n_bytes
4332619200
ipdb>

matthew-brett · 2015-10-20T18:12:52Z

So, that should overflow a 32-bit unsigned int as well. I guess the difference is in the Python implementations.

surchs · 2015-10-20T18:13:57Z

I'm not sure I'm following. Why would there be a 32 bit constraint when running on a 64 bit python shell?

matthew-brett · 2015-10-20T18:25:21Z

I don't know why that would happen, but it will depend on how Python implements gzip decompression. My guess is Python changed the implementation at some point, and the newer implementation happens to use an unsigned 32-bit int.

Can you check other Python versions? Sorry to ask...

surchs · 2015-10-20T18:29:49Z

not easily, no. Sorry as well. I'll try on another machine. But this is seriously weird.

matthew-brett · 2015-10-20T18:58:09Z

Fair enough. I'll do my best to replicate on my laptop.

effigies · 2015-11-08T14:09:24Z

Any movement on this?

bcipolli · 2015-11-08T15:36:50Z

I will take a look at this.

bcipolli · 2015-11-08T15:50:39Z

I was able to reproduce the error in Python 3 (and no error in Python 2) by creating a dummy image of the same size:

import nibabel as nib
import numpy as np
import os
img = nib.Nifti1Image(np.zeros((91,109,91,1200), dtype=np.float32), affine=np.eye(4))
nib.save(img, 'test.nii.gz')
del img
try:
    data = nib.load('test.nii.gz').get_data()
finally:
    os.remove('test.nii.gz')

Output:

>>> data = img.get_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bcipolli/code/libs/nibabel/nibabel/spatialimages.py", line 572, in get_data
    data = np.asanyarray(self._dataobj)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/core/numeric.py", line 525, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/Users/bcipolli/code/libs/nibabel/nibabel/arrayproxy.py", line 145, in __array__
    raw_data = self.get_unscaled()
  File "/Users/bcipolli/code/libs/nibabel/nibabel/arrayproxy.py", line 140, in get_unscaled
    mmap=self._mmap)
  File "/Users/bcipolli/code/libs/nibabel/nibabel/volumeutils.py", line 518, in array_from_file
    n_read = infile.readinto(data_bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py", line 469, in read
    uncompress = self._decompressor.decompress(buf, size)
OverflowError: Python int too large for C unsigned int

I will debug a bit from here, but the bottom line is zlib is a binary, so not much to debug (just to check the inputs on the successful and unsuccessful calls)

bcipolli · 2015-11-08T16:14:09Z

Yep, it's exactly as expected: the call to zlib with len(buf)=8192 and size=4332617728 is failing. Decrease size to a non-overflow value (e.g. 4000000000), and the call completes just fine.

So I guess the question is whether we report a bug, or work around the issue. Will try to find some relevant documentation. To start, perhaps:
https://hg.python.org/cpython/rev/931e1bc090f6/
https://hg.python.org/cpython/rev/0cca6c5513d2
source:
https://bugs.python.org/issue18294

effigies · 2015-11-08T16:30:40Z

Thanks for tracking that down. Looks like that issue was superseded and they went with this patch. For now, it makes sense to me to report upstream and see if they're willing to fix this before trying to do any workarounds, here.

arthurmensch · 2015-11-12T08:05:19Z

I am having the same issue here, hasn't the patch above been merged already into Python 3 default branch ?

effigies · 2015-11-12T15:49:31Z

Sorry, to clarify: I'm guessing that's the patch causing the problem.

matthew-brett · 2015-11-12T16:18:00Z

Ben - how about reporting the bug AND working around the issue :)

effigies · 2015-11-13T19:15:34Z

Don't have a good test rig, so trying to set up a test case using Ben's code (and a smaller Travis test suite) and getting weird Travis errors. Did I screw something up, or is this just a thing where I should wait a bit and re-run the tests?

Edit: It was the test I added. Took too much time or memory, so got killed. Will see if I can upload a sample dataset so we're not wasting time writing, at least.

effigies · 2015-11-14T02:33:09Z

Loading a pre-made dataset still causes either a memory or time issue. Not sure testing for this can be automated...

My initial thought for a strategy would be: try loading with gzip, and if that fails, pipe through gunzip and store in a BytesIO.

bcipolli · 2015-11-14T17:20:04Z

Just as a note, I don't see max_read_chunk defined in gzip.py, nor defined in the docs. So the monkey patch done here will no longer work in Python 3.5:
#210

bcipolli · 2015-11-14T19:30:29Z

After further review, I don't see a quick way to do this either. I would suggest that, for a known bug, we should at least try...catch and refer users to this issue; when the issue is solved, the code can be removed. If that's cool, I'd be glad to implement that.

I also think we ought to re-open #209. I will do that now.

matthew-brett · 2015-11-14T19:39:51Z

Don't we need to file a Python bug?

bcipolli · 2015-11-14T19:48:53Z

Don't we need to file a Python bug?

https://bugs.python.org/issue25626 :)

matthew-brett · 2015-11-14T20:46:57Z

Ben - do you get an error reading a file created like this?

head -c 5G < /dev/urandom | gzip -f > out.gz

matthew-brett · 2015-11-14T21:09:06Z

Ah - how about :

dd if=/dev/urandom bs=1m count=5k | gzip -f > out.gz

http://unix.stackexchange.com/questions/32988/why-does-dd-from-dev-random-give-different-file-sizes

matthew-brett · 2015-11-14T21:09:56Z

If you edit the nibabel code to not pass the size, but just ask to read all the file (not pass the size argument), do you still get a crash?

bcipolli · 2015-11-14T21:25:45Z

head -c 5000000000 < /dev/urandom | gzip -f > out.gz works, but is slower than the Python code to create an image. AND it burns my battery :)

And nibabel won't open it; the file type is unrecognized (even if I change to nii.gz; I assume the header info is wrong)

If you edit the nibabel code to not pass the size, but just ask to read all the file (not pass the size argument), do you still get a crash?

I don't see anywhere that we explicitly pass in the size. We pass in a buffer in volumeutils.py:array_from_file. I also debugged and looked through all the calling code (and the data object itself), didn't see anything... did you see a spot?

matthew-brett · 2015-11-14T21:29:22Z

Sorry, I meant, do you get the same error as for an image, if you do:

import gzip
gzf = gzip.open('out.gz', 'rb')
res = gzf.read(4332619200)

bcipolli · 2015-11-14T21:32:53Z

My thought was to try a buffered read loop, to avoid the problem. This seems to work:

        data_bytes = bytearray(n_bytes)
        n_read = 0
        while n_read < n_bytes:
            to_read = (n_bytes - n_read)
            if to_read > 2**32:
                to_read = 2**32 - 1
            n_read += infile.readinto(data_bytes[n_read:n_read+to_read])

bcipolli · 2015-11-14T21:34:27Z

Sorry, I meant, do you get the same error as for an image, if you do:

Ah I see, sorry. Yes, you get the same error.

bcipolli · 2015-11-14T21:36:54Z

Or, more succinctly:

        n_read = 0
        while n_read < n_bytes:
            to_read = (n_bytes - n_read) % (2**32 - 1) or (2**32 - 1)
            n_read += infile.readinto(data_bytes[n_read:n_read+to_read])

matthew-brett · 2015-11-14T21:45:55Z

I guess another option is to disable use of readinto for Python 3.5...

bcipolli · 2015-11-14T21:53:01Z

Any preference?

matthew-brett · 2015-11-14T22:12:40Z

I guess buffering will be more efficient. But - I think we need to cover the case where the read returns fewer bytes than we thought. How about something like:

n_read = 0
max_read = 2 ** 32 -1  # Max for unsigned 32-bit integer
while (n_read > n_bytes):
    n_wanted = min([n_bytes - n_read, max_read])
    n_got = infile.readinto(data_bytes[n_read:n_read + n_wanted])
    n_read += n_got
    if n_got != n_wanted:
        break

bcipolli · 2015-11-14T22:26:05Z

Sounds great. Since it's gzip-specific, we should probably put that in _gzip_open, in openers.py.

For testing, I think a unit test on array_from_file using a BytesIO object could be fastest--writing to disk sucks. If you can think of a way to do the Nifti1Image with BytesIO and avoid actually writing to disk, please let me know.

I'll try the unit test in the meantime.

bcipolli · 2015-11-14T22:27:08Z

Er.... we'll also have to wrap GzipFile in that case... but it seems best to keep things gzip-specific...

matthew-brett · 2015-11-14T22:39:22Z

I see that it's possible to pass a file-like object to gzip.GzipFile, I guess it would be possible to use that to simulate a gzipped stream that is long, but we would still need a machine that could handle at least 4GB images in memory.

bcipolli · 2015-11-14T23:11:40Z

we would still need a machine that could handle at least 4GB images in memory.

Well, I guess it'll have to be 4GB + the size of the GZIP output in memory, vs. 4GB (just the image in memory) that seems to be unavoidable. So 6GB vs. 4GB... the major concern is the time.

As I'm running this code, I'm realizing that it's gzip itself that's slow. I'll decrease compression level in the test...

matthew-brett · 2015-11-14T23:27:26Z

I hate to say this now, but maybe we could do without a test here, on the basis that it's too time-consuming and memory intensive to test routinely.

Or have a separate test repo, for these kinds of very long / slow tests, that we trigger travis on / run on the buildbots from time to time.

bcipolli · 2015-11-14T23:31:32Z

Sure thing. I want a test case to make sure things are working as expected before doing a PR, so ... it's useful, even if we disable the test by default.

bcipolli · 2015-11-14T23:39:31Z

How about an environment variable SLOW_TESTS that runs slow tests when set; otherwise, they're skipped?

matthew-brett · 2015-11-14T23:42:50Z

Sure. Even something like NIPY_EXTRA_TESTS which has a comma separated list of types of extra tests such as slow,bigmem.

effigies mentioned this issue Nov 14, 2015

MRG: Release candidate 2.0.2 #377

Merged

matthew-brett closed this as completed in 5ee2c7c Nov 20, 2015

grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016

BF: Fix nipy#362 using buffered gzip read.

cb88dce

nibabel.load crashes for very large files on python 3, not python 2 #362

nibabel.load crashes for very large files on python 3, not python 2 #362

Comments

surchs commented Oct 20, 2015

effigies commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

surchs commented Oct 20, 2015

matthew-brett commented Oct 20, 2015

effigies commented Nov 8, 2015

bcipolli commented Nov 8, 2015

bcipolli commented Nov 8, 2015

bcipolli commented Nov 8, 2015

effigies commented Nov 8, 2015

arthurmensch commented Nov 12, 2015

effigies commented Nov 12, 2015

matthew-brett commented Nov 12, 2015

effigies commented Nov 13, 2015

effigies commented Nov 14, 2015

bcipolli commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015 via email

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

bcipolli commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015

bcipolli commented Nov 14, 2015

bcipolli commented Nov 14, 2015

matthew-brett commented Nov 14, 2015