Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nibabel.load crashes for very large files on python 3, not python 2 #362

Closed
surchs opened this issue Oct 20, 2015 · 46 comments
Closed

nibabel.load crashes for very large files on python 3, not python 2 #362

surchs opened this issue Oct 20, 2015 · 46 comments

Comments

@surchs
Copy link

surchs commented Oct 20, 2015

Hi,

I just noticed the following error: loading a very large file (such as a resting state image from the HCP dataset, e.g. with dimensions (91, 109, 91, 1200)) works fine with nibabel and python 2.7. The same command on python 3.5 fails with the following error:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-5-6d7286f1eb52> in <module>()
----> 1 in_vol = in_img.get_data()

/usr/local/lib/python3.5/dist-packages/nibabel/spatialimages.py in get_data(self, caching)
    580         if self._data_cache is not None:
    581             return self._data_cache
--> 582         data = np.asanyarray(self._dataobj)
    583         if caching == 'fill':
    584             self._data_cache = data

/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
    512 
    513     """
--> 514     return array(a, dtype, copy=False, order=order, subok=True)
    515 
    516 def ascontiguousarray(a, dtype=None):

/usr/local/lib/python3.5/dist-packages/nibabel/arrayproxy.py in __array__(self)
    142     def __array__(self):
    143         # Read array and scale
--> 144         raw_data = self.get_unscaled()
    145         return apply_read_scaling(raw_data, self._slope, self._inter)
    146 

/usr/local/lib/python3.5/dist-packages/nibabel/arrayproxy.py in get_unscaled(self)
    137                                        offset=self._offset,
    138                                        order=self.order,
--> 139                                        mmap=self._mmap)
    140         return raw_data
    141 

/usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
    512     if hasattr(infile, 'readinto'):
    513         data_bytes = bytearray(n_bytes)
--> 514         n_read = infile.readinto(data_bytes)
    515         needs_copy = False
    516     else:

/usr/lib/python3.5/gzip.py in read(self, size)
    272             import errno
    273             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 274         return self._buffer.read(size)
    275 
    276     def read1(self, size=-1):

/usr/lib/python3.5/_compression.py in readinto(self, b)
     66     def readinto(self, b):
     67         with memoryview(b) as view, view.cast("B") as byte_view:
---> 68             data = self.read(len(byte_view))
     69             byte_view[:len(data)] = data
     70         return len(data)

/usr/lib/python3.5/gzip.py in read(self, size)
    467             buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
    468 
--> 469             uncompress = self._decompressor.decompress(buf, size)
    470             if self._decompressor.unconsumed_tail != b"":
    471                 self._fp.prepend(self._decompressor.unconsumed_tail)

OverflowError: Python int too large for C unsigned int

Happy to provide further info, not sure what is going on.

@effigies
Copy link
Member

What specific file from HCP did you use to produce this crash?

Also, can you try with Python 3.4? I don't believe nibabel is testing on 3.5, yet.

@surchs
Copy link
Author

surchs commented Oct 20, 2015

I used a resting state nifti file with all 4 sessions combined, hence the 1200 timepoints.

@matthew-brett
Copy link
Member

Is it possible you have a 32-bit version of Python 3.5?

@surchs
Copy link
Author

surchs commented Oct 20, 2015

Pretty sure I got 64

In [1]: import sys

In [2]: sys.maxsize
Out[2]: 9223372036854775807

@matthew-brett
Copy link
Member

Can you try debug at the IPython prompt after the error, and find out what the value of size is in the traceback?

@surchs
Copy link
Author

surchs commented Oct 20, 2015

Sure

ipdb> size
4332617728

@matthew-brett
Copy link
Member

As the error message says, that will just overflow a 32-bit unsigned integer.

Would you mind putting a 1/0 or similar at line 512 of volumeutils.py in order to raise an error, then check the size of n_bytes when reading in Python 2.7? I guess it is the same?

@surchs
Copy link
Author

surchs commented Oct 20, 2015

It's pretty close:

/usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py in array_from_file(shape, in_dtype, infile, offset, order, mmap)
    509         return np.array([])
    510     # Read data from file
--> 511     1/0
    512     infile.seek(offset)
    513     if hasattr(infile, 'readinto'):

ZeroDivisionError: division by zero

In [5]: debug
> /usr/local/lib/python3.5/dist-packages/nibabel/volumeutils.py(511)array_from_file()
    510     # Read data from file
--> 511     1/0
    512     infile.seek(offset)

ipdb> n_bytes
4332619200
ipdb> 

@matthew-brett
Copy link
Member

So, that should overflow a 32-bit unsigned int as well. I guess the difference is in the Python implementations.

@surchs
Copy link
Author

surchs commented Oct 20, 2015

I'm not sure I'm following. Why would there be a 32 bit constraint when running on a 64 bit python shell?

@matthew-brett
Copy link
Member

I don't know why that would happen, but it will depend on how Python implements gzip decompression. My guess is Python changed the implementation at some point, and the newer implementation happens to use an unsigned 32-bit int.

Can you check other Python versions? Sorry to ask...

@surchs
Copy link
Author

surchs commented Oct 20, 2015

not easily, no. Sorry as well. I'll try on another machine. But this is seriously weird.

@matthew-brett
Copy link
Member

Fair enough. I'll do my best to replicate on my laptop.

@effigies
Copy link
Member

effigies commented Nov 8, 2015

Any movement on this?

@bcipolli
Copy link
Contributor

bcipolli commented Nov 8, 2015

I will take a look at this.

@bcipolli
Copy link
Contributor

bcipolli commented Nov 8, 2015

I was able to reproduce the error in Python 3 (and no error in Python 2) by creating a dummy image of the same size:

import nibabel as nib
import numpy as np
import os
img = nib.Nifti1Image(np.zeros((91,109,91,1200), dtype=np.float32), affine=np.eye(4))
nib.save(img, 'test.nii.gz')
del img
try:
    data = nib.load('test.nii.gz').get_data()
finally:
    os.remove('test.nii.gz')

Output:

>>> data = img.get_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bcipolli/code/libs/nibabel/nibabel/spatialimages.py", line 572, in get_data
    data = np.asanyarray(self._dataobj)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/core/numeric.py", line 525, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/Users/bcipolli/code/libs/nibabel/nibabel/arrayproxy.py", line 145, in __array__
    raw_data = self.get_unscaled()
  File "/Users/bcipolli/code/libs/nibabel/nibabel/arrayproxy.py", line 140, in get_unscaled
    mmap=self._mmap)
  File "/Users/bcipolli/code/libs/nibabel/nibabel/volumeutils.py", line 518, in array_from_file
    n_read = infile.readinto(data_bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/gzip.py", line 469, in read
    uncompress = self._decompressor.decompress(buf, size)
OverflowError: Python int too large for C unsigned int

I will debug a bit from here, but the bottom line is zlib is a binary, so not much to debug (just to check the inputs on the successful and unsuccessful calls)

@bcipolli
Copy link
Contributor

bcipolli commented Nov 8, 2015

Yep, it's exactly as expected: the call to zlib with len(buf)=8192 and size=4332617728 is failing. Decrease size to a non-overflow value (e.g. 4000000000), and the call completes just fine.

So I guess the question is whether we report a bug, or work around the issue. Will try to find some relevant documentation. To start, perhaps:
https://hg.python.org/cpython/rev/931e1bc090f6/
https://hg.python.org/cpython/rev/0cca6c5513d2
source:
https://bugs.python.org/issue18294

@effigies
Copy link
Member

effigies commented Nov 8, 2015

Thanks for tracking that down. Looks like that issue was superseded and they went with this patch. For now, it makes sense to me to report upstream and see if they're willing to fix this before trying to do any workarounds, here.

@arthurmensch
Copy link

I am having the same issue here, hasn't the patch above been merged already into Python 3 default branch ?

@effigies
Copy link
Member

Sorry, to clarify: I'm guessing that's the patch causing the problem.

@matthew-brett
Copy link
Member

Ben - how about reporting the bug AND working around the issue :)

@effigies
Copy link
Member

Don't have a good test rig, so trying to set up a test case using Ben's code (and a smaller Travis test suite) and getting weird Travis errors. Did I screw something up, or is this just a thing where I should wait a bit and re-run the tests?

Edit: It was the test I added. Took too much time or memory, so got killed. Will see if I can upload a sample dataset so we're not wasting time writing, at least.

@effigies
Copy link
Member

Loading a pre-made dataset still causes either a memory or time issue. Not sure testing for this can be automated...

My initial thought for a strategy would be: try loading with gzip, and if that fails, pipe through gunzip and store in a BytesIO.

@bcipolli
Copy link
Contributor

Just as a note, I don't see max_read_chunk defined in gzip.py, nor defined in the docs. So the monkey patch done here will no longer work in Python 3.5:
#210

@bcipolli
Copy link
Contributor

After further review, I don't see a quick way to do this either. I would suggest that, for a known bug, we should at least try...catch and refer users to this issue; when the issue is solved, the code can be removed. If that's cool, I'd be glad to implement that.

I also think we ought to re-open #209. I will do that now.

@matthew-brett
Copy link
Member

matthew-brett commented Nov 14, 2015 via email

@bcipolli
Copy link
Contributor

Don't we need to file a Python bug?​

https://bugs.python.org/issue25626 :)

@matthew-brett
Copy link
Member

Ben - do you get an error reading a file created like this?

head -c 5G < /dev/urandom | gzip -f > out.gz

@matthew-brett
Copy link
Member

Ah - how about :

dd if=/dev/urandom bs=1m count=5k | gzip -f > out.gz

http://unix.stackexchange.com/questions/32988/why-does-dd-from-dev-random-give-different-file-sizes

@matthew-brett
Copy link
Member

If you edit the nibabel code to not pass the size, but just ask to read all the file (not pass the size argument), do you still get a crash?

@bcipolli
Copy link
Contributor

head -c 5000000000 < /dev/urandom | gzip -f > out.gz works, but is slower than the Python code to create an image. AND it burns my battery :)

And nibabel won't open it; the file type is unrecognized (even if I change to nii.gz; I assume the header info is wrong)

If you edit the nibabel code to not pass the size, but just ask to read all the file (not pass the size argument), do you still get a crash?

I don't see anywhere that we explicitly pass in the size. We pass in a buffer in volumeutils.py:array_from_file. I also debugged and looked through all the calling code (and the data object itself), didn't see anything... did you see a spot?

@matthew-brett
Copy link
Member

Sorry, I meant, do you get the same error as for an image, if you do:

import gzip
gzf = gzip.open('out.gz', 'rb')
res = gzf.read(4332619200)

@bcipolli
Copy link
Contributor

My thought was to try a buffered read loop, to avoid the problem. This seems to work:

        data_bytes = bytearray(n_bytes)
        n_read = 0
        while n_read < n_bytes:
            to_read = (n_bytes - n_read)
            if to_read > 2**32:
                to_read = 2**32 - 1
            n_read += infile.readinto(data_bytes[n_read:n_read+to_read])

@bcipolli
Copy link
Contributor

Sorry, I meant, do you get the same error as for an image, if you do:

Ah I see, sorry. Yes, you get the same error.

@bcipolli
Copy link
Contributor

Or, more succinctly:

        n_read = 0
        while n_read < n_bytes:
            to_read = (n_bytes - n_read) % (2**32 - 1) or (2**32 - 1)
            n_read += infile.readinto(data_bytes[n_read:n_read+to_read])

@matthew-brett
Copy link
Member

I guess another option is to disable use of readinto for Python 3.5...

@bcipolli
Copy link
Contributor

Any preference?

@matthew-brett
Copy link
Member

I guess buffering will be more efficient. But - I think we need to cover the case where the read returns fewer bytes than we thought. How about something like:

n_read = 0
max_read = 2 ** 32 -1  # Max for unsigned 32-bit integer
while (n_read > n_bytes):
    n_wanted = min([n_bytes - n_read, max_read])
    n_got = infile.readinto(data_bytes[n_read:n_read + n_wanted])
    n_read += n_got
    if n_got != n_wanted:
        break

@bcipolli
Copy link
Contributor

Sounds great. Since it's gzip-specific, we should probably put that in _gzip_open, in openers.py.

For testing, I think a unit test on array_from_file using a BytesIO object could be fastest--writing to disk sucks. If you can think of a way to do the Nifti1Image with BytesIO and avoid actually writing to disk, please let me know.

I'll try the unit test in the meantime.

@bcipolli
Copy link
Contributor

Er.... we'll also have to wrap GzipFile in that case... but it seems best to keep things gzip-specific...

@matthew-brett
Copy link
Member

I see that it's possible to pass a file-like object to gzip.GzipFile, I guess it would be possible to use that to simulate a gzipped stream that is long, but we would still need a machine that could handle at least 4GB images in memory.

@bcipolli
Copy link
Contributor

we would still need a machine that could handle at least 4GB images in memory.

Well, I guess it'll have to be 4GB + the size of the GZIP output in memory, vs. 4GB (just the image in memory) that seems to be unavoidable. So 6GB vs. 4GB... the major concern is the time.

As I'm running this code, I'm realizing that it's gzip itself that's slow. I'll decrease compression level in the test...

@matthew-brett
Copy link
Member

I hate to say this now, but maybe we could do without a test here, on the basis that it's too time-consuming and memory intensive to test routinely.

Or have a separate test repo, for these kinds of very long / slow tests, that we trigger travis on / run on the buildbots from time to time.

@bcipolli
Copy link
Contributor

Sure thing. I want a test case to make sure things are working as expected before doing a PR, so ... it's useful, even if we disable the test by default.

@bcipolli
Copy link
Contributor

How about an environment variable SLOW_TESTS that runs slow tests when set; otherwise, they're skipped?

@matthew-brett
Copy link
Member

Sure. Even something like NIPY_EXTRA_TESTS which has a comma separated list of types of extra tests such as slow,bigmem.

grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants