ENH: Improve read pattern in gzip #209

GaelVaroquaux · 2013-11-14T12:59:57Z

Problems

Memory usage

When opening big gzip files, as for instance those shipped by the human connectome project, nibabel takes too much memory (see for instance #208 ).

The breakage happens in gzip.open and can be explained as follow (see the traceback in #208 to understand better):

gzip does not know how much memory it needs to allocate, so it uses 1) for decompression a growing chunk size, with powers of two and capped at gzip.GzipFile.max_read_chunk, 2) for storing the output a buffer that it grows progressively by appending the chunk read.
for memory problems, that buffer is the worst problem as its memory footprint gets doubled when the append happens (it is not an append: a new buffer is created, and the old one removed).
if the final data has 4Gb + epsilon, the gzip opener will have at a given instant 2 4Gb buffers, and, in @rphlypo's case blow the memory

Read is slow

In addition to the memory usage, this pattern leads to very inefficient code, as the temporary buffer is grown again and again, which leads to costly memory copies. These are especially costly since the size of the memory copy isn't small compared to the total size of the memory, and thus the system has to move other memory pages belonging to other programs to allocate this memory. This is typically the behavior that renders systems unresponsive when the memory usage gets close 100%. To witness how bad the I/O speed is with data of a size on the same order of magnitude than the total memory, an easy experiment is to compare the time a 'nibabel.get_data()' takes to the time required to:

gzip.GzipFile('file.nii.gz').read() (slow)
gunzip file.nii.gz -c > /dev/null (almost twice as fast)
open in fslview (almost twice as fast)

Proposed improvement

The only way that I can think of to improve the memory usage would be to preallocate a buffer big enough. We roughly know the size. However, the gzip module does not allow us to do this, and it would require fairly deep monkey patching that I think is not reasonable. I give up on this aspect of the problem.

For the speed, the simple option is to increase the 'max_read_chunk' of the GzipFile object. With a file that is 400Mb big, here is an experiment that shows the benefit:

import gzip
gzip_file = gzip.open('0010042_rest_tshift_RPI_voreg_mni.nii.gz')
gzip_file.max_read_chunk = max_read_chunk = 100 * 1024 * 1024 # 100Mb
data = gzip_file.read()

The extra line (third line) makes a factor of 2 of speed in the above example.

To see the impact on nibabel, a little bit of monkey-patching enables experimentation:

import gzip
gzip.GzipFile.max_read_chunk = max_read_chunk = 200 * 1024 * 1024 # 200Mb

Roughly factors of two speedups

Action point

I am proposing to submit a patch to nibabel that uses the following strategy for an opener:

gzip_file = gzip.open('0010042_rest_tshift_RPI_voreg_mni.nii.gz')
gzip_file.max_read_chunk = max_read_chunk = 100 * 1024 * 1024 # 100Mb
data = gzip_file.read()

It should be an easy, local, modification. @matthew-brett : would you be in favor of such a patch?

See discussion in nipy/nibabel#209

Fixes nipy#209

See discussion in nipy/nibabel#209

MRG: fast reads on large gzip files Fixes #209

bcipolli · 2015-11-14T19:34:05Z

Oh, I don't have permission to re-open this, but the fix made for this won't work for Python 3.5. A new change needs to be found and made for that case.

@GaelVaroquaux I opened an issue on the nilearn side of things as well:
nilearn/nilearn#843

GaelVaroquaux · 2015-11-15T10:32:50Z

@matthew-brett : I agree that this probably needs to be reopened.

lesteve · 2015-11-16T09:43:25Z

The commit removing max_read_chunk is: python/cpython@845155b. Haven't tried to look into it more than that.

matthew-brett · 2015-11-16T18:27:20Z

Guys - have a look at comments in : https://bugs.python.org/issue25626

Specifically:

"""
If you think the new 3.5 implementation (with OverflowError fixed) is significantly less efficient, I can look at ways of improving it. But I am a bit skeptical. Usually I find once you process 10s or 100s of kB of data at once, any increases in the buffer size have negligible effect on the little time spent in native Python code, and times actually get worse as your buffers get too big, probably due to ineffective use of fast memory caching.
"""

Do you have evidence with which to reply?

Fixes nipy#209

MRG: fast reads on large gzip files Fixes nipy#209

GaelVaroquaux added a commit to GaelVaroquaux/nilearn that referenced this issue Nov 14, 2013

ENH: MonkeyPatch gzip to have larger chunk size

5f5108f

See discussion in nipy/nibabel#209

GaelVaroquaux added a commit to GaelVaroquaux/nibabel that referenced this issue Nov 14, 2013

ENH: fast reads on large gzip files

46122c4

Fixes nipy#209

GaelVaroquaux mentioned this issue Nov 14, 2013

ENH: fast reads on large gzip files #210

Merged

GaelVaroquaux added a commit to nilearn/nilearn that referenced this issue Nov 20, 2013

ENH: MonkeyPatch gzip to have larger chunk size

593f1c7

See discussion in nipy/nibabel#209

matthew-brett closed this as completed in #210 Oct 13, 2014

matthew-brett added a commit that referenced this issue Oct 13, 2014

Merge pull request #210 from GaelVaroquaux/fast_gzip_read

5e00fd2

MRG: fast reads on large gzip files Fixes #209

This was referenced Nov 14, 2015

nibabel.load crashes for very large files on python 3, not python 2 #362

Closed

gzipGzipFile.max_read_chunk does not exist in Python 3.5 nilearn/nilearn#843

Closed

effigies reopened this Nov 16, 2015

grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016

ENH: fast reads on large gzip files

810984f

Fixes nipy#209

grlee77 pushed a commit to grlee77/nibabel that referenced this issue Mar 15, 2016

Merge pull request nipy#210 from GaelVaroquaux/fast_gzip_read

bd7f849

MRG: fast reads on large gzip files Fixes nipy#209

effigies closed this as completed Oct 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve read pattern in gzip #209

ENH: Improve read pattern in gzip #209

GaelVaroquaux commented Nov 14, 2013

bcipolli commented Nov 14, 2015

GaelVaroquaux commented Nov 15, 2015

lesteve commented Nov 16, 2015

matthew-brett commented Nov 16, 2015

ENH: Improve read pattern in gzip #209

ENH: Improve read pattern in gzip #209

Comments

GaelVaroquaux commented Nov 14, 2013

bcipolli commented Nov 14, 2015

GaelVaroquaux commented Nov 15, 2015

lesteve commented Nov 16, 2015

matthew-brett commented Nov 16, 2015