-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Improve read pattern in gzip #209
Comments
See discussion in nipy/nibabel#209
See discussion in nipy/nibabel#209
MRG: fast reads on large gzip files Fixes #209
Oh, I don't have permission to re-open this, but the fix made for this won't work for Python 3.5. A new change needs to be found and made for that case. @GaelVaroquaux I opened an issue on the |
@matthew-brett : I agree that this probably needs to be reopened. |
The commit removing max_read_chunk is: python/cpython@845155b. Haven't tried to look into it more than that. |
Guys - have a look at comments in : https://bugs.python.org/issue25626 Specifically: """ Do you have evidence with which to reply? |
MRG: fast reads on large gzip files Fixes nipy#209
Problems
Memory usage
When opening big gzip files, as for instance those shipped by the human connectome project, nibabel takes too much memory (see for instance #208 ).
The breakage happens in gzip.open and can be explained as follow (see the traceback in #208 to understand better):
Read is slow
In addition to the memory usage, this pattern leads to very inefficient code, as the temporary buffer is grown again and again, which leads to costly memory copies. These are especially costly since the size of the memory copy isn't small compared to the total size of the memory, and thus the system has to move other memory pages belonging to other programs to allocate this memory. This is typically the behavior that renders systems unresponsive when the memory usage gets close 100%. To witness how bad the I/O speed is with data of a size on the same order of magnitude than the total memory, an easy experiment is to compare the time a 'nibabel.get_data()' takes to the time required to:
Proposed improvement
The only way that I can think of to improve the memory usage would be to preallocate a buffer big enough. We roughly know the size. However, the gzip module does not allow us to do this, and it would require fairly deep monkey patching that I think is not reasonable. I give up on this aspect of the problem.
For the speed, the simple option is to increase the 'max_read_chunk' of the GzipFile object. With a file that is 400Mb big, here is an experiment that shows the benefit:
The extra line (third line) makes a factor of 2 of speed in the above example.
To see the impact on nibabel, a little bit of monkey-patching enables experimentation:
Roughly factors of two speedups
Action point
I am proposing to submit a patch to nibabel that uses the following strategy for an opener:
It should be an easy, local, modification. @matthew-brett : would you be in favor of such a patch?
The text was updated successfully, but these errors were encountered: