-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance degrade in unpickling fairly large objects #36
Comments
Thank you for reporting and finding the root cause. If we remove the wrapper, what is your plan on supporting pickle on HDFS? There are a few cases we need to replace the original file object to support some functionalities. For example, reading as text from HDFS and zip And the internal zip for Python < 3.7 I am thinking about making the functionality of In this way, we can for now get rid of the following I/O calls going to file object wrapper while still having the ability to solve the issues like text read on HDFS and internal zip |
To make it more clear, I am thinking about creating a The currently implementation of open_wrapper returns the FileObject defined in fileobject.pyor its derived classes. And the If we extract the |
Wrapping where we need it is fine, but FileObject in fileobject.py is unnecessary for now especially for "posix" filesystem and zip container. This is maybe because aligning other filesystems and file objects' behaviour to "posix" is the best way to prevent potential performance issues like this, just until profiler. |
I think we need to wrap zip container for text reading and internal zip. And how about pickle on HDFS? an simple plan will be extracting the content and putting into io.bufferedreader in the |
The readline test in hdfs `test_read_bytes` is removed. The original purpose of this test was to support pickle, while such functionality is already tested in `test_pickle`. Having `readline` when opening with 'rb' is weird. A timeout timer is added to `test_pickle` test_zip_container and test_hdfs_handler to prevent issue #36.
The readline test in hdfs `test_read_bytes` is removed. The original purpose of this test was to support pickle, while such functionality is already tested in `test_pickle`. Having `readline` when opening with 'rb' is weird. A timeout timer is added to `test_pickle` test_zip_container and test_hdfs_handler to prevent issue #36.
The readline test in hdfs `test_read_bytes` is removed. The original purpose of this test was to support pickle, while such functionality is already tested in `test_pickle`. Having `readline` when opening with 'rb' is weird. A timeout timer is added to `test_pickle` test_zip_container and test_hdfs_handler to prevent issue #36.
This issue for POSIX filesystems is addressed by #38 . |
@yuyu2172 reported that with ChainerIO unpickling fairly large file in NFS is much slower than Python's built-in IO system. The difference was like 10x or even more.
Microbench
Here's complete benchmark results and code: https://gist.github.com/kuenishi/d8d93847e0705c110501d68101fe5f53
To unpickle long list
l = [0.1] * 10000000
takes 12 seconds in ChainerIO and 0.75 secs withio
.Main difference in profile result was number of read calls, like 20M calls vs almost zero (doesn't appear in cProfile).
Root cause
This is because ChainerIO's file object doen't have
peek
method, while Python'sio.BufferdReader
(BufferedIOBase
implementation ) has it. Code:With this patch
The number of read call comes back to sanity and the performance has imporoved:
Discussion
I would suggest retreating from current wrapper strategy on file objects until we truly support profiling. This is mainly because we don't have perfect solution for now. For example, any of these aren't good, or even nonsense: 1) adding
peek
method is a hacky workaround, 2) wrapping ChainerIO's file object again withio.BufferedReader
orio.BufferedWriter
is like matryoshka doll and crazy.io.BufferedIOBase
orio.IOBase
.peek
zipfile.ZipExtFile
haspeek
since at least 3.5The text was updated successfully, but these errors were encountered: