Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrade in unpickling fairly large objects #36

Closed
kuenishi opened this issue Jul 31, 2019 · 5 comments
Closed

Performance degrade in unpickling fairly large objects #36

kuenishi opened this issue Jul 31, 2019 · 5 comments
Labels
cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible.

Comments

@kuenishi
Copy link
Member

kuenishi commented Jul 31, 2019

@yuyu2172 reported that with ChainerIO unpickling fairly large file in NFS is much slower than Python's built-in IO system. The difference was like 10x or even more.

Microbench

Here's complete benchmark results and code: https://gist.github.com/kuenishi/d8d93847e0705c110501d68101fe5f53

To unpickle long list l = [0.1] * 10000000 takes 12 seconds in ChainerIO and 0.75 secs with io .
Main difference in profile result was number of read calls, like 20M calls vs almost zero (doesn't appear in cProfile).

Root cause

This is because ChainerIO's file object doen't have peek method, while Python's io.BufferdReader ( BufferedIOBase implementation ) has it. Code:

        /* Prefetch some data without advancing the file pointer, if possible */
        if (self->peek && n < PREFETCH) {
            len = PyLong_FromSsize_t(PREFETCH);
            if (len == NULL)
                return -1;
            data = _Pickle_FastCall(self->peek, len);

With this patch

diff --git a/chainerio/fileobject.py b/chainerio/fileobject.py
index df7fedd..da31ad9 100644
--- a/chainerio/fileobject.py
+++ b/chainerio/fileobject.py
@@ -43,6 +43,9 @@ class FileObject(io.IOBase):
     def fileno(self) -> int:
         return self.base_file_object.fileno()
 
+    def peek(self, size: int = 0) -> bytes:
+        return self.base_file_object.peek(size)
+
     def read(self, size: Optional[int] = None) -> Union[bytes, str]:
         return self.base_file_object.read(size)

The number of read call comes back to sanity and the performance has imporoved:

$ python b.py chainerio    
chainerio <class 'chainerio.fileobject.FileObject'> haspeek: True
chainerio <module 'chainerio' from '/home/kuenishi/src/chainerio/chainerio/__init__.py'> : 0.8312304019927979 sec
        156254 function calls in 0.831 seconds
                                
  Ordered by: cumulative time                                     
                                
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.428    0.428    0.831    0.831 {built-in method _pickle.load}
   39063    0.016    0.000    0.295    0.000 fileobject.py:49(read)
   39063    0.279    0.000    0.279    0.000 {method 'read' of '_io.BufferedReader' objects}
   39063    0.016    0.000    0.109    0.000 fileobject.py:46(peek)
   39063    0.093    0.000    0.093    0.000 {method 'peek' of '_io.BufferedReader' objects}
       1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Discussion

I would suggest retreating from current wrapper strategy on file objects until we truly support profiling. This is mainly because we don't have perfect solution for now. For example, any of these aren't good, or even nonsense: 1) adding peek method is a hacky workaround, 2) wrapping ChainerIO's file object again with io.BufferedReader or io.BufferedWriter is like matryoshka doll and crazy.

  • Which file object interface to support must be cafully chosen at open time, e.g. io.BufferedIOBase or io.IOBase .
  • PyArrow's HFDS file object doesn't have peek
  • Python's zipfile.ZipExtFile has peek since at least 3.5
@kuenishi kuenishi added cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible. labels Jul 31, 2019
@belldandyxtq
Copy link
Member

belldandyxtq commented Jul 31, 2019

Thank you for reporting and finding the root cause.

If we remove the wrapper, what is your plan on supporting pickle on HDFS?

There are a few cases we need to replace the original file object to support some functionalities. For example, reading as text from HDFS and zip

And the internal zip for Python < 3.7

I am thinking about making the functionality of __init__ of file object, which is currently used to determine the proper file object type for opening, to a separate function to be called by open decorator.

In this way, we can for now get rid of the following I/O calls going to file object wrapper while still having the ability to solve the issues like text read on HDFS and internal zip

@belldandyxtq
Copy link
Member

belldandyxtq commented Jul 31, 2019

To make it more clear, I am thinking about creating a file_object_maker as a replacement of __init__ in the current file_object.py, which is used to cover those cases where some functionalities are not supported by the original file objects like the text read cases in HDFS and zip.

The currently implementation of open_wrapper returns the FileObject defined in fileobject.pyor its derived classes. And the file object replacement takes places in such __init__ of the FileObject, hence the replacement is bound to the file object.

If we extract the __init__ from the FileObject and make it a function (e.g. file_object_maker), we can still have the ability to control which kind of file object to return to user while not returning a always-wrappered FileObject to workaround the issues we have now.

@kuenishi
Copy link
Member Author

Wrapping where we need it is fine, but FileObject in fileobject.py is unnecessary for now especially for "posix" filesystem and zip container. This is maybe because aligning other filesystems and file objects' behaviour to "posix" is the best way to prevent potential performance issues like this, just until profiler.

@belldandyxtq
Copy link
Member

belldandyxtq commented Jul 31, 2019

I think we need to wrap zip container for text reading and internal zip.
The only two don't need wrapper are POSIX and HTTP.

And how about pickle on HDFS? an simple plan will be extracting the content and putting into io.bufferedreader in the file_object_maker

belldandyxtq added a commit that referenced this issue Aug 6, 2019
The readline test in hdfs `test_read_bytes` is removed.
The original purpose of this test was to support pickle,
while such functionality is already tested in `test_pickle`.
Having `readline` when opening with 'rb' is weird.

A timeout timer is added to `test_pickle` test_zip_container
and test_hdfs_handler to prevent issue #36.
belldandyxtq added a commit that referenced this issue Aug 7, 2019
The readline test in hdfs `test_read_bytes` is removed.
The original purpose of this test was to support pickle,
while such functionality is already tested in `test_pickle`.
Having `readline` when opening with 'rb' is weird.

A timeout timer is added to `test_pickle` test_zip_container
and test_hdfs_handler to prevent issue #36.
belldandyxtq added a commit that referenced this issue Aug 7, 2019
The readline test in hdfs `test_read_bytes` is removed.
The original purpose of this test was to support pickle,
while such functionality is already tested in `test_pickle`.
Having `readline` when opening with 'rb' is weird.

A timeout timer is added to `test_pickle` test_zip_container
and test_hdfs_handler to prevent issue #36.
@kuenishi
Copy link
Member Author

kuenishi commented Aug 9, 2019

This issue for POSIX filesystems is addressed by #38 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible.
Projects
None yet
Development

No branches or pull requests

2 participants