Performance degrade in unpickling fairly large objects #36

kuenishi · 2019-07-31T02:34:53Z

@yuyu2172 reported that with ChainerIO unpickling fairly large file in NFS is much slower than Python's built-in IO system. The difference was like 10x or even more.

Microbench

Here's complete benchmark results and code: https://gist.github.com/kuenishi/d8d93847e0705c110501d68101fe5f53

To unpickle long list l = [0.1] * 10000000 takes 12 seconds in ChainerIO and 0.75 secs with io .
Main difference in profile result was number of read calls, like 20M calls vs almost zero (doesn't appear in cProfile).

Root cause

This is because ChainerIO's file object doen't have peek method, while Python's io.BufferdReader ( BufferedIOBase implementation ) has it.　Code:

        /* Prefetch some data without advancing the file pointer, if possible */
        if (self->peek && n < PREFETCH) {
            len = PyLong_FromSsize_t(PREFETCH);
            if (len == NULL)
                return -1;
            data = _Pickle_FastCall(self->peek, len);

With this patch

diff --git a/chainerio/fileobject.py b/chainerio/fileobject.py
index df7fedd..da31ad9 100644
--- a/chainerio/fileobject.py
+++ b/chainerio/fileobject.py
@@ -43,6 +43,9 @@ class FileObject(io.IOBase):
     def fileno(self) -> int:
         return self.base_file_object.fileno()
 
+    def peek(self, size: int = 0) -> bytes:
+        return self.base_file_object.peek(size)
+
     def read(self, size: Optional[int] = None) -> Union[bytes, str]:
         return self.base_file_object.read(size)

The number of read call comes back to sanity and the performance has imporoved:

$ python b.py chainerio    
chainerio <class 'chainerio.fileobject.FileObject'> haspeek: True
chainerio <module 'chainerio' from '/home/kuenishi/src/chainerio/chainerio/__init__.py'> : 0.8312304019927979 sec
        156254 function calls in 0.831 seconds
                                
  Ordered by: cumulative time                                     
                                
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.428    0.428    0.831    0.831 {built-in method _pickle.load}
   39063    0.016    0.000    0.295    0.000 fileobject.py:49(read)
   39063    0.279    0.000    0.279    0.000 {method 'read' of '_io.BufferedReader' objects}
   39063    0.016    0.000    0.109    0.000 fileobject.py:46(peek)
   39063    0.093    0.000    0.093    0.000 {method 'peek' of '_io.BufferedReader' objects}
       1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Discussion

I would suggest retreating from current wrapper strategy on file objects until we truly support profiling. This is mainly because we don't have perfect solution for now. For example, any of these aren't good, or even nonsense: 1) adding peek method is a hacky workaround, 2) wrapping ChainerIO's file object again with io.BufferedReader or io.BufferedWriter is like matryoshka doll and crazy.

Which file object interface to support must be cafully chosen at open time, e.g. io.BufferedIOBase or io.IOBase .
PyArrow's HFDS file object doesn't have peek
Python's zipfile.ZipExtFile has peek since at least 3.5

The text was updated successfully, but these errors were encountered:

belldandyxtq · 2019-07-31T02:57:12Z

Thank you for reporting and finding the root cause.

If we remove the wrapper, what is your plan on supporting pickle on HDFS?

There are a few cases we need to replace the original file object to support some functionalities. For example, reading as text from HDFS and zip

And the internal zip for Python < 3.7

I am thinking about making the functionality of __init__ of file object, which is currently used to determine the proper file object type for opening, to a separate function to be called by open decorator.

In this way, we can for now get rid of the following I/O calls going to file object wrapper while still having the ability to solve the issues like text read on HDFS and internal zip

belldandyxtq · 2019-07-31T03:17:40Z

To make it more clear, I am thinking about creating a file_object_maker as a replacement of __init__ in the current file_object.py, which is used to cover those cases where some functionalities are not supported by the original file objects like the text read cases in HDFS and zip.

The currently implementation of open_wrapper returns the FileObject defined in fileobject.pyor its derived classes. And the file object replacement takes places in such __init__ of the FileObject, hence the replacement is bound to the file object.

If we extract the __init__ from the FileObject and make it a function (e.g. file_object_maker), we can still have the ability to control which kind of file object to return to user while not returning a always-wrappered FileObject to workaround the issues we have now.

kuenishi · 2019-07-31T03:23:06Z

Wrapping where we need it is fine, but FileObject in fileobject.py is unnecessary for now especially for "posix" filesystem and zip container. This is maybe because aligning other filesystems and file objects' behaviour to "posix" is the best way to prevent potential performance issues like this, just until profiler.

belldandyxtq · 2019-07-31T03:48:38Z

I think we need to wrap zip container for text reading and internal zip.
The only two don't need wrapper are POSIX and HTTP.

And how about pickle on HDFS? an simple plan will be extracting the content and putting into io.bufferedreader in the file_object_maker

The readline test in hdfs `test_read_bytes` is removed. The original purpose of this test was to support pickle, while such functionality is already tested in `test_pickle`. Having `readline` when opening with 'rb' is weird. A timeout timer is added to `test_pickle` test_zip_container and test_hdfs_handler to prevent issue #36.

kuenishi · 2019-08-09T01:34:13Z

This issue for POSIX filesystems is addressed by #38 .

kuenishi added cat:bug Bug report or fix. prio:high High priority. Urgent and needs to be worked on as soon as possible. labels Jul 31, 2019

belldandyxtq mentioned this issue Aug 6, 2019

New design of open wrapper #37

Closed

3 tasks

belldandyxtq mentioned this issue Aug 6, 2019

Remove FileObject Wrapper #38

Merged

kuenishi closed this as completed Aug 9, 2019

belldandyxtq mentioned this issue Aug 16, 2019

Pickle is slow with large objects on HDFS #42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degrade in unpickling fairly large objects #36

Performance degrade in unpickling fairly large objects #36

kuenishi commented Jul 31, 2019 •

edited

belldandyxtq commented Jul 31, 2019 •

edited

belldandyxtq commented Jul 31, 2019 •

edited

kuenishi commented Jul 31, 2019

belldandyxtq commented Jul 31, 2019 •

edited

kuenishi commented Aug 9, 2019

Performance degrade in unpickling fairly large objects #36

Performance degrade in unpickling fairly large objects #36

Comments

kuenishi commented Jul 31, 2019 • edited

Microbench

Root cause

Discussion

belldandyxtq commented Jul 31, 2019 • edited

belldandyxtq commented Jul 31, 2019 • edited

kuenishi commented Jul 31, 2019

belldandyxtq commented Jul 31, 2019 • edited

kuenishi commented Aug 9, 2019

kuenishi commented Jul 31, 2019 •

edited

belldandyxtq commented Jul 31, 2019 •

edited

belldandyxtq commented Jul 31, 2019 •

edited

belldandyxtq commented Jul 31, 2019 •

edited