Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle is slow with large objects on HDFS #42

Closed
belldandyxtq opened this issue Aug 16, 2019 · 0 comments · Fixed by #44
Closed

Pickle is slow with large objects on HDFS #42

belldandyxtq opened this issue Aug 16, 2019 · 0 comments · Fixed by #44
Labels
cat:performance Performance in terms of speed or memory consumption.

Comments

@belldandyxtq
Copy link
Member

belldandyxtq commented Aug 16, 2019

A related issue to #36, As peek is not in the HdfsFile, the same issue also happens when using HDFS. Unlike the #36, where the peek in POSIX was just hiden by the FileObject, peek needs to be added in case of HDFS.

$ python test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-simple.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
19/08/16 14:53:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
chainerio:  186.07044076919556
$ cat test.py
import time
import pickle
import chainerio

chainerio.set_root("hdfs")
cache_path = 'a_large_file.pkl'

start = time.time()
with chainerio.open(cache_path, 'rb') as f:
    data = pickle.load(f)
print('chainerio: ', time.time() - start)
Wed Jul 24 19:43:47 2019    profile

         466895442 function calls (466887441 primitive calls) in 161.088 seconds

   Ordered by: internal time
   List reduced from 2941 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   74.242   74.242  161.008  161.008 {built-in method _pickle.load}
233293409   46.669    0.000   84.175    0.000 
xxx/versions/3.7.2/lib/python3.7/site-packages/chainerio/fileobject.py:50(read)
233293409   37.505    0.000   37.505    0.000 {method 'read' of '_io.BufferedReader' objects}
     1452    0.384    0.000    0.388    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
@belldandyxtq belldandyxtq added the cat:performance Performance in terms of speed or memory consumption. label Aug 16, 2019
belldandyxtq added a commit that referenced this issue Aug 16, 2019
This commit solves the performance issues described in #42.

In order to add the missing `peek` support, the HdfsFile object gets
wrapped with `io.BufferedReader` when opening with 'rb',
which is how the file is opened when using pickle.
kuenishi pushed a commit that referenced this issue Sep 12, 2019
* Improve pickle performance on HDFS

This commit solves the performance issues described in #42.

In order to add the missing `peek` support, the HdfsFile object gets
wrapped with `io.BufferedReader` when opening with 'rb',
which is how the file is opened when using pickle.

* add comment string

* add io.bufferedwriter

* update comment string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:performance Performance in terms of speed or memory consumption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant