Fix zlib and blosc imports in to_msgpack #9783

invisibleroads · 2015-04-02T00:52:27Z

6717aa0 removed zlib and blosc from the global namespace.

from pandas import read_csv
table = read_csv('aadhaar_data.csv')
table.to_msgpack('d.msg')

# NameError: global name 'blosc' is not defined
table.to_msgpack('d-blosc.msg', compress='blosc')

# NameError: global name 'zlib' is not defined
table.to_msgpack('d-zlib.msg', compress='zlib')

This pull request restores zlib and blosc compression in to_msgpack via local imports.

shoyer · 2015-04-02T00:57:57Z

Could you please add tests for these? Looks like this functionality was not tested before.

jreback · 2015-04-02T00:58:31Z

this needs some tests
and a perf check

they were not in the local imports originally because of perf

invisibleroads · 2015-04-02T00:59:20Z

What's a perf check?

jreback · 2015-04-02T01:02:36Z

since this is importing each time in a lower level function I would like to see if that impacts things

show to/from msgpack with a good sized frame with mixed types with compression and without

look at prun and see if the imports cause an issue

invisibleroads · 2015-04-02T01:09:56Z

What is prun?

In my experience, module initialization happens only on the first import.
http://stackoverflow.com/questions/296036/does-python-optimize-modules-when-they-are-imported-multiple-times

You can test this by creating a.py

print 'hey'

and creating b.py

import a
import a

and then run python b.py to see that "hey" only prints once.

invisibleroads · 2015-04-02T01:13:34Z

The following code also only prints "hey" once.

import a
import a

def f():
    import a

f()

invisibleroads · 2015-04-02T01:15:07Z

Lastly, this variation also only prints "hey" once.

def f():
    import a

def g():
    import a

f()
g()

These examples hold true for python2.7 and python3.4.

jreback · 2015-04-02T01:16:04Z

use ipython

of course they are only imported once

but the function call and import checks themselves are non trivial

jreback · 2015-04-02T01:17:16Z

https://github.com/pydata/pandas/wiki/Performance-Testing

invisibleroads · 2015-04-02T04:13:24Z

Here are some performance results using the following code saved as compress.py.

from cStringIO import StringIO
from pandas import read_csv, read_msgpack
from urllib2 import urlopen

url = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv'
table = read_csv(StringIO(urlopen(url).read()))
print 'prun z = read_msgpack(table.to_msgpack())'

When zlib and blosc are imported globally

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack())

500 function calls (492 primitive calls) in 0.030 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.022    0.022    0.025    0.025 packers.py:134(read)
     1    0.002    0.002    0.003    0.003 {method 'pack' of 'pandas.msgpack.Packer' objects}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}
     2    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))

508 function calls (500 primitive calls) in 0.059 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.027    0.013    0.027    0.013 {zlib.compress}
     1    0.024    0.024    0.027    0.027 packers.py:134(read)
     1    0.003    0.003    0.031    0.031 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))

532 function calls (524 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.042    0.042    0.046    0.046 packers.py:134(read)
     1    0.003    0.003    0.005    0.005 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.001    0.001    0.001    0.001 {blosc.blosc_extension.compress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack())
10 loops, best of 3: 27.7 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='zlib'))
10 loops, best of 3: 49.4 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='blosc'))
10 loops, best of 3: 28.7 ms per loop

When zlib and blosc are imported locally

prun z = read_msgpack(table.to_msgpack())

500 function calls (492 primitive calls) in 0.029 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.003    0.003 {method 'pack' of 'pandas.msgpack.Packer' objects}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}
     2    0.001    0.000    0.001    0.000 index.py:1508(get_indexer)

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))
In [2]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))
In [3]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))

509 function calls (501 primitive calls) in 0.081 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.048    0.048 packers.py:133(read)
     2    0.027    0.013    0.027    0.013 {zlib.compress}
     1    0.004    0.004    0.031    0.031 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}

479 function calls (471 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.022    0.011    0.022    0.011 {zlib.compress}
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.025    0.025 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
     1    0.001    0.001    0.053    0.053 <string>:1(<module>)

479 function calls (471 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.022    0.011    0.022    0.011 {zlib.compress}
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.025    0.025 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
     1    0.002    0.002    0.053    0.053 <string>:1(<module>)

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))
In [2]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))
In [3]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))

596 function calls (588 primitive calls) in 0.056 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.047    0.047 packers.py:133(read)
     1    0.003    0.003    0.007    0.007 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}

503 function calls (495 primitive calls) in 0.055 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.046    0.046 packers.py:133(read)
     1    0.003    0.003    0.006    0.006 {method 'pack' of 'pandas.msgpack.Packer' objects}
     1    0.002    0.002    0.055    0.055 <string>:1(<module>)
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    13    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}

503 function calls (495 primitive calls) in 0.054 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.042    0.042    0.045    0.045 packers.py:133(read)
     1    0.003    0.003    0.006    0.006 {method 'pack' of 'pandas.msgpack.Packer' objects}
     1    0.002    0.002    0.054    0.054 <string>:1(<module>)
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    13    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack())
10 loops, best of 3: 28.2 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='zlib'))
10 loops, best of 3: 49.1 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='blosc'))
10 loops, best of 3: 29.5 ms per loop

invisibleroads · 2015-04-02T04:16:32Z

I didn't expect such a quick response to the pull request. Thanks for maintaining the pandas package.

jreback · 2015-04-02T11:54:50Z

pandas/io/tests/test_packers.py

+class TestCompression(TestPackers):
+
+    def setUp(self):
+        super(TestCompression, self).setUp()


add the issue number as a comment (on the class)

jreback · 2015-04-02T11:57:37Z

ok, seems reasonable to use the local imports.

remove some of the blosc imports (e.g. from 2.6, 2.7_LOCALE). Then in the tests test for blosc imports, if it fails, skip the test.
add a release note in v0.16.1, you can put it in bug fixes, use this issue number as the reference number

invisibleroads · 2015-04-02T13:19:59Z

This seems to be a documented issue between blosc and conda.
https://groups.google.com/a/continuum.io/forum/#!topic/conda/2a9Fyvzedug
Specifically, conda only installs c-blosc and not python-blosc. Hmm.

jreback · 2015-04-02T13:25:04Z

hmm ok let me see if I can have this fixed

jreback · 2015-04-02T20:43:13Z

ci/script.sh

@@ -16,6 +16,8 @@ fi
 "$TRAVIS_BUILD_DIR"/ci/build_docs.sh 2>&1 > /tmp/doc.log &
 # doc build log will be shown after tests

+pip install -U blosc  # See https://github.com/pydata/pandas/pull/9783
+python -c 'import blosc; blosc.print_versions()'


jreback · 2015-04-02T20:49:08Z

@invisibleroads this looks good!

This needs to be tested slightly differently. encode_decode passes the kwargs thru, so what you are doing is fine.

BUT, you are adding the compression type to the objects themselves (again which is ok), I don't think this is actually used on decompression (its not tested that way now). AND I am not sure that you CAN use it, as you would have to decompress first, look at the compression variable, but then you are done already.

So I am not sure how to get around this problem, other than having a top-level section which has some uncompressed attributes.

So either we need to take off the compress : .... on the objects themselves as it doesn't serve any purpose or test it (and/or figure out how to encode the compression).

invisibleroads · 2015-04-03T02:00:02Z

I added the compress attributes because read_msgpack was throwing an exception. Specifically, to_msgpack compressed the table indices without setting the compress attribute and read_msgpack > unconvert > np.fromstring was throwing an exception because it did not know that the data had been compressed. You can recreate this exception by reverting the extra compress attributes.

After realizing this, I added the compress attribute for every object type where encode applies convert (which does the actual compression) to obj.values. I also checked that decode applies the unconvert function to these same object types. You can see this if you search for all the uses of convert and unconvert in packers.py.

def encode(obj):
    if isinstance(obj, Index):
        return {
            'data': convert(obj.values),
            'compress': compressor,
        }

def decode(obj):
    elif typ == 'index':
        data = unconvert(
            obj['data'],
            np.typeDict[obj['dtype']],
            obj.get('compress'))

I think the compress attributes are necessary whenever convert/unconvert are used, but I'm not sure how or whether these attributes need to be tested explicitly.

invisibleroads · 2015-04-03T02:05:16Z

The other alternative is to remove convert/unconvert from those object types that you do not want compressed.

invisibleroads · 2015-04-09T16:07:52Z

Are you at PyCon? Maybe we can negotiate this during sprints on Monday.

jreback · 2015-04-09T17:30:13Z

sure at pycon starting tomorrow (thru Tuesday)

invisibleroads · 2015-04-13T14:45:44Z

@jreback
You suggested creating a compressed file with an earlier version of pandas to make sure that the newer version of pandas can decompress that file without error. However, it doesn't seem like the user would have been able to compress the file in the first place because of the local import error, so luckily we won't need to include a try/except to cover that case.

invisibleroads · 2015-04-13T15:12:05Z

@jreback
As suggested, I tried compressing a dataframe using an earlier version of pandas to_msgpack (f2882b8) but was unable to decompress the same file using read_msgpack because to_msgpack compresses indices (without indicating that they are compressed) and read_msgpack tried to load the indices (without knowing that they are compressed).

In [10]: read_msgpack('timestamps-zlib.msg')
ValueError: string size must be a multiple of element size

In [11]: read_msgpack('timestamps-blosc.msg')
ValueError: Shape of passed values is (2, 13), indices imply (2, 15)

Since the user would not have been able to compress and decompress with to_msgpack/read_msgpack in any version of pandas, we decided that a try/except is not necessary and the code is good as is and is ready to be merged.

Fix zlib and blosc imports

51d7bca

invisibleroads added 3 commits April 1, 2015 22:57

Add missing keys and tests

ff548c9

Make test_plain actually plain

b45a6c5

Add blosc to .travis.yml

5867145

Add blosc to requirements

8c5289b

jreback reviewed Apr 2, 2015
View reviewed changes

jreback added Bug Msgpack Compat pandas objects compatability with Numpy or Python functions labels Apr 2, 2015

jreback added this to the 0.16.1 milestone Apr 2, 2015

invisibleroads added 3 commits April 2, 2015 08:25

Diagnose why blosc is not importing

da7ab02

Get travis to cooperate

9ebf389

Make travis listen

278dd06

Respond to comments from jreback

76f689d

invisibleroads added 2 commits April 2, 2015 10:34

Add release note in v0.16.1

c5a1de7

Restore spacing

702be31

jreback reviewed Apr 2, 2015
View reviewed changes

cpcloud merged commit 702be31 into pandas-dev:master Apr 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zlib and blosc imports in to_msgpack #9783

Fix zlib and blosc imports in to_msgpack #9783

invisibleroads commented Apr 2, 2015

shoyer commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback Apr 2, 2015

invisibleroads Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

jreback Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 3, 2015

invisibleroads commented Apr 3, 2015

invisibleroads commented Apr 9, 2015

jreback commented Apr 9, 2015

invisibleroads commented Apr 13, 2015

invisibleroads commented Apr 13, 2015

Fix zlib and blosc imports in to_msgpack #9783

Fix zlib and blosc imports in to_msgpack #9783

Conversation

invisibleroads commented Apr 2, 2015

shoyer commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

When zlib and blosc are imported globally

When zlib and blosc are imported locally

invisibleroads commented Apr 2, 2015

jreback Apr 2, 2015

Choose a reason for hiding this comment

invisibleroads Apr 2, 2015

Choose a reason for hiding this comment

jreback commented Apr 2, 2015

invisibleroads commented Apr 2, 2015

jreback commented Apr 2, 2015

jreback Apr 2, 2015

Choose a reason for hiding this comment

jreback commented Apr 2, 2015

invisibleroads commented Apr 3, 2015

invisibleroads commented Apr 3, 2015

invisibleroads commented Apr 9, 2015

jreback commented Apr 9, 2015

invisibleroads commented Apr 13, 2015

invisibleroads commented Apr 13, 2015