Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zlib and blosc imports in to_msgpack #9783

Merged
merged 11 commits into from
Apr 13, 2015

Conversation

invisibleroads
Copy link
Contributor

6717aa0 removed zlib and blosc from the global namespace.

from pandas import read_csv
table = read_csv('aadhaar_data.csv')
table.to_msgpack('d.msg')

# NameError: global name 'blosc' is not defined
table.to_msgpack('d-blosc.msg', compress='blosc')

# NameError: global name 'zlib' is not defined
table.to_msgpack('d-zlib.msg', compress='zlib')

This pull request restores zlib and blosc compression in to_msgpack via local imports.

@shoyer
Copy link
Member

shoyer commented Apr 2, 2015

Could you please add tests for these? Looks like this functionality was not tested before.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

this needs some tests
and a perf check

they were not in the local imports originally because of perf

@invisibleroads
Copy link
Contributor Author

What's a perf check?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

since this is importing each time in a lower level function I would like to see if that impacts things

show to/from msgpack with a good sized frame with mixed types with compression and without

look at prun and see if the imports cause an issue

@invisibleroads
Copy link
Contributor Author

What is prun?

In my experience, module initialization happens only on the first import.
http://stackoverflow.com/questions/296036/does-python-optimize-modules-when-they-are-imported-multiple-times

You can test this by creating a.py

print 'hey'

and creating b.py

import a
import a

and then run python b.py to see that "hey" only prints once.

@invisibleroads
Copy link
Contributor Author

The following code also only prints "hey" once.

import a
import a

def f():
    import a

f()

@invisibleroads
Copy link
Contributor Author

Lastly, this variation also only prints "hey" once.

def f():
    import a

def g():
    import a

f()
g()

These examples hold true for python2.7 and python3.4.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

use ipython

of course they are only imported once

but the function call and import checks themselves are non trivial

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

@invisibleroads
Copy link
Contributor Author

Here are some performance results using the following code saved as compress.py.

from cStringIO import StringIO
from pandas import read_csv, read_msgpack
from urllib2 import urlopen

url = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv'
table = read_csv(StringIO(urlopen(url).read()))
print 'prun z = read_msgpack(table.to_msgpack())'

When zlib and blosc are imported globally

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack())

500 function calls (492 primitive calls) in 0.030 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.022    0.022    0.025    0.025 packers.py:134(read)
     1    0.002    0.002    0.003    0.003 {method 'pack' of 'pandas.msgpack.Packer' objects}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}
     2    0.000    0.000    0.000    0.000 {method 'encode' of 'unicode' objects}

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))

508 function calls (500 primitive calls) in 0.059 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.027    0.013    0.027    0.013 {zlib.compress}
     1    0.024    0.024    0.027    0.027 packers.py:134(read)
     1    0.003    0.003    0.031    0.031 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))

532 function calls (524 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.042    0.042    0.046    0.046 packers.py:134(read)
     1    0.003    0.003    0.005    0.005 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.001    0.001    0.001    0.001 {blosc.blosc_extension.compress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}
$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack())
10 loops, best of 3: 27.7 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='zlib'))
10 loops, best of 3: 49.4 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='blosc'))
10 loops, best of 3: 28.7 ms per loop

When zlib and blosc are imported locally

prun z = read_msgpack(table.to_msgpack())

500 function calls (492 primitive calls) in 0.029 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.003    0.003 {method 'pack' of 'pandas.msgpack.Packer' objects}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}
     2    0.001    0.000    0.001    0.000 index.py:1508(get_indexer)

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))
In [2]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))
In [3]: prun z = read_msgpack(table.to_msgpack(compress='zlib'))

509 function calls (501 primitive calls) in 0.081 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.048    0.048 packers.py:133(read)
     2    0.027    0.013    0.027    0.013 {zlib.compress}
     1    0.004    0.004    0.031    0.031 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}

479 function calls (471 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.022    0.011    0.022    0.011 {zlib.compress}
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.025    0.025 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
     1    0.001    0.001    0.053    0.053 <string>:1(<module>)

479 function calls (471 primitive calls) in 0.053 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.022    0.011    0.022    0.011 {zlib.compress}
     1    0.022    0.022    0.025    0.025 packers.py:133(read)
     1    0.002    0.002    0.025    0.025 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {zlib.decompress}
     1    0.002    0.002    0.053    0.053 <string>:1(<module>)

$ ipython -i compress.py
In [1]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))
In [2]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))
In [3]: prun z = read_msgpack(table.to_msgpack(compress='blosc'))

596 function calls (588 primitive calls) in 0.056 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.047    0.047 packers.py:133(read)
     1    0.003    0.003    0.007    0.007 {method 'pack' of 'pandas.msgpack.Packer' objects}
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    15    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    15    0.001    0.000    0.001    0.000 {numpy.core.multiarray.array}

503 function calls (495 primitive calls) in 0.055 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.043    0.043    0.046    0.046 packers.py:133(read)
     1    0.003    0.003    0.006    0.006 {method 'pack' of 'pandas.msgpack.Packer' objects}
     1    0.002    0.002    0.055    0.055 <string>:1(<module>)
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    13    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}

503 function calls (495 primitive calls) in 0.054 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.042    0.042    0.045    0.045 packers.py:133(read)
     1    0.003    0.003    0.006    0.006 {method 'pack' of 'pandas.msgpack.Packer' objects}
     1    0.002    0.002    0.054    0.054 <string>:1(<module>)
     2    0.002    0.001    0.002    0.001 {blosc.blosc_extension.compress}
    13    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack())
10 loops, best of 3: 28.2 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='zlib'))
10 loops, best of 3: 49.1 ms per loop

$ ipython -i compress.py 
In [1]: timeit read_msgpack(table.to_msgpack(compress='blosc'))
10 loops, best of 3: 29.5 ms per loop

@invisibleroads
Copy link
Contributor Author

I didn't expect such a quick response to the pull request. Thanks for maintaining the pandas package.

class TestCompression(TestPackers):

def setUp(self):
super(TestCompression, self).setUp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment (on the class)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

ok, seems reasonable to use the local imports.

  • remove some of the blosc imports (e.g. from 2.6, 2.7_LOCALE). Then in the tests test for blosc imports, if it fails, skip the test.
  • add a release note in v0.16.1, you can put it in bug fixes, use this issue number as the reference number

@jreback jreback added Bug Msgpack Compat pandas objects compatability with Numpy or Python functions labels Apr 2, 2015
@jreback jreback added this to the 0.16.1 milestone Apr 2, 2015
@invisibleroads
Copy link
Contributor Author

This seems to be a documented issue between blosc and conda.
https://groups.google.com/a/continuum.io/forum/#!topic/conda/2a9Fyvzedug
Specifically, conda only installs c-blosc and not python-blosc. Hmm.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

hmm ok let me see if I can have this fixed

@@ -16,6 +16,8 @@ fi
"$TRAVIS_BUILD_DIR"/ci/build_docs.sh 2>&1 > /tmp/doc.log &
# doc build log will be shown after tests

pip install -U blosc # See https://github.com/pydata/pandas/pull/9783
python -c 'import blosc; blosc.print_versions()'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect!

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

@invisibleroads this looks good!

This needs to be tested slightly differently. encode_decode passes the kwargs thru, so what you are doing is fine.

BUT, you are adding the compression type to the objects themselves (again which is ok), I don't think this is actually used on decompression (its not tested that way now). AND I am not sure that you CAN use it, as you would have to decompress first, look at the compression variable, but then you are done already.

So I am not sure how to get around this problem, other than having a top-level section which has some uncompressed attributes.

So either we need to take off the compress : .... on the objects themselves as it doesn't serve any purpose or test it (and/or figure out how to encode the compression).

@invisibleroads
Copy link
Contributor Author

I added the compress attributes because read_msgpack was throwing an exception. Specifically, to_msgpack compressed the table indices without setting the compress attribute and read_msgpack > unconvert > np.fromstring was throwing an exception because it did not know that the data had been compressed. You can recreate this exception by reverting the extra compress attributes.

After realizing this, I added the compress attribute for every object type where encode applies convert (which does the actual compression) to obj.values. I also checked that decode applies the unconvert function to these same object types. You can see this if you search for all the uses of convert and unconvert in packers.py.

def encode(obj):
    if isinstance(obj, Index):
        return {
            'data': convert(obj.values),
            'compress': compressor,
        }

def decode(obj):
    elif typ == 'index':
        data = unconvert(
            obj['data'],
            np.typeDict[obj['dtype']],
            obj.get('compress'))

I think the compress attributes are necessary whenever convert/unconvert are used, but I'm not sure how or whether these attributes need to be tested explicitly.

@invisibleroads
Copy link
Contributor Author

The other alternative is to remove convert/unconvert from those object types that you do not want compressed.

@invisibleroads
Copy link
Contributor Author

Are you at PyCon? Maybe we can negotiate this during sprints on Monday.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2015

sure at pycon starting tomorrow (thru Tuesday)

@invisibleroads
Copy link
Contributor Author

@jreback
You suggested creating a compressed file with an earlier version of pandas to make sure that the newer version of pandas can decompress that file without error. However, it doesn't seem like the user would have been able to compress the file in the first place because of the local import error, so luckily we won't need to include a try/except to cover that case.

@invisibleroads
Copy link
Contributor Author

@jreback
As suggested, I tried compressing a dataframe using an earlier version of pandas to_msgpack (f2882b8) but was unable to decompress the same file using read_msgpack because to_msgpack compresses indices (without indicating that they are compressed) and read_msgpack tried to load the indices (without knowing that they are compressed).

In [10]: read_msgpack('timestamps-zlib.msg')
ValueError: string size must be a multiple of element size

In [11]: read_msgpack('timestamps-blosc.msg')
ValueError: Shape of passed values is (2, 13), indices imply (2, 15)

Since the user would not have been able to compress and decompress with to_msgpack/read_msgpack in any version of pandas, we decided that a try/except is not necessary and the code is good as is and is ready to be merged.

@cpcloud cpcloud merged commit 702be31 into pandas-dev:master Apr 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants