Optimize Pickle Serialization Format #54

brucehappy · 2017-11-03T19:56:18Z

Using diskcache 2.9.0, via python 2.7 on Mac:

I am having problems using a tuple of strings as a cache key. Once the key has been placed into the cache and is subsequently retrieved from the cache via the key iterator, it apparently no longer matches the pickled form of the key when it was first inserted, leading to the value not being found when looked up in sqlite. I have tried changing the pickle format version to no avail. Using a list instead of a tuple works properly. Here is code to demonstrate the problem:

from diskcache import Cache, Index
import shutil
import tempfile

key_part_0 = u"part0"
key_part_1 = u"part1"
to_test = [
    (key_part_0, key_part_1),
    [key_part_0, key_part_1]
]
for key in to_test:
    tmpdir = tempfile.mkdtemp(prefix="discache_test.")
    try:
        dc = Index.fromcache(Cache(tmpdir, eviction_policy='none', size_limit=2**32)) #4GB
        dc[key] = {
            "example0": ["value0"]
        }
        diskcache_key = dc.keys()[0]
        print "Keys equal: %s" % (diskcache_key==key)
        print "Value using original key: %s" % dc[key]
        try:
            print "Value using retrieved key: %s" % dc[diskcache_key]
        except KeyError as e:
            print "Could not find value using retrieved key"
    finally:
        shutil.rmtree(tmpdir, True)

And the output:

Keys equal: True
Value using original key: {'example0': ['value0']}
Could not find value using retrieved key
Keys equal: True
Value using original key: {'example0': ['value0']}
Value using retrieved key: {'example0': ['value0']}

Thanks for any assistance you can provide.

The text was updated successfully, but these errors were encountered:

grantjenks · 2017-11-03T21:57:15Z

This is a weird one. Excellent repro and details. Thank you for reporting. Everything you describe/document is accurate. There are, in fact, two different pickles which result in the same object. I did not know this was possible and I do not know why it happens. There seems to be some evidence for the behavior though given that pickletools.optimize exists.

Here's my test case based on your code:

@setup_cache
def test_key_roundtrip(cache):
    key_part_0 = u"part0"
    key_part_1 = u"part1"
    to_test = [
        (key_part_0, key_part_1),
        [key_part_0, key_part_1],
    ]

    for key in to_test:
        cache.clear()
        cache[key] = {'example0': ['value0']}
        keys = list(cache)
        assert len(keys) == 1
        cache_key = keys[0]
        assert cache[key] == {'example0': ['value0']}
        assert cache[cache_key] == {'example0': ['value0']}

This test fails with a key error as your code illustrates. I then monkey patch the code to observe the differences in pickle:

@setup_cache
def test_key_roundtrip(cache):
    # <start> Monkey Patch Disk.put to observe the pickle variations.

    import pickletools

    disk_type = type(cache._disk)
    original_put = disk_type.put

    def monkey_put(self, key):
        result, flag = original_put(self, key)
        pickletools.dis(str(result))
        return result, flag

    disk_type.put = monkey_put

    # </end>

    key_part_0 = u"part0"
    key_part_1 = u"part1"
    to_test = [
        (key_part_0, key_part_1),
        [key_part_0, key_part_1],
    ]

    for key in to_test:
        cache.clear()
        cache[key] = {'example0': ['value0']}
        keys = list(cache)
        assert len(keys) == 1
        cache_key = keys[0]
        assert cache[key] == {'example0': ['value0']}
        assert cache[cache_key] == {'example0': ['value0']}

The output now shows:

$ nosetests tests/test_core.py:test_key_roundtrip
E
======================================================================
ERROR: tests.test_core.test_key_roundtrip
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/grantj/repos/python-diskcache/env27/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/grantj/repos/python-diskcache/tests/test_core.py", line 39, in wrapper
    func(cache)
  File "/Users/grantj/repos/python-diskcache/tests/test_core.py", line 1220, in test_key_roundtrip
    assert cache[cache_key] == {'example0': ['value0']}
  File "/Users/grantj/repos/python-diskcache/diskcache/core.py", line 973, in __getitem__
    raise KeyError(key)
KeyError: (u'part0', u'part1')
-------------------- >> begin captured stdout << ---------------------
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP
highest protocol among opcodes = 2
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP
highest protocol among opcodes = 2
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: X    BINUNICODE u'part1'
   22: \x86 TUPLE2
   23: q    BINPUT     1
   25: .    STOP
highest protocol among opcodes = 2

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 1 test in 0.014s

FAILED (errors=1)

And you'll notice that pickle has changed its serialization of the tuple. Originally:

    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP

And now:

    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: X    BINUNICODE u'part1'
   22: \x86 TUPLE2
   23: q    BINPUT     1
   25: .    STOP

Hence the failure. The pickletools module has an optimize function. If we use that here then the problem disappears. So if monkey_put looks like:

    def monkey_put(self, key):
        result, flag = original_put(self, key)
        optimize_result = buffer(pickletools.optimize(str(result)))
        return optimize_result, flag

Then the problem goes away.

Would you see if that solves the problem for you too? Try using this Disk class:

class OptimizingDisk(Disk):
    def put(self, key):
        db_key, raw = Disk.put(self, key)

        if not raw and isinstance(db_key, sqlite3.Binary):
            db_key = sqlite3.Binary(pickletools.optimize(str(db_key)))

        return db_key, raw

When you create the Index, you can pass disk=OptimizingDisk to the Cache. Does that work for you?

brucehappy · 2017-11-04T08:04:48Z

This does indeed work. Thanks very much! I was unaware of pickletools.optimize.

I have no notion of the performance implications of using optimize on all pickles performed in Disk. Perhaps that suggests adding disk_pickle_optimize as a setting that defaults to False is the most prudent next step? It could be consulted in both put and store after pickles are performed to determine whether an optimize should be run on the result of the pickle. Thoughts?

grantjenks · 2017-11-08T06:26:26Z

The overhead of optimize is not negligible:

In [22]: data = (u'grant', u'jenks')

In [23]: %timeit pickle.dumps(data, 2)
100000 loops, best of 3: 11.1 µs per loop

In [24]: %timeit pickletools.optimize(pickle.dumps(data, 2))
10000 loops, best of 3: 25.5 µs per loop

But I've always kind of known that pickle is a lousy serialization strategy as it's being used (for hashing and equality comparisons). The benchmarks really aren't affected by this because they use byte strings for keys and values (as they should if you really care about performance).

I think we should call this a bug and just fix it. I'm opposed to adding the disk_pickle_optimize flag though that would be more backwards compatible. Anyone who submits a bug later because pickles became optimized can be given a workaround.

grantjenks · 2017-12-09T01:02:35Z

I started work on this. Hope to push out changes before the New Year.

grantjenks · 2017-12-12T20:27:23Z

Fixed at 0bd9d61. To be deployed in v3 to PyPI.

grantjenks changed the title ~~Tuples as cache keys (likely pickling related)~~ Optimize Pickle Serialization Format Nov 8, 2017

grantjenks mentioned this issue Dec 11, 2017

Bump to v3? #60

Closed

6 tasks

grantjenks closed this as completed Dec 12, 2017

grantjenks added a commit that referenced this issue Dec 13, 2017

Add test case for Issue #54

591d4cb

grantjenks referenced this issue Apr 4, 2018

Optimize pickled keys and values using pickletools

0bd9d61

grantjenks mentioned this issue Sep 12, 2019

Index with groups from pandas's groupby #123

Closed

rderollepot mentioned this issue Jan 25, 2023

Cache is not successfully retrieved using memoize decorator with big nested object as function input #272

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Pickle Serialization Format #54

Optimize Pickle Serialization Format #54

brucehappy commented Nov 3, 2017

grantjenks commented Nov 3, 2017

brucehappy commented Nov 4, 2017

grantjenks commented Nov 8, 2017

grantjenks commented Dec 9, 2017

grantjenks commented Dec 12, 2017

Optimize Pickle Serialization Format #54

Optimize Pickle Serialization Format #54

Comments

brucehappy commented Nov 3, 2017

grantjenks commented Nov 3, 2017

brucehappy commented Nov 4, 2017

grantjenks commented Nov 8, 2017

grantjenks commented Dec 9, 2017

grantjenks commented Dec 12, 2017