Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Pickle Serialization Format #54

Closed
brucehappy opened this issue Nov 3, 2017 · 5 comments
Closed

Optimize Pickle Serialization Format #54

brucehappy opened this issue Nov 3, 2017 · 5 comments

Comments

@brucehappy
Copy link

Using diskcache 2.9.0, via python 2.7 on Mac:

I am having problems using a tuple of strings as a cache key. Once the key has been placed into the cache and is subsequently retrieved from the cache via the key iterator, it apparently no longer matches the pickled form of the key when it was first inserted, leading to the value not being found when looked up in sqlite. I have tried changing the pickle format version to no avail. Using a list instead of a tuple works properly. Here is code to demonstrate the problem:

from diskcache import Cache, Index
import shutil
import tempfile

key_part_0 = u"part0"
key_part_1 = u"part1"
to_test = [
    (key_part_0, key_part_1),
    [key_part_0, key_part_1]
]
for key in to_test:
    tmpdir = tempfile.mkdtemp(prefix="discache_test.")
    try:
        dc = Index.fromcache(Cache(tmpdir, eviction_policy='none', size_limit=2**32)) #4GB
        dc[key] = {
            "example0": ["value0"]
        }
        diskcache_key = dc.keys()[0]
        print "Keys equal: %s" % (diskcache_key==key)
        print "Value using original key: %s" % dc[key]
        try:
            print "Value using retrieved key: %s" % dc[diskcache_key]
        except KeyError as e:
            print "Could not find value using retrieved key"
    finally:
        shutil.rmtree(tmpdir, True)

And the output:

Keys equal: True
Value using original key: {'example0': ['value0']}
Could not find value using retrieved key
Keys equal: True
Value using original key: {'example0': ['value0']}
Value using retrieved key: {'example0': ['value0']}

Thanks for any assistance you can provide.

@grantjenks
Copy link
Owner

This is a weird one. Excellent repro and details. Thank you for reporting. Everything you describe/document is accurate. There are, in fact, two different pickles which result in the same object. I did not know this was possible and I do not know why it happens. There seems to be some evidence for the behavior though given that pickletools.optimize exists.

Here's my test case based on your code:

@setup_cache
def test_key_roundtrip(cache):
    key_part_0 = u"part0"
    key_part_1 = u"part1"
    to_test = [
        (key_part_0, key_part_1),
        [key_part_0, key_part_1],
    ]

    for key in to_test:
        cache.clear()
        cache[key] = {'example0': ['value0']}
        keys = list(cache)
        assert len(keys) == 1
        cache_key = keys[0]
        assert cache[key] == {'example0': ['value0']}
        assert cache[cache_key] == {'example0': ['value0']}

This test fails with a key error as your code illustrates. I then monkey patch the code to observe the differences in pickle:

@setup_cache
def test_key_roundtrip(cache):
    # <start> Monkey Patch Disk.put to observe the pickle variations.

    import pickletools

    disk_type = type(cache._disk)
    original_put = disk_type.put

    def monkey_put(self, key):
        result, flag = original_put(self, key)
        pickletools.dis(str(result))
        return result, flag

    disk_type.put = monkey_put

    # </end>

    key_part_0 = u"part0"
    key_part_1 = u"part1"
    to_test = [
        (key_part_0, key_part_1),
        [key_part_0, key_part_1],
    ]

    for key in to_test:
        cache.clear()
        cache[key] = {'example0': ['value0']}
        keys = list(cache)
        assert len(keys) == 1
        cache_key = keys[0]
        assert cache[key] == {'example0': ['value0']}
        assert cache[cache_key] == {'example0': ['value0']}

The output now shows:

$ nosetests tests/test_core.py:test_key_roundtrip
E
======================================================================
ERROR: tests.test_core.test_key_roundtrip
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/grantj/repos/python-diskcache/env27/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/grantj/repos/python-diskcache/tests/test_core.py", line 39, in wrapper
    func(cache)
  File "/Users/grantj/repos/python-diskcache/tests/test_core.py", line 1220, in test_key_roundtrip
    assert cache[cache_key] == {'example0': ['value0']}
  File "/Users/grantj/repos/python-diskcache/diskcache/core.py", line 973, in __getitem__
    raise KeyError(key)
KeyError: (u'part0', u'part1')
-------------------- >> begin captured stdout << ---------------------
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP
highest protocol among opcodes = 2
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP
highest protocol among opcodes = 2
    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: X    BINUNICODE u'part1'
   22: \x86 TUPLE2
   23: q    BINPUT     1
   25: .    STOP
highest protocol among opcodes = 2

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 1 test in 0.014s

FAILED (errors=1)

And you'll notice that pickle has changed its serialization of the tuple. Originally:

    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: q    BINPUT     1
   14: X    BINUNICODE u'part1'
   24: q    BINPUT     2
   26: \x86 TUPLE2
   27: q    BINPUT     3
   29: .    STOP

And now:

    0: \x80 PROTO      2
    2: X    BINUNICODE u'part0'
   12: X    BINUNICODE u'part1'
   22: \x86 TUPLE2
   23: q    BINPUT     1
   25: .    STOP

Hence the failure. The pickletools module has an optimize function. If we use that here then the problem disappears. So if monkey_put looks like:

    def monkey_put(self, key):
        result, flag = original_put(self, key)
        optimize_result = buffer(pickletools.optimize(str(result)))
        return optimize_result, flag

Then the problem goes away.

Would you see if that solves the problem for you too? Try using this Disk class:

class OptimizingDisk(Disk):
    def put(self, key):
        db_key, raw = Disk.put(self, key)

        if not raw and isinstance(db_key, sqlite3.Binary):
            db_key = sqlite3.Binary(pickletools.optimize(str(db_key)))

        return db_key, raw

When you create the Index, you can pass disk=OptimizingDisk to the Cache. Does that work for you?

@brucehappy
Copy link
Author

This does indeed work. Thanks very much! I was unaware of pickletools.optimize.

I have no notion of the performance implications of using optimize on all pickles performed in Disk. Perhaps that suggests adding disk_pickle_optimize as a setting that defaults to False is the most prudent next step? It could be consulted in both put and store after pickles are performed to determine whether an optimize should be run on the result of the pickle. Thoughts?

@grantjenks
Copy link
Owner

The overhead of optimize is not negligible:

In [22]: data = (u'grant', u'jenks')

In [23]: %timeit pickle.dumps(data, 2)
100000 loops, best of 3: 11.1 µs per loop

In [24]: %timeit pickletools.optimize(pickle.dumps(data, 2))
10000 loops, best of 3: 25.5 µs per loop

But I've always kind of known that pickle is a lousy serialization strategy as it's being used (for hashing and equality comparisons). The benchmarks really aren't affected by this because they use byte strings for keys and values (as they should if you really care about performance).

I think we should call this a bug and just fix it. I'm opposed to adding the disk_pickle_optimize flag though that would be more backwards compatible. Anyone who submits a bug later because pickles became optimized can be given a workaround.

@grantjenks grantjenks changed the title Tuples as cache keys (likely pickling related) Optimize Pickle Serialization Format Nov 8, 2017
@grantjenks
Copy link
Owner

I started work on this. Hope to push out changes before the New Year.

@grantjenks grantjenks mentioned this issue Dec 11, 2017
6 tasks
@grantjenks
Copy link
Owner

Fixed at 0bd9d61. To be deployed in v3 to PyPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants