-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Pickle Serialization Format #54
Comments
This is a weird one. Excellent repro and details. Thank you for reporting. Everything you describe/document is accurate. There are, in fact, two different pickles which result in the same object. I did not know this was possible and I do not know why it happens. There seems to be some evidence for the behavior though given that Here's my test case based on your code: @setup_cache
def test_key_roundtrip(cache):
key_part_0 = u"part0"
key_part_1 = u"part1"
to_test = [
(key_part_0, key_part_1),
[key_part_0, key_part_1],
]
for key in to_test:
cache.clear()
cache[key] = {'example0': ['value0']}
keys = list(cache)
assert len(keys) == 1
cache_key = keys[0]
assert cache[key] == {'example0': ['value0']}
assert cache[cache_key] == {'example0': ['value0']} This test fails with a key error as your code illustrates. I then monkey patch the code to observe the differences in pickle: @setup_cache
def test_key_roundtrip(cache):
# <start> Monkey Patch Disk.put to observe the pickle variations.
import pickletools
disk_type = type(cache._disk)
original_put = disk_type.put
def monkey_put(self, key):
result, flag = original_put(self, key)
pickletools.dis(str(result))
return result, flag
disk_type.put = monkey_put
# </end>
key_part_0 = u"part0"
key_part_1 = u"part1"
to_test = [
(key_part_0, key_part_1),
[key_part_0, key_part_1],
]
for key in to_test:
cache.clear()
cache[key] = {'example0': ['value0']}
keys = list(cache)
assert len(keys) == 1
cache_key = keys[0]
assert cache[key] == {'example0': ['value0']}
assert cache[cache_key] == {'example0': ['value0']} The output now shows:
And you'll notice that pickle has changed its serialization of the tuple. Originally:
And now:
Hence the failure. The def monkey_put(self, key):
result, flag = original_put(self, key)
optimize_result = buffer(pickletools.optimize(str(result)))
return optimize_result, flag Then the problem goes away. Would you see if that solves the problem for you too? Try using this Disk class: class OptimizingDisk(Disk):
def put(self, key):
db_key, raw = Disk.put(self, key)
if not raw and isinstance(db_key, sqlite3.Binary):
db_key = sqlite3.Binary(pickletools.optimize(str(db_key)))
return db_key, raw When you create the |
This does indeed work. Thanks very much! I was unaware of I have no notion of the performance implications of using |
The overhead of In [22]: data = (u'grant', u'jenks')
In [23]: %timeit pickle.dumps(data, 2)
100000 loops, best of 3: 11.1 µs per loop
In [24]: %timeit pickletools.optimize(pickle.dumps(data, 2))
10000 loops, best of 3: 25.5 µs per loop But I've always kind of known that pickle is a lousy serialization strategy as it's being used (for hashing and equality comparisons). The benchmarks really aren't affected by this because they use byte strings for keys and values (as they should if you really care about performance). I think we should call this a bug and just fix it. I'm opposed to adding the |
I started work on this. Hope to push out changes before the New Year. |
Fixed at 0bd9d61. To be deployed in v3 to PyPI. |
Using diskcache 2.9.0, via python 2.7 on Mac:
I am having problems using a tuple of strings as a cache key. Once the key has been placed into the cache and is subsequently retrieved from the cache via the key iterator, it apparently no longer matches the pickled form of the key when it was first inserted, leading to the value not being found when looked up in sqlite. I have tried changing the pickle format version to no avail. Using a list instead of a tuple works properly. Here is code to demonstrate the problem:
And the output:
Thanks for any assistance you can provide.
The text was updated successfully, but these errors were encountered: