-
Notifications
You must be signed in to change notification settings - Fork 1.3k
odb: pin index pickle protocol version #7222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dvc/objects/db/index.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla, will this break repo if they were using protocol 5 before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, users may need to remove .dvc/tmp/index in this case (which is the same potential issue we had when we pinned the version for state #5563 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python >=3.8 uses Protocol version 5, so we may as well just don't support compatibility between 3.7 and >=3.8.
What do you think of doing something like:
| self.index = Index(self.index_dir, disk_pickle_protocol=4) | |
| self.index = Index( | |
| self.index_dir, | |
| disk_pickle_protocol=4 if sys.version_info < (3,8) else 5 | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that works here since the reason users are still running into this issue is because they are mixing python versions (between 3.7 and >3.7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the core python default protocol version in 3.8+ is still 4, the problem for us is that diskcache explicitly uses pickle.HIGHEST_PROTOCOL https://github.com/grantjenks/python-diskcache/blob/d55a50ee083784afa9c85e14e41c4a2d132f3111/diskcache/core.py#L60
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For myself, I'd prefer to raise some messages to suggest .dvc/tmp/index here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding pickle protocol version, I think we should ensure compatibility across python versions that are supported by DVC, which means we should be enforcing a pinned version wherever we are using pickle (unless at some point in the future Python officially deprecates whatever ver we are using).
In the state/index use case, I don't think there's any practical benefit to using protocol ver 5 for the way we use diskcache (my understanding of protocol ver 5 is that it improves performance for serializing very large non-native objects that have to implement their own __reduce__ serialization methods, like numpy arrays and pandas dataframes), but with the way we use diskcache we are really just serializing (relatively) small native dicts.
There is an official backport library for pickle protocol v5, so if/when the state/index pickle performance becomes a real problem for us we can go that route, but I don't think it's a pressing need right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla, I agree with all of your points, I have no issues on the technical side of things.
I'm just worried about potentially breaking things here for Python 3.8+ users in the middle of 2.x version. I'm fine with this as long as we are explicit about this being a breaking change and weigh in both options:
- By not fixing this, we break compatibility between 3.7 with 3.8+ Python versions. So, if the user used Python 3.8+ version, they cannot use dvc repo again with Python 3.7.
- By fixing this, we ensure compatibility between all Python versions by breaking compatibility for once for Python 3.8+ users (note all of our binary packages and homebrew package are 3.8+).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry as a short term workaround we could just catch the protocol errors in state/index, and then add a troubleshooting link w/details suggesting users can try removing the relevant temp dirs (md5s/links/index) and then retry the command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after some testing, it turns out that changing to protocol version 4 will not break an existing diskcache cache that was generated using protocol version 5 (as long as you are on a python version that supports v5).
It looks like disk_pickle_protocol is only used on writes, and on read it uses the protocol version specified in the object it is reading, rather than the parent cache's default protocol version. So if you change protocol versions, it will just rewrite the cache entry using the new ver on the next time that entry is modified.
This test:
@pytest.mark.parametrize("proto_a, proto_b", [(4, 5), (5, 4)])
def test_protocol_error(tmp_dir, proto_a, proto_b):
with diskcache.Cache(directory=(tmp_dir / "test"), disk_pickle_protocol=proto_a) as cache:
cache["key"] = ("value1", "value2")
with diskcache.Cache(directory=(tmp_dir / "test"), disk_pickle_protocol=proto_b) as cache:
assert ("value1", "value2") == cache["key"]
cache["key"] = ("value3", "value4")
assert ("value3", "value4") == cache["key"]passes on 3.8, 3.9, 3.10 and fails on 3.7 (since 3.7 cannot read/write protocol version 5)
So this should not break compatibility with existing repos for anyone on python 3.8+ (but generating the troubleshooting message for py 3.7 will still be useful)
4ae76b8 to
8048266
Compare
8048266 to
ebab8a6
Compare
|
To be honest, I don't find it worth it to break for every python version, considering it may fail in unexpected places/ways. |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Fixes #7220