Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add __getstate__ and __setstate__ for Nodes #2938

Merged
merged 1 commit into from
Sep 9, 2021
Merged

ENH: Add __getstate__ and __setstate__ for Nodes #2938

merged 1 commit into from
Sep 9, 2021

Conversation

patcao
Copy link
Contributor

@patcao patcao commented Sep 7, 2021

From icexelloss:
We are trying to build "consistent hash" for ibis Expression (hash that doesn't change cross Python interpreter restart for ibis Expression) and the current implementation depends on first pickling the ibis Expression then use something like hashlib.sha256() to compute the hash from the pickled bytes.

During this, we found that because ibis Node class that "_cached_expr" and "_hash" attributes, the pickled bytes can be different for the same table depending on whether those two fields are created, which is not a very desirable behavior. This PR aims to fix the serialization/deserialization method so that they produce consistent results for the equivalent Node object.

@pep8speaks
Copy link

pep8speaks commented Sep 7, 2021

Hello @patcao! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-09 17:46:57 UTC

@datapythonista
Copy link
Contributor

@patcao can you link an issue number if you're working on one. Or provide a description to know why you are proposing this change please. Thank you

@@ -59,6 +59,18 @@ def _pp(x):

return '{}({})'.format(opname, ', '.join(pprint_args))

def __getstate__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstring for these two method describing what you are doing and why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type these things as well

@@ -59,6 +59,18 @@ def _pp(x):

return '{}({})'.format(opname, ', '.join(pprint_args))

def __getstate__(self):
excluded_slots = {'_expr_cached', '_hash'}
return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain how it works when slot is another ibis TableExpr or ColumnExpr, can you explain it works?

@icexelloss
Copy link
Contributor

@datapythonista Chime in a little bit here about the rationale here. We are trying to build "consistent hash" for ibis Expression (hash that doesn't change cross Python interpreter restart for ibis Expression) and the current implementation depends on first pickling the ibis Expression then use sth like hashlib.sha256() to compute the hash from the pickled bytes.

During this, we found that because ibis Node class that "_cached_expr" and "_hash" attributes, the pickled bytes can be different for the same table depending on whether those two fields are created, which is not a very desirable behavior. This PR aims to fix the ser/de method so that they produce consistent results for the equivalent Node object.

@@ -59,6 +59,18 @@ def _pp(x):

return '{}({})'.format(opname, ', '.join(pprint_args))

def __getstate__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type these things as well

@@ -1298,6 +1298,12 @@ def test_pickle_table_expr():
assert t1.equals(t0)


def test_pickle_table_node(table):
n0 = table.op()
n1 = pickle.loads(pickle.dumps(n0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a testing function that does this e.g. lines 1303 & 1304 and in ibis/tests/util.py

call it : assert_pickle_roundtrip(expr)

@jreback jreback added the expressions Issues or PRs related to the expression API label Sep 7, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls add a release note

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. @icexelloss has a couple of comments. pls add a release note as well.

if slot not in excluded_slots
}

def __setstate__(self, state: Dict[str, Any]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -> None

@jreback jreback added this to the Next release milestone Sep 8, 2021
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

@patcao can you add a release note. ping on green.

@@ -12,6 +12,7 @@ Release Notes
These release notes are for versions of ibis **1.0 and later**. Release
notes for pre-1.0 versions of ibis can be found at :doc:`release-pre-1.0`

* :feature:`2938` Add `__getstate__` and `__setstate__` methods to Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just say, that serialization-deserialization via pickle is now byte compatible between different processes.

@patcao patcao requested a review from jreback September 9, 2021 18:05
@jreback jreback merged commit 778a878 into ibis-project:master Sep 9, 2021
@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

thanks @patcao

@cpcloud cpcloud mentioned this pull request Sep 9, 2021
@cpcloud cpcloud removed this from the Next release milestone Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants