Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Make StructValue picklable #2577

Merged
merged 8 commits into from
Dec 23, 2020

Conversation

timothydijamco
Copy link
Contributor

@timothydijamco timothydijamco commented Dec 22, 2020

StructValue is an expr class (parent class of StructScalar and StructColumn)

Problem (Why is StructValue unpicklable?)

First, StructValue defines a __getattr__ so that users may access fields in the struct through attrs. For example: my_struct_scalar.foo to access the value of the field named 'foo'.

Some background on pickle in case not already familiar: During the pickling process, pickle will check if StructValue has a __getstate__ method (generally speaking, this method can be defined for a class if custom pickling behavior is desired). Pickle will then call that method if found.

Now since StructValue has a custom __getattr__, and pickle is looking for an attribute named '__getstate__', it will trigger the logic that tries to fetch a field named '__getstate__' from the struct, which will fail.

(There's also a __setstate__ method that pickle attempts to call during the unpickling process. The same logic above applies.)

Solution

In StructValue's __getattr__ method, raise an AttributeError if the attribute being asked for is either '__getstate__' or '__setstate__'. This tells pickle that we aren't trying to define an implementation for either method.

Why always raise an error in those cases?

In other words, why not confirm the struct actually does not have any field named '__getstate__' or '__setstate__' in the struct before raising an error? Doing a check like that would cover the edge case that the user has purposely defined a struct with a field named '__getstate__' or '__setstate__'.

This won't work during the unpickling process. Here's a walkthrough:

  1. pickle attempts to call __setstate__
  2. We enter StructValue's __getattr__ method
  3. Since the unpickling process has just begun, we don't have any attributes in StructValue populated. This means that we don't have access to the underlying struct, and can't check if a field named '__setstate__' exists in the struct.

ibis/expr/api.py Outdated
def _destructure(expr: StructColumn) -> DestructColumn:
""" Destructure a ``Struct`` to create a destruct column.
def _struct_get_attr(expr: StructValue, attr_name: str) -> ValueExpr:
"""Get field named `attr_name` from the ``StructValue`` expression `expr`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are now just defining __getstate___ and __setstate___ on a StructValueExpr itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is extremely convoluted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are now just defining __getstate___ and __setstate___ on a StructValueExpr itself?

Do you mean defining __getstate__ and __setstate__ but mimic the default implementation (i.e. their behavior if they weren't defined), so that __getattr__ is not triggered?

I think that would work too, but I'm not sure how much better it would be:

  • It feels odd to be explicitly defining __getstate__ and __setstate__ with an implementation that is just their default implementation
  • On the other hand, my current implementation is essentially defining __getattr__ with an exclusion (that says that it should not be used for __getstate__ and __setstate__), which seems a little easier to follow to me

I would agree that both solutions feel like hacks though

ibis/expr/api.py Outdated
def _struct_get_attr(expr: StructValue, attr_name: str) -> ValueExpr:
"""Get field named `attr_name` from the ``StructValue`` expression `expr`.

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also why are you defining __dir__ on StructValueExpr in the first place? this is the cause of the trouble. it would be much better to provide an attribute which holds the data itself, e.g. .attr or something, then you don't have these namespace collisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why we define StructValue.__getattr__? (and that we should remove StructValue.__getattr__, and only allow access to fields of the struct through StructValue.__getitem__?)

The underlying struct is already stored in the way you describe if I'm understanding correctly. Information about the underlying struct is stored in an attribute called StructValue._dtype, and then doing my_struct_value.foo or my_struct_value['foo'] constructs a operation+expression (using information from StructValue._dtype) that would pull the field's value out when executed.

Getting rid of StructValue.__getattr__ sounds like a good solution to me. It would simplify all of this (no need for any pickling-specific fix). I would have to do a little digging to see whether it's used anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the current implementation is a hack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of StructValue.getattr sounds like a good solution to me. It would simplify all of this (no need for any pickling-specific fix). I would have to do a little digging to see whether it's used anywhere.

yes do this, its way better

@jreback jreback added the expressions Issues or PRs related to the expression API label Dec 22, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you add some documentation / examples (not sure exactly where) about getting these sub-fields

@timothydijamco
Copy link
Contributor Author

I've made the changes—just to confirm we're OK with the fact that this changes the StructValue API so that you can only access fields in the struct through my_struct_value['foo'] (and not my_struct_value.foo)?

Side note: I noticed TableExpr's API also allows either t.bar or t['bar'] (in this case, to get the column named bar from the table t). This is its approach to the pickling issue for reference:

@timothydijamco
Copy link
Contributor Author

Oops messages crossed each other, but yes sure I'll look into adding documentation. Looks like a few tests will have to be fixed as well

@jreback jreback added this to the Next release milestone Dec 23, 2020
@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

can you rebase once more

@jreback jreback merged commit 2114f87 into ibis-project:master Dec 23, 2020
@cpcloud cpcloud removed this from the Next release milestone Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants