Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive dataclasses #2

Open
JelleZijlstra opened this issue Mar 27, 2021 · 37 comments
Open

Recursive dataclasses #2

JelleZijlstra opened this issue Mar 27, 2021 · 37 comments

Comments

@JelleZijlstra
Copy link
Contributor

Another problem brought up by Joseph Perez on python-dev:

from dataclasses import dataclass

@dataclass
class User:
   name: str
   friends: list[User]

This will break because the implementation of the dataclass decorator will access the class's __annotations__, but at that point User is not defined yet, so the user will get a NameError.

This is a similar problem to #1, and possible solutions are similar:

  • Tell users to hand-stringify their annotation and write list["User"].
  • Hack the language so that instead of throwing a NameError, we do something more useful here.
@PaperclipBadger
Copy link

PaperclipBadger commented Apr 16, 2021

I think we could fix this (and the problem with annotations defined in control flow structures) by having __annotations__ be a lazy mapping with known keys:

class Annotations(MutableMapping):
    def __init__(self):
        self.thunks = {}
        self.items = {}

    def __getitem__(self, key):
        with suppress(KeyError):
            self.items[key] = self.thunks[key]()
            del self.thunks[key]
        return self.items[key]
    
    def __setitem__(self, key, value):
        with suppress(KeyError):
            del self.thunks[key]
        self.items[key] = value
    
    def __delitem__(self, key):
        with suppress(KeyError):
            del self.thunks[key]
        del self.items[key]
    
    def __len__(self):
        return len(self.thunks) + len(self.items)
        
    def __keys__(self):
        return set(self.thunks) | set(self.items)
    
    ...


# now function annotation
def foo(x: int = 3, y: MyType = None) -> float: ...
    
# is sugar for
def foo(x=3, y=None): ...
foo.__annotations__ = Annotations()
foo.__annotations__.thunks['x'] = lambda: int
foo.__annotations__.thunks['y'] = lambda: MyType
foo.__annotations__.thunks['return'] = lambda: float

# class annotation
class MyClass:
    a: int
    b: MyType
    
    if condition:
        c: float

# becomes sugar for
class MyClass:
    __annotations__ = Annotations()
    __annotations__.thunks['a'] = lambda: int
    __annotations__.thunks['b'] = lambda: MyType
    
    if condition:
        __annotations__.thunks['c'] = lambda: float

As long as dataclasses.dataclass (and typing.NamedTuple and the like) only need access to the keys of the annotations mapping (which I think is the case), there should no longer be problems with circular references under this system.

@ericvsmith
Copy link
Collaborator

This will break because the implementation of the dataclass decorator will access the class's __annotations__, but at that point User is not defined yet, so the user will get a NameError.

I don't think that's true, unless I'm misunderstanding. The class is completely defined by the time the decorator is run. User = dataclass(User)

@JelleZijlstra
Copy link
Contributor Author

@ericvsmith the name only gets bound after decorators run. Given this code:

from dataclasses import dataclass
def f():
    @dataclass
    class X: pass

It gets compiled to:

In [17]: dis.dis(f)
  3           0 LOAD_GLOBAL              0 (dataclass)
              2 LOAD_BUILD_CLASS
              4 LOAD_CONST               1 (<code object X at 0x7fa826aa2300, file "<ipython-input-15-a9a7934e921a>", line 3>)
              6 LOAD_CONST               2 ('X')
              8 MAKE_FUNCTION            0
             10 LOAD_CONST               2 ('X')
             12 CALL_FUNCTION            2
             14 CALL_FUNCTION            1
             16 STORE_FAST               0 (X)
             18 LOAD_CONST               0 (None)
             20 RETURN_VALUE

The STORE_FAST storing the name X gets run after the CALL_FUNCTION implementing the decorator.

@ericvsmith
Copy link
Collaborator

@ericvsmith the name only gets bound after decorators run. Given this code:

You learn something new every day!

Maybe we should change that.

@JelleZijlstra
Copy link
Contributor Author

Maybe we should change that.

That would be a viable option for fixing the issue. We'd have to store the name, run the decorator, then store the name again.

It would change observable behavior in cases like this:

def bad_decorator(): raise Exception
try:
    @bad_decorator
    class X:
        pass
except Exception:
    pass
X() # currently undefined, proposed change would make it defined

Though I suppose we could make it del the name if the decorator throws. (But then what if the name was already set to something else before?)

@NeilGirdhar
Copy link

NeilGirdhar commented Feb 15, 2022

Would the proposed solution work for dataclasses defined using a base class, like this one?

By the way, I'm very excited about PEP 649.

@gvanrossum
Copy link

gvanrossum commented Mar 15, 2022

@NeilGirdhar

Would the proposed solution work for dataclasses defined using a base class, like this one?

That link points to a random line in the file (presumably the file got updated since you took the link). Please use permalinks. Even better, post a simpler example here that shows the issue. I'm guessing you meant this but that class definition doesn't really help me understand your question. Jelle's example above is clear and self-contained, I like that.

@larryhastings
Copy link
Owner

larryhastings commented Mar 16, 2022

If I understand correctly, one idea here is to bind the name before calling the decorator. The problem with that approach is that the decorator may return a different class. I did that once, in my "bound inner classes" recipe:

https://code.activestate.com/recipes/577070-bound-inner-classes/

This makes a class declared inside another class behave more like a function declared inside a class: self is automatically passed in as the first positional argument. It does this by wrapping the class in a class implementing the descriptor protocol, so that every access to the class from a bound object returns a child class it constructs with an __init__ that injects a reference to self. It's sadly very magical... but it does work.

If we bound the class before calling the descriptor, then in this case we'd have to bind it a second time, binding it to what's returned by the descriptor. Also, annotations for class members that refer to the class itself would retain a reference to the original class, not the class returned by the descriptor, which means they'd fail comparisons--the User used in the definition of friends would not be equal to the User defined in the module.

p.s. credit where credit is due: Alex Martelli wrote the original version of "bound inner classes", in response to my question on StackOverflow.

@larryhastings
Copy link
Owner

I have one idea: dataclasses could inject its own descriptor for __annotations__ into the class it was dataclass-ifying, and lazily perform its computation there. My understanding is that dataclasses doesn't really care much about annotations. So if they're not needed for the main dataclass-ifying work, this would solve the problem locally. If I'm wrong, and dataclasses needs to refer to the annotations while it's creating the data class, then this probably wouldn't work, as it'd need to delay all processing until after its descriptor returned. Also, you'd still have lookup problems if you called a second descriptor, after (above) @dataclass, and that descriptor examined __annotations__.

@hmc-cs-mdrissi
Copy link

hmc-cs-mdrissi commented Mar 16, 2022

One workaround for dataclasses going with lazy approach would be something like,

from __future__ import annotations

from dataclasses import dataclass

def lazy_dataclass(tp: type) -> type:
  class LazyDataclass:
    def __new__(cls, *args: object, **kwargs: object) -> object:
      if not hasattr(cls, "actual_cls"):
        cls.actual_cls = dataclass(tp)
      return cls.actual_cls(*args, **kwargs)

  return LazyDataclass

Basic usage looks to be fine. The recursive User example also is fine.

@dataclass
class Foo:
    x: int
    y: str

@lazy_dataclass
class Foo2:
    x: int
    y: str

foo1 = Foo(1, "a")
foo2 = Foo2(1, "a")
print(foo1) # Foo(x=1, y='a')
print(foo2) # Foo2(x=1, y='a')

@lazy_dataclass
class User:
   name: str
   friends: list[User] # type: ignore

user1 = User("Alice", friends=[User("Bob", friends=[])])
print(user1) # User(name='Alice', friends=[User(name='Bob', friends=[])])

This does definitely have user visible differences on the the type made although maybe you can get differences minimal enough to preserve public api. You also don't need to make the whole thing lazy. You could have only type analysis portions computed lazily on the first time they're really needed while as much else as possible be computed eagerly. I had same type of issue with a type analysis decorator and ended up moving all introspection logic to internal cached properties and tried to avoid calling them until really needed.

@larryhastings
Copy link
Owner

The easiest workaround would be to simply not use the decorator syntax, instead calling dataclass manually.

from dataclasses import dataclass

class User:
   name: str
   friends: list[User]
User = dataclass(User)

This would work fine with PEP 649 active.

@JelleZijlstra
Copy link
Contributor Author

My understanding is that dataclasses doesn't really care much about annotations

It mostly doesn't car exactly what is in the annotation, but it uses the keys of the dict to find out what the fields are, and it needs to find out when a variable is annotated with InitVar or ClassVar, so it does also need to look at the values.

I don't particularly like the various workarounds suggested above. @hmc-cs-mdrissi's @lazy_dataclass breaks introspectability. Larry's suggestion of calling the decorator after the class is defined introduces another obscure case similar to string annotations: with this change you have to remember that in this one case you can't use the decorator directly, just as now you have to remember to use string annotations in some obscure cases.

@ericvsmith
Copy link
Collaborator

The easiest workaround would be to simply not use the decorator syntax, instead calling dataclass manually.

That would be essentially the same as binding the name before the decorator ran, right?

@larryhastings
Copy link
Owner

That would be essentially the same as binding the name before the decorator ran, right?

Yes, but it works today without changing the language, and it's explicit rather than implicit.

@gvanrossum
Copy link

Solutions for references to the defining class itself (like User = dataclass(User) miss the more general point -- if a data class contains a forward reference in an annotation, the dataclass decorator looks up __annotations__ on the class being decorated, and this will fail. E.g.

@dataclass
class User:
    name: str
    group: Group  # <--- error here

@dataclass
class Group:
    name: str
    admins: list[User]  # Mutual recursion

Existing solutions (without PEP 649) are:

  • from __future__ import annotations (i.e., PEP 563)
  • Explicitly put string quotes around forward references (Group)

PEP 649 as currently written does not solve this, and if we were to chose it in favor of PEP 563, the solution would be to use string quotes.

A modified PEP 649 that returned some kind of error marker (similar to ForwardRef('Group')) would solve this, but Larry has repeatedly rejected that proposal.

@carljm
Copy link

carljm commented Apr 8, 2022

If we are willing to have typing tooling and dataclasses get their hands a bit dirty in Python's introspective flexibility, I think we could have our cake and eat it too. Larry can keep the default behavior of __co_annotations__() totally simple and non-magical, and not introduce new things to the language, but tools that need more magic for handling forward references can inject the magic themselves via existing features of the language.

The trick is that if you need more magical behavior, instead of calling somefunc.__co_annotations__(), you eval(somefunc.__co_annotations__.__code__, myglobals, {}), where myglobals is a customized globals dictionary you build yourself. (End users would generally not do this directly of course, but there could be one or more utilities in third-party code or in the stdlib, like inspect.get_annotations or typing.get_type_hints, that do.)

Let's say you are a documentation tool, and you want a dict with string representations of the annotations (a la PEP 563), without needing to worry about the real objects referenced in the annotations at all or which namespace they live in or whether they are forward refs. You inspect somefunc.__co_annotations__.__code__.co_names and construct a globals dict that has each of those name keys populated with an object somewhat in the spirit of unittest.Mock, that implements __getattr__ and __getitem__ (and maybe __call__, though that's not typically used in annotations) to build up and return extended mock objects, that ultimately have a __str__ that gives you a stringified representation of exactly how they were built.

Dataclasses, which only cares about the keys in __annotations__ and which of those keys' outer "type" is ClassVar or InitVar, can do something similar and check whether the value in the returned dict for a key is the mock object it inserted under those two names. This is arguably cleaner and more robust than the hoops dataclasses has to jump through today. (If those names aren't even in co_names it will already know none of the types are ClassVar or InitVar, but it'll still need to eval __co_annotations__ to get the full list of keys.)

I think a similar approach can be used for any typing-related tooling that wants to get annotations at runtime while protecting itself against the possibility of NameError due to forward references or references to things guarded by if TYPE_CHECKING. More complex variants may want to use the real objects if they are already present in the default __co_annotations__.__globals__, but insert some kind of ForwardRef object for names in co_names that are missing from __globals__.

@gvanrossum
Copy link

Could that magic globals be a defaultdict?

@JelleZijlstra
Copy link
Contributor Author

Thanks @carljm, this is a great idea!

@gvanrossum defaultdict works at runtime, but it's not a great fit for many use cases, because defaultdict's default function doesn't have access to the key. Something like this would work well though:

>>> class x(dict):
...     def __missing__(self, key): return ForwardRef(key)
... 
>>> d = x()
>>> eval("y", d)
ForwardRef('y')

@carljm
Copy link

carljm commented Apr 8, 2022

The one thing I'm not sure about is whether using a non-dict for globals (if it works; afk now and haven't tested) is a language feature or a cpython implementation detail. Docs for eval suggest it has to be a dict? But maybe that's meant to include dict subclasses too. https://docs.python.org/3/library/functions.html#eval

I think the idea can work either way given the option to introspect co_names, but if we can use a custom dict type that does simplify some use cases.

@gvanrossum
Copy link

In CPython, only locals may be a non-dict mapping.

@JelleZijlstra
Copy link
Contributor Author

JelleZijlstra commented Apr 8, 2022

Currently the documentation for exec() specifies that globals must be exactly a dict (not a subclass), but the one for eval() just says dict. python/cpython#25039 is proposing to remove the no-subclass requirement. Subclasses work at runtime for both (at least in 3.11 where I tried).

@ncoghlan
Copy link

ncoghlan commented Apr 15, 2022

Arriving from https://discuss.python.org/t/finding-edge-cases-for-peps-484-563-and-649-type-annotations/14314/2 my own thoughts were along the lines of @carljm's suggestion, but thinking at the decorator level (i.e. running the decorators in general with a pseudo-namespace in place that includes a name binding for the class, rather than only doing that for annotation evaluation).

However, that isn't really any different in practice from simply doing early binding in the namespace containing the class definition, with the problems @larryhastings described above: it works fine for decorators that preserve the class identity (although they may still mutate the class in place, as dataclass does), but is potentially fraught with confusion if the decorators change the class identity.

What's making me more amenable to the idea is that a variant of this problem already exists, and has existed since we introduced the zero-arg super support: __class__ (and zero arg super) always refers to the result of the "class" statement prior to decorator application, so changing identity in a class decorator already has the potential to introduce problems.

>>> def replace_the_class(cls):
...     global replaced_cls
...     replaced_cls = cls
...     return object()
...
>>> @replace_the_class
... class Example:
...     @classmethod
...     def get_defining_class(cls):
...         return __class__
...
>>> Example
<object object at 0x0000022496E848C0>
>>> replaced_cls
<class '__main__.Example'>
>>> replaced_cls.get_defining_class()
<class '__main__.Example'>

Given that precedent, the notion of "when a decorator runs, the name in the defining namespace will refer to the same object as is being passed in as the first argument" (rather than leaving it bound to whatever it was previously bound to) seems less radical.

I also don't think the notion of trapping NameError is viable, as there's no guarantee that NameError will be raised: the class name might already be defined and refer to something else. That's technically also a backwards compatibility challenge for changing the name binding behaviour, but lazy annotations will require a future import anyway, so an eager name binding change could be tied in to that same feature flag.

@gvanrossum
Copy link

gvanrossum commented Apr 15, 2022 via email

@ncoghlan
Copy link

ncoghlan commented Apr 15, 2022

Indeed, looks like that's a genuine existing bug arising from the pre-decoration/post-decoration discrepancy:

>>> @dataclass(slots=True)
... class BaseExample:
...    a: int = 1
...
>>> @dataclass(slots=True)
... class ChildExample(BaseExample):
...   b: int = 2
...   def explicit_super(self):
...     return super(ChildExample, self).__thisclass__
...   def zero_arg_super(self):
...     return super().__thisclass__
...
>>> ChildExample().explicit_super()
<class '__main__.ChildExample'>
>>> ChildExample().zero_arg_super()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in zero_arg_super
TypeError: super(type, obj): obj must be an instance or subtype of type

Seems to be an existing issue report for the problem here: python/cpython#90562

@ncoghlan
Copy link

ncoghlan commented Apr 15, 2022

I think that (and similar) bugs may turn the potential precedent on its head: the existing problems with the __class__ cell and keeping it up to date in class decorators that need to replace the class object suggest we don't want to introduce more ways to run into those kinds of challenges.

That would push more towards approaches like @carljm's suggestion, where work would focus on providing ways to customise the annotation name resolution at runtime for use cases that wanted to implicitly deal with forward references (including references to a class currently being defined).

@ericvsmith
Copy link
Collaborator

Does this imply that in a dataclass created with slots=True, zero-arg super doesn't work?

Yes. For an example of this, see python/cpython#90055

@larryhastings
Copy link
Owner

larryhastings commented Apr 15, 2022 via email

@ericvsmith
Copy link
Collaborator

So, if we change Python so that processing of slots is done lazily, during the creation of the first instance of the class (or a subclass), dataclass could skip its subclass hack and this problem would cease to exist?

Yes.

It also occurs to me that dataclasses could create the new class first, then add methods, instead of creating the class after the methods are added, which is what it does now. This wouldn’t help with user methods, but I think would fix dataclasses’s added methods. But maybe there’s another issue with this, I haven’t tried it yet.

@gvanrossum
Copy link

And beware that other implementations would have to follow suit.

@Conchylicultor
Copy link

Conchylicultor commented Apr 20, 2022

Solutions for references to the defining class itself (like User = dataclass(User) miss the more general point -- if a data class contains a forward reference in an annotation, the dataclass decorator looks up __annotations__ on the class being decorated, and this will fail. E.g.

@dataclass
class User:
    name: str
    group: Group  # <--- error here

@dataclass
class Group:
    name: str
    admins: list[User]  # Mutual recursion

I'm sorry if this was discussed previously but wouldn't the following solve the issues while being simpler than current proposals ?

The idea is to have Python automatically replace undefined closures in annotations by some ForwardRef(lambda: closure).

So the user code above would be automatically transformed by python to:

@dataclasses.dataclass
class User:
    name: str
    group: ForwardRef(lambda: Group)

@dataclasses.dataclass
class Group:
    name: str
    admins: list[ForwardRef(lambda: User)]

Then type_hints would resolve the forward ref by calling the lambda (no eval needed, no descriptor needed).

Here is an example: https://gist.github.com/Conchylicultor/d41f3637e44be7dcd361af37b597451d

print(type_hints(User))  # {'name': <class 'str'>, 'group': <class '__main__.Group'>}
print(type_hints(Group))  # {'name': <class 'str'>, 'admins': list[__main__.User]}

And it also works when classes are defined within functions:

User, Group = user_group_class_factory()  # class defined inside the function

print(type_hints(User))  # {'name': <class 'str'>, 'group': <class '__main__.user_group_class_factory.<locals>.Group'>}
print(type_hints(Group))  # {'name': <class 'str'>, 'admins': list[__main__.user_group_class_factory.<locals>.User]}

What would be the drawback/limitations ? Am I missing something obvious ?

@JelleZijlstra
Copy link
Contributor Author

What would be the drawback/limitations ? Am I missing something obvious ?

The obvious way to write the type for group in your example is Group. I'd like users to be able to write types the obvious way, so they don't have to think about details of scoping while writing code and can move classes around without having to change annotations. We can come up with all kinds of contortions, like nested lambdas, but they don't really improve the situation over string annotations.

@Conchylicultor
Copy link

Conchylicultor commented Apr 21, 2022

The obvious way to write the type for group in your example is Group.

Yes, this would still be the case.
The Group -> ForwardRef conversion would be automatically done by python, so users don't have to think about internals/scope. Everything would just magically work.

Did I missunderstand you ?

they don't really improve the situation over string annotations.

I think it does:

Compared to PEP 563, it solve all limitations pointed out by https://peps.python.org/pep-0649/#motivation

  • Does not require eval
  • Works with local class definition, while PEP 563 require annotations be evaluated at module-level scope.
  • Does not require to stringize their annotations, which is surprising.

Compared to current PEP 649, I think the lazy ForwardRef has advantages:

  • It works with recusive dataclasses (this issue), without complicated hacks (e.g. dataclasses would as-is without any changes)
  • It is more flexible (e.g. one can still track annotations dynamically as they are built in metaclass __prepare__ with a custom __annotations__ dict subclass).
  • One can inspect/modify the annotations without triggering undefined NameError (e.g. from decorators):
class A:
  x: B

# Can inspect typing annotations even when `B` is not yet defined
print(A.__annotations__)  # {'a': ForwardRef(lambda: B)}
# Mutate works as expected
A.__annotations__['y'] = int  

class B:
  pass

# Only works after B has been resolved
type_hints(A)  # {'x': B, 'y': int}
  • It should be fully backward compatible as ForwardRef are only added where NameError was raised previously.
  • I also think it is conceptually simpler than PEP 649.

@carljm
Copy link

carljm commented Apr 21, 2022

@Conchylicultor I think your proposal is effectively similar to what @JelleZijlstra proposed in #3 , with the added wrinkle of replacing the NameError with a ForwardRef wrapping a lambda, to make it simpler to resolve later. I think the main objection to doing this sort of thing by default (e.g. from @larryhastings ) is that code in annotations should not get special treatment for forward references, it's just Python code and "not special enough to break the rules." Thus my proposal above to have it follow the usual rules by default, but use Python's existing flexibility in evaluating code objects to allow for other behaviors when needed. In fact, with PEP 649 and the eval() trick, your proposal is one behavior that could be implemented.

If your proposal aims to be simpler than PEP 649 by not wrapping the overall evaluation of annotations in a function, but eagerly executing annotations and only wrapping undefined names in a ForwardRef, then I think your proposal fails to solve the problem of annotations adding significant memory and CPU overhead in large codebases, even if they are never used at runtime.

@carljm
Copy link

carljm commented Apr 21, 2022

One small wrinkle I noticed with the "eval the code object" trick is that currently eval() does not support eval of code objects with freevars, so the unusual cases where the annotation references a name in non-global scope, where PEP 649 has to create a closure for the annotation function, can't currently be handled via eval().

I think this should be pretty easily resolvable; in principle I don't think there's any reason eval() couldn't gain the ability to also take a closure tuple when evaluating a code object with freevars.

@gvanrossum
Copy link

I still feel that the code objects computing the annotations are sufficiently special to allow whatever special-casing we need to give it the best user experience.

@Conchylicultor
Copy link

Conchylicultor commented Apr 22, 2022

So I tried to implement lazy ForwardRef with the eval proposal from @carljm and it seems to work: https://gist.github.com/Conchylicultor/cf41527fa9f849a832f737acee961466

To make eval works with closure, I had to dynamically disassembling __code__ to replace closure LOAD_DEREF call by globals LOAD_GLOBAL

@dataclasses.dataclass
class User:
    def __co_annotations__():
        return {"name": str, "group": Group}


# Get incomplete type hints
incomplete = type_hints(User)  # {"name": str, "group": ForwardRef(Group)}

@dataclasses.dataclass
class Group:
    def __co_annotations__():
        return {
            "name": str,
            "group": list[User],
            "inexisting": DoesNotExists,  # e.g. a `if TYPE_CHECKIING:` import
        }

# Once `Group` is created, ForwardRef can be resolved
assert incomplete['group'].resolve() is Group

# Calling the annotations after the forward refs are resolved by default.
type_hints(User)  # {"name": str, "group": Group}
type_hints(Group)  # {"name": str, "group": list[User], "inexisting": ForwardRef(DoesNotExists)}

And this also works if User and Group are defined inside a function.

There's some edge cases I'm missing (e.g. __getitem__ & __getattr__ support for ForwardRef). Also not sure if this would work if user overwrite __co_annotations__ or mutate __annotations__['x'] = X.

This would also fix #1 (at least at the same level as PEP 563)

But overall, once all edge cases are fixed, maybe __annotations__ should apply this magic by default rather than raising NameError ?

@ncoghlan
Copy link

The accepted version of PEP 649 covers this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants