Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding/decoding arbitrary tagged subclasses without specifying an explicit union #541

Closed
mishamsk opened this issue Aug 27, 2023 · 4 comments

Comments

@mishamsk
Copy link

Description

Hi,

I am working on a library to model & manipulate arbitrary AST`s (as-in abstract syntax tree). I'd love to use msgspec as a faster alternative to the currently used mashumaro, however, it is impossible without the ability to instruct msgspec to treat any nested Struct-based type as an implicit union of all of it's subclasses. Consider a very simplified hierarchy:

class ASTNode(mshspec.Struct, tag=True):
    """Base for all AST nodes"""
    ...

class Expr(ASTNode):
    """Base for all expressions"""
    pass

class Literal(Expr):
    """Base for all literals"""
    value: Any

class NumberLiteral(Literal):
    value: int

class NoneLiteral(Literal):
    value: None = field(default=None, init=False)

class BinExpr(Expr):
    """Base for all binary expressions"""
    left: Expr
    right: Expr

Binary expression can have any expression in it's left & right attributes, so it should be serialized as a tagged class based on the actual instance and deserialized as if left/right had a type Union[NumberLiteral | NoneLiteral | Literal | Expr]. But in real life there are going to be many more subclasses and maintaining such a union manually is not practical (and ugly too, a union of 100 classes doesn't look good).

I could have used a custom class and define enc/dec_hook, but I expect this to result in something slower than mashumaro, since all of the encoding/deconding will be done in Python, Struct won't be leveraged at all. In such an instance mashumaro should be superior as it precompiles class-specific python code.

Wonder if this is something you'd willing to support?

@dhirschfeld
Copy link

I'm using the below idiom to deserialize any child class:

import operator as op
from functools import reduce


def subclasses(cls):
    for subclass in cls.__subclasses__():
        yield subclass
        yield from subclasses(subclass)
 In[5]: reduce(op.or_, subclasses(ASTNode))
Out[5]: __main__.Expr | __main__.Literal | __main__.NumberLiteral | __main__.NoneLiteral | __main__.BinExpr

Is that a bit hacky? Is there a better way?

@mishamsk
Copy link
Author

@dhirschfeld what you allude to is probably generating decoders, that will go through each possible subclass in turn and try that subclass. However, this means that it must be executed within a dec_hook, which means python, which basically beats the purpose of using msgspec in the first place. The reason why msgspec is fast, is that decoding is done in C-code. But hooks delegate to python.

The only other approach now is just listing the union explicitely, but it is not an option to use something like this:

class BinExpr(Expr):
    """Base for all binary expressions"""
    left: Union[NumberLiteral | NoneLiteral | Literal | BinExpr | Expr]
    right: Union[NumberLiteral | NoneLiteral | Literal | BinExpr | Expr]

as this will not be readable in real life (100s of AST classes) and not maintainable (one would need to keep every such union up to date when adding new classes).

@mishamsk
Copy link
Author

mishamsk commented Dec 5, 2023

@jcrist hey, saw you are out on parental leave (congrats!), so not sure if you'll have time to look into it. no pressure, just a friendly nudge.

I've actually realized that my use case is impossible to implement with the current version msgspec. If I use a Struct or dataclass for the base class (ASTNode in the example below), apparently there is no way to force msgspec to use dec_hook. It just throws a validation error, when trying to decode an field that is typed as a base class, but is actually some subclass (e.g. decoding an encoded BinExpr with left & right both being another BinExpr). And using some entirely bespoke dataclass-like custom class doesn't seem reasonable at all.

So was wondering if you'd consider this use case as meaningful enough to implement at some point in time?

@mishamsk
Copy link
Author

feel a little embarrassed, but apparently this is already working. Not sure why I’ve missed it in my tests before(( awesome lib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants