Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support generic dataclasses #172

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

onursatici
Copy link

Previously, generic dataclasses were not supported even if they had type arguments, now it is possible to do the following:

T = typing.TypeVar("T")

@dataclasses.dataclass
class SimpleGeneric(typing.Generic[T]):
    data: T

schema_int = marshmallow_dataclass.class_schema(SimpleGeneric[int])  # this works
schema = marshmallow_dataclass.class_schema(SimpleGeneric)  # this doesn't, no type arguments

It is also possible to use such generic aliases nested in another dataclass:

@dataclasses.dataclass
class Nested:
    data: SimpleGeneric[int]
    
schema = marshmallow_dataclass.class_schema(Nested)  # generated schema will validate data has an integer value

@onursatici
Copy link
Author

@lovasoa does this seem like the right approach to you? I wanted to keep the previous behaviour of not generating a schema for a generic dataclass if it does not specify any type arguments, but if they do, it feels like it would be a good feature for marshmallow-dataclass to be able to create a schema, respecting the given type arguments

@dairiki
Copy link
Collaborator

dairiki commented Jan 26, 2022

I don't know how easy it is to do, but it would be nice to generalize some more.

As an example, currently, with this PR, the following does not work, though there are no unfixed type parameters.

from dataclasses import dataclass
from typing import Generic
from typing import TypeVar

from marshmallow_dataclass import class_schema

T = TypeVar("T")

@dataclass
class Simple(Generic[T]):
    x: T

@dataclass
class Nested(Generic[T]):
    nested: Simple[T]

schema_class = class_schema(Nested[int])

(⇒ TypeError: T is not a dataclass and cannot be turned into one.)

@lovasoa
Copy link
Owner

lovasoa commented Apr 21, 2022

Hello ! Sorry for the delay ! Looks like there is a type issue on python 3.7

@onursatici
Copy link
Author

Sorry for the delay here, I will take a look to the mypy complaint and will see if it is low lift to support nested generic dataclasses. It might take a while until I find the time so feel free to close till then

@jemshit
Copy link

jemshit commented Nov 21, 2022

Will this get merged? Is there a workaround without mergin this PR?

@lovasoa
Copy link
Owner

lovasoa commented Nov 21, 2022

I will be happy to merge it when conflicts are fixed and the tests pass.
@jemshit, if you want to pick up the work where it was left off, that would be welcome !

@@ -354,21 +356,27 @@ def class_schema(
del current_frame
_RECURSION_GUARD.seen_classes = {}
try:
return _internal_class_schema(clazz, base_schema, clazz_frame)
return _internal_class_schema(clazz, base_schema, clazz_frame, None)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to add None here explicitly so we hit the cache for _internal_class_schema. lru_cache uses all given params as key, not adding the default values for not specified params

@onursatici
Copy link
Author

@lovasoa should be good for a review, I haven't ran tests for older python versions though. Appreciate if you could kick off CI again

@onursatici
Copy link
Author

I have also added a test to cover @dairiki 's comment. It should support nested generics now and it should only work when the same concrete type is used for the same TypeVar

@dairiki
Copy link
Collaborator

dairiki commented Nov 30, 2022

@onursatici Tests were failing due to pre-commit hook bitrot/breakage. I've fixed that and merged those fixes here.

Now there is a test failure for python < 3.10. I haven't looked at it carefully yet, to see what the fix (or even the cause) might be.

@onursatici
Copy link
Author

Thanks a lot @dairiki , it seems like __name__ was not an attribute of generic alias prior to 3.10, I can have a look

@onursatici
Copy link
Author

Also this implementation has a bug for repeated use of the same generic dataclasses.

Example:

# class B is a generic dataclass
class A:
  a: B[str]
  a: B[int]
  c: B[int]

Because marshmallow_dataclass supports forward references, the field generated for c would be a Nested field, with the forward reference "B" as its argument.
Now when using this schema, marshmallow will happily resolve the forward schema to be the first schema it has registered for this class B, which is of the B[str] variant, because it does appear as the first field with a B type. Because of this behaviour the schema of A will fail to deserialise field c, because it would treat it as B[str].

one ugly fix would be reordering fields:

class A:
  a: B[int]
  c: B[int]
  a: B[str]

now field c would be correctly identified as B[int] because it is the first B marshmallow sees, from field a.

Fixing this in general without breaking forward reference support is a bit tricky, because marshmallow's class registry is oblivious to generic arguments. We can accept that this is not something we support or log warning, but that would require keeping track of extra state (which variant of this generic class have we seen first, and is it different than the variant we are trying to use a forward reference for now, if so, log warning)

@onursatici
Copy link
Author

@dairiki fixed conflicts and the failing tests for python < 3.10. Also added a test for my comment above

@dairiki
Copy link
Collaborator

dairiki commented Dec 10, 2022

@onursatici Thank you!

The tests under python 3.6 had started failing due to the upgrade of the ubuntu-latest runners from 20.04 to 22.04.
I've fixed that and merged that fix here.

There's still a test failure under python 3.6. (I haven't looked at it in detail.)

@onursatici
Copy link
Author

@dairiki py3.6 should be fine now based on my local testing. Would be nice to kick CI one more time 🤞

@dairiki
Copy link
Collaborator

dairiki commented Dec 15, 2022

@onursatici I still haven't found time to dig into this deeply, so these may be stupid questions, but...

Also this implementation has a bug for repeated use of the same generic dataclasses.

Example:

# class B is a generic dataclass
class A:
  a: B[str]
  a: B[int]
  c: B[int]

Because marshmallow_dataclass supports forward references, the field generated for c would be a Nested field, with the forward reference "B" as its argument. Now when using this schema, marshmallow will happily resolve the forward schema to be the first schema it has registered for this class B, which is of the B[str] variant, because it does appear as the first field with a B type. Because of this behaviour the schema of A will fail to deserialise field c, because it would treat it as B[str].

Does this example depend on having the repeated a field? (Or could the second a field be named b?)

one ugly fix would be reordering fields:

class A:
  a: B[int]
  c: B[int]
  a: B[str]

now field c would be correctly identified as B[int] because it is the first B marshmallow sees, from field a.

Why doesn't this break the (second) a field? Wouldn't it mistakenly be treated as B[int]?

What happens if there are multiple dataclasses with B fields?

class A1:
    x: B[int]
SchemaA1 = class_schema(A1)

class A2:
    y: B[str]
SchemaA2 = class_schema(A2)

Does the same issue pertain?

Fixing this in general without breaking forward reference support is a bit tricky, because marshmallow's class registry is oblivious to generic arguments.

What would fixing that involve?

We can accept that this is not something we support or log warning, but that would require keeping track of extra state (which variant of this generic class have we seen first, and is it different than the variant we are trying to use a forward reference for now, if so, log warning)

This does seem like creepy behavior, especially so if we can't issue a warning when someone bumps into it.

@onursatici
Copy link
Author

@dairiki here is my understanding:

Does this example depend on having the repeated a field? (Or could the second a field be named b?)

my bad, the issue happens when the fields have distinct names, so the example should have been:

# class B is a generic dataclass

# this is not fine
class A:
  a: B[str]
  b: B[int]
  c: B[int]

# this is fine
class A:
  b: B[int]
  c: B[int]
  a: B[str]

Why doesn't this break the (second) a field? Wouldn't it mistakenly be treated as B[int]?

As far as I understand _RECURSION_GUARD.seen_classes in marshmallow_dataclass correctly uses generic arguments as keys, so B[int] and B[str] would not end up in a nested schema field (meaning it will miss the seen_classes cache).
The problem is with marshmallow, which also has a class registry, and it doesn't know about generic arguments.
So for this bug to occur, we would need marshmallow_dataclass.class_schema to create a Nested type for a field, and it only does it if it sees two identical types, including generic arguments. However on deserialisation, marshmallow treats this Nested field type to have a type of the first seen generic variant of the class, because the key it uses for its internal cache (i.e class_registry) does not have the generic arguments of the type it is deserialising.

it goes something like this:

# comments explain the field schemas `class_schema` derives from the dataclass
class A:
  a: B[str]  # field schema correctly created as B[str]
  b: B[int]  # field schema correctly created as B[int]
  c: B[int]  # B[int] is in `seen_classes`, field schema is Nested(B)

# when we do this:
marshmallow_dataclass.class_schema(A)().load(a_object)
# marshmallow does:
# load a first, add its type `B` to class_registry, so the registry has the mapping `B -> schema of B[str]`
# load b, use given field schema, which is for B[int]
# load c, this has a Nested(B) schema, and the deserialiser stored in class_registry for B is for B[str], so this fails


# when we reorder fields like this:
class A:
  b: B[int] # field schema correctly created as B[int]
  c: B[int] # B[int] is in `seen_classes`, field schema is Nested(B)
  a: B[str] # field schema correctly created as B[str]

# now when we do this:
marshmallow_dataclass.class_schema(A)().load(a_object)
# marshmallow does:
# load b first, add its type `B` to class_registry, so the registry has the mapping `B -> schema of B[int]`
# load c, this has a Nested(B) schema, and the deserialiser stored in class_registry for B is for B[int], so this succeeds
# load a, use given field schema, which is for B[str]

Does the same issue pertain?

I don't think it will because none of these classes would have a field schema of Nested. That is when marshmallow's class registry is read as far as I know

What would fixing that involve?

Some thoughts:

  • do not use nested field schema's if the value has generic arguments, but that would break recursive generic types
  • fix marshmallow class_registry upstream so it does respect generic args
  • support recursive types but still do not use nested schemas, but limit the level of type recursion

@dairiki
Copy link
Collaborator

dairiki commented Dec 20, 2022

@onursatici

Also this implementation has a bug for repeated use of the same generic dataclasses.

Aha!

If I understand correctly, here's what's going on.

Currently, we store the class name in _RECURSION_GUARD.seen_classes, keyed by type. I.e. for B[int] we set seen_classes[B[int]] = "B", and for B[str] we set `seen_classes[B[str]] = "B" as well. (See here.)

The marshmallow.fields.Nesteds get instantiated here.
For each given type (e.g. B[int]):

  • The first time through, nested gets set to the result of calling _internal_class_schema. This is a Marshmallow schema class object [ed: class object not instance] — and is the correct schema for the given type. Then field_for_schema returns Nested(nested), which is correct.
  • The second time through for the same type, nested gets set to _RECURSION_GUARD.seen_classes[B[int]], which is the string "B". Then field_for_schema returns Nested("B"). Here's where the incorrect cached value is being returned — Nested looks up the schema from Marshmallow's class registry using the key "B".

Maybe the fix is to somehow cache the schema instance (rather than or in addition to schema class name) in the _RECURSION_GUARD thread-local. (Or maybe cache the Marshmallow field instance?)


Also of note, is these problem cases seem to work in older python versions. (Or at least behave differently.)
I think this has to do with this use of repr(clazz) rather than clazz.__name__ to generate the seen_classes key for python < 3.10. Note B[int].__name__ is "B", while repr(B[int]) is something like "__main__.B[int]".

@dairiki
Copy link
Collaborator

dairiki commented Dec 20, 2022

I think this has to do with this use of repr(clazz) rather than clazz.__name__ to generate the seen_classes key for python < 3.10.

So the name under which a schema gets entered in Marshmallow's class_registry comes from the name specified when the schema class is constructed here. That is whatever "name" we picked up here, which is the same name that gets remembered in _RECURSION_GUARD.seen_classes. That "name" is, e.g., "B" in python >= 3.10 and "<module>.B[int]" in earlier pythons — the later (usually) works, since it includes the generic parameter types, the former doesn't.

So, just including the generic type parameters within the "name" string is enough to mostly fix the issue.
I still think it's probably a better fix to actually cache the schema class for the type rather than just a "name" of that schema class.

@onursatici I can look at this further to see what a clean fix might be, but I likely won't have time to get to that until after New Year's sometime. (In the meantime, if you figure out something, have at it!)

@onursatici
Copy link
Author

@dairiki Thanks for having a look! Yea caching the schema class makes sense to me. I can have a go at it, hopefully soon if I can make the time, will let you know if I can't so we don't do double work

@dairiki
Copy link
Collaborator

dairiki commented Jan 12, 2023

Currently, this PR works, I think, if the dataclass fields have types that are one of:

  • one of the generic type parameters
  • a generic dataclass (with generic args passed on)
  • a non-generic type

It does not appear to work when fields have other generic types, e.g.

T = TypeVar("T")

@dataclasses.dataclass
class Z(Generic[T]):
    z: List[T]

class_schema(Z[int])  # -> TypeError("T is not a dataclass and cannot be turned into one.")

Since we are already passing generic_params_to_args to field_for_schema, I suspect this is fixable (though I haven't tried to do so.)

eta: also, what about:

@dataclasses.dataclass
class ZZ(Generic[T]):
    z: List[Tuple[T, T]]
    t: Z[Tuple[T, T]]

@dairiki
Copy link
Collaborator

dairiki commented Jan 12, 2023

If generic parameters are not specified for a generic dataclass field types, should we assume Any (the same way we do for List and other built-in generics (in _generic_type_add_any))?

E.g., with the current state of this PR:

T = TypeVar("T")

@dataclasses.dataclass
class GenericDataclass(Generic[T]):
    x: T

@dataclasses.dataclass
class Bar:
    x: GenericDataclass[int]
    y: List  # (interpreted as List[Any])

class_schema(Bar) # this works

@dataclasses.dataclass
class Foo:
    fu: GenericDataclass

class_schema(Foo)  # throws => TypeError("T is not a dataclass and cannot be turned into one.")

dairiki added a commit to dairiki/marshmallow_dataclass that referenced this pull request Jan 12, 2023
@dairiki
Copy link
Collaborator

dairiki commented Jan 12, 2023

Maybe the fix is to somehow cache the schema instance (rather than or in addition to schema class name) in the _RECURSION_GUARD thread-local. (Or maybe cache the Marshmallow field instance?)

As the name (RECURSION_GUARD) suggests, we can't do exactly this, since the whole point here is to be able to deal with recursive schemas. At the time we create the Nested field, we have not necessarily finished compiled the schema for the nested type.

One way around this is that marshmallow.fields.Nested can accept a callable for it's nested parameter, thus allowing for passing deferred values.

I've forked your onursatici:os/generic-dataclasses branch and, I think fixed this issue here in dairiki:generic-dataclasses.
In particular, see 024375f.

@dairiki
Copy link
Collaborator

dairiki commented Jan 13, 2023

Implements #230

dairiki added a commit to dairiki/marshmallow_dataclass that referenced this pull request Jan 19, 2023
Manually merge work in PR lovasoa#172 by @onursatici to support generic dataclasses

Refactor _field_for_schema to reduce complexity.
dairiki added a commit to dairiki/marshmallow_dataclass that referenced this pull request Jan 20, 2023
Manually merge work in PR lovasoa#172 by @onursatici to support generic dataclasses

Refactor _field_for_schema to reduce complexity.
dairiki added a commit to dairiki/marshmallow_dataclass that referenced this pull request Jan 20, 2023
Manually merge work in PR lovasoa#172 by @onursatici to support generic dataclasses

Refactor _field_for_schema to reduce complexity.
@onursatici
Copy link
Author

Thanks for having a look, it looks neat. I think I can have a look at this next week, and add the tests for generic types other than dataclasses. Or If you wish you can also take this PR over, both work for me

@dairiki
Copy link
Collaborator

dairiki commented Jan 21, 2023

@onursatici So I've been drawn down the rabbit-hole and have started a larger refactor of the marshmallow_code.

I've just created a draft PR for that (with a few notes) at #232.

The work there pulls in the work here, as well as loads of other stuff.

It also fixes things to work for

@dataclasses.dataclass
class ZZ(Generic[T]):
    z: List[Tuple[T, T]]
    t: Z[Tuple[T, T]]

as noted above by moving the resolution of TypeVars to field_for_schema.


If you're in a hurry to have support for generic dataclasses, I say we should fix this PR and merge it (I think it's almost there) rather than wait for #232. But if you're not in a big rush, let's wait.

@onursatici
Copy link
Author

ah I see, yea that makes sense. Sure I can wait, thanks for doing this

@shooit
Copy link

shooit commented Jun 27, 2023

We would love for this to go in soon! Our use case is to have a generic paginated response shape that can be marshaled out to JSON

Something like:

import typing
from dataclasses import dataclass


Result = typing.TypeVar("Result")


@dataclass
class PaginatedResult(typing.Generic[Result]):
    results: list[Result]
    next_page_token: typing.Optional[str]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants