Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default_factory values leaking into compiled schema #400

Closed
michal-rogala opened this issue Aug 29, 2023 · 14 comments
Closed

default_factory values leaking into compiled schema #400

michal-rogala opened this issue Aug 29, 2023 · 14 comments

Comments

@michal-rogala
Copy link
Contributor

Non-optional field with default_factory leaks default value into compiled schema:

from uuid import uuid4, UUID

from dataclasses_avroschema.avrodantic import AvroBaseModel
from pydantic import Field

class TestSchema(AvroBaseModel):
     uuid: UUID = Field(default_factory=uuid4)

print(TestSchema.avro_schema())

results in schema:

{
  "type": "record",
  "name": "TestSchema",
  "fields": [
    {
      "name": "uuid",
      "type": {
        "type": "string",
        "logicalType": "uuid"
      },
      "default": "8671a831-9cf5-4875-b0cf-5ed7aad81a95"
    }
  ]
}

As default value is dynamically generated - it makes no sense to have it statically compiled into schema. Also - does having "default" value baked in, make the field optional from application point of view?
cc: @ddevlin

@marcosschroh
Copy link
Owner

marcosschroh commented Aug 29, 2023

@michal-rogala that is the expected behavior. You should not do it if you want always the same default value (which I expect you want). If you want a default value which is a default uuid just return the string or you can have a default factory that returns always the same result

@michal-rogala
Copy link
Contributor Author

michal-rogala commented Aug 29, 2023

What we want to achieve is to have an application automatically fill non-optional values - like UUID or timestamps (which cannot be static by design) when object is being generated. Those values should not propagate to schema "default" field because it does not make sense in that case (like timestamp). One way to work around this is to fill those values in custom constructor but it does not scale with objects with thousands nested fields.

Do you have an idea how to achieve that? Maybe we could implement annotation to field to skip adding those values to schema.

@marcosschroh
Copy link
Owner

@michal-rogala Ah now I understand what you need. There is not a proper way to do it, but:

Do you register the schema every time that you encounter a new class with a different default value? I am asking because even though the default value changes every time still should be possible to serialize/deserialize payloads/events with schemas that are the same except the default value. What do you do with the schemas?

@michal-rogala
Copy link
Contributor Author

So far we don't use schema registry - I was wondering if this behaviour is normal - passing factory as default value implies value is dynamic (uuid/timestamp, etc) and putting it into default value in schema seemed odd.

I believe this issue can be closed now :). Thanks for explanation.

@marcosschroh
Copy link
Owner

You're welcome

I was wondering if this behaviour is normal - passing factory as default value implies value is dynamic (uuid/timestamp, etc) and putting it into default value in schema seemed odd.

I agree that is odd but if you do not register a schema every time then it is fine. Even if the field has a default value in the schema it won't be a problem because you also have it in the default_factory and its output always will be used to serialize the event.

If you register a schema, then having a default value on fields like updated_on (timestamp) it's weird but possible, everything dependens on the use case. I will try to write some documentation about it.

Let's close this issue for now. If you or someone else experience another issue or believes that this is not the right way to go let's re open it.

@michal-rogala
Copy link
Contributor Author

Didn't take too long to reopen this :). Now with serious use case for that.

Background: schemaless_read/write does not support backward schema compatibility - adding optional field to the schema results with fastavro exception when trying to deserialize older message. This is a known limitation and design problem with fastavro. Solution to that is to use regular fastavro read/write which embeds original writer schema along with message.

I deliberately omit solution with schema repository because it requires additional infrastructure and schema fields indicating version - and with existing systems it's often too late. And according to "flawed" Confluent docs - such compatibility should be achievable without schema repository. Using writer_schema option is often impossible without coordination between software components.

Problem: Using custom serialize/deserialize methods using fastavro read/write is possible - but as schema contains unnecessary "default" fields (timestamps, uuids, etc) embedded schema is larger - adding considerable overhead to message size.

Solution: when using default_factory, have an option to decide if this value should be included in the schema

@marcosschroh marcosschroh reopened this Sep 12, 2023
@marcosschroh
Copy link
Owner

We can add an extra option to the Meta called List[str]: exclude_default. This option will contain the name of the fields whose default values should be omitted(default or default_factory)

@michal-rogala
Copy link
Contributor Author

yes, this would would work - or an option to define it at field level - if you have nested dataclasses it would be difficult to manage Meta class.

@joaoe
Copy link

joaoe commented Sep 27, 2023

We can add an extra option to the Meta called List[str]: exclude_default

That sounds good but it would make much more sense to configure that in the Field, whether the default_factory value can be exported.

@tribunsky-kir
Copy link

Hello! I've encountered the same issue and found this discussion helpful. 😉

I agree that is odd but if you do not register a schema every time then it is fine.

That's true, that everything goes smoothly if schema is published only once.

However, in my own (albeit unrepresentative) experience with living in Kubernetes or its analogs, this behavior will just continuously bump the schema version to an excessive level. This is because pods are supposed to be restarted frequently, and each restart will publish a new default value.

Moreover, the same behavior occurs for any restarts of the process with the producer (e.g., local runs against a development cluster). In the case of Kafka, the producer has to push its schema to the schema registry after initialization to get the schema ID. As far as I know, this behavior is default for Java producers and the confluent-kafka-python producer.

So, I agree with @michal-rogala.

@marcosschroh, what is the specific use case for utilizing the value from the factory? In my past projects, I used default_factory specifically to dynamically generate a value 'out of thin air' in my code while keeping the schema unchanged.

Would it be better to retain the option to initialize the default value for the schema from the factory, but turn it off by default?

@marcosschroh
Copy link
Owner

Hi all,

Sorry for the delay. The reason to have default_factory due to the need to have default values for map, array and records.

Example:

Have a schema with favourites_numbers field that has a default value [7, 13] (it needs to be generated using the default_factory because it is a mutable value):

@dataclasses.dataclass
class UserAdvance(AvroModel):
    ...
    favourites_numbers: typing.List[int] = dataclasses.field(default_factory=lambda: [7, 13])

resulting in:

{
  "type": "record",
  "name": "UserAdvance",
  "fields": [
    ...
    {
      "name": "favourites_numbers",
      "type": {
        "type": "array",
        "items": "long",
        "name": "favourites_number"
      },
      "default": [7, 13]
    }
  ],
}

Then the question is: Should default_factory be applied only for map, array and records and any other mutable value?
I think yes, then it will solve this issue.

If end users want to have a default value for map, array and records they have to use the default_factory as default can not be used. It is likely that the default will always be the same. For the other cases, default must be used.

Because it is not possible to use default and default_factory at the same time, we can use it as a distinction to include or not a default in the schema (except for mutable values).

Then:

  • For types like date, time, datetime and uuid that contain a default value which was generated using default_factory (they will change every time), they won't be included in the resulting schema but new instance will have it.

Example:

@dataclasses.dataclass
class UUIDLogicalTypes(AvroModel):
    event_uuid: uuid.UUID = dataclasses.field(default_factory=uuid.uuid4)


In [9]: UUIDLogicalTypes()
Out[9]: UUIDLogicalTypes(event_uuid=UUID('c0c48c38-6cf1-4fb8-ab13-dd3e9563dfd9'))

The resulting schema must be:

"type": "record",
  "name": "UUIDLogicalTypes",
  "fields": [
    {
      "name": "uuid_1",
      "type": {"type": "string", "logicalType": "uuid"}
    }
  ]
}
  • If a default value is needed in the schema, then default must be used. This will result that the schema won't change and the instances will have always the same default (this might not have any sense at all, but it is possible)
@dataclasses.dataclass
class UUIDLogicalTypes(AvroModel):
    event_uuid: uuid.UUID = uuid.UUID('c0c48c38-6cf1-4fb8-ab13-dd3e9563dfd9')

What do you think?

@tribunsky-kir
Copy link

tribunsky-kir commented Jan 6, 2024

Thank you for the detailed answer!

Now I see what the case is: dataclasses disallow mutable default items, unlike Pydantic (my bad, I don't use them often, when I need a lightweight structure I still use named tuples).

Then, applying default_factory only for maps, arrays, and records seems to be more than a reasonable default (and a necessary evil).

And I am strongly agree on types like date, time, datetime and uuid: as their default factories are not 'pure functions' and changes output between calls, they shouldn't be used in schema to define the default value and the default value must be passed explicitly.

My only concern then remains whether the same behavior should be applied to AvroBaseModel, where we can just define mutable defaults. Ideally, it would be the same as for dataclasses - it is easier to understand from the user's perspective. On the other hand, technically, it is not necessary to impose such restrictions: when the default value is mutable, it can still be defined via Field, and the schema generator should not consider it as a special case.

from typing import List

from dataclasses_avroschema import AvroBaseModel
from pydantic import Field


# This just works and should be passed to schema
class PydanticUserAdvance(AvroBaseModel):
    ...
    favourites_numbers: List[int] = Field(default=[7, 13])

...

# At first glance, it should work as in dataclasses, but why schema generator should consider special cases 
# to skip/pass default-factory's value to the schema?
class PydanticUserAdvance(AvroBaseModel):
    ...
    favourites_numbers: List[int] = Field(default_factory=lambda: [7, 13])

What are your thoughts on this?

@marcosschroh
Copy link
Owner

I think it won't be a problem

Dataclasses:

  1. If end users want to define a default in the schema, then the default must be used except for maps, array and records.
  2. If default_factory is used for a type that is not a maps, array or records then it will be ignored in the final schema

Pydantic:

  1. If end users want to define a default in the schema, then the default must be used. This also apply for maps, array and records unlike dataclasses point 1 because with pydantic it is possible have mutable values as default.
  2. If default_factory is used for a type that is not a maps, array or records then it will be ignored in the final schema like in dataclasses.

This leads us to:

Usually, for date, time, datetime or uuid types end user want to use the default_factory so instances will have auto generated values, then the value will be ignored in the final schema (restriction 2). If end users want a default value in the schema for these types, then default must be used (restriction 1), meaning that every instance will have the same value and it will match the python class.

For maps, array or records is not a problem because the work only with restriction 2. If the default factory is dynamic then the schema will change as well, typically it is not the case.

@marcosschroh
Copy link
Owner

marcosschroh commented Jan 13, 2024

Hi all,

I tried to follow the approach that I describe above but I realized that it does not scale, we have users that generated the schema once and then they use it in a different programing language, meaning that we can add custom behavior/rules to specific types. Then, the only solution is that the end user must specifies when a default value should be excluded.

I followed the marshmallow_dataclass approach, which is what @joaoe suggested. It will be possible to use the config exclude_default in the metadata. Here is the PR, you can also take a look the example in the documentation

Let me know whether this satisfy your requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants