Handling of schemas whose fields are not in accordance with the rules for Python variables #628

anneum · 2024-05-16T08:45:22Z

Is your feature request related to a problem? Please describe.
When I convert an Avro schema whose fields do not match the rules for Python variables to a Pydantic (or Avrodantic) model, I get a non-valid model.

For example, if I convert the following schema, I get an invalid model:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    }
  ]
}

Invalid value due to - in the variable name

from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope-case: str

In my opinion, the serialization_alias from Fields should be used for this.

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')

When converting the valid model to an Avro schema, the serialization_alias should be used as the field name.

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "scope-case"
            ],
            "name": "scope_case",
            "type": "string"
        }
    ]
}

To summarize: Converting from a schema to a model and then back to a schema should result in the same schema.

I am already working on the implementation and getting the right result for such a simple schema (except for importing fields), but would appreciate support. In particular, I am having a lot of trouble with union cases.

The text was updated successfully, but these errors were encountered:

marcosschroh · 2024-05-24T09:59:32Z

Hi @anneum

Yes, we should definitely generate proper class attributes names in snakecase. However, we should not uses aliases unless the original schema has it. The reason is that we should keep the same schema, if we change the field name and we add an aliases then the schemas (original and generated from model) will be different. Did you try to use the case funcionality?

marcosschroh · 2024-05-24T10:07:42Z

Following your example, you could do the following (if the field is generated properly):

from dataclasses_avroschema.pydantic import AvroBaseModel
from dataclasses_avroschema import case


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

{"type": "record", "name": "SchemaAvroBenchmark", "fields": [{"name": "scope-case", "type": "string"}]}

…hemas. Related to #628

…hemas. Related to #628 (#639)

marcosschroh · 2024-05-24T11:01:36Z

@anneum if you update to the latest version (0.59.1) now it will generate valid python variables.

anneum · 2024-05-24T11:47:37Z

That's great, thanks for that. I see a little problem when we specify the case. It also overwrites field names that do not match the case.

Original Schema:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    },
    {
      "name": "scope_case2",
      "type": "string"
    }
  ]
}

After the model_generator.render(schema=schema, model_type=ModelType.AVRODANTIC.value):

from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

After the conversion into a avro schema:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "name": "scope-case",
            "type": "string"
        },
        {
            "name": "scope-case-2",
            "type": "string"
        }
    ]
}

Therefore, I suggest that we use the pydantic field with the serialization_alias parameter and instead of converting it to an alias as part of the schema, use it to store the original name.

The avro_schema() method should then use the serialization_alias (if available) as the field name instead of the attribute name when creating the schema.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema())

anneum · 2024-05-24T11:53:42Z

As an addition, I think the serialization_alias parameter is meant for just such cases. https://docs.pydantic.dev/latest/concepts/fields/#field-aliases

marcosschroh · 2024-05-24T14:14:16Z

Ok, we can add the alias to the generated model. We need to take into account that if the field has aliases already defined then it will be a bit weird and serialization_alias is a string, not list. Example:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "my-scope-case",
                "renamed-scope-case"
            ],
            "name": "scope-case",
            "type": "string"
        }
    ]
}

Maybe we should use the metadata to add the aliases instead of serialization_alias? In this case the generated model should be:

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(metadata={"aliases": ["scope-case", ...])

Then when generating the schema from the model, we will have the extra alias "scope-case". Does it work for you? If it does, then it is quite easy to implement.

anneum · 2024-05-27T07:50:24Z

The metadata field sounds like a good option, but I would use it as a string instead of a list. This is because there can only be one alias that could not be used as a field name because of its name. So if the field name in the schema is not valid (according to python rules), we add the original field name as an alias in the metadata and change the field name for the variable label in the pydantic model to a valid name.

The reason I would have liked to use the serialization_alias is the ability built into pydantic to dump the object with .model_dump(by_alias=True). This allows me to bring objects directly into the shape needed for the schema. This kills two birds with one stone.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

SchemaAvroBenchmark(scope_case='foo', scope_case2='bar').model_dump(by_alias=True)
# {'scope-case': 'foo', 'scope_case2': 'bar'}

marcosschroh · 2024-06-05T11:36:04Z

Question: are you using mudel_dump(by_alias=True) to generate the payload, then encode it and send it to kafka? I am asking because the way to do it is using serialize.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')


# serialize to avro-json to send to kafka just to see the fields (the same will happen with avro-binary)
benchmark = SchemaAvroBenchmark.fake()

print(benchmark, "\n\n")
>>> scope_case='FbeDbPMeawuTwxUbhSaY'

# This can be send to kafka (it is bytes)
ser = benchmark.serialize(serialization_type="avro-json")

# It will produce an event with the field `scope_case` and not with the alias. 
print(set)
>>> b'{"scope_case": "FbeDbPMeawuTwxUbhSaY"}'

# This is with alias, but they are not bytes. Do not send to kafka.
print(benchmark.model_dump(by_alias=True))
>>> {"scope-case'": 'FbeDbPMeawuTwxUbhSaY'}

I will work in a PR to add the alias, working as:

Generate a valid identifier to use in the python class
The original field name will be added in the serialization_alias
If the original field has more aliases, then the python class must contain serialization_alias and metadata={"aliases": [ ...]}

…Related to #628

…Related to #628 (#654)

marcosschroh added a commit that referenced this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc…

517926b

…hemas. Related to #628

marcosschroh mentioned this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc… #639

Merged

marcosschroh added a commit that referenced this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc…

8154ab8

…hemas. Related to #628

marcosschroh added a commit that referenced this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc…

c2a839d

…hemas. Related to #628

marcosschroh added a commit that referenced this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc…

0d94200

…hemas. Related to #628

marcosschroh added a commit that referenced this issue May 24, 2024

fix: generate proper python identifiers when generating model from sc…

8e29eee

…hemas. Related to #628 (#639)

marcosschroh added a commit that referenced this issue Jun 7, 2024

fix: add aliases to field when field name is not a valid identifier. …

c75d1bc

…Related to #628

marcosschroh mentioned this issue Jun 7, 2024

fix: add aliases to field when field name is not a valid identifier. … #654

Merged

marcosschroh added a commit that referenced this issue Jun 7, 2024

fix: add aliases to field when field name is not a valid identifier. …

d4b62e7

…Related to #628

marcosschroh added a commit that referenced this issue Jun 7, 2024

fix: add aliases to field when field name is not a valid identifier. …

9722254

…Related to #628 (#654)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

anneum commented May 16, 2024

marcosschroh commented May 24, 2024 •

edited

Loading

marcosschroh commented May 24, 2024

marcosschroh commented May 24, 2024

anneum commented May 24, 2024

anneum commented May 24, 2024

marcosschroh commented May 24, 2024

anneum commented May 27, 2024

marcosschroh commented Jun 5, 2024 •

edited

Loading

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

Comments

anneum commented May 16, 2024

marcosschroh commented May 24, 2024 • edited Loading

marcosschroh commented May 24, 2024

marcosschroh commented May 24, 2024

anneum commented May 24, 2024

anneum commented May 24, 2024

marcosschroh commented May 24, 2024

anneum commented May 27, 2024

marcosschroh commented Jun 5, 2024 • edited Loading

marcosschroh commented May 24, 2024 •

edited

Loading

marcosschroh commented Jun 5, 2024 •

edited

Loading