Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of schemas whose fields are not in accordance with the rules for Python variables #628

Open
anneum opened this issue May 16, 2024 · 8 comments

Comments

@anneum
Copy link
Contributor

anneum commented May 16, 2024

Is your feature request related to a problem? Please describe.
When I convert an Avro schema whose fields do not match the rules for Python variables to a Pydantic (or Avrodantic) model, I get a non-valid model.

For example, if I convert the following schema, I get an invalid model:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    }
  ]
}

Invalid value due to - in the variable name

from dataclasses_avroschema.pydantic import AvroBaseModel

class SchemaAvroBenchmark(AvroBaseModel):
    scope-case: str

In my opinion, the serialization_alias from Fields should be used for this.

from dataclasses_avroschema.pydantic import AvroBaseModel
from pydantic import Field

class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')

When converting the valid model to an Avro schema, the serialization_alias should be used as the field name.

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "scope-case"
            ],
            "name": "scope_case",
            "type": "string"
        }
    ]
}

To summarize: Converting from a schema to a model and then back to a schema should result in the same schema.

I am already working on the implementation and getting the right result for such a simple schema (except for importing fields), but would appreciate support. In particular, I am having a lot of trouble with union cases.

@marcosschroh
Copy link
Owner

marcosschroh commented May 24, 2024

Hi @anneum

Yes, we should definitely generate proper class attributes names in snakecase. However, we should not uses aliases unless the original schema has it. The reason is that we should keep the same schema, if we change the field name and we add an aliases then the schemas (original and generated from model) will be different. Did you try to use the case funcionality?

@marcosschroh
Copy link
Owner

Following your example, you could do the following (if the field is generated properly):

from dataclasses_avroschema.pydantic import AvroBaseModel
from dataclasses_avroschema import case


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

{"type": "record", "name": "SchemaAvroBenchmark", "fields": [{"name": "scope-case", "type": "string"}]}

@marcosschroh
Copy link
Owner

@anneum if you update to the latest version (0.59.1) now it will generate valid python variables.

@anneum
Copy link
Contributor Author

anneum commented May 24, 2024

That's great, thanks for that. I see a little problem when we specify the case. It also overwrites field names that do not match the case.

Original Schema:

{
  "type": "record",
  "name": "SchemaAvroBenchmark",
  "fields": [
    {
      "name": "scope-case",
      "type": "string"
    },
    {
      "name": "scope_case2",
      "type": "string"
    }
  ]
}

After the model_generator.render(schema=schema, model_type=ModelType.AVRODANTIC.value):

from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema(case_type=case.SPINALCASE))

After the conversion into a avro schema:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "name": "scope-case",
            "type": "string"
        },
        {
            "name": "scope-case-2",
            "type": "string"
        }
    ]
}

Therefore, I suggest that we use the pydantic field with the serialization_alias parameter and instead of converting it to an alias as part of the schema, use it to store the original name.

The avro_schema() method should then use the serialization_alias (if available) as the field name instead of the attribute name when creating the schema.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

print(SchemaAvroBenchmark.avro_schema())

@anneum
Copy link
Contributor Author

anneum commented May 24, 2024

As an addition, I think the serialization_alias parameter is meant for just such cases. https://docs.pydantic.dev/latest/concepts/fields/#field-aliases

@marcosschroh
Copy link
Owner

Ok, we can add the alias to the generated model. We need to take into account that if the field has aliases already defined then it will be a bit weird and serialization_alias is a string, not list. Example:

{
    "type": "record",
    "name": "SchemaAvroBenchmark",
    "fields": [
        {
            "aliases": [
                "my-scope-case",
                "renamed-scope-case"
            ],
            "name": "scope-case",
            "type": "string"
        }
    ]
}

Maybe we should use the metadata to add the aliases instead of serialization_alias? In this case the generated model should be:

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(metadata={"aliases": ["scope-case", ...])

Then when generating the schema from the model, we will have the extra alias "scope-case". Does it work for you? If it does, then it is quite easy to implement.

@anneum
Copy link
Contributor Author

anneum commented May 27, 2024

The metadata field sounds like a good option, but I would use it as a string instead of a list. This is because there can only be one alias that could not be used as a field name because of its name. So if the field name in the schema is not valid (according to python rules), we add the original field name as an alias in the metadata and change the field name for the variable label in the pydantic model to a valid name.

The reason I would have liked to use the serialization_alias is the ability built into pydantic to dump the object with .model_dump(by_alias=True). This allows me to bring objects directly into the shape needed for the schema. This kills two birds with one stone.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(..., serialization_alias='scope-case')
    scope_case2: str

SchemaAvroBenchmark(scope_case='foo', scope_case2='bar').model_dump(by_alias=True)
# {'scope-case': 'foo', 'scope_case2': 'bar'}

@marcosschroh
Copy link
Owner

marcosschroh commented Jun 5, 2024

Question: are you using mudel_dump(by_alias=True) to generate the payload, then encode it and send it to kafka? I am asking because the way to do it is using serialize.

from pydantic import Field
from dataclasses_avroschema.pydantic import AvroBaseModel


class SchemaAvroBenchmark(AvroBaseModel):
    scope_case: str = Field(serialization_alias='scope-case')


# serialize to avro-json to send to kafka just to see the fields (the same will happen with avro-binary)
benchmark = SchemaAvroBenchmark.fake()

print(benchmark, "\n\n")
>>> scope_case='FbeDbPMeawuTwxUbhSaY'

# This can be send to kafka (it is bytes)
ser = benchmark.serialize(serialization_type="avro-json")

# It will produce an event with the field `scope_case` and not with the alias. 
print(set)
>>> b'{"scope_case": "FbeDbPMeawuTwxUbhSaY"}'

# This is with alias, but they are not bytes. Do not send to kafka.
print(benchmark.model_dump(by_alias=True))
>>> {"scope-case'": 'FbeDbPMeawuTwxUbhSaY'}

I will work in a PR to add the alias, working as:

  1. Generate a valid identifier to use in the python class
  2. The original field name will be added in the serialization_alias
  3. If the original field has more aliases, then the python class must contain serialization_alias and metadata={"aliases": [ ...]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants