New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent using field name when using load_from #714

Merged
merged 2 commits into from Jan 6, 2018

Conversation

Projects
None yet
4 participants
@lafrech
Member

lafrech commented Jan 2, 2018

I'm afraid there is an issue with load_from.

In Unmarshaller.deserialize, it is only checked if raw_value is missing.

    if raw_value is missing and field_obj.load_from:

So if incoming data has a key that accidentally matches the field name, the corresponding value is used.

Although the tests explicitly test this, I don't think this is intended, so I assume the tests are wrong too.

This PR fixes that and fixes the tests.

I'm working on dev branch, but I'm now realizing this also affects 2.x. Should I base this PR on 2.x?

I believe this is a bugfix but not a breaking change. It only breaks scenarios when someone would write

    years = fields.Integer(attribute='age', load_from='Years')

and deserialize data with a years (not Years!) field.

@sloria

This comment has been minimized.

Member

sloria commented Jan 3, 2018

The current behavior is expected. The docs say

load_from (str) – Additional key to look for when deserializing. Will only be checked if the field’s name is not found on the input dictionary.

So this is backwards-incompatible. I would be willing to change the behavior, though I'd like to understand the rationale for the original behavior. @hakjoon Do you remember why we decided to check the field name before checking load_from in your PR?

@lafrech

This comment has been minimized.

Member

lafrech commented Jan 3, 2018

Oh, right. It looked intentional in the code but I didn't see the rationale at all, so I thought it might be a mistake.

Note that current behavior does not allow swapping fields likes this:

field_a = fields.Integer(load_from='field_b')
field_b = fields.Integer(load_from='field_a')

Not really a big deal as this would be a bit strange.

However, I don't like the fact that the deserializer accepts a value with a name from the model that is not exposed in the API. The API only exposes the load_from string. The field name is internal, implementation detail.

@lafrech

This comment has been minimized.

Member

lafrech commented Jan 3, 2018

Good news is since it is a breaking change, it is 3.0, so I don't need to rebase to 2.x.

@hakjoon

This comment has been minimized.

Contributor

hakjoon commented Jan 3, 2018

My original thought was that It would allow you to use the same schema to load a foreign API with PascalCase and an API created with this schema with proper snake_case. If we didn’t check attr_name and you loaded a dataset, dumped it you could no longer load the dumped dataset using the same schema.

I don’t know if this is worth supporting, clearly it can lead to confusion. In practice I’ve found it less confusing to just create different schemas when I have weird needs like this. Keeping straight how load_from, dump_to and attribute will interplay can get a little hairy.

(In fact I would love the ability to specify a load only or dump only schema so it would only work one way.)

@lafrech lafrech force-pushed the Nobatek:dev_fix_load_from branch from 5a4ee03 to 7b08c84 Jan 3, 2018

@lafrech

This comment has been minimized.

Member

lafrech commented Jan 3, 2018

Thanks @hakjoon for the clarification.

I've always been confused by load_from, dump_to and attribute, thinking they must overlap, somehow. Now, I think I get it.

In the general case, dump/load is symmetric and if you have a model field that is exposed with another name, you use the API name in the schema, and specify the model name as attribute.

         class MySchema(Schema):
             ApiName = fields.String(attribute='model_name')

dump_to and load_from introduce asymmetry, allowing to specify a different key for dump and load:

         class MySchema(Schema):
             model_name = fields.String(load_from='api_load_name', dump_to='api_dump_name')

If you wanted to reproduce use case 1 using these, you would need to specify both load_from and dump_to and make sure they match:

         class MySchema(Schema):
             model_name = fields.String(load_from='api_name', dump_to='api_name')

As you're saying, this may be a bit confusing. There might be use cases where asymmetry is desirable. I can't tell.

The use case I'm familiar with is an API exposing an object model, documented with apispec. In this case, it is probably a better idea to use two different schemas. In fact, AFAIU, apispec itself doesn't know what to do when dump_to != dump_load so it gives up and exposes the field name (see https://github.com/marshmallow-code/apispec/blob/dev/apispec/ext/marshmallow/swagger.py#L95 and https://github.com/marshmallow-code/apispec/blob/dev/apispec/ext/marshmallow/swagger.py#L407).

I don't have a strong stand against load_from / dump_to (I just don't use them) and maybe my confusion came from not reading the docs thoroughly. Of course, if there is consensus that they provide a feature that is best achieved another way (using different dump/load schemas), then I guess we might as well simplify things and remove them.

I stumbled upon this while trying to address #378 and figured I should solve this before introducing new changes in Unmarshaller.deserialize.

@hakjoon

This comment has been minimized.

Contributor

hakjoon commented Jan 3, 2018

one thing that I forgot to mention was that when load_from was added dump_to didn’t exist so I needed the weird fallback behavior with attr_name. Now that dump_to exists that fallback is probably no longer necessary and I think your suggested change is probably clearer behavior.

I might be in an unusual use case in that I use serilaizers a lot for consuming external APIs so load_from is super useful, but I have moved to making serializers that are only meant for be used for loading in these circumstances instead of trying to make something work both ways.

I feel like I tried using attribute to solve this like you demonstrated but it didn't work for some reason. I can't really remember why but you'll notice the original issue #125 was around exposing attribute for serialization and deserialization.

My main usecase is to be able to do transforms like this.

// input
{
    'ApiKey': 2,
    'weird-key': 3
}
// output
{
   'apikey': 2,
   'weird_key':3
}
@lafrech

This comment has been minimized.

Member

lafrech commented Jan 3, 2018

I checked your example in #125 and I don't see what you couldn't do with attribute.

class UserSchema(Schema):
    name = fields.String(deserialize_from="Name")
    email_addr = fields.String(attribute="email", deserialize_from="EmailAddress")

could be written

class UserSchema(Schema):
    Name = fields.String(attribute='name')
    EmailAddress = fields.String(attribute='email')

However, attribute doesn't solve your example above:

class MySchema(Schema):
    ApiKey = fields.String(attribute='apikey')
    weird-key = fields.String(attribute='weird_key')  # syntax error

This could be solved on user side with meta-programming, but it sucks.

Maybe this is the reason attribute didn't work for you.

A symmetric to attribute would allow this (let's call it key):

class MySchema(Schema):
    apikey = fields.String(key='apikey')
    weird_key = fields.String(key='weird-key')

and in fact, if it existed, it would cover all that attribute does so attribute itself wouldn't be useful, would it?

This would allow dump/load to/from any API with keys that are invalid python keywords on a symmetric fashion (load key = dump key) and with unique keys (only one key accepted for a field).

What load_from / dump_to brings on top of this is

  • asymmetry
  • different keys allowed

If this is correct, then it boils down to whether we want asymmetry (from what we said, different schemas sounds better) and whether we want to allow multiple keys for a field (this sounds wrong).

Should we implement attribute's symmetric parameter? How should it be named? Should we keep attribute, then?

@hakjoon

This comment has been minimized.

Contributor

hakjoon commented Jan 3, 2018

Regarding,

class UserSchema(Schema):
    Name = fields.String(attribute='name')
    EmailAddress = fields.String(attribute='email')

I don't think it ever occurred to me to use attribute like this :)

The proposed use of key makes a lot of sense to me. Honestly that was all I needed when I proposed this as I wanted to move a bunch of incoming API mappers from BPMappers, which had that exact feature, to Marshmallow because it supported a bunch of features I had bolted on top of BPMappers.

I think all the current options can be a little overwhelming, so I am all for simplifying them. For example I can't tell at a glance what would happen here.

class MySchema(Schema):
    email_addr = fields.String(attribute="email", load_from="EmailAddress", dump_to='the_email')

Depending on what Steven wants to support regarding symmetric vs asymmetric options, your proposal would work for my needs.

@sloria

This comment has been minimized.

Member

sloria commented Jan 4, 2018

There shouldn't be a need for key. Just use load_from to specify the key in the input. The name of the field will be the output key.

from marshmallow import Schema, fields


class MySchema(Schema):
    apikey = fields.Int(load_from='ApiKey')
    weird_key = fields.Int(load_from='weird-key')

    class Meta:
        strict = True


in_data = {
    'ApiKey': 2,
    'weird-key': 3
}
out_data = {
    'apikey': 2,
    'weird_key': 3
}

sch = MySchema()
deserialized = sch.load(in_data).data
assert sch.dump(deserialized).data == out_data
@lafrech

This comment has been minimized.

Member

lafrech commented Jan 4, 2018

Sure, but if you want this to be symmetrical, you need to pass both load_from and dump_to, right?

class MySchema(Schema):
    apikey = fields.Int(load_from='ApiKey', dump_only='ApiKey')
    weird_key = fields.Int(load_from='weird-key', dump_only='weird-key')

Our point is that we could simplify things by disallowing asymmetry and use a single argument (let's call it key for now):

class MySchema(Schema):
    apikey = fields.Int(key='ApiKey')
    weird_key = fields.Int(key='weird-key')

And if we do that, then there is no point in attribute anymore (except backward compatibility).

@timc13

This comment has been minimized.

timc13 commented Jan 4, 2018

it would be even cooler if we could specify a key_func in Meta that takes the field name as an argument.

My use case is that I'm camelCasing everything coming in to & going out of the API layer. Right now, I'm doing some monkeypatching like this:

def to_camel_case(snake_str):
    components = snake_str.split('_')
    return components[0] + ''.join(x.title() for x in components[1:])


def camel_case_schema_all_fields():
    for cls_ in BaseSchema.__subclasses__():
        for field_name, field_value in cls_._declared_fields.items():
            field_value.dump_to = to_camel_case(field_name)
            field_value.load_from = to_camel_case(field_name)


camel_case_schema_all_fields()

it would be awesome if I could just do this:

class BaseSchema(Schema):
    class Meta:
        key_func = to_camel_case
@sloria

This comment has been minimized.

Member

sloria commented Jan 5, 2018

Ah, I see now. Thanks for the clarification @lafrech . I like the idea of having just one parameter for specifying both the input and output key, but I'm a bit nervous about what use cases we might break. Sure, having load_from and dump_to is more typing, but it allows the greatest control.

If you want that feature, you could do this:

class BaseSchema(Schema):

    def on_bind_field(self, field_name, field_obj):
        if field_obj.metadata.get('key'):
            key = field_obj.metadata['key']
            field_obj.load_from = key
            field_obj.dump_to = key

So I think for now, we'll move this PR forward then create a RFC issue about the key parameter. Sound good @lafrech and @hakjoon ?

@timc13 Your use case is also addressed with on_bind_field:

class BaseSchema(Schema):

    def on_bind_field(self, field_name, field_obj):
        if self.opts.key_func:
            key = self.opts.key_func(field_name)
            field_obj.load_from = key
            field_obj.dump_to = key
@lafrech

This comment has been minimized.

Member

lafrech commented Jan 5, 2018

OK for the RFC issue about the new key parameter (we can probably find a better name). The proposal would be:

  • Remove load_from and dump_to
  • Remove attribute
  • Add a new parameter as described above

We can't really split this in two, because if we remove load_from and dump_to without adding the new parameter, we break the weird-key use case. However, we could remove load_from and dump_to, add the new parameter and keep attribute. But I don't see the point of keeping attribute as it brings nothing more than what the new parameter would provide.

Setting load_from and dump_to at once can also be done in a BaseSchema. However, my point here is not only to simplify things for the user but also to simplify the code here and in related libs (webargs, apicspec,...) and avoid undefined cases. Like @hakjoon says, there can be combinations that are not easily grasped at first sight. The case where load_from and dump_to are different is a bit broken in apispec IIUC.

I think @timc13's proposal is not directly related. It could be a nice feature to have but it can be added afterwards in a non-breaking way and, as you say, this can be done in a BaseSchema in the meantime.

Regarding this PR, it sounds like it could be merged as current behaviour is calling for trouble and a better way to achieve the same thing would be to create two schemas.

@lafrech lafrech force-pushed the Nobatek:dev_fix_load_from branch from 7b08c84 to b8a2366 Jan 5, 2018

@lafrech

This comment has been minimized.

Member

lafrech commented Jan 5, 2018

Rebased.

CI error is a PyPi connection timeout. I don't have the credentials to relaunch.

@sloria

This comment has been minimized.

Member

sloria commented Jan 5, 2018

Thanks @lafrech . Would you mind updating the documentation for load_from to reflect the new behavior?

@lafrech

This comment has been minimized.

Member

lafrech commented Jan 5, 2018

Done.

@sloria

This comment has been minimized.

Member

sloria commented Jan 5, 2018

Thanks! Will merge this over the weekend.

@sloria sloria merged commit 2f1fa2b into marshmallow-code:dev Jan 6, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@lafrech lafrech deleted the Nobatek:dev_fix_load_from branch Jan 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment