New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing API for dumping does not match API for loading #153

Closed
davidism opened this Issue Feb 19, 2015 · 33 comments

Comments

Projects
None yet
3 participants
@davidism
Contributor

davidism commented Feb 19, 2015

When dumping, I want to manipulate some of the attributes on the object being dumped before actually dumping. So I decorate a preprocessor function. But that only gets used when loading. I could use a Method field, but the data returned from that must be the final serialized form, so there's no way to specify that it's actually a Nested(OtherSchema, many=True) unless I do that serialization manually at the end of the method.

When loading, I want to manipulate the loaded data in the exact opposite direction as the situation above. So I decorate a data_handler function. But that only gets used when dumping. The solution is more straightforward here: I override the make_object method to manipulate the final output.

My point is that the names are different everywhere, some of them are decorators while others are methods, and they don't apply in both directions, when it would be very convenient to be able to do so. The loading situation is in better shape than the dumping one, since loading has preprocessor and make_object, except there's still a weird decorator vs. method difference.

There should be a standard way to preprocess and postprocess the entire data during both loading and dumping. One solution is adding pre_dump, post_dump, pre_load, and post_load hooks, either as methods on the schema or as decorators.

@sloria

This comment has been minimized.

Member

sloria commented Mar 1, 2015

You raise fair points @davidism.

The rationale for the current API is as follows:

  • The Schema decorators--preprocessor and data_handler--allow multiple Schemas to use the same pre- and post-processing routines without having to inherit from base classes or mixins. They also hide the fact that multiple functions can be stored within a list.
  • make_object is a method because it is meant to be specific to a single Schema. Whereas pre- and post-processors are written as generic functions, make_object provides a 1-to-1 mapping between a schema and an "app-level" object.
  • There is no preprocessing hook for dumping/serialization because marshmallow is not meant to make updates to app-level objects. All processing happens in your application's business logic before serialization.

Here is the intended flow:

[client input] -> preprocess/validate -> make object -> [app-level object] -> business logic* -> [app-level-object] -> serialize -> modify serialized data

That said, I am not opposed to changing the API if it is unclear. Perhaps you could describe your use case in more detail and we can discuss possible solutions.

@sloria

This comment has been minimized.

Member

sloria commented Mar 20, 2015

After some thought and discussion on #179, I'm leaning towards implementing the pre_load, post_load hooks suggested by @davidism . I think this would be a more symmetrical and cohesive than the current API.

Doing so raises some follow-up questions:

  • Should the preprocessor and data_handler decorators be deprecated?
  • Should all hooks become methods? For example, should the validator decorator be replaced with a validate method? accessor? error_handler?
@taion

This comment has been minimized.

Contributor

taion commented Mar 20, 2015

One comment - we should make sure that the pre_load and post_dump hooks are told whether they are dealing with multiple instances or not. Doing so will resolve #177 as well.

@sloria

This comment has been minimized.

Member

sloria commented Mar 20, 2015

Good idea, @taion .

@sloria

This comment has been minimized.

Member

sloria commented Mar 21, 2015

Another consideration: if all the hooks are implemented as methods, implementing #116 would probably be unnecessary.

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 21, 2015

preprocessor should be deprecated, the default pre_load should fall back to using anything decorated with preprocessor for now. Similarly for make_objpost_load, accessorpre_dump, and data_handlerpost_dump. This will require some gymnastics to get everything using a consistent api.

Everything should be overridable methods on the Schema class, since those classes are how everything is collected anyway. If I want to re-use something between schemas, I can refactor the logic to an external function that is called by each Schema's method, or I can use subclassing.


There is no preprocessing hook for dumping/serialization because marshmallow is not meant to make updates to app-level objects. All processing happens in your application's business logic before serialization.

This bothers me because now I can't just use a Schema to process my data. If I want to reuse a Schema somewhere else, I also have to repeat the other processing. Schemas should be self-contained and convenient to use. If I'm concerned with mutating external data, I can create a copy as part of the pre_ process and mutate that copy instead.

@taion

This comment has been minimized.

Contributor

taion commented Mar 21, 2015

This proposal means that there will no longer be a step between deserializing into dictionary and validation. Is this intentional? It's not a feature that I can personally see myself using, FWIW.

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 21, 2015

It makes more sense to validate the deserialized data, then post_process it only if it's valid. What is the extra step in between right now?

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 21, 2015

Here's what the method definitions might look like. I'm probably missing details, as I haven't delved into all the combinations the api currently presents.

  • pre_load(self, data) gets the raw data and returns data to be deserialized. It only ever gets one item at a time, removing the need to know if many is True. (Internally the schema calls it for each item if many is True.)
  • post_load(self, data) gets the validated, deserialized data and returns the final data/object. Like pre_load, it gets one item at a time.
  • pre_dump and post_dump work the same way
  • validate(self, data, errors) performs schema-level validation, after all fields are validated individually during (de)serialization. It puts messages in the errors dict, and returns nothing. Internally, the schema knows if there are errors now by checking if there are values in the error dict. This also means that there's no separate "external" validate vs. load methods, just call load and check errors is not None.
  • validate_<field name>(self, value) gets the value after it has been (de)serialized by the field, to perform extra validation. This is a convenience in addition to specifying a list of validators on the field. The behavior of validate and validate_<field name> is more in line with how Django and WTForms do things.
  • error_handler is removed. Whatever it does can be done in validate.
  • accessor is removed. Getting the data in the right format is the job of pre_dump now.

The logic for dump or load looks like this: pre_(de)serializevalidate (schema level) → post_ (only if no errors). Within (de)serialize, the logic for each field is: (de)serializevalidate (field validators) → validate_<field name>.

This means there's no way to return data and errors at the same time. Either the data was valid and errors is None, or the data was invalid and is None, and there are errors. This avoids getting data in some indeterminate state where some of it is valid, some of it has a default/None applied because of errors, etc.

@sloria

This comment has been minimized.

Member

sloria commented Mar 22, 2015

That's a sound plan, @davidism . A few comments:

validate(self, data, errors) performs schema-level validation... It puts messages in the errors dict, and returns nothing.

  • For backwards-compatibility, we could name this validate_schema.
  • Instead of mutating the errors dict, I'm thinking the method should raise a ValidationError, just like a normal validator. This would, however, be limiting because only one error could be raised. The benefit of the "hooks-as-decorators" API is that multiple schema validators can be attached, each one raising its own error.

This also means that there's no separate "external" validate vs. load methods,

I think the validate method is still useful when you want to validate input data without doing expensive deserialization work (e.g. database access).

validate_(self, value)

I've never really liked this API in Django forms and WTForms. It is redundant API and a bit too "magical". I think we can defer implementing this.

accessor is removed. Getting the data in the right format is the job of pre_dump now.

accessor is still necessary because it defines how to pull a single value from an object (e.g. pulling a value by key from a dict, accessing an attribute/property). It does not do any preprocessing.

This means there's no way to return data and errors at the same time.

Can you clarify this? Are you proposing to change the return value of load and dump?

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 26, 2015

I think the validate method is still useful when you want to validate input data without doing expensive deserialization work (e.g. database access).

Any deserialization (that I do, at least) that would hit the database would also require valid values. For example, deserializing a list of ids to a list of users would still require validating that the ids are all valid. I can only imagine contrived examples where you would want to deserialize data even if it's not valid. Splitting validate and deserialize just overcomplicates the process, making some of the validation only happen on deserialization.

This means there's no way to return data and errors at the same time.

Can you clarify this? Are you proposing to change the return value of load and dump?

As above, since I can't imagine a good example of wanting some validation to only happen on deserialization, I'm proposing that either there are errors or there is valid deserialized data. The return type can stay the same, but it would always end up being {data}, None or None, {errors}.

Instead of mutating the errors dict, I'm thinking the method should raise a ValidationError, just like a normal validator. This would, however, be limiting because only one error could be raised. The benefit of the "hooks-as-decorators" API is that multiple schema validators can be attached, each one raising its own error.

That's too much bolierplate. If I want 5 possibly simultaneous messages, I now need to write and decorate 5 validaton functions. If they're somewhat dependent, or all require the same expensive calculation, the work has been repeated 5 times.

WTForms solves this the way I suggested, by just having you set errors directly wherever they are appropriate during form-level validation. Field validators can both raise errors and set them directly, so they provide both convenience and power without adding overhead.

accessor is still necessary because it defines how to pull a single value from an object

Because of the original problem I reported, accessor is exactly what I use for pre-processing right now. And it doesn't follow the (inconsistent) api of any of the other decorated functions right now, it operates per field instead of on all the data.

if key == 'field_that_needs_preprocessing':
    return preprocessed_data
if key == 'other_field':
    return other_data
return get_value(key, obj)

pre_ would be able to do the equivalent thing, but with a behavior consistent with all the other methods. The advantage is that there is pre_dump and pre_load, while accessor is only part of the dump process.

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 26, 2015

validate_<field name>(self, value)

I've never really liked this API in Django forms and WTForms. It is redundant API and a bit too "magical". I think we can defer implementing this.

If you're going to make everything else methods, might as well go all the way. Currently, you can put field-specific validation as a lambda in the validators list, or make an external function. Lambdas are convenient if the validation is not complex and can fit on one line. Functions are convenient only if they're likely to be reused, which is not the case most times I want a validate_<field name> method. The convenience of having field-specific validation outweighs the perceived magic.

For backwards-compatibility, we could name this validate_schema

If you do end up implementing validate_<field name> methods, this means no field can be named "schema".

@taion

This comment has been minimized.

Contributor

taion commented Mar 26, 2015

For ValidationError, you can always make its constructor accept some sort of iterable input representing multiple errors. It seems a lot more Pythonic to me to use an exception to indicate validation errors instead of returning an option-like tuple. Especially in the case where you combine deserialization and validation, it seems more idiomatic to raise an exception to terminate deserialization if you run into an error condition. You'd just end up with something like:

if errors:
    raise ValidationError(errors)

The database-driven deserialization case is weird and I don't think e.g. DRF have an ideal solution to this. The main issue is that you might want to apply permissioning in validation during deserialization (e.g. restrict related objects to a query set corresponding to only objects for which the user has view permissions). Ideally this would be a property of the view rather than of the schema. See e.g.: encode/django-rest-framework#1985

I don't think there's an ideal answer here. I don't think the schema itself ought to know enough to find validation errors from e.g. the user not being permissioned to view a certain object.

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 26, 2015

If you want to stick with raising exceptions in all cases, handling the following semantics seems to cover validation. (Although it's really ValidationErrors now.) This would allow adding one or more errors

# during field validation, adds one message to the errors for the field
# during schema validation, adds one message to some top-level error key
raise ValidationError('message1')

# during field validation, adds multiple messages to the errors for the field
# during schema validation, adds multiple messages to some top-level error key
raise ValidationError('message1', 'message2')

# during field and schema validation, adds message or messages to the given key
raise ValidationError(field='message1', field2=['message2', 'message3'])

# *args and **kwargs can be used together, will add messages to the field and to other fields
raise ValidationError('message1', field2='message2')

I am not saying that processing returns some "options tuple" or something different than what it already does. Dump and load already return a 2-tuple data, errors.


Why shouldn't deserialization have access to permissions or whatever other supplementary data is needed? In Flask, that sort of information is just a thread local away, you don't even need to pass in anything: session, current_user, etc. For Django-likes that don't use thread locals, Marshmallow already allows setting context for validation.

@taion

This comment has been minimized.

Contributor

taion commented Mar 26, 2015

Well, if validation can only ever return either an error or the valid data, then there's not a lot of point in returning both data and errors in any event. Currently that's not the case, right? If it's not possible to deserialize data that doesn't validate, there's no reason to return both data and errors - you should always either return the data or raise some exception.

Also, I think from a design perspective filtering queries for e.g. read perms belongs on the view rather than on the schema anyway. And currently schema contexts are only applied for serialization, not for deserialization. That's neither here nor there though. I don't really want my schema loading to e.g. load the instance from the database and apply changes to it, but DRF works that way and I'm sure it's fine for them.

@davidism

This comment has been minimized.

Contributor

davidism commented Mar 26, 2015

Well, if validation can only ever return either an error or the valid data, then there's not a lot of point in returning both data and errors in any event.

Yes there is, you need to know which of the two were returned. The other option you advocate is basically always setting strict=True.

currently schema contexts are only applied for serialization, not for deserialization

That's the entire point of this bug report: the dump and load apis are not symmetrical.


I'm really not seeing how you plan to validate data such as user ids without actually talking to the database. And in most cases, you'll want the actual user associated with that id, so why separate the process and require two database queries? If you don't want to query the database at all, don't, and load your real data later.

@taion

This comment has been minimized.

Contributor

taion commented Mar 27, 2015

If you set up a strict dichotomy between having valid data or having errors, then you know exactly which was returned. If there's data, you have a return value. If there's an exception, you catch it. However, even if you think that invalid fields should have no data, you can still have a case where certain fields validate but other fields don't. In those cases, then it makes sense for the deserializer to return both data and errors, where the data may be partial. I'm not a huge fan of this as the default, though.

And, correct, in my case, I separate out pulling data off the wire and validating syntax (which I do with a Schema) from actually resolving against the database, which I don't.

@sloria

This comment has been minimized.

Member

sloria commented Mar 29, 2015

If you're going to make everything else methods, might as well go all the way.

I'm not proposing to abandon the idea of validator methods, but I think a more explicit approach would be better than dynamically-generated methods. Something like:

class MySchema(Schema):
    __validators__ = {
        '_schema': ['validate_schema'],
        'field_a': ['validate_field_a']
    }

    field_a = fields.Str()

    def validate_schema(self, data):
        # ...

    def validate_field_a(self, val):
        # ...

Anyway, this a separate issue (#116).

currently schema contexts are only applied for serialization, not for deserialization

I'm not sure what is meant by this. You can access the schema context in a Field's _deserialize method through self.context.


How about we start with the following?

  • pre_ and post_ methods, as proposed by @davidism
  • Deprecate preprocessor, data_handler, and accessor.

I think these will go a long way in making the API more consistent and meet the use cases discussed without breaking compatibility.

@taion

This comment has been minimized.

Contributor

taion commented Mar 29, 2015

Oops, my bad on the context bit. I like the pre_ and post_ methods.

@sloria

This comment has been minimized.

Member

sloria commented Apr 4, 2015

I would welcome any help with this. Even a work-in-progress PR with just the tests for this feature would be a big help.

@taion

This comment has been minimized.

Contributor

taion commented Apr 5, 2015

I have a couple of questions.

  1. What should the semantics of many look like for these new methods? Should they always take in either an object or a list depending on whether many == True, and take many as a kwarg?
  2. If pre_dump is replacing accessor, is it always supposed to return a dict (or depending on the above, a list of dicts) now? Or is it always supposed to return something that you can call getattr on? Or is something like this example now no longer possible: http://marshmallow.readthedocs.org/en/latest/extending.html?highlight=accessor#overriding-how-attributes-are-accessed
@taion

This comment has been minimized.

Contributor

taion commented Apr 5, 2015

My preferred answers/thoughts:

  1. pre_dump and post_load shouldn't need to be aware of many. I'd always want the library to map them over the collection if I am dumping/loading many. However, in certain instances, post_dump and pre_load... but possibly in the most common use cases they won't. We could in principle introspect the signatures of those methods to see if they have a many parameter, and use that to decide what to do (pass in the list and many=True if they have that parameter, or map the function ourselves if the parameter is absent), but this might be too magical.
  2. I don't see a huge issue in keeping accessor around. There's seldom a good reason to override the default accessor, sure, but I think if you want to be able to serialize both objects and dicts without making the user jump through hoops in pre_dump, then there isn't a single consistent format that pre_dump is going to spit out.
@sloria

This comment has been minimized.

Member

sloria commented Apr 5, 2015

@taion

  1. Indeed, it would be nice for the pre/post_* hooks to always receive a single datum, regardless of the value of many. As you point out, though, it may be useful for the user to know the value of many--they may want to perform different post-processing on a collection vs. a single value. So perhaps we pass value of many. I don't think we need to do any introspection; if the user doesn't need the many parameter, they can just do def pre_load(self, data, **kwargs)
  2. I still see value in keeping accessor around for the time being. Let's defer deprecation for now.
@taion

This comment has been minimized.

Contributor

taion commented Apr 5, 2015

The problem with passing only a single datum into those hooks is that it makes it impossible to implement something like this: http://marshmallow.readthedocs.org/en/latest/extending.html#example-adding-a-namespace-to-serialized-output, where the final serialized output looks like

{
    'users': [
        {'name': "Keith"},
        {'name': "Mick"}
    ]
}
@sloria

This comment has been minimized.

Member

sloria commented Apr 6, 2015

@taion Yes, that is true. To account for cases like that, it might make the most sense to just past the raw data to pre_load.

@taion

This comment has been minimized.

Contributor

taion commented Apr 6, 2015

Yeah, but then you end up in the same boat of making the more common use case more verbose. What do you think of these two proposals:

  1. Add separate post_dump_raw and pre_load_raw methods to handle this use case.
Schema.pre_dump(item)
Schema.post_dump(item)
Schema.post_dump_raw(item_or_collection, many)
Schema.pre_load_raw(item_or_collection, many)
Schema.pre_load(item)
Schema.post_load(item)
  1. Control everything with decorators, like with what were talking about with @validate_schema on #116.
class UserSchema(Schema):
    @marshmallow.post_dump
    def add_type(item):
        item['type'] = 'user'
        return item

    @marshmallow.post_dump(raw=True)
    def add_envelope(item_or_collection, many):
        key = 'users' if many else 'user'
        return {key: item_or_collection}

I think I like (2) slightly better. It's a bit more complicated, but it would be more consistent with #116, and might be a bit more future-proof in letting us add additional parameters to the decorator as needed.

@sloria

This comment has been minimized.

Member

sloria commented Apr 7, 2015

@taion I do find the decorator syntax user-friendly, and it would certainly be more consistent with the ideas in #116. Any reason not to make the hooks bound methods, i.e. pass self as the first argument?

@taion

This comment has been minimized.

Contributor

taion commented Apr 7, 2015

Mistake on my part, sorry. Been writing too much JavaScript lately. Will leave my mistake there to remind myself of my failure. It should of course be:

class UserSchema(Schema):
    @marshmallow.post_dump
    def add_type(self, item):
        item['type'] = 'user'
        return item

    @marshmallow.post_dump(raw=True)
    def add_envelope(self, item_or_collection, many):
        key = 'users' if many else 'user'
        return {key: item_or_collection}

I'll take a stab at this tomorrow. It will make some things I'm doing cleaner.

One more question, though - given these, does it make sense to deprecate extra as well?

@sloria

This comment has been minimized.

Member

sloria commented Apr 7, 2015

@taion No problem. Thanks for your help with this.

Yes, I think extra is made unnecessary with this change.

taion added a commit to taion/marshmallow that referenced this issue Apr 9, 2015

taion added a commit to taion/marshmallow that referenced this issue Apr 9, 2015

taion added a commit to taion/marshmallow that referenced this issue Apr 9, 2015

@sloria

This comment has been minimized.

Member

sloria commented Apr 14, 2015

@davidism Would the proposed API in #191 meet your requirements to close this issue?

EDIT: Fix PR number.

@sloria

This comment has been minimized.

Member

sloria commented Sep 14, 2015

There are a few improvements to the pre/post_* and validates* decorators I'd like to make before the 2.0.0 final release.

Problems:

  • As pointed out in #216, there is no way to get access to the target object (the data to be (de)serialized). This is possible, however, with validates_schema via the pass_original param.
  • Also related to #216: The "raw" param is misleading. Users might think that it means that the raw target object is passed, but it does not.

Proposal:

  • Rename raw -> pass_many.
  • Add pass_original to all decorators. Maybe rename to pass_target or pass_input--not sure about this.
@taion

This comment has been minimized.

Contributor

taion commented Sep 14, 2015

👍

One other difference with raw is that the data is potentially before/after enveloping... but in practice that actually doesn't work well at all, because you can't assert on the order in which those processors are called.

@sloria

This comment has been minimized.

Member

sloria commented Sep 15, 2015

Opened a issue for this here #276

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment