New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing API for dumping does not match API for loading #153
Comments
You raise fair points @davidism. The rationale for the current API is as follows:
Here is the intended flow:
That said, I am not opposed to changing the API if it is unclear. Perhaps you could describe your use case in more detail and we can discuss possible solutions. |
After some thought and discussion on #179, I'm leaning towards implementing the Doing so raises some follow-up questions:
|
One comment - we should make sure that the |
Good idea, @taion . |
Another consideration: if all the hooks are implemented as methods, implementing #116 would probably be unnecessary. |
Everything should be overridable methods on the Schema class, since those classes are how everything is collected anyway. If I want to re-use something between schemas, I can refactor the logic to an external function that is called by each Schema's method, or I can use subclassing.
This bothers me because now I can't just use a Schema to process my data. If I want to reuse a Schema somewhere else, I also have to repeat the other processing. Schemas should be self-contained and convenient to use. If I'm concerned with mutating external data, I can create a copy as part of the |
This proposal means that there will no longer be a step between deserializing into dictionary and validation. Is this intentional? It's not a feature that I can personally see myself using, FWIW. |
It makes more sense to validate the deserialized data, then post_process it only if it's valid. What is the extra step in between right now? |
Here's what the method definitions might look like. I'm probably missing details, as I haven't delved into all the combinations the api currently presents.
The logic for dump or load looks like this: This means there's no way to return data and errors at the same time. Either the data was valid and errors is None, or the data was invalid and is None, and there are errors. This avoids getting data in some indeterminate state where some of it is valid, some of it has a default/None applied because of errors, etc. |
That's a sound plan, @davidism . A few comments:
I think the
I've never really liked this API in Django forms and WTForms. It is redundant API and a bit too "magical". I think we can defer implementing this.
Can you clarify this? Are you proposing to change the return value of |
Any deserialization (that I do, at least) that would hit the database would also require valid values. For example, deserializing a list of ids to a list of users would still require validating that the ids are all valid. I can only imagine contrived examples where you would want to deserialize data even if it's not valid. Splitting validate and deserialize just overcomplicates the process, making some of the validation only happen on deserialization.
As above, since I can't imagine a good example of wanting some validation to only happen on deserialization, I'm proposing that either there are errors or there is valid deserialized data. The return type can stay the same, but it would always end up being
That's too much bolierplate. If I want 5 possibly simultaneous messages, I now need to write and decorate 5 validaton functions. If they're somewhat dependent, or all require the same expensive calculation, the work has been repeated 5 times. WTForms solves this the way I suggested, by just having you set errors directly wherever they are appropriate during form-level validation. Field validators can both raise errors and set them directly, so they provide both convenience and power without adding overhead.
Because of the original problem I reported,
|
If you're going to make everything else methods, might as well go all the way. Currently, you can put field-specific validation as a lambda in the validators list, or make an external function. Lambdas are convenient if the validation is not complex and can fit on one line. Functions are convenient only if they're likely to be reused, which is not the case most times I want a
If you do end up implementing |
For if errors:
raise ValidationError(errors) The database-driven deserialization case is weird and I don't think e.g. DRF have an ideal solution to this. The main issue is that you might want to apply permissioning in validation during deserialization (e.g. restrict related objects to a query set corresponding to only objects for which the user has view permissions). Ideally this would be a property of the view rather than of the schema. See e.g.: encode/django-rest-framework#1985 I don't think there's an ideal answer here. I don't think the schema itself ought to know enough to find validation errors from e.g. the user not being permissioned to view a certain object. |
If you want to stick with raising exceptions in all cases, handling the following semantics seems to cover validation. (Although it's really ValidationErrors now.) This would allow adding one or more errors
I am not saying that processing returns some "options tuple" or something different than what it already does. Dump and load already return a 2-tuple Why shouldn't deserialization have access to permissions or whatever other supplementary data is needed? In Flask, that sort of information is just a thread local away, you don't even need to pass in anything: |
Well, if validation can only ever return either an error or the valid data, then there's not a lot of point in returning both data and errors in any event. Currently that's not the case, right? If it's not possible to deserialize data that doesn't validate, there's no reason to return both data and errors - you should always either return the data or raise some exception. Also, I think from a design perspective filtering queries for e.g. read perms belongs on the view rather than on the schema anyway. And currently schema contexts are only applied for serialization, not for deserialization. That's neither here nor there though. I don't really want my schema loading to e.g. load the instance from the database and apply changes to it, but DRF works that way and I'm sure it's fine for them. |
Yes there is, you need to know which of the two were returned. The other option you advocate is basically always setting
That's the entire point of this bug report: the dump and load apis are not symmetrical. I'm really not seeing how you plan to validate data such as user ids without actually talking to the database. And in most cases, you'll want the actual user associated with that id, so why separate the process and require two database queries? If you don't want to query the database at all, don't, and load your real data later. |
If you set up a strict dichotomy between having valid data or having errors, then you know exactly which was returned. If there's data, you have a return value. If there's an exception, you catch it. However, even if you think that invalid fields should have no data, you can still have a case where certain fields validate but other fields don't. In those cases, then it makes sense for the deserializer to return both data and errors, where the data may be partial. I'm not a huge fan of this as the default, though. And, correct, in my case, I separate out pulling data off the wire and validating syntax (which I do with a |
I'm not proposing to abandon the idea of validator methods, but I think a more explicit approach would be better than dynamically-generated methods. Something like: class MySchema(Schema):
__validators__ = {
'_schema': ['validate_schema'],
'field_a': ['validate_field_a']
}
field_a = fields.Str()
def validate_schema(self, data):
# ...
def validate_field_a(self, val):
# ... Anyway, this a separate issue (#116).
I'm not sure what is meant by this. You can access the schema context in a Field's How about we start with the following?
I think these will go a long way in making the API more consistent and meet the use cases discussed without breaking compatibility. |
Oops, my bad on the |
I would welcome any help with this. Even a work-in-progress PR with just the tests for this feature would be a big help. |
I have a couple of questions.
|
My preferred answers/thoughts:
|
|
The problem with passing only a single datum into those hooks is that it makes it impossible to implement something like this: http://marshmallow.readthedocs.org/en/latest/extending.html#example-adding-a-namespace-to-serialized-output, where the final serialized output looks like {
'users': [
{'name': "Keith"},
{'name': "Mick"}
]
} |
@taion Yes, that is true. To account for cases like that, it might make the most sense to just past the raw data to |
Yeah, but then you end up in the same boat of making the more common use case more verbose. What do you think of these two proposals:
Schema.pre_dump(item)
Schema.post_dump(item)
Schema.post_dump_raw(item_or_collection, many) Schema.pre_load_raw(item_or_collection, many)
Schema.pre_load(item)
Schema.post_load(item)
class UserSchema(Schema):
@marshmallow.post_dump
def add_type(item):
item['type'] = 'user'
return item
@marshmallow.post_dump(raw=True)
def add_envelope(item_or_collection, many):
key = 'users' if many else 'user'
return {key: item_or_collection} I think I like (2) slightly better. It's a bit more complicated, but it would be more consistent with #116, and might be a bit more future-proof in letting us add additional parameters to the decorator as needed. |
Mistake on my part, sorry. Been writing too much JavaScript lately. Will leave my mistake there to remind myself of my failure. It should of course be: class UserSchema(Schema):
@marshmallow.post_dump
def add_type(self, item):
item['type'] = 'user'
return item
@marshmallow.post_dump(raw=True)
def add_envelope(self, item_or_collection, many):
key = 'users' if many else 'user'
return {key: item_or_collection} I'll take a stab at this tomorrow. It will make some things I'm doing cleaner. One more question, though - given these, does it make sense to deprecate |
@taion No problem. Thanks for your help with this. Yes, I think |
There are a few improvements to the Problems:
Proposal:
|
👍 One other difference with |
Opened a issue for this here #276 |
When dumping, I want to manipulate some of the attributes on the object being dumped before actually dumping. So I decorate a
preprocessor
function. But that only gets used when loading. I could use aMethod
field, but the data returned from that must be the final serialized form, so there's no way to specify that it's actually aNested(OtherSchema, many=True)
unless I do that serialization manually at the end of the method.When loading, I want to manipulate the loaded data in the exact opposite direction as the situation above. So I decorate a
data_handler
function. But that only gets used when dumping. The solution is more straightforward here: I override themake_object
method to manipulate the final output.My point is that the names are different everywhere, some of them are decorators while others are methods, and they don't apply in both directions, when it would be very convenient to be able to do so. The loading situation is in better shape than the dumping one, since loading has
preprocessor
andmake_object
, except there's still a weird decorator vs. method difference.There should be a standard way to preprocess and postprocess the entire data during both loading and dumping. One solution is adding
pre_dump
,post_dump
,pre_load
, andpost_load
hooks, either as methods on the schema or as decorators.The text was updated successfully, but these errors were encountered: