Nested schema auto resolution as default and respect modifiers in references by Bangertm · Pull Request #333 · marshmallow-code/apispec

Bangertm · 2018-10-31T18:09:53Z

Relates to #307

This is a proof of concept on combining the logic in inspect_schema_for_auto_referencing with the nesting logic in field2property.

Default schema_name_resolver now passed to MarshmallowPlugin
Logic for adding nested schmeas moved from
MarshmallowPlugin.inspect_schema_for_auto_referencing to
OpenAPIConverter.field2property
Removed using LazyDict in favor of regular Dict

All the tests are passing, but I presume that there are some minor differences in the specs that are not captured by the test. Of potential concern would be cases like this. The second two definition adds would be unnecessary and if the schema name resolved to something other than what the user is putting in as the name there will be extra components in the model. Not sure that is too big of a deal or not.

At this point the cyclomatic complexity of field2property has increased, but there should be some simplifying that can be done. For example I'm not sure that the unbound self referencing cases are valid and the concept of embedding the reference string in field metadata feels like a relic of not having a ref dictionary.

lafrech

This has the same (mostly theoretical) limitation as the current implementation: the container fields (Nested, List(Nested), Dict(Nested)) are hardcoded, it won't work with a custom container field.

But your proposal here reminds me that the whole field2property logic hardcodes the container fields. This is nothing new. In this comment I propose a hook to manage those cases. Perhaps could it be extended to manage the auto-ref logic.

In any case, after a quick look, this implementation looks better than the existing IMHO. This looks like the right place for the job. And the hook I mention in above paragraph could cover the custom container field corner case for both documentation and auto-ref.

lafrech · 2018-11-01T18:32:03Z

I think your recent commits make my comment above irrelevant (unless it was irrelevant from the beginning), as the resolving happens in a single place (https://github.com/marshmallow-code/apispec/pull/333/files#diff-daf6a711ac369a8d051b9606e0a9eca6R367) and there is no need to add specific cases for each container field. This is great.

lafrech · 2018-11-01T18:34:03Z

+            property = self.field2property(
+                field_obj, use_refs=use_refs, dump=dump, load=load, name=name,
+            )
+            jsonschema['properties'][observed_field_name] = property


AFAIU, this change and the change above (OrderedLazyDict() -> OrderedDict()) are related to each other but unrelated to the issue, right?

Last time I checked, I didn't understand this part of the code, so I appreciate that you could make it simpler but I can't comment on the impact.

No these changes are necessary if we want to add a schema to the spec from within field2property. The way it worked before is that for each field a prop_func was created with a default parameter of the field_obj. Then that prop_func is loaded into a LazyDict. A the end of fields2jsonschema the LazyDict is returned but none of the prop_funcs are called. They only get called when accessing the item in the dict.

This strategy was added here. With the plan being that if execution of field2property was delayed until all of the schemas are in the spec (and in the ref dictionary) then when field2property eventually gets called it will find the the nested schemas in the ref, insert a ref into the spec, and stop infinite recursion.

Now if we want to move the functionaity to add a nested schema to the spec in field2property there is a problem with this approach. The first time iterating over the spec's definitions all of the all of the prop_funcs in the spec get called for the first time and field2property tires to add schemas to the dictionary during iteration which throws an error. For some reason things like print spec.to_dict() will work but json.dumps(spec.to_dict()) will not.

I think the original authors of the autoreferencing functionality were running into this, which is why they put they logic to iterate over the fields and add nested schemas to the spec where they did. In definition_helper inspect_schema_for_autoreferencing was adding all of the nested schemas into the spec before any of the field2property functions are called.

The plan in this pr is to use regular dictionaries and call field2property immediately from fields2jsonschema. All nested schemas will get added to the spec along the way via a call to spec.definition from field2property - instead of manually in the original plan or via inspect_schema_for_autoreferencing.

I hope this clarifies things, but I fear it might not...

Bangertm

I think your recent commits make my comment above irrelevant (unless it was irrelevant from the beginning), as the resolving happens in a single place (https://github.com/marshmallow-code/apispec/pull/333/files#diff-daf6a711ac369a8d051b9606e0a9eca6R367) and there is no need to add specific cases for each container field. This is great.

I'm glad you pointed it out for the Dict case because I didn't realize that I had redundant logic in the List case. Can you take a look a the two test cases added to make sure they cover the appropriate case.

Bangertm · 2018-11-01T20:53:51Z

+            property = self.field2property(
+                field_obj, use_refs=use_refs, dump=dump, load=load, name=name,
+            )
+            jsonschema['properties'][observed_field_name] = property


No these changes are necessary if we want to add a schema to the spec from within field2property. The way it worked before is that for each field a prop_func was created with a default parameter of the field_obj. Then that prop_func is loaded into a LazyDict. A the end of fields2jsonschema the LazyDict is returned but none of the prop_funcs are called. They only get called when accessing the item in the dict.

This strategy was added here. With the plan being that if execution of field2property was delayed until all of the schemas are in the spec (and in the ref dictionary) then when field2property eventually gets called it will find the the nested schemas in the ref, insert a ref into the spec, and stop infinite recursion.

Now if we want to move the functionaity to add a nested schema to the spec in field2property there is a problem with this approach. The first time iterating over the spec's definitions all of the all of the prop_funcs in the spec get called for the first time and field2property tires to add schemas to the dictionary during iteration which throws an error. For some reason things like print spec.to_dict() will work but json.dumps(spec.to_dict()) will not.

I think the original authors of the autoreferencing functionality were running into this, which is why they put they logic to iterate over the fields and add nested schemas to the spec where they did. In definition_helper inspect_schema_for_autoreferencing was adding all of the nested schemas into the spec before any of the field2property functions are called.

The plan in this pr is to use regular dictionaries and call field2property immediately from fields2jsonschema. All nested schemas will get added to the spec along the way via a call to spec.definition from field2property - instead of manually in the original plan or via inspect_schema_for_autoreferencing.

I hope this clarifies things, but I fear it might not...

lafrech · 2018-11-07T21:12:33Z

Current implementation of spec.components.schemas does not detect if a name is used twice. The second schema overrides the first. We should probably raise an Exception when trying to register a second schema with an already registered name. While this is not critical when registration is manual, it can be useful when registration is automatic as the resolving can lead to tricky cases.

I think we both have the same nominal use case in mind. All schemas are called EntitySchema and are registered as Entity in the spec. But there can be collisions if user doesn't follow this naming scheme. Like PetSchema and Pet. And letting the user specify another resolver opens the door to all sorts of tricky issues.

I don't think this should be addressed here, but it pushes for a simple check-raise in spec.components.schema.

I opened #340 about this.

lafrech · 2018-11-07T21:44:02Z

On second thought, the auto-ref feature could be a reason not to raise an Exception, because it might auto-register a Schema that was already registered manually...

We could add a test to check if the already registered schema with the same name is the same schema class, but then it becomes a bit smelly.

lafrech · 2018-11-07T21:54:03Z

Do we want to force the feature or just make it default?

In the current state of the PR, the only way to cancel the feature is to pass a no-op resolver to override the default. This is due to my change request of not passing the resolver default in the signature. I realized that afterwards. We could revert that change. It is not a friendly API.

We can go the other way and decide the feature is mandatory. This does not seem unreasonable. After all, when viewing the spec with ReDoc (and perhaps swagger-ui, I don't remember), it is done in the viewer frontend.

If we do that, we should provide a default name for when the schema name can't be resolved (when the resolver returns None). It can be 'Schema_{}'.format(some_incremented_counter).

Then we may write:

        schema_cls = resolve_schema_cls(schema)
        if schema_cls not in self.refs:
            self.spec.components.schema(
                self.thin_layer_above_schema_name_resolver_providing_default_name(schema_cls),
                schema=schema,
                dump=dump,
                load=load,
            )

Or we don't try to think too much and we leave it as it is.

I think this thought was triggered by the self-nesting or circular nesting case. I still don't totally grasp the change you made w.r.t. the delayed resolution (LazyDict). It seems too nice to be true. I was wondering if it would work if the resolver returned None. Wouldn't we fall back on the old issue (before #44)? Making sure all schemas are registered, even with a dummy name, sounds like a simple way around this. Sorry if I'm still missing something.

Don't miss the "good" in "too good to be true". I like the code simplification brought by this PR.

Bangertm · 2018-11-08T00:37:17Z

I'm still trying to wrap my head around the implications of making this feature the default. There are just so many assumptions that have been made along the way.

Regarding collisions, my biggest concern is that a nested schema comes in once with dump=False and then gets referenced someplace where dump=True. That schema second reference will be missing fields.

Part of the rational for excluding fields when dump or load is False is that the schema's parameters are listed in line, but if we have the general plan of using the schema_name_resolver to add the schema to the doc and insert a reference it might negate the need to have dump and load functionality at all.

For example we could have a check early in resolve_schema_dict to use add schemas not currently in the spec using the resolved name and insert a ref instead of inlining the parameters.

I was wondering if it would work if the resolver returned None. Wouldn't we fall back on the old issue (before #44)?

With the current implementation there is a check on whether schema_name_resolver returned a value and if not then the schema is not added to the spec. For circular references, this will revert the the problem before #44. But we could also remove the check. But that would also remove a convenient way for users to opt out of this functionality.

lafrech · 2018-11-09T08:35:49Z

Regarding collisions, my biggest concern is that a nested schema comes in once with dump=False and then gets referenced someplace where dump=True. That schema second reference will be missing fields.

Exactly.

Part of the rational for excluding fields when dump or load is False is that the schema's parameters are listed in line, but if we have the general plan of using the schema_name_resolver to add the schema to the doc and insert a reference it might negate the need to have dump and load functionality at all.

Yeah, what I thought, too. We should double-check that. I wouldn't mind dropping the dump/load logic since schema used as reference should have the dumpOnly/loadOnly attributes correctly set.

For circular references, this will revert the the problem before #44.

OK, I thought so. We can't afford to break the circular reference use case so it looks like we're in the middle of the bridge: either we keep things as they are, either we make the auto-ref mandatory.

Keeping things as they are means back to #307. Either keep, remove or deprecate current implementation (which can be easily introduced by subclassing MarshmallowPlugin).

If we choose to make the feature mandatory, we need to sort out a few things:

ensure the name defaults to something unique
think of a nice way to manage name collisions (is an exception acceptable?)
remove dump/load logic if it has become useless

I wrote:

On second thought, the auto-ref feature could be a reason not to raise an Exception, because it might auto-register a Schema that was already registered manually...

This shouldn't be an issue since we check if the schema is already registered. The issue is when registering manually a schema that has been auto-registered before. We could skip that second registration, but that means the auto-name will be used instead of the user-specified name. Doing this silently sounds wrong, so an exception might be acceptable. If the user wants to specify the name manually, he should register the schema before using it.

This logic could be a blocker for users wanting to specify names manually in case of circular references.

And if not a blocker, it could be an annoyance for users wanting to register manual names as they'd need to ensure they register the schemas in the right order. In practice, I'm not sure it is such an issue. I think I already do that in my code.

We may postpone this to 2.0 if we can't sort it out easily. We'd still need to decide what to do in 1.0.

Bangertm · 2018-11-13T04:13:55Z

Given that this pr is somewhat experimental I added a couple of commits to move towards always adding the schema to the spec and inserting a reference whenever a Marshmallow schema is encountered in a request or response body.

I'm starting to think that this might generally be the most desirable out of the box behavior. All of the schemas have names in the user's program. Part of the reason for creating a openapi spec is to communicate the structure and the names of the objects to the user. Auto-resolving the names makes this easier. In fact we may want to remove the name parameter from Components.schema() and just resolve that too...

OK, I thought so. We can't afford to break the circular reference use case so it looks like we're in the middle of the bridge: either we keep things as they are, either we make the auto-ref mandatory.

The only way to break the circular reference case is to override the default schema_name__resolver with a function that returned None with one of the users' schemas. Given that the default resolver never returns None I'm not sure why anyone would want to create a function that didn't work. If they really wanted to have their own custom names for circular referencing that were different from the names of the schemas in their program, it's easy enough to setup a function that returns a the custom name based on the schema class.

I added code that warns the user if they attempt to add a schema that is already in the spec. I'm not sure we need an exception, but that could easily be changed. I went with a warning for now because if the users manually adds a schema that has already been added to the spec (automatically) the resulting spec will still be valid. It just may not have the names that they were expecting in the ref and may have an "unused" schema.

If we remove the dump/load logic from schema2jsonschema there should be no issues with fields being missing. I'll look at tackling that next.

lafrech · 2018-11-13T08:27:21Z

In fact we may want to remove the name parameter from Components.schema() and just resolve that too...

This method can be used without MarshmallowPlugin.

The only way to break the circular reference case is to override the default schema_name__resolver with a function that returned None with one of the users' schemas. Given that the default resolver never returns None I'm not sure why anyone would want to create a function that didn't work.

So should we specify that custom resolver MUST return a string and not handle the None case? Currently, the None case is silently handled (https://github.com/marshmallow-code/apispec/pull/333/files#diff-daf6a711ac369a8d051b9606e0a9eca6R409) but it will break in case of circular reference resulting in hard to debug errors.

Bangertm · 2018-11-21T21:09:49Z

After thinking about this for a few days I think this is in a pretty good state.

Out of the box:

Whenever a nested schema is encountered it is added to the spec and referenced
Whenever a schema is found in request body or in a response body it is added to the spec and referenced (this eliminates the need for dump and load
Circular referencing schemas are automatically resolved - users who are upgrading will get a warning that they are entering a schema twice (once automatically and then again manually) and will likely want to make some changes to their code to remove the manual entry of the schema and to change the schema's name if they don't like the auto one

With a custom schema_name_resolver function:

Users can provide whatever names they want - easiest would be with a dictionary mapping schema class to a name string
If users really want to have nested schema properties listed inline instead of referenced they can have their resolver return None for a given schema
If the custom name resolver returns None for a schema in a circular referencing chain they get a helpful exception telling them not to do that

I think this seems like a decent mix of desired out of the box behavior with enough customization for users who want that.

The nice thing about this from my perspective is that out of the box every schema in the spec will have a name that matches what I call the schema in my code. I don't have to make sure to add nested schemas to the spec before adding the schemas that nest them and I don't have to provide a name resolver. It also makes the resolving schemas from a doc string much more convenient because those schemas get automatically named as well. I can probably eliminate adding schemas manually altogether can just add the paths.

lafrech · 2018-11-21T21:15:52Z

Great!

I still need to review the code changes more thoroughly but I agree on the principle.

Users can provide whatever names they want - easiest would be with a dictionary mapping schema class to a name string

Yeah, or an access to a class attribute.

class MySchema(Schema):
    schema_name = 'MySuperSchema'
    ...

def resolver(schema):
    return schema.schema_name

lafrech · 2018-12-04T22:26:26Z

I agree on the principle.

I had a quick look at the code. I'll review more thoroughly later.

Edit: This is wrong. This case is covered. I think there is still the (unlikely) possibility of a collision if Pet schema is resolved twice, generating Pet and Pet1 while Pet1 already exists. This might be complicated to address cleanly. Or we could add a UUID (Pet-81c3209c-8aff-45c8-b2cf-67f746b0e07) instead of a counter value. The names would suck, but the collisions would be unlikely. After all, it is just a fallback.

lafrech · 2018-12-22T21:04:30Z

I think this is the way to go.

Do you think you could rebase and update docs?

Bangertm · 2018-12-31T23:01:01Z

Got the branch up to date and added some information in the docs on working with nested fields and modifiers.

sloria

Just did a quick once-over, mostly on the docs. I just had a few nits. Can you also update CHANGELOG.rst? Feel free to create a new entry for "1.0.0 (unreleased)". I didn't look at this in-depth, so I'm not the best person to describe these changes concisely.

No need to block on me for merging. @Bangertm @lafrech Feel free to merge when you're confident with this.

sloria

A few more drive-by nits.

@Bangertm Can you please also update the CHANGELOG?

Bangertm · 2019-01-20T17:08:11Z

@sloria changelog is updated. Can you take a look to make sure it's clear.

sloria

Just a bit more copy editing.

@lafrech Since you're more familiar with the substance of this, would you mind doing the final pass?

lafrech · 2019-01-23T16:10:13Z

@Bangertm, sorry for the delay about this. I just reviewed it again and I think t is good to go once @sloria's comments are addressed.

When this is merged, we can release a RC version (maybe with other fixes from https://github.com/marshmallow-code/apispec/milestone/2 it you have something ready) and we're close to a final v1!

Default schema_name_resolver now passed to MarshmallowPlugin Logic for adding nested schmeas moved from MarshmallowPlugin.inspect_schema_for_auto_referencing to OpenAPIConverter.field2property Removed using LazyDict in favor of regular Dict Warn if schema is added twice Prefer to reference schemas in request and response bodies Catches the RuntimeError and raises an APISpecError indicating that a the schema_name_resolver returned None for a circular reference Also adds a note to the documentation to avoid using a function that returns None for a circular reference

because it doesn't do anything anymore

Key is now a namedtuple of the schema class and its modifiers resolves marshmallow-code#169 Ignoring the Many modifier because putting a schema in an array is is addressed in seperate logic and we would typically want the same ref to a singular schema and an array of objects

…nested fields

lafrech · 2019-01-24T08:14:09Z

Strangely enough, I wasn't notified about the push. Perhaps because it was a force-push.

Thanks @Bangertm for going through all this.

Let's merge this!

Anything else you'd like to add before we release a new RC version?

buxx · 2019-01-31T13:03:01Z

Thank's @lafrech and @Bangertm for your work !

lafrech · 2019-02-07T08:04:42Z

I've been working on a similar caching mechanism and it was a bit more tricky because the key could include dicts (Scille/umongo#165).

I'm tempted to add a comment to avoid potential issues in the future:

    # This works because modifiers are either strings or lists/tuples.
    # If MODIFIERS ever include dicts, this code will need to be adapted.
    for modifier in MODIFIERS:
        attribute = getattr(schema, modifier)
        try:
            hash(attribute)
        except TypeError:
            attribute = tuple(attribute)
        modifiers.append(attribute)

lafrech reviewed Nov 1, 2018

View reviewed changes

Comment thread apispec/ext/marshmallow/__init__.py

Comment thread apispec/ext/marshmallow/__init__.py Outdated

lafrech reviewed Nov 1, 2018

View reviewed changes

Bangertm commented Nov 1, 2018

View reviewed changes

lafrech added this to the 1.0 milestone Nov 3, 2018

Bangertm force-pushed the refactor-nested branch 2 times, most recently from aa4f857 to 3e69588 Compare November 6, 2018 15:12

lafrech reviewed Nov 7, 2018

View reviewed changes

Comment thread apispec/ext/marshmallow/openapi.py

lafrech reviewed Nov 7, 2018

View reviewed changes

Comment thread apispec/ext/marshmallow/__init__.py Outdated

Bangertm force-pushed the refactor-nested branch from 128ee27 to 2e8acc2 Compare November 13, 2018 03:16

Bangertm force-pushed the refactor-nested branch from 1beb35e to 7653300 Compare November 13, 2018 05:39

Bangertm force-pushed the refactor-nested branch 2 times, most recently from a0d3026 to 22f7e9a Compare November 20, 2018 21:17

lafrech added the backwards incompat label Nov 21, 2018