Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#134] Version schema check for ["object","null"] #738

Merged
merged 24 commits into from Nov 23, 2018

Conversation

kindly
Copy link
Contributor

@kindly kindly commented Jul 30, 2018

No description provided.

@@ -67,7 +67,7 @@ def add_versions(schema, location=''):
version_properties["value"] = new_value
schema['properties'][key] = version

elif prop_type == "object":
elif (prop_type == "object" or prop_type == ["object", "null"]) and "properties" in value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a more robust test, where prop_type is coerced to a list, and we check for the inclusion of object in that list? We can also have NotImplementedError for cases/combinations for which we don't have defined behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for object in the list is not good enough due to former_value which has type ["string", "number", "integer", "array", "object", "null"].
There is also Organization:details that is an ["object", "null"] but does not have any properties.
Both these cases probably want to be treated as a whole and is why I chose to be so specific about this case.

I am not sure this tool is the correct place for more in depth schema analysis and make sure we are consistent with our type usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for object in the list is not good enough due to…

Indeed - that was my meaning when writing "We can also have NotImplementedError for cases/combinations for which we don't have defined behavior." If the defined behavior is to run the default case (for former_value, etc.) then the code should obviously continue to do that.

My use case is: "As standard editor, I should not have to keep track of the assumptions various tools are making about the structure of the JSON Schema that I am editing."

As with ocdsmerge, it would be good to better document this code, as it has several specific conditions that would benefit from a comment. In the absence of comments, I'm curious as to whether there should be a special case for an array of strings (which in ocdsmerge is treated as wholeListMerge even if wholeListMerge is not set open-contracting/ocds-merge#14).

I am not sure this tool is the correct place for more in depth schema analysis and make sure we are consistent with our type usage.

I don't understand this sentence. Can you expand?

Copy link
Contributor Author

@kindly kindly Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that we should probably have a tool/test of some kind, separate to these tools, that checks the schema for new or different uses of jsonschema that we currently have not used, or documented, that falls into undefined behaviour.
I do not think that tool/test should exist in this script or ocdsmerge and they are not necessary the best place to comment on the edge cases.

Any new behaviour that falls into default behaviour (without explicit commands) should be documented in http://standard.open-contracting.org/latest/en/schema/merging/

For example if a schema editor writes a oneOf, anyOf ... then the merge tool, or this tool, will not know what to do with them either. There are huge amounts of other edge-cases that we probably have not covered.

This will be difficult to write the NotImplementedErrors without guessing all the possible usages a schema editor can write before they write it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the number of edge cases is that large (the JSON Schema spec is small). If we broke long if-statements into single condition statements, and then added else: raise NotImplementedError to each, I think we'd be most of the way there.

I could see some sense in having one thing that identifies new cases by e.g. raising NotImplementedError for unexpected cases, instead of making that every tool's responsibility. However, if the number of cases is small, then I might prefer doing it everywhere anyway, as we have no assurance that we'll always remember to run this extra check. Furthermore, even if we have one thing that identifies new cases / lists all cases to implement, that doesn't really help us check that those cases are in fact implemented in all tools. If all tools raised errors for cases in the schema that they can't handle, we'd be guaranteed to notice that they aren't handling those cases.

If we are concerned about duplication of effort, we have some options, including:

  1. Write a schema parser with a simple API that emits parts of schema to tools. If the tool doesn't recognize a part, it errors. This adds an intermediary betweens tools and schema, which has pros (centralizing some logic and case-handling) and cons.
  2. Have all tools call a JSON Schema auditor library during initialization, which – I don't know – returns the 'features' the schema uses, which the tool then compares to its known list of implemented features, and fails if a feature isn't implemented.

There may be other options. However, I wouldn't be confident in a new tool that is entirely divorced from other tools.

Copy link
Contributor Author

@kindly kindly Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the number of edge cases is that large (the JSON Schema spec is small). If we broke long if-statements into single condition statements, and then added else: raise NotImplementedError to each, I think we'd be most of the way there.

As I said I do think the the amount of edge cases are large. There 6 different types in jsonschema we have 3 different merge rules and theoretically a schema could have any subset of these. That is 2^9 = 512 different ways to express what can go into into a field that could effect merging behaviour. Of course each if/else could cover a large subset of these, but I would not be confident of coving all combinations. Also there are other cases like not putting in properties or items, what of the types are that are allowed in a list and constructs like anyOf, oneOf.

When writing the original make_validation_schema.py and ocds-merge I wrote a script to analyse all the different combination of types and rules that were actually used and made sure there was logic to cover those in use. As that seemed like the practical option.

This issue, and the ocds-merge one (open-contracting/ocds-merge#14) have come up due to new combinations of these appearing that had not appeared before those were written.

My point about the test, was to check that schema writers are not adding any new combinations of these things, without considering the consequences to merging behaviour.

The only way to do that would to be to fix the combinations to the ones we currently use and if there are any new ones then the tests against the standard or an extension should fail. We could add combinations, but we would have to do that at the same time as modifying the tools.

Only then, with a list of allowed combinations with what there expected behaviour is, can the tools be confident that they are not actually doing some undefined behaviour by accident and can raise a NotImplementedError. In that case a long list of very specific if/else statements would be workable in all tools.

Copy link
Member

@jpmckinney jpmckinney Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a test makes sense. Can you share the script you wrote to analyze the combination of types and rules, which can perhaps be reused when authoring that test? I imagine we can add the test to https://github.com/open-contracting/standard-maintenance-scripts/blob/master/tests/test_json.py

I started a page documenting what would need to change if we were to change the layout of codelist CSV files, extension.json files, etc. I also have a short list of repos that would need to be changed if extensions were changed. We can consider making a list of things that would need to change if a new pattern is used in the JSON Schema. (I'm also happy to reorganize the above content – I just wrote those to help myself.)

@jpmckinney
Copy link
Member

From my perspective, this PR can be merged, though we should add a note to the changelog first. I propose (please feel free to suggest corrections or edits!):

## [1.X]

### Advisories

1.1.3 changed the merging and versioning behavior of the `unit` field of the `Item` object.

* If you are using compiled releases, `unit` information can now be removed entirely in later releases by setting the field to `null`; previously, only its subfields could be set to `null` and removed.
* If you are using versioned releases, `unit` information is now versioned as a whole; previously, its subfields were versioned individually.

@jpmckinney
Copy link
Member

I've added the note to the changelog.

@kindly
Copy link
Contributor Author

kindly commented Aug 7, 2018

@jpmckinney sorry for taken so long to get round to this.

I do not think this pull request should be merged apart from the changelog.

There is some confusion, the code changes in this pull request change the behaviour back to what was in 1.1.2 and contrary to what is outlined in the changelog.

What I think needs to happen in this branch is:

  • Change the code so that logic is the same but is more explicit about the behaviour. i.e that ["object", null] is treated differently than ["object"]. I think the logic that says that any property that has type null should be versioned as a whole, as otherwise merging/versioning behaviour is undefined.

  • Make a comment in the docs that say that any property that has type null should be versioned as a whole, which is I think the desired behaviour now.

  • Make sure prop_type is coerced as a list as outlined above by @jpmckinney

  • Add the changelog (done)

Later we can get to making sure that all current uses of types and merge rules are consistent and that there are checks in place to make sure nobody is using any unexpected combinations of these rules.

The script I used when getting the current type/merge-rules used was a slightly modified version of this function (removing the conditional parts):
https://github.com/open-contracting/ocds-merge/blob/master/ocdsmerge/merge.py#L32
However, I think I could write a more thorough one now.

@jpmckinney
Copy link
Member

jpmckinney commented Aug 7, 2018

@kindly Aha, right! Please go ahead with your proposed changes. You can create a new branch if that would be simpler or cleaner.

Also, please create the follow-up issues we discussed, e.g. adding a check about unexpected combinations.

@jpmckinney
Copy link
Member

@kindly @robredpath Just pinging on this issue, in case it's not being tracked in the backlog.

@kindly
Copy link
Contributor Author

kindly commented Nov 20, 2018

@jpmckinney
I am finally getting round to looking at this.

What is the status of the changes you have made?
Would you like me to review them?

On the whole (after looking at them for a while) the gist of changes look good but I have not got round to checking the logic thoroughly or checking the output compared to the previous output.

Would it be beneficial for me to do that?

@jpmckinney
Copy link
Member

jpmckinney commented Nov 20, 2018

I had to drop off and work on other things for a bit. Looking back at my work, I had a few things I wanted to follow-up on.

At the OCDS level, I'm not sure if making unit nullable was the right decision, given the impact on versioning (which wasn't well understood when we worked on #630). Reverting to null not being an option if the type includes object would mean removing null from:

  • Core
    • /definitions/Organization/properties/details
    • /definitions/Item/properties/unit
  • Extensions
    • ocds_lots_extension: /definitions/LotDetails
    • ocds_location_extension: /definitions/Location/properties/geometry, /definitions/Location/properties/gazetteer
    • ocds_finance_extension: /definitions/Finance/properties/interestRate
    • ocds-shareholders-extension: /definitions/Organization/properties/beneficialOwnership
    • ocds_performance_failures: /definitions/PerformanceFailure/properties/period
    • ocds_metrics_extension: /definitions/Observation/properties/unit, /definitions/Observation/properties/dimensions
    • ocds_tariffs_extension: /definitions/Observation/properties/unit, /definitions/Observation/properties/dimensions

What do you think? That would avoid 1.1.3 introducing a change in how Unit is versioned. If it sounds reasonable to make this a bug fix, I've staged those changes locally (to see what'd happen).

At a technical level:

  1. The old logic didn't account for the presence of minLength when replacing sections with e.g. StringNullVersioned. In the new code, I continue to drop minLength to maintain the same behavior. However, should this keyword instead be preserved? I assume so.
  2. There's an exceptional case made for /Amendment/changes in the new code, whose versioning behavior seems to have changed in Make make_validation_schema simpler and useful for extensions #412. I don't know what the correct behavior ought to be, but I suppose it doesn't matter as it's deprecated.
  3. title, description, merging properties, etc. are removed from most of the schema, but not all of it. We should probably make those removals in a single, final pass, to avoid this inconsistency.
  4. We can make more patterns like StringNullVersioned, e.g. StringIntegerMinLength1.

Once we resolve this, we can close #737, and we should fix:

open-contracting/ocds-merge#13
open-contracting/sample-data#82

@kindly
Copy link
Contributor Author

kindly commented Nov 21, 2018

@jpmckinney
I originally argued for reverting to null not being an option for type object but mostly I am undecided on this. I think its the correct thing to do for backwards compatibility in terms of organization.details an unit in the core standard even though we it broke 1.1.3, so I agree a bugfix makes sense. Nonetheless, as @timgdavies argued, in terms of usability they are probably better to versioned as a whole.

As to the the technical points:

  1. Yes that was an oversight, I think the minLength changes came in around the same time or after this script this was worked on.

  2. I was aware of amendmant.changes but decided to put my head in the sand about it as it was being deprecated.

  3. Yes I noticed some title and description not being dropped and forgot to do anything about it.

  4. Sounds like a good idea.

@jpmckinney
Copy link
Member

Great. I suggest I make the OCDS-level changes in this PR (updating the changelog), which you can then review, and we can move the technical-level changes into new issue(s). I'll compare the new schema to the 1.1.2 schema and note any changes. When I last did this, the new script had not changed much.

@jpmckinney
Copy link
Member

jpmckinney commented Nov 21, 2018

So, interesting… Organization/details allowed null since it was first introduced. I think that means I leave it as-is. details would be versioned as a whole, regardless.

@jpmckinney
Copy link
Member

jpmckinney commented Nov 21, 2018

@kindly Ready for review. Differences (ignoring whitespace) from 1.1.2 in the generated schema are:

So, as expected.

Copy link
Contributor Author

@kindly kindly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement over the previous script, much cleaner and more readable.

I also agree with the backward compatible stance of unit and keeping organization.details

I can not approve as I originally started this pull request.


warnings.formatwarning = custom_warning_formatter

versioned_template = OrderedDict([
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this more readable, how about putting multi line string here (with triple quotes) of the template as JSON and then loading it ordered i.e something like:

versioned_template_json = '''        
{
"type": "array",
"items": {
  "type": "object",
  "properties": {
    "releaseDate": {
      "format": "date-time",
      "type": "string"
    },
    "releaseID": {
      "type": "string"
    },
    "value": {},
    "releaseTag": {
      "type": "array",
      "items": {
        "type": "string"
      }
    }
  }
}
'''
versioned_template = json.loads(versioned_template_json, object_pairs_hook=OrderedDict)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I've made that change, and will merge once tests complete.

@jpmckinney jpmckinney merged commit 3bdae83 into 1.1-dev Nov 23, 2018
@jpmckinney jpmckinney deleted the 134-version-schema branch November 23, 2018 20:27
@jpmckinney jpmckinney added this to the 1.1.4 milestone May 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants