Improve nbformat json validator #5658

Merged
merged 25 commits into from Apr 25, 2014

Projects

None yet

4 participants

@jhamrick
Contributor

This pull request fixes some of the issues in #2786:

  • Add a clean function for validating notebooks that returns True|False.
  • Iterate on the spec so that our current notebooks pass the validation test.

The biggest change is moving IPython/nbformat/v3/validator.py to IPython/nbformat/validator.py and adding some code to figure out the correct schema file depending on the notebook format.

I also had to change a few parts of the json schema to make the example notebooks pass validation: changing "source" and "input" to be either strings or arrays, and making the cell prompt optional. These things should probably be addressed in v4.

@jhamrick
Contributor

Obviously, this still needs tests and stuff, but I'll hold off on that until I get feedback about whether the overall changes are ok.

@jhamrick
Contributor

I'd also like to handle this action item:

  • Create a top level command line sub-command for validating notebooks.

But I figured I should probably get input on the format of the command. Something like ipython nbvalidate <notebook>?

@Carreau
Member
Carreau commented Apr 19, 2014

@jhamrick sent you a PR that should fix the test faillure

I don't have anything against a nbvalidate subcommand, though, I'm not sure it would be use a lot. Why not just make nbformat module callable python -m IPython.nbformat --validate ipynb [...] ?

@jhamrick jhamrick Merge pull request #2 from Carreau/nbvalidate
nbvalidate add v3.withref.json to package.
ad5372c
@jhamrick
Contributor

@Carreau I could definitely do that too, just the original check list suggested a subcommand. Maybe @ellisonbg has thoughts?

@Carreau
Member
Carreau commented Apr 20, 2014

I think the subcommand/callable module can be done in separate PR/discussion.
I also think that if/when we do something like that, we should integrate it with integrate it with the trust/sign mechanism.

@Carreau Carreau commented on an outdated diff Apr 20, 2014
IPython/nbformat/validator.py
@@ -0,0 +1,99 @@
+from __future__ import print_function
+#!/usr/bin/env python
@Carreau
Carreau Apr 20, 2014 Member

Not callable anymore, we can remove the shebang.

@takluyver takluyver and 1 other commented on an outdated diff Apr 21, 2014
IPython/nbformat/current.py
def writes_json(nb, **kwargs):
- return versions[current_nbformat].writes_json(nb, **kwargs)
+ """Take a NotebookNode object and write out a JSON string. Report if
+ any JSON format errors are detected.
+
+ """
+ nbjson = versions[current_nbformat].writes_json(nb, **kwargs)
+ num_errors = validate(nbjson)
+ if num_errors > 0:
+ logger.error(
+ "Notebook JSON is invalid (%d errors detected)",
@takluyver
takluyver Apr 21, 2014 Member

We log exactly the same message on read and on write - I think they should be distinguished somehow.

@takluyver takluyver commented on an outdated diff Apr 21, 2014
IPython/nbformat/validator.py
@@ -0,0 +1,99 @@
+from __future__ import print_function
+#!/usr/bin/env python
+# -*- coding: utf8 -*-
+import json
+import os
+
+from IPython.external.jsonschema import Draft3Validator, SchemaError
+import IPython.external.jsonpointer as jsonpointer
+from IPython.utils.py3compat import iteritems
+
+
+from .current import nbformat, nbformat_schema
+schema_path = os.path.join(
+ os.path.split(__file__)[0], "v%d" % nbformat, nbformat_schema)
@takluyver
takluyver Apr 21, 2014 Member

os.path.split()[0] --> os.path.dirname()

@takluyver takluyver and 1 other commented on an outdated diff Apr 21, 2014
IPython/nbformat/current.py
def writes_json(nb, **kwargs):
- return versions[current_nbformat].writes_json(nb, **kwargs)
+ """Take a NotebookNode object and write out a JSON string. Report if
+ any JSON format errors are detected.
+
+ """
+ nbjson = versions[current_nbformat].writes_json(nb, **kwargs)
+ num_errors = validate(nbjson)
@takluyver
takluyver Apr 21, 2014 Member

If I'm following the code correctly, here we're passing nbjson as a string, but above in reads_json we're passing it as an actual dict structure. Does the validation machinery accept both?

@jhamrick
jhamrick Apr 22, 2014 Contributor

Oooh, good catch. I think it just should take the actual dict structure, rather than handling both strings and dict structures.

@takluyver takluyver and 1 other commented on an outdated diff Apr 21, 2014
IPython/nbformat/validator.py
+
+ # load the schema file
+ with open(schema_path, 'r') as fh:
+ schema_json = json.load(fh)
+
+ # resolve internal references
+ v3schema = resolve_ref(schema_json)
+ v3schema = jsonpointer.resolve_pointer(v3schema, '/notebook')
+
+ # count how many errors there are
+ errors = 0
+ v = Draft3Validator(v3schema)
+ for error in v.iter_errors(nbjson):
+ errors = errors + 1
+ if verbose:
+ print(error)
@takluyver
takluyver Apr 21, 2014 Member

Why not just return the list of errors? If the caller wants the number, it can simply call len(). If not, there are many ways you might want to display the errors other than just printing them.

@takluyver takluyver commented on an outdated diff Apr 22, 2014
IPython/nbformat/validator.py
+schema_path = os.path.join(
+ os.path.dirname(__file__), "v%d" % nbformat, nbformat_schema)
+
+
+def isvalid(nbjson):
+ """Checks whether the given notebook JSON conforms to the current
+ notebook format schema. Returns True if the JSON is valid, and
+ False otherwise.
+
+ To see the individual errors that were encountered, please use the
+ `validate` function instead.
+
+ """
+
+ errors = validate(nbjson)
+ return errors == 0
@takluyver
takluyver Apr 22, 2014 Member

This should be comparing against [] rather than 0 now.

@takluyver takluyver commented on an outdated diff Apr 22, 2014
IPython/nbformat/validator.py
+ """Checks whether the given notebook JSON conforms to the current
+ notebook format schema, and returns the list of errors.
+
+ """
+
+ # load the schema file
+ with open(schema_path, 'r') as fh:
+ schema_json = json.load(fh)
+
+ # resolve internal references
+ v3schema = resolve_ref(schema_json)
+ v3schema = jsonpointer.resolve_pointer(v3schema, '/notebook')
+
+ # count how many errors there are
+ v = Draft3Validator(v3schema)
+ errors = [e for e in v.iter_errors(nbjson)]
@takluyver
takluyver Apr 22, 2014 Member

list(v.iter_errors(nbjson))

@takluyver takluyver commented on an outdated diff Apr 22, 2014
IPython/nbformat/current.py
@@ -74,13 +79,33 @@ def parse_py(s, **kwargs):
return nbf, nbm, s
-def reads_json(s, **kwargs):
- """Read a JSON notebook from a string and return the NotebookNode object."""
- return convert(reader_reads(s), current_nbformat)
+def reads_json(nbjson, **kwargs):
+ """Read a JSON notebook from a string and return the NotebookNode
+ object. Report if any JSON format errors are detected.
+
+ """
+ nb = reader_reads(nbjson, **kwargs)
+ nb_current = convert(nb, current_nbformat)
+ errors = validate(nb_current)
+ if len(errors) > 0:
@takluyver
takluyver Apr 22, 2014 Member

This could just be if errors: now.

@jhamrick
Contributor

Thanks for catching those things. I'm going to start writing some tests for this now, too.

@jhamrick
Contributor

Just added a commit with some tests for this code. It is going to fail on Travis, because I included a test for checking whether the JSON is valid for a notebook converted from version 2 to version 3 -- and this test is currently failing, implying that the conversion itself isn't working quite right.

I'm not sure what the best thing to do here is -- should I investigate the conversion code and submit a pull request on that, so that it produces valid JSON? Or something else?

@takluyver
Member

If the error after conversion is a trivial problem, then it would be good to fix it here. If not, or if you don't have the time to look into it, it's OK to skip the test for now and file a new issue about it.

@jhamrick
Contributor

Ok, I looked into this a bit. One problem is an easy fix (the converted adds an orig_nbformat property, which I can just add as an optional property in the JSON schema). The other problem is that v3 notebooks are supposed to have metadata properties for all worksheets and all cells. There are two possible solutions:

  • have the convert code add metadata
  • make metadata be optional in the JSON schema

I think it would probably be better to make it optional in the JSON schema, since my guess is that there are notebooks floating around out there that have been saved as v3 and do not include metadata (and then we should just make sure that when we do v4 we enforce that it's required).

@takluyver
Member

I think that sounds fine, but I'll let others, especially @minrk, weigh in on whether making metadata optional is OK.

@minrk
Member
minrk commented Apr 25, 2014

From older versions of IPython, I believe the metadata key is optional. I think we can do a better job being strict about these definitions in the next nbformat.

@takluyver
Member

OK then, merging. Thanks Jess.

@takluyver takluyver merged commit 6c15f0e into ipython:master Apr 25, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@minrk minrk added this to the 3.0 milestone May 7, 2014
@jhamrick jhamrick deleted the jhamrick:nbvalidate branch Sep 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment