Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enables input removal via preprocessor, template and invalid nb intermediate representation #643

Merged
merged 14 commits into from Aug 21, 2017

Conversation

mpacer
Copy link
Member

@mpacer mpacer commented Aug 7, 2017

This introduces a new step in between Preprocessors and passing the notebook to the template and a new class to manage that step. This also introduces testing infrastructure for this class.

The new class is called a Sieve. The key idea is that this class allows us to have a step after Preprocessors are done that produces an object that is not guaranteed to adhere to the notebook format. This is needed in order to have output format independent ways of removing input.

Accordingly along with this object this introduces the TagRemoveInputSieve. This is equipped with the traitlet TagRemoveInputSieve.remove_input_tags which is a set of strings that are checked to see if a cell's input is to be transformed to None.

This also changes the skeleton to exclude any cells' whose source is None from being included in the output format.

I anticipate discovering more clever ways of using the ability to break the notebook json schema, but this should get us started.

The API for the Sieve class is largely derived from the Preprocessor class, which I took to be a good model since Sieves are (essentially) Preprocessors that allow producing invalid in-memory notebook objects.

NB: Sieves are not to be invoked by the NotebookExporter or any in-place notebook transformation, meaning they should never appear as part of the Exporter class.

Implementation influenced by extensive conversations with:
@Carreau @fperez @ian-r-rose @OliverEvans96

From those conversations additional things were agreed to be improvements, but possibly outside the scope of this PR…

Tasks worth considering:

  • check that it is a pseudo-notebook and not just an arbitrary json-like blob
  • check in the top level preprocessor code that the produced notebook is actually compliant with the nbformat protocol (not just when reading it)
  • add narrative documentation explaining what sieves do
  • consider the appropriateness of the name in the context of not just removing content but modifying content

cc'ing people who probably want to know about how this affects nbconvert & the larger Jupyter ecosystem:
@takluyver @minrk @rgbkrk @ellisonbg

@mpacer mpacer force-pushed the input_tag_sieve branch 2 times, most recently from 175b9c7 to c54646c Compare August 7, 2017 23:01
@takluyver
Copy link
Member

A few design questions:

  • Could we not do this while preserving a valid notebook structure by adding some private metadata to each cell (cell.metadata._nbconvert.hide_input)?
    • Can you give some other examples of use cases you're imagining? This might make it clearer why we need pseudo-notebooks.
  • What are the constraints on pseudo-notebooks? How much like notebooks do they have to be? If we define a new data type, we need to specify it - preferably with a formal schema, but at least a rough description.
  • What's the reason to add a new class of things rather than just relaxing the limitations on preprocessors (perhaps with a marker attribute like preproc.produces_valid_nb = False)?

I may change my mind once I hear more of the reasoning, but my initial reaction is definitely -1 to a new 'pseudo-notebook' format: it doesn't seem to be clearly defined, and I don't see a compelling need. Exchanging data between components in the existing well-specified notebook format is much nicer for interoperability.

@takluyver
Copy link
Member

Thinking about this some more, I'm not even sure if we need a preprocessor to do it - it could be configurable on TemplateExporter, with a test in the template so you can do:

{%- if cell is input_visible -%}

Implemented something like this:

def input_visible(cell):
    tags = cell.get('metadata', {}).get('tags', [])
    return resources.global_content_filter.include_input \
        and not self.remove_input_tags.intersection(tags)

@minrk
Copy link
Member

minrk commented Aug 8, 2017

That's cool! I didn't know about tests.

@takluyver
Copy link
Member

In the video meeting today, a couple of arguments were presented for not relying on cell-level metadata for this. Firstly, that we don't want that to become a public API, where people rely on setting the metadata directly (paraphrased - let me know if I've misunderstood).

I see a few possible ways to deal with this without involving pseudo-notebooks:

  1. A test in the skeleton template, as described above.
  2. Borrow the Python convention that an underscore-prefixed name is private (metadata._nbconvert), but not actually protected (you can do it, but there's a signal that you shouldn't rely on it).
  3. Enforce it by removing that key from metadata before nbconvert sets it, so setting it in the notebook file has no effect.
  4. Add an extra key to cell objects (cell.nbconvert_transient) - this technically makes an invalid notebook, but it's equivalent to a minor bump of the format version - it only adds fields, so most code won't break. We could even make a minor nbformat version to allow it.

The other argument was that we should consider what we can do if we're not constrained to valid notebooks going into exporters.

I'm on board with that, but we should consider it before making that change, and look carefully at whether we can do what we want within valid notebooks (or by only adding new fields). Even if we can't, we need to know which bits of the format pseudo-notebooks are allowed to break.

@rgbkrk
Copy link
Member

rgbkrk commented Aug 8, 2017

Having an intermediate representation (IR) makes sense to me, as does the fact that the in-memory representation in pretty much all the frontends doesn't match the on-disk format. I couldn't imagine standardizing the in-memory representation (much of the time its implementation specific), but I could imagine an intermediate representation of a notebook being similar to the way we think of a compiler's IR.

@takluyver
Copy link
Member

Nbconvert exporters are a public interface - we encourage third parties to implement them and integrate them using entry points. So whatever data goes into them needs to be specified one way or another, it can't be an implementation detail.

One possible spec would be: it's a notebook which may have extra fields, as if it was in format version 4.x+1. Another: it's a notebook where each cell is also required to have a cell.transient key which is a dictionary. I'm not convinced we need either for this use case, but I prefer both of them to allowing extra values for existing keys.

@mpacer
Copy link
Member Author

mpacer commented Aug 8, 2017

The data that goes into the exporter when you call from_notebook_node is always a NotebookNode object. The idea is keep the data going into the exporter as a NotebookNode object.

But, that doesn't constrain the kind of thing that enters the templating step. Currently, we have no schema checking for what goes into the template. Implicitly that suggests that what is to be passed in need not necessarily be a notebook. What this is doing is making that explicit and acknowledging that there maybe notebook-like objects that are able to be handled by the templating system.

@mpacer
Copy link
Member Author

mpacer commented Aug 8, 2017

NB: I'm going to suggest the following idea, because I already did the work to write it out and I think it's worth considering. Still I don't think its a good solution.

Is there a way to extend an existing json schema? If so, what if we were to require that someone how to extend the schema to accommodate their sieve? I can see that becoming a process that is complicated and unable to be composed uniquely. However, it would require people to be specific about how things were being extended…

For example, the thought would be something along the lines of:

{"code_cell": {
  "properties": {
    "source": {
      "$ref": "#/definitions/misc/multiline_string_or_null"
    }
  }
}
{"definitions": {
  "misc": {
    "multiline_string_or_null":{
      "oneOf" : [
        {"type": "string"},
        {
          "type": "array",
          "items": {"type": "string"}
        },
        {"type": "null"}
      ]
    }
  }
}

@ellisonbg
Copy link
Contributor

Thanks for tackling this @mpacer !

I am not quite sure why the tag based input/output filtering can't be implemented as a notebook->notebook transform, with all the pseudo-notebook stuff being an internal implementation detail that doesn't introduce any new public APIs. Can you clarify this point? I apologize in advance that I haven't though much about this problem.

@mpacer
Copy link
Member Author

mpacer commented Aug 9, 2017 via email

@minrk
Copy link
Member

minrk commented Aug 9, 2017

It sees that only hiding input has the problem, yes? Because outputs and cells can be removed as plain transforms. And since we do already have the transient field for data that should not be saved in other frontends, we know that it won't come from files on disk. So I would propose:

  1. for whole objects that can be removed, actually remove them during the preprocess phase
  2. for behaviors that need handling without modifying the data structure (hide/remove input are the only ones?) set a flag in the transient area, which can be handled by templates/exporters.

Are there any cases this doesn't cover?

I do think that the IR being a notebook with additional fields is AOK. The only case where we need to do anything about these extra fields is --to notebook, where we should be sure to pop the extra fields before passing it to nbformat. I don't think we should modify any fields in-place except for complete removal of nodes.

@mpacer
Copy link
Member Author

mpacer commented Aug 9, 2017

We still can't produce an invalid IR in the current Preprocessor steps, so is the idea that I'll just change the way this Sieve works… and now if we're only talking about adding things it probably makes sense to change the name to a metaphor that suggests adding things rather than removing them.

@mpacer
Copy link
Member Author

mpacer commented Aug 9, 2017

--to notebook doesn't rely on the TemplateExporter so we shouldn't need to worry about it if we include this at the level of the TemplateExporter and not Exporter.

@mpacer
Copy link
Member Author

mpacer commented Aug 9, 2017

@takluyver @minrk I 'm going to add added a new issue (#644) to move a lot of the filtering logic to custom tests. I will need to read up on how to enable and register custom tests.

@mpacer mpacer changed the title Input tag sieve builds on #640 to enable input removal too Input tag with transient tag & template changes, builds on #640 to enable input removal Aug 9, 2017
@mpacer mpacer changed the title Input tag with transient tag & template changes, builds on #640 to enable input removal enables input removal via preprocessor, template and invalid nb intermediate representation Aug 10, 2017
@takluyver
Copy link
Member

Thanks, this makes much more sense to me now. :-)

I'd suggest we add a default tag for each kind of filtering, so you can use it without configuring nbconvert by adding one of the standard tags. E.g. nbconvert-remove-input.

@mpacer
Copy link
Member Author

mpacer commented Aug 15, 2017

@takluyver I'd prefer to leave any default value questions for a later date. What we choose to make those be will be likely an opportunity for disagreement.

I would like to move forward with this & then #645 so that we can get a new release of nbconvert out with this functionality.

Barring defaults does anyone else have issues with the PR as it stands?

@mpacer mpacer mentioned this pull request Aug 15, 2017
@takluyver
Copy link
Member

I'm fine with leaving default tags out of this PR, but I'd like to consider them before making a release.

@mpacer
Copy link
Member Author

mpacer commented Aug 15, 2017

I don't support their existence as part of this release. Partially, I want to release before Jupytercon so I can talk about these features in my talk about the document structure and how to use it.

I'm opposed to including them for this release partially because once they are added, we cannot remove them without breaking notebooks and notebook functionality. Whereas we can always add them at a later date without removing functionality. The only challenge then would be conflicting with user tag names, but that would be a concern even if we released it with defaults.

For that reason (among others), I would like us to consider using a different signaling mechanism for unconditional element removal, with tag-based element removal being reserved for conditional element removal with no conditions included as default.

@mpacer
Copy link
Member Author

mpacer commented Aug 18, 2017

@takluyver can I release once I take care of this and #645 and release the new nbformat?

One of the reasons I want to not have a non-empty default is that it gets complicated regarding how to handle traitlet defaults when it's a List/Set.

If you want to cancel the removal of the cells then you need to know to overwrite reassigning the traitlet value to an empty List/Set. That is in contrast to appending/adding to it (which is what I usually suggest people do with List/Sets in general). I know that some people get tripped up with that.

I do agree with you that there should be a way for people to set things to "always" be filtered out, but I think that should not use a magic tag value. Maybe it could live in the metadata.jupyter namespace. But by taking that approach, we can still surface a way to cancel that removal of those cells without having to touch anything about the tags; we could just use a Boolean to handle the cancelling. This way it doesn't force our code to have one mechanism that acts in two really different ways depending upon the nuances of the API people use to customise filtering in their config.

@takluyver
Copy link
Member

Fine on the release.

I agree that configuring collections through traitlets is a bit of a pain. But I would ultimately like to define some magic tag values that can remove cells with default nbconvert config. We added cell tags to give notebook authors an easier way to do this kind of thing than editing JSON cell metadata.

@mpacer
Copy link
Member Author

mpacer commented Aug 18, 2017

So… can this be merged?

@mpacer mpacer modified the milestone: 5.3 Aug 18, 2017
@mpacer
Copy link
Member Author

mpacer commented Aug 20, 2017

@takluyver @minrk @rgbkrk @ellisonbg

Do I have consensus to merge & release this & #645?

@ellisonbg
Copy link
Contributor

Please hold off on merging this. I think we have a significant misunderstanding about all these things.

@ellisonbg
Copy link
Contributor

I looked at part of this, but not all of it.

As I understood what happened at the last in person team meeting, we decided to use official (as in Kyle's PR on nbformat here https://github.com/jupyter/nbformat/pull/94/files) cell metadata (not tags) as the official way to specify hidden input/output in the notebook format, that would be respected across our entire stack. The idea is that we want a single way in the notebook format for this to be represented. Views may have additional ways of representing it, for example, to allow different users to see different views in a real-time collaboration setting. But once the information hits the notebook document, there should be only one way of representing that. I should note that in this context, hidden means "not currently shown, but still present." In the live notebook, this will amount to the content having some sort of "collapse" based UI. I don't see why nbconvert should not respect this metadata. In this case, nbconvert would (for output formats that support it) should keep the content but show it collapsed by default and provide a way for the user to show it. Ideally, we could reuse whatever UI we settle on for JupyterLab.

We also talked about tag based filtering, but the idea there is that nbconvert and the live notebook could hid/show content based on semantic tags that a user adds. For example, a user could add solution tags to certain cells and have nbconvert, create a version of a notebook for students to work. Am I correct that this is what you have added here? What I am a bit confused about is that this PR seems to have a set of particular tags used for removing content. I am opposed to such tags existing as the original idea is that tags should be user-defined and semantically meaningful (solution, test, scratch, idea, etc.).

@mpacer
Copy link
Member Author

mpacer commented Aug 21, 2017

What I am a bit confused about is that this PR seems to have a set of particular tags used for removing content. I am opposed to such tags existing as the original idea is that tags should be user-defined and semantically meaningful (solution, test, scratch, idea, etc.).

@ellisonbg This specifically does not implement any "magic tags"; I had the same understanding as you regarding their user-defined nature. @takluyver has suggested that that would be a desirable feature, but I am not sure where you are seeing that here.

@ellisonbg
Copy link
Contributor

OK great. I saw the remove_input, remove_output, and remove_cell tags in the PR, but missed they were in tests only. As long as there are no specific default flags created by the PR, I am fine with it!

@mpacer
Copy link
Member Author

mpacer commented Aug 21, 2017

@ellisonbg No worries! And thanks for checking!

@minrk minrk merged commit 8dbe72d into jupyter:master Aug 21, 2017
@minrk
Copy link
Member

minrk commented Aug 21, 2017

Thanks, @mpacer! I think this PR is uncontroversial, and we can continue to explore more convenient ways to expose this kind of thing in a future iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants