Add preprocessor to remove empty cells. #575

tillahoffmann · 2017-04-25T10:01:53Z

No description provided.

takluyver · 2017-04-25T10:31:55Z

Seems reasonable. I'll let @mpacer take a look.

tillahoffmann · 2017-04-25T11:22:32Z

Just to explain the motivation behind this PR and #569: Adding jupyter notebooks to git is a bit painful because of the generated output. There are a range of scripts that strip the output and other content that may not be relevant for versioning (such as empty cells). But as the notebook format evolves, these scripts become outdated. My hope was to be able to use nbconvert to run git filters to ensure that the format of the notebook is always in sync with the filter script.

mpacer · 2017-04-25T17:15:34Z

As you can guess from my participation in jupyter/notebook#1928 — that sounds great! I'd even be open to having this included in most exporters by default — @takluyver @Carreau is that a reasonable inclusion? That can be a second PR though.

My only issue is that cell.source.strip() used to return True because of a nonempty string being returned is a bit more clever than I'd prefer (in terms of a readable codebase). I'm more ok with it being used in the filter than in the assert statement (where it tripped me up for a second regarding what the test was actually testing (since it's looping through the three cells that should not be there and then asserting something about them as the only assert statement)). At the least, there should be a comment explaining the logic behind it.

It might be better to have it assert the emptiness of the notebook as one branch (of an if:… else:) and then run your loop if it happens to not be empty. That way both conditions are tested.

Carreau · 2017-04-25T17:57:31Z

One of the questions I'll have is what if metadata or other things are set ? Should the cell be considered empty (IIRC gene pattern may use only metadata to store cell informations) ? Nitpicking but what if the notebook is using [this programming language](https://en.wikipedia.org/wiki/Whitespace_(programming_language) ?

Regardless I think it is a useful addition. It's common to have trailing empty cells indeed.

Carreau · 2017-04-25T17:58:11Z

nbconvert/preprocessors/removeempty.py

+# Copyright (c) IPython Development Team.
+# Distributed under the terms of the Modified BSD License.
+
+from traitlets import Set


Set is un necessary, but can be remove in a subsequent PR.

Carreau · 2017-04-25T18:01:26Z

nbconvert/preprocessors/tests/test_removeempty.py

+        nb, res = preprocessor(nb, res)
+
+        for cell in nb.cells:
+            assert cell.source.strip(), "found unexpected empty cell"


Agreed with @mpacer using something like nose_tools.assert_not_equal will also give a better error message, and won't be compiled out if Python is by mistake configured with -O something.

tillahoffmann · 2017-04-25T18:07:10Z

Ok, will have a look at making the tests a bit easier to read. We could define a string traitlet that defines a regular expression such that a cell is considered empty if it matches the regex. Thoughts?

tillahoffmann · 2017-04-26T09:51:45Z

Hm, this is now just a regex-based filter for cells. The only reason it should still be called RemoveEmptyPreprocessor is because of the default value of the pattern.

mpacer · 2017-04-27T06:12:31Z

What you could do is change the current one to RegexRemovePreprocessor, with RemoveEmptyPreprocessor as a special case that is explained in the docs (if not hardcoded)?.

Yes it's an straightforward use case of the RegexRemovePreprocessor, but there still may be a reason to to give a built in RemoveEmptyPreprocessor. People might do a bad job of writing a RemoveEmptyPreprocessor if they don't know to check the docs tutorial. Because of that it might be worthwhile to have it as a prebuilt core Preprocessor that is surfaced in the same submodule.

But even if we don't do that, this should be added as an example to the docs of how to use traitlets to create a subclass of a particular kind of Preprocessors.

Also, please add the config=True attribute to the empty_pattern = Unicode(…) assignment and change the variable name if you are going to generalise the aim to a RegexRemovePreprocessor. Otherwise I can just accept this as is and I'll do that tomorrow.

mpacer · 2017-04-27T19:48:22Z

@Carreau & @takluyver this is looking good to me.

Docs, default value, default inclusion in `TemplateExporter`?

The only thing that needs to happen before merge are documentation improvements. Add the default value to the docstring whatever it ends up being. Add a section in the docs explaining how to use this preprocessor. Specifically pointing out that you need to separately enable the preprocessor and set the regex to a value. That should include some reference to resources on how to use regexes.

If you want my first pass on a structure for displaying those resources: first link to a regex resource better than the python docs about regular expressions, then the python 3 regex (not the re docs), and then at least one tool for testing regexes interactively using python syntax and the regex (better) or re (worse) modules Ideally, that regex testing tool would be in the browser, but finding something like that for python regex syntax might be hard. :/

↑ @blink1073 I'd love to make a JupyterLab extension/widget to do that kind of interactive regex testing. Can we take a stab at implementing that at the spring dev meeting?

I'm open to thoughts as to what the default value should be… I think this is a solid default but I'm not sure if there'd be something more appropriate given the generality. Possibly the null string? If we go with the empty string, I'd strongly suggest including it as a default enabled preprocessor in TemplateExporter.

Re-addition of special use case `RemoveEmptyPreprocessor`

I think we should add a specific RemoveEmptyPreprocessor as an hard coded subclass that is not enabled by default. It's name alone will immediately jump out (hopefully) to users who are skimming for this solution. But I can do that in a separate PR (unless @tillahoffman wants to tackle that too).

If we don't end up wanting this separate RemoveEmptyPreprocessor then I think the default should be as it is now and we should definitely enable this filter by default.

`List` instead of `Unicode`?

Additionally, I'm thinking about the use case of filtering multiple patterns… so switching it to a List rather than but that also seems like it should be a new PR… where this idea is coming from is partially related to what the default value is. If we did that… is there away to ensure that a List traitlet only takes values of a certain type? I'm sure we can check before applying it but it'd be nice to err at the time of traitlet instance creation with Nbconvertapp (since then it will point to the traitlet def and not to wherever the check happens to be).

I know regex can have alternations, but that is relatively clunky to write and test incrementally. Also, if there is a non-null string as the default, you don't want to have to first check what the default is and include it as an alternation to filter more things. At least as a list you can just extend it rather than overwrite it.

Overall

I love this feature, and want to see it reach the power & flexibility that it clearly can have. I'm thinking it's still not too

mpacer · 2017-04-27T20:48:52Z

Possible resources for docs

• http://www.rexegg.com/regex-python.html
• https://pypi.python.org/pypi/regex (Best(only) docs as far as I can tell for this module, which apparently still isn't in the standard library‽)
• https://docs.python.org/3/library/re.html
• https://docs.python.org/3/howto/regex.html#regex-howto
• https://regex101.com/

takluyver · 2017-04-28T16:20:35Z

is there away to ensure that a List traitlet only takes values of a certain type?

Yep, this should work:

patterns = List(Unicode, default_value=[r'\s*$'])

tillahoffmann · 2017-04-28T16:39:31Z

Thanks for all the comments and references. The www.pyregex.com tool is great! 👍

I like the idea of using a list of patterns such that users can extend the patterns without having to manipulate regex patterns.

Unfortunately, I will only have sporadic internet access for the next two weeks but will try to implement the following before:

Extend documentation of the preprocessor (I'm not sufficiently knowledgeable about the traitlets and how to configure them to extend the documentation in that respect).
Use list of unicode strings.
Change default value to the null string \Z as suggested by @mpacer with the assumption that we add a hard-coded RemoveEmtpyPreprocessor with pattern \s*\Z.

nfultz · 2017-05-05T04:29:18Z

This could also be useful for stripping out sensitive info from a notebook, such as AWS keys.

I think using a List is fine for the config, but under the hood you should probably convert everything to a single regex using '|'.join(patterns) or similar.

mpacer · 2017-05-09T22:28:55Z

Improved regex testing suggestion: https://regex101.com/

↑ it actually goes through and explains the logic of the matches.

tillahoffmann · 2017-05-15T14:12:04Z

@nfultz, thank you for the idea. Much faster that way.
@mpacer, docs are updated.

mpacer · 2017-05-15T18:30:38Z

This looks great! Thank you for all the hard work.

I had actually meant docs in the sense of the narrative docs (rather than in the docstring) but that works too!

Merging!

Add preprocessor to remove empty cells.

9fbffcf

takluyver added this to the 5.2 milestone Apr 25, 2017

Carreau reviewed Apr 25, 2017

View reviewed changes

tillahoffmann added 2 commits April 26, 2017 10:25

Use regex traitlet and clean up tests.

32771e9

Compatibility with python2.

d69c620

Rename to RegexRemovePreprocessor.

a8cb3a5

mpacer closed this Apr 27, 2017

mpacer reopened this Apr 27, 2017

Update docs and tests.

e82ddca

mpacer merged commit e670bf6 into jupyter:master May 15, 2017

tillahoffmann deleted the removeempty branch May 15, 2017 18:36

mpacer added to_changelog and removed to_changelog labels May 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocessor to remove empty cells. #575

Add preprocessor to remove empty cells. #575

tillahoffmann commented Apr 25, 2017

takluyver commented Apr 25, 2017

tillahoffmann commented Apr 25, 2017

mpacer commented Apr 25, 2017 •

edited

Carreau commented Apr 25, 2017

Carreau Apr 25, 2017

Carreau Apr 25, 2017

tillahoffmann commented Apr 25, 2017

tillahoffmann commented Apr 26, 2017

mpacer commented Apr 27, 2017

mpacer commented Apr 27, 2017

mpacer commented Apr 27, 2017 •

edited

takluyver commented Apr 28, 2017

tillahoffmann commented Apr 28, 2017

nfultz commented May 5, 2017

mpacer commented May 9, 2017

tillahoffmann commented May 15, 2017

mpacer commented May 15, 2017

Add preprocessor to remove empty cells. #575

Add preprocessor to remove empty cells. #575

Conversation

tillahoffmann commented Apr 25, 2017

takluyver commented Apr 25, 2017

tillahoffmann commented Apr 25, 2017

mpacer commented Apr 25, 2017 • edited

Carreau commented Apr 25, 2017

Carreau Apr 25, 2017

Choose a reason for hiding this comment

Carreau Apr 25, 2017

Choose a reason for hiding this comment

tillahoffmann commented Apr 25, 2017

tillahoffmann commented Apr 26, 2017

mpacer commented Apr 27, 2017

mpacer commented Apr 27, 2017

Docs, default value, default inclusion in TemplateExporter?

Re-addition of special use case RemoveEmptyPreprocessor

List instead of Unicode?

Overall

mpacer commented Apr 27, 2017 • edited

takluyver commented Apr 28, 2017

tillahoffmann commented Apr 28, 2017

nfultz commented May 5, 2017

mpacer commented May 9, 2017

tillahoffmann commented May 15, 2017

mpacer commented May 15, 2017

mpacer commented Apr 25, 2017 •

edited

Docs, default value, default inclusion in `TemplateExporter`?

Re-addition of special use case `RemoveEmptyPreprocessor`

`List` instead of `Unicode`?

mpacer commented Apr 27, 2017 •

edited