Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessor to remove empty cells. #575

Merged
merged 5 commits into from May 15, 2017
Merged

Add preprocessor to remove empty cells. #575

merged 5 commits into from May 15, 2017

Conversation

tillahoffmann
Copy link
Contributor

No description provided.

@takluyver takluyver added this to the 5.2 milestone Apr 25, 2017
@takluyver
Copy link
Member

Seems reasonable. I'll let @mpacer take a look.

@tillahoffmann
Copy link
Contributor Author

Just to explain the motivation behind this PR and #569: Adding jupyter notebooks to git is a bit painful because of the generated output. There are a range of scripts that strip the output and other content that may not be relevant for versioning (such as empty cells). But as the notebook format evolves, these scripts become outdated. My hope was to be able to use nbconvert to run git filters to ensure that the format of the notebook is always in sync with the filter script.

@mpacer
Copy link
Member

mpacer commented Apr 25, 2017

As you can guess from my participation in jupyter/notebook#1928 — that sounds great! I'd even be open to having this included in most exporters by default — @takluyver @Carreau is that a reasonable inclusion? That can be a second PR though.

My only issue is that cell.source.strip() used to return True because of a nonempty string being returned is a bit more clever than I'd prefer (in terms of a readable codebase). I'm more ok with it being used in the filter than in the assert statement (where it tripped me up for a second regarding what the test was actually testing (since it's looping through the three cells that should not be there and then asserting something about them as the only assert statement)). At the least, there should be a comment explaining the logic behind it.

It might be better to have it assert the emptiness of the notebook as one branch (of an if:… else:) and then run your loop if it happens to not be empty. That way both conditions are tested.

@Carreau
Copy link
Member

Carreau commented Apr 25, 2017

One of the questions I'll have is what if metadata or other things are set ? Should the cell be considered empty (IIRC gene pattern may use only metadata to store cell informations) ? Nitpicking but what if the notebook is using [this programming language](https://en.wikipedia.org/wiki/Whitespace_(programming_language) ?

Regardless I think it is a useful addition. It's common to have trailing empty cells indeed.

# Copyright (c) IPython Development Team.
# Distributed under the terms of the Modified BSD License.

from traitlets import Set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set is un necessary, but can be remove in a subsequent PR.

nb, res = preprocessor(nb, res)

for cell in nb.cells:
assert cell.source.strip(), "found unexpected empty cell"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @mpacer using something like nose_tools.assert_not_equal will also give a better error message, and won't be compiled out if Python is by mistake configured with -O something.

@tillahoffmann
Copy link
Contributor Author

Ok, will have a look at making the tests a bit easier to read. We could define a string traitlet that defines a regular expression such that a cell is considered empty if it matches the regex. Thoughts?

@tillahoffmann
Copy link
Contributor Author

Hm, this is now just a regex-based filter for cells. The only reason it should still be called RemoveEmptyPreprocessor is because of the default value of the pattern.

@mpacer
Copy link
Member

mpacer commented Apr 27, 2017

What you could do is change the current one to RegexRemovePreprocessor, with RemoveEmptyPreprocessor as a special case that is explained in the docs (if not hardcoded)?.

Yes it's an straightforward use case of the RegexRemovePreprocessor, but there still may be a reason to to give a built in RemoveEmptyPreprocessor. People might do a bad job of writing a RemoveEmptyPreprocessor if they don't know to check the docs tutorial. Because of that it might be worthwhile to have it as a prebuilt core Preprocessor that is surfaced in the same submodule.

But even if we don't do that, this should be added as an example to the docs of how to use traitlets to create a subclass of a particular kind of Preprocessors.

Also, please add the config=True attribute to the empty_pattern = Unicode(…) assignment and change the variable name if you are going to generalise the aim to a RegexRemovePreprocessor. Otherwise I can just accept this as is and I'll do that tomorrow.

@mpacer mpacer closed this Apr 27, 2017
@mpacer mpacer reopened this Apr 27, 2017
@mpacer
Copy link
Member

mpacer commented Apr 27, 2017

@Carreau & @takluyver this is looking good to me.

Docs, default value, default inclusion in TemplateExporter?

The only thing that needs to happen before merge are documentation improvements. Add the default value to the docstring whatever it ends up being. Add a section in the docs explaining how to use this preprocessor. Specifically pointing out that you need to separately enable the preprocessor and set the regex to a value. That should include some reference to resources on how to use regexes.

If you want my first pass on a structure for displaying those resources: first link to a regex resource better than the python docs about regular expressions, then the python 3 regex (not the re docs), and then at least one tool for testing regexes interactively using python syntax and the regex (better) or re (worse) modules Ideally, that regex testing tool would be in the browser, but finding something like that for python regex syntax might be hard. :/

@blink1073 I'd love to make a JupyterLab extension/widget to do that kind of interactive regex testing. Can we take a stab at implementing that at the spring dev meeting?

I'm open to thoughts as to what the default value should be… I think this is a solid default but I'm not sure if there'd be something more appropriate given the generality. Possibly the null string? If we go with the empty string, I'd strongly suggest including it as a default enabled preprocessor in TemplateExporter.

Re-addition of special use case RemoveEmptyPreprocessor

I think we should add a specific RemoveEmptyPreprocessor as an hard coded subclass that is not enabled by default. It's name alone will immediately jump out (hopefully) to users who are skimming for this solution. But I can do that in a separate PR (unless @tillahoffman wants to tackle that too).

If we don't end up wanting this separate RemoveEmptyPreprocessor then I think the default should be as it is now and we should definitely enable this filter by default.

List instead of Unicode?

Additionally, I'm thinking about the use case of filtering multiple patterns… so switching it to a List rather than but that also seems like it should be a new PR… where this idea is coming from is partially related to what the default value is. If we did that… is there away to ensure that a List traitlet only takes values of a certain type? I'm sure we can check before applying it but it'd be nice to err at the time of traitlet instance creation with Nbconvertapp (since then it will point to the traitlet def and not to wherever the check happens to be).

I know regex can have alternations, but that is relatively clunky to write and test incrementally. Also, if there is a non-null string as the default, you don't want to have to first check what the default is and include it as an alternation to filter more things. At least as a list you can just extend it rather than overwrite it.

Overall

I love this feature, and want to see it reach the power & flexibility that it clearly can have. I'm thinking it's still not too

@mpacer
Copy link
Member

mpacer commented Apr 27, 2017

Possible resources for docs

http://www.rexegg.com/regex-python.html
https://pypi.python.org/pypi/regex (Best(only) docs as far as I can tell for this module, which apparently still isn't in the standard library‽)
https://docs.python.org/3/library/re.html
https://docs.python.org/3/howto/regex.html#regex-howto
https://regex101.com/

@takluyver
Copy link
Member

is there away to ensure that a List traitlet only takes values of a certain type?

Yep, this should work:

patterns = List(Unicode, default_value=[r'\s*$'])

@tillahoffmann
Copy link
Contributor Author

Thanks for all the comments and references. The www.pyregex.com tool is great! 👍

I like the idea of using a list of patterns such that users can extend the patterns without having to manipulate regex patterns.

Unfortunately, I will only have sporadic internet access for the next two weeks but will try to implement the following before:

  • Extend documentation of the preprocessor (I'm not sufficiently knowledgeable about the traitlets and how to configure them to extend the documentation in that respect).
  • Use list of unicode strings.
  • Change default value to the null string \Z as suggested by @mpacer with the assumption that we add a hard-coded RemoveEmtpyPreprocessor with pattern \s*\Z.

@nfultz
Copy link

nfultz commented May 5, 2017

This could also be useful for stripping out sensitive info from a notebook, such as AWS keys.

I think using a List is fine for the config, but under the hood you should probably convert everything to a single regex using '|'.join(patterns) or similar.

@mpacer
Copy link
Member

mpacer commented May 9, 2017

Improved regex testing suggestion: https://regex101.com/

↑ it actually goes through and explains the logic of the matches.

@tillahoffmann
Copy link
Contributor Author

@nfultz, thank you for the idea. Much faster that way.
@mpacer, docs are updated.

@mpacer
Copy link
Member

mpacer commented May 15, 2017

This looks great! Thank you for all the hard work.

I had actually meant docs in the sense of the narrative docs (rather than in the docstring) but that works too!

Merging!

@mpacer mpacer merged commit e670bf6 into jupyter:master May 15, 2017
@tillahoffmann tillahoffmann deleted the removeempty branch May 15, 2017 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants