New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocessor to remove empty cells. #575

Merged
merged 5 commits into from May 15, 2017

Conversation

Projects
None yet
5 participants
@tillahoffmann
Copy link
Contributor

tillahoffmann commented Apr 25, 2017

No description provided.

@takluyver takluyver added this to the 5.2 milestone Apr 25, 2017

@takluyver

This comment has been minimized.

Copy link
Member

takluyver commented Apr 25, 2017

Seems reasonable. I'll let @mpacer take a look.

@tillahoffmann

This comment has been minimized.

Copy link
Contributor Author

tillahoffmann commented Apr 25, 2017

Just to explain the motivation behind this PR and #569: Adding jupyter notebooks to git is a bit painful because of the generated output. There are a range of scripts that strip the output and other content that may not be relevant for versioning (such as empty cells). But as the notebook format evolves, these scripts become outdated. My hope was to be able to use nbconvert to run git filters to ensure that the format of the notebook is always in sync with the filter script.

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented Apr 25, 2017

As you can guess from my participation in jupyter/notebook#1928 — that sounds great! I'd even be open to having this included in most exporters by default — @takluyver @Carreau is that a reasonable inclusion? That can be a second PR though.

My only issue is that cell.source.strip() used to return True because of a nonempty string being returned is a bit more clever than I'd prefer (in terms of a readable codebase). I'm more ok with it being used in the filter than in the assert statement (where it tripped me up for a second regarding what the test was actually testing (since it's looping through the three cells that should not be there and then asserting something about them as the only assert statement)). At the least, there should be a comment explaining the logic behind it.

It might be better to have it assert the emptiness of the notebook as one branch (of an if:… else:) and then run your loop if it happens to not be empty. That way both conditions are tested.

@Carreau

This comment has been minimized.

Copy link
Contributor

Carreau commented Apr 25, 2017

One of the questions I'll have is what if metadata or other things are set ? Should the cell be considered empty (IIRC gene pattern may use only metadata to store cell informations) ? Nitpicking but what if the notebook is using [this programming language](https://en.wikipedia.org/wiki/Whitespace_(programming_language) ?

Regardless I think it is a useful addition. It's common to have trailing empty cells indeed.

# Copyright (c) IPython Development Team.
# Distributed under the terms of the Modified BSD License.

from traitlets import Set

This comment has been minimized.

@Carreau

Carreau Apr 25, 2017

Contributor

Set is un necessary, but can be remove in a subsequent PR.

nb, res = preprocessor(nb, res)

for cell in nb.cells:
assert cell.source.strip(), "found unexpected empty cell"

This comment has been minimized.

@Carreau

Carreau Apr 25, 2017

Contributor

Agreed with @mpacer using something like nose_tools.assert_not_equal will also give a better error message, and won't be compiled out if Python is by mistake configured with -O something.

@tillahoffmann

This comment has been minimized.

Copy link
Contributor Author

tillahoffmann commented Apr 25, 2017

Ok, will have a look at making the tests a bit easier to read. We could define a string traitlet that defines a regular expression such that a cell is considered empty if it matches the regex. Thoughts?

tillahoffmann added some commits Apr 26, 2017

@tillahoffmann

This comment has been minimized.

Copy link
Contributor Author

tillahoffmann commented Apr 26, 2017

Hm, this is now just a regex-based filter for cells. The only reason it should still be called RemoveEmptyPreprocessor is because of the default value of the pattern.

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented Apr 27, 2017

What you could do is change the current one to RegexRemovePreprocessor, with RemoveEmptyPreprocessor as a special case that is explained in the docs (if not hardcoded)?.

Yes it's an straightforward use case of the RegexRemovePreprocessor, but there still may be a reason to to give a built in RemoveEmptyPreprocessor. People might do a bad job of writing a RemoveEmptyPreprocessor if they don't know to check the docs tutorial. Because of that it might be worthwhile to have it as a prebuilt core Preprocessor that is surfaced in the same submodule.

But even if we don't do that, this should be added as an example to the docs of how to use traitlets to create a subclass of a particular kind of Preprocessors.

Also, please add the config=True attribute to the empty_pattern = Unicode(…) assignment and change the variable name if you are going to generalise the aim to a RegexRemovePreprocessor. Otherwise I can just accept this as is and I'll do that tomorrow.

@mpacer mpacer closed this Apr 27, 2017

@mpacer mpacer reopened this Apr 27, 2017

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented Apr 27, 2017

@Carreau & @takluyver this is looking good to me.

Docs, default value, default inclusion in TemplateExporter?

The only thing that needs to happen before merge are documentation improvements. Add the default value to the docstring whatever it ends up being. Add a section in the docs explaining how to use this preprocessor. Specifically pointing out that you need to separately enable the preprocessor and set the regex to a value. That should include some reference to resources on how to use regexes.

If you want my first pass on a structure for displaying those resources: first link to a regex resource better than the python docs about regular expressions, then the python 3 regex (not the re docs), and then at least one tool for testing regexes interactively using python syntax and the regex (better) or re (worse) modules Ideally, that regex testing tool would be in the browser, but finding something like that for python regex syntax might be hard. :/

@blink1073 I'd love to make a JupyterLab extension/widget to do that kind of interactive regex testing. Can we take a stab at implementing that at the spring dev meeting?

I'm open to thoughts as to what the default value should be… I think this is a solid default but I'm not sure if there'd be something more appropriate given the generality. Possibly the null string? If we go with the empty string, I'd strongly suggest including it as a default enabled preprocessor in TemplateExporter.

Re-addition of special use case RemoveEmptyPreprocessor

I think we should add a specific RemoveEmptyPreprocessor as an hard coded subclass that is not enabled by default. It's name alone will immediately jump out (hopefully) to users who are skimming for this solution. But I can do that in a separate PR (unless @tillahoffman wants to tackle that too).

If we don't end up wanting this separate RemoveEmptyPreprocessor then I think the default should be as it is now and we should definitely enable this filter by default.

List instead of Unicode?

Additionally, I'm thinking about the use case of filtering multiple patterns… so switching it to a List rather than but that also seems like it should be a new PR… where this idea is coming from is partially related to what the default value is. If we did that… is there away to ensure that a List traitlet only takes values of a certain type? I'm sure we can check before applying it but it'd be nice to err at the time of traitlet instance creation with Nbconvertapp (since then it will point to the traitlet def and not to wherever the check happens to be).

I know regex can have alternations, but that is relatively clunky to write and test incrementally. Also, if there is a non-null string as the default, you don't want to have to first check what the default is and include it as an alternation to filter more things. At least as a list you can just extend it rather than overwrite it.

Overall

I love this feature, and want to see it reach the power & flexibility that it clearly can have. I'm thinking it's still not too

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented Apr 27, 2017

Possible resources for docs

http://www.rexegg.com/regex-python.html
https://pypi.python.org/pypi/regex (Best(only) docs as far as I can tell for this module, which apparently still isn't in the standard library‽)
https://docs.python.org/3/library/re.html
https://docs.python.org/3/howto/regex.html#regex-howto
https://regex101.com/

@takluyver

This comment has been minimized.

Copy link
Member

takluyver commented Apr 28, 2017

is there away to ensure that a List traitlet only takes values of a certain type?

Yep, this should work:

patterns = List(Unicode, default_value=[r'\s*$'])
@tillahoffmann

This comment has been minimized.

Copy link
Contributor Author

tillahoffmann commented Apr 28, 2017

Thanks for all the comments and references. The www.pyregex.com tool is great! 👍

I like the idea of using a list of patterns such that users can extend the patterns without having to manipulate regex patterns.

Unfortunately, I will only have sporadic internet access for the next two weeks but will try to implement the following before:

  • Extend documentation of the preprocessor (I'm not sufficiently knowledgeable about the traitlets and how to configure them to extend the documentation in that respect).
  • Use list of unicode strings.
  • Change default value to the null string \Z as suggested by @mpacer with the assumption that we add a hard-coded RemoveEmtpyPreprocessor with pattern \s*\Z.
@nfultz

This comment has been minimized.

Copy link

nfultz commented May 5, 2017

This could also be useful for stripping out sensitive info from a notebook, such as AWS keys.

I think using a List is fine for the config, but under the hood you should probably convert everything to a single regex using '|'.join(patterns) or similar.

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented May 9, 2017

Improved regex testing suggestion: https://regex101.com/

↑ it actually goes through and explains the logic of the matches.

@tillahoffmann

This comment has been minimized.

Copy link
Contributor Author

tillahoffmann commented May 15, 2017

@nfultz, thank you for the idea. Much faster that way.
@mpacer, docs are updated.

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented May 15, 2017

This looks great! Thank you for all the hard work.

I had actually meant docs in the sense of the narrative docs (rather than in the docstring) but that works too!

Merging!

@mpacer mpacer merged commit e670bf6 into jupyter:master May 15, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@tillahoffmann tillahoffmann deleted the tillahoffmann:removeempty branch May 15, 2017

@mpacer mpacer added to_changelog and removed to_changelog labels May 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment