Skip to content
This repository

IPEP 2: Input transformations #2293

Closed
takluyver opened this Issue August 12, 2012 · 33 comments

6 participants

Thomas Kluyver Brian E. Granger Matthias Bussonnier Aaron Meurer Fernando Perez Takafumi Arakaki
Thomas Kluyver
Collaborator
  • Created: 2012-08-12
  • Author: Thomas Kluyver

The state of our input transformation machinery has come up a couple of times recently, and I'd promised to look into it.

Requirements

A line-by-line input filter is needed for two main reasons:

  • We need to avoid transforming content inside multi-line strings.
  • Line-based frontends (the terminal and Qt console) decide whether another line is required based on attempting to compile the current buffer as Python code. So constructs like %magic commands, which aren't Python syntax, have to be intercepted as each line is entered.

We also need to do some transformations which are only possible with access to the interactive namespace, i.e. they must be done in the kernel. Examples include the autocall system (which lets you type exit to exit), macros, automagics (using magics without the % prefix) and aliases for shell commands (like ls). We refer to these as 'dynamic transformations'.

Finally, we need an extensible system that third parties can hook into without having to monkeypatch lots of our code.

Current situation

InputSplitter does line-by-line transformation (the name's a little confusing, as its primary role is no longer splitting input). It also handles cell magics, but the implementation feels somewhat awkward to me. For line-based frontends, inputsplitter is run twice: once by the frontend, and again by run_cell(), which is called with the raw, untransformed code.

Prefilter does dynamic transformations using a mixture of Transformer subclasses and Checker/Handler subclass pairs. We've struck the compromise that dynamic transforms only happen on single line cells, because the frontend can't make them valid syntax on its own. This is the primary extension point for third parties, but it's somewhat awkward to use (subclassing from Transformer isn't simple), and doesn't work as extension authors might expect (only transforms single lines).

Several bits of functionality are duplicated in inputsplitter and prefilter: the transformations for %magic, !system, assigning versions of both (foo = %magic), help? (and ?help, morehelp??), escapes for various kinds of call (/callme arg, ,quoteseparate a b c, ;quotetogether a b c), and stripping Python/IPython input prompts. As far as I know, we only use the inputsplitter versions of these functions, since Fernando fixed %paste to use inputsplitter.

Suggestions

  1. I suggest that we make InputSplitter the main point of contact for extension authors to transform input. It works on multi-line cells, and knows to ignore text within strings, which almost all transformations will want to leave alone. This will involve developing InputSplitter to make it easier to extend (details to be fleshed out in discussion), and improving the documentation to point extension authors towards this rather than prefilter.

  2. For all the duplicated functionality, strip it out of prefilter and rely on the inputsplitter versions.

  3. Rename InputSplitter to something more meaningful, before many third parties are depending on it. InputAccumulator? InputFilter?

  4. Allow inputsplitter transformers to maintain state between lines. This should allow a less special-cased system for catching cell magics, as well as correctly stripping the prompts in a pasted block like the following::

>>> a = """1
... 2
... 3"""
  1. Add a later transformation hook which acts on the AST, rather than a string of the code. This would support cases like SymPy's intention to wrap integer literals (reference). The ast module already has a NodeTransformer class to support this kind of thing. This approach is limited to code that is already valid Python syntax before the transformation, but it should be powerful and reliable in those situations.
Brian E. Granger
Owner
Thomas Kluyver
Collaborator

This is a Github issue ;-). Fernando also used a Gist for the actual document, but copied it into the issue. My preference is to have the document in a single location; with two locations you can't be sure that what you're reading is up to date.

Brian E. Granger
Owner

I should have clarified what I mean. There are currently two pages I have to visit:

  1. The github issue.
  2. The gist with the actual doc.

And there are three places people can post comments:

  1. On ipython-dev.
  2. On the issue's page.
  3. On the gist's page.

That is confusing. I agree that Fernando's approach of putting the doc as a gist and pasted into the issue isn't quite right, but I your approach has too many things going on. Why not just put the doc only on the issue and encourage people to leave comments only there (rather than also on the list)? If we want a separate place other than the issue to host the document I recommend putting them on the github wiki. The benefit of this is that any of us can easily edit the IPEP and we will have a centralize record of all the IPEPs. I think that is what I would prefer along with all comments going to the issue.

Matthias Bussonnier
Collaborator

IPEP 3 : How to write an IPEP.

Thomas Kluyver
Collaborator

OK, I've copied it in above. The downside of this approach is that we don't have a history of the IPEP, but I guess that's not so important (I've never looked up the versioning of a PEP). Now perhaps we can discuss the content. ;-)

Aaron Meurer

Will this be able to handle things like the physics extension that attempt to extend the Python syntax?

Aaron Meurer

Why do you want to give the ast of the code instead of the string of the code? Maybe related to some detail of how IPython works that I'm not familiar with?

Fernando Perez
Owner

Too late for meaningful comment, I just wanted to say thank you for getting this going!! That machinery did improve a lot after the big refactor that led to inputsplitter in 0.11, but I knew at the time we'd still need one more pass to completely consolidate and rationalize things.

Now is the time to do it, and hopefully we'll end up with just one, comprehensible module to handle all input transformations, as well as a clear policy of what it is that we do, where users can plug in their own customizations and what they can and can't expect to be able to do there.

Thomas Kluyver
Collaborator

Thanks, Fernando.

Aaron: I would leave the prefilter machinery working as is, but with the same limitation of only acting on a single line, so the physics extension would continue to work as well as it does now. If Georg wants to update it after this refactoring, I would point him towards the new InputSplitter machinery, which should be capable of handling it.

I want to have the option of using an AST because, if your intention is to transform all integer literals in a block of code, the only way to do that reliably is to parse the code (consider an input like 1 or '2 "or 3\' or" 4' or 5. or 6j - regexes will get things wrong). Each extension could call ast.parse() itself, but:

  • If other code-string transformations still need to be applied, the code you try to parse may not be valid Python syntax.
  • A separate hook can be re-used by magic commands like %timeit to modify code passed to them.
  • If you use a lot of extensions that each parse the code, performance will suffer. We can parse it once and modify the tree.

In case it wasn't clear, the AST transformation I'm talking about is in addition to the code-string transformations done by inputsplitter & prefilter. But I would encourage people to use it where possible, and I think it is possible in your case.

Takafumi Arakaki
tkf commented August 13, 2012

Is #2164 (Request: Line breaks in line magic command) related to this IPEP?

Thomas Kluyver
Collaborator

I'm increasingly enthusiastic about the AST-transforming idea. With access to the parse tree and the interactive namespace, all sorts of magic becomes possible, like intercepting references to undefined variables. Of course, there's a lot of stuff that you really shouldn't do, but some of the best bits of IPython are those that skirt the edge of 'never do this'.

Thomas Kluyver
Collaborator

@tkf: It's certainly related. I'm having ideas about how to redesign InputSplitter, so I'll try to work in a way to handle that case. Thanks for pointing it out.

Takafumi Arakaki
tkf commented August 13, 2012

Great!

Aaron Meurer

OK, that makes sense. Currently we use tokenize to do the transformation, but I guess ast would work just as well. This would also solve the problem of what to do with SyntaxErrors for the user, as with your model code with incorrect syntax would never even reach the sat transformer stage.

Though I disagree with leaving line based machinery. As I argued elsewhere, 99% of the time, you really want to act on the whole cell, not a line (#2164 is a great example of this). In the cases when you really do want lines, you can always do input.split("\n"). Users should be encouraged to write code that works on whole cells, not lines, as they tend to forget that things like line continuations are possible. And the way you encourage users to do things with APIs is to make the API do what you want the user to do easily, and not do what you don't want him to do.

Also, since it's clear that you need a string based whole cell transformer, wouldn't a line transformer be redundant anyway?

Aaron Meurer

By the way, I don't remember if I mentioned this elsewhere (maybe that's why you mentioned it), but we would indeed use the AST to catch undefined names for SymPy. Our current method adds an exception handler for NameError, but this leads to subtle bugs, because the code is ran until the NameError is caught, and then ran again, so things like

a = 0
for i in range(10):
    a += 1
    print b

would result in a == 10 if b is defined and a == 11 if b is not defined.

My only complaint with AST is that it's a little easier to do the things I want with tokenize (although I've never actually used ast before, so maybe it's not as hard as the docs make it seem).

Aaron Meurer

By the way, for additional reference, the SymPy hack code that monkey-patches run_cell is now part of the official repo, and can be found at https://github.com/sympy/sympy/blob/master/sympy/interactive/session.py.

Also, issue #1491 is related.

Thomas Kluyver
Collaborator

I think it might actually be easier to do the things you want with AST - i.e. simpler code - but it's less obvious. E.g. to transform integers, I think this should be sufficient:

class RewriteInts(NodeTransformer):
    def visit_Num(self, node):
        if isinstance(node.n, int):
            return Call(func=Name(id='Integer', ctx=Load()),
                               args = [node], keywords=[])

        return node  # Other types of number, like floats

You would register that class with IPython to have it act on all input.


Transforming a whole cell at a time might seem obvious from the outside, but:

  1. Without parsing the code, it's very hard to know what's in a string and what isn't. If the code you want to act on can be parsed as Python code, you should use the proposed AST hook, which will naturally operate on full cells. If you want to extend Python syntax, as we do with our %magic commands, then our machinery in inputsplitter tries hard to make sure we only apply transformations in the right places.

  2. Our line-based frontends rely on being able to form valid Python syntax to judge when a block is complete. So any extensions to Python syntax have to be processed line-by-line for them. The notebook is great, but we're not abandoning our console roots just yet.

To reiterate, where the transformation expects syntactically valid Python as input, I would encourage people to use the AST hook. InputSplitter is only required when you're defining some syntax that's not valid Python itself.

Aaron Meurer

Without parsing the code, it's very hard to know what's in a string and what isn't.

We must be misunderstanding each other, because this is exactly my argument for using cell-based transformations. You can have

In [1]: "1 \
%run rm -rf /\
"

or even

In [1]: """
%run rm -rf /
"""

You have to parse the whole cell to know what is inside a string and what is not.

Our line-based front ends rely on being able to form valid Python syntax to judge when a block is complete.
The notebook is great, but we're not abandoning our console roots just yet.

OK, I see your argument. So I guess the question is, should IPython mask from the user (of the API) whether or not the front-end is line-based or not? If yes, then the line-by-line parsing into a single cell should happen at a different level than the rest of this. If no, then there are some tricky issues, especially since the line-by-line-ness of the console front-end is shaky, as it depends on the state of readline, and has things like semi-working multiline editing enabled.

I don't know the answer here (I also know much less about IPython internals than you, so I hope my input is meaningful). Perhaps there should be a way for transformers to "register" with InputSplitter to let it extend the idea of whether input should continue after a line or not.

To reiterate, where the transformation expects syntactically valid Python as input, I would encourage people to use the AST hook.

I'm not arguing there, but there is still the issue of extending the Python syntax. Most of this will happen with IPython itself, but others may want to do it too. And the public API should be good enough that IPython internals just use it.

Matthias Bussonnier
Collaborator

@asmeurer hope you don't mind, I edited you comment, you were missing a backtick which was making the all thing hard to read.

Thomas Kluyver
Collaborator

Consider if you enter:

for i in range(5):
    %run script_{i}.py
    b = """
    %foo
    """

You can't just hand it to Python's parser, because it will choke on the syntax. We need to implement some degree of parsing ourselves. Going line-by-line, that's a relatively simple regex, and inputsplitter takes care of not applying it to %foo. If our transformer were just handed the whole cell, it would need a much more complex parser to work out that the %run command should be transformed but not %foo.

It does give us a limitation: if you're extending Python syntax, your new syntax needs to be line based (like our magic commands: you can do x = %mymagic, but you can't use it in arbitrary expressions like foo(%mymagic)). But that's a restriction that hasn't yet been terribly inconvenient.

Yes, my thoughts for InputSplitter include a way for transformers to indicate when a block is complete. This would be useful for cell magics, for instance. But that still puts transformers in the same position of being fed code line-by-line; though there may be a way for them to accumulate it and return it in one go, to allow things like line breaks in a magic command (#2164).

Aaron Meurer

@Carreau that's fine.

Aaron Meurer

Hmm. We'll, I understand why you don't want to implement your own parser to truly extend the syntax, though I still don't see why you can't make it possible for others to do that if they really wanted to, and just split the code line-by-line yourself. But I guess there is more logic for line-by-line than input.split('\n'), so even so I guess a line based API could be useful. Personally, "maintaining state between lines" sounds much more tricky (and prone to getting wrong) than parsing proper.

By the way, a third potential transformation we might do in SymPy is the automatic replacement of ^ with **. This has to be done as a string transformation, not an ast transformation, because ^ has a different precedence than **.

Actually, changing precedence of operators might be an interesting thing for us to try too. I know, I know, at some point, we might as well just invent our own language and stop using Python (and actually, one of SymPy's design goals is to embrace Python's designs and not try to change them), but some modules in SymPy use certain operators, in particular the logical operators ^, &, and |, to represent products (like dot, inner, and outer products) that do not have the same usual precedence as the logical operators in SymPy. The usual solution is to tell the user to make sure to parenthesize everything, but it would be cool (again, for interactive purposes only) if we could optionally let users remove that need. I'm not sure if we'll ever do this (it's a tricky problem on its own, without IPython consideration), but I just want to point it out as another example of something that cannot be done on the ast level (because information about parentheses is destroyed at that level).

Thomas Kluyver
Collaborator
Aaron Meurer

^ -> ** is actually doable with line-by-line transformation, as it's just replacing one character (I'm assuming that the same logic used by magic can be used to not transform ^ inside a string).

The tricky one is changing the precedence order of operations. To do that, you'd have to add parentheses, which is not doable without the whole string, at least in the case of line continuations.

Thomas Kluyver
Collaborator
Aaron Meurer

I assume you essentially have to parse the expression
yourself and apply your own precedence rules?

The way I would do it is to just add parentheses (for example, I would change x + y ^ z to x + (y ^ z)). I think this is doable with tokenize. Again, maybe not trivial, but I think it's doable. At the very least, tokenize would do the parsing proper. But this is not IPython's problem, of course. As long as it gives suitable information to the API, it's up to the user to figure it out.

Takafumi Arakaki
tkf commented October 09, 2012

Currently you can't use "variable injection" {var} in cell magic, right? It will be nice to do something like:

%%sh
echo {var}

Is it in the scope of this IPEP?

Thomas Kluyver
Collaborator

I don't think that's within the scope of this work. If we want that, it should be implemented in the relevant cell magics - because it doesn't make sense for all of them, e.g. timeit.

Takafumi Arakaki
tkf commented October 09, 2012

My understanding is that {var} works unconditionally for all line magics, which makes me think that it is part of transformer.

Thomas Kluyver
Collaborator

It is currently applied for any line magic, but it's a separate step after transformation. Transformation produces valid Python code, and the functions called to run magic or system commands expand variables. E.g. !echo {a} is transformed to get_ipython().system(u'echo {a}'). You can check the results of any transformation with %hist -t.

Takafumi Arakaki
tkf commented October 09, 2012

I see. Thanks for the explanation.

Thomas Kluyver
Collaborator

The reworking of input transformation is essentially complete, so I'm going to close this.

Thomas Kluyver takluyver closed this June 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.