Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group line and column metadata needs special handling #107

Closed
lysnikolaou opened this issue Dec 10, 2019 · 19 comments
Closed

Group line and column metadata needs special handling #107

lysnikolaou opened this issue Dec 10, 2019 · 19 comments

Comments

@lysnikolaou
Copy link
Collaborator

lysnikolaou commented Dec 10, 2019

When writing the action for the group rule, I found out that parsing something like (yield a) returns the following AST object:

Module(
    body=[
        Expr(
            value=Yield(
                value=Name(
                    id="a", ctx=Load(), lineno=1, col_offset=7, end_lineno=1, end_col_offset=8
                ),
                lineno=1,
                col_offset=1,
                end_lineno=1,
                end_col_offset=8,
            ),
            lineno=1,
            col_offset=0,
            end_lineno=1,
            end_col_offset=9,
        )
    ],
    type_ignores=[],
)

That means that the Expr node's start column offset is 0, although the yield node's is 1. The correct action for group is thus:

group[expr_ty]: '(' a=(yield_expr | full_expression) ')' { a }

but the surrounding Expr node needs to have a different column offset and I can't seem to come up with a way of handling this, so any ideas are welcome!

@gvanrossum
Copy link
Collaborator

Hm, so Expr is the node type for an expression statement. I suspect the problem is not specific to group but occurs for dict and set as well, and perhaps others. But can it happen for interior nodes?

I guess this is another reason why I think we may have to come up with a way to gather line/column information out of band, rather than passing it to the node constructors using EXTRA().

(PS. I reformatted your code so I could read it. The easiest way to do it that I've found is to write it to a file and run black over it... I'm weaponizing this as we speak. :-)

@lysnikolaou
Copy link
Collaborator Author

I suspect the problem is not specific to group but occurs for dict and set as well, and perhaps others.

Having refactored all of the atom rules(list, tuple, dict, comprehensions, set and so on) it is actually specific only to group, because it's the only one where the metadata of the inner node is different to that of the outer statement. And that makes sense, in a dict for example, the dict node starts at the opening brace and ends at the closing brace. The problem with group is that the parentheses are not part of the expression, but are only used for, well, grouping.

PS Very good idea! I was trying to do it using pprint but couldn't get it formatted the way I wanted.

@gvanrossum
Copy link
Collaborator

Ah, so it's because the group disappears. Can we introduce a helper object? Or does it have to be an existing expr_ty?

I propose not to worry about this -- I'm not sure it's worth reproducing this particular behavior. OTOH maybe the solution really is to take the line/col info out of band.

PS. #108

@lysnikolaou
Copy link
Collaborator Author

What exactly do you mean with out of band?

@gvanrossum
Copy link
Collaborator

We could add some code to the generated code that extracts the start line/col from the upcoming token at the top of a rule implementation function, and extracts the end line/col from the last token just before calling the action. Something like this:

    if (p->mark == p->fill) {
        if (fill_token(p) < 0) return -1;
    }
    int start_lineno = p->tokens[mark]->lineno;
    int start_col_offset = p->tokens[mark]->col_offset;

before diving into the first alternative (but after checking the memo), and then in each rule, just before the action is executed:

    res = <action>;  // or res = CONSTRUCTOR(...);

insert this:

    assert(p->mark > 0);
    int end_lineno = p->tokens[p->mark-1]->end_lineno;
    int end_col_offset = p->tokens[p->mark-1]->end_col_offset;

And then, finally, change the EXTRA macro to just be this:

#define EXTRA start_lineno, start_col_offset, end_lineno, end_col_offset, p->arena

We can then drop a whole bunch of infrastructure around EXTRA.

@lysnikolaou
Copy link
Collaborator Author

That seems like a nice solution. It also resembles what Python/ast.c does with nodes a bit more closely. No clue if that's good or bad!

In #109, I'm just ignoring this problem, until we have a final decision on how to proceed with this one.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Dec 11, 2019 via email

@lysnikolaou
Copy link
Collaborator Author

OK! I'll start with it tonight.

@lysnikolaou
Copy link
Collaborator Author

lysnikolaou commented Dec 12, 2019

After spending some time on this, the only problem I'm having now has to do with the elif rule, because the correct column offset for it is the one where the condition starts and not the one where the keyword elif itself starts. What would the correct solution for that be? The first one that jumps to mind, is to just have a different token offset for the elif rule in the c generator and have something like int start_col_offset = p->tokens[mark+1];, which needs the condition p->mark < p->fill to be true, right?

@lysnikolaou
Copy link
Collaborator Author

The first one that jumps to mind, is to just have a different token offset for the elif rule in the c generator and have something like int start_col_offset = p->tokens[mark+1];, which needs the condition p->mark < p->fill to be true, right?

That indeed works and the whole test suite now passes. The problem is as we refactor more of simpy.gram more edge cases might appear, which might overcomplicate the generator. Should I push a PR just for you to see what I've done?

@gvanrossum
Copy link
Collaborator

Hm, this seems due to some idiosyncrasy in ast.c (ast_for_if_stmt()). I wonder if we could submit a PR to ast.c that fixes the column offset there? It could just use CHILD(n, off - 1) in the asdl_seq_SET() call.

@lysnikolaou
Copy link
Collaborator Author

I can do that, should I open an issue on bpo?

@gvanrossum
Copy link
Collaborator

Sure. Link it here.

@lysnikolaou
Copy link
Collaborator Author

Opened an issue at https://bugs.python.org/issue39031.

@lysnikolaou
Copy link
Collaborator Author

Also opened a PR on python/cpython#17582.

@lysnikolaou
Copy link
Collaborator Author

Two new observations:

First of all, after the newest refactorings we've got two new problems.

  1. If there are function decorators, when parsing a function definition, then the decorator is the first thing parse, so it is what lineno/col_offset points at.
  2. If there is a kwarg in the parameter list of a function, then the lineno/col_offset points to the first star, instead of pointing to the parameter name.

Second of all, there is a REALLY SIGNIFICANT runtime increase. Running simpy_cpython now lasts 188-244 seconds on my machine. Not sure what causes this.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Dec 14, 2019 via email

@lysnikolaou
Copy link
Collaborator Author

I'm not 100% sure on that. The runtime increase is only reproducible on that branch on my local environment, but Travis thinks otherwise, as it completed in a little more that 6 seconds in #121. I need to investigate further.

@lysnikolaou
Copy link
Collaborator Author

Closing due to #121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants