Ignoring contents of lines that aren't recognized #53

cmaughan · 2016-05-22T21:07:33Z

How would you ignore content that isn't recognized by a rule?
Supposing I have something like this:
string : /"(.|[^"])*"/ ;
lang : /^/ /$/

This should recognize quoted strings, but would fail on anything else. So how to define the language such that it collects or discards anything that isn't a string?

orangeduck · 2016-05-22T21:16:03Z

Perhaps you can just or it with a catch all and then check after the parse which one got captured?

lang : /^/ (<string> | /.*/) /$/

cmaughan · 2016-05-25T08:45:01Z

That works I think, thanks. The combinators from grammar are a fantastic way to build an AST tree, I like it a lot. Two things that I haven't figured out:

How to discard items from the AST? Is that like an mpc_null? - is that something that would be easy to add?:
num [null] : /[0-9]*/](url)
How to stop skipping the newlines for an occasional special rule. Turning on whitespace sensitivity isn't an option because it complicates everything else, but having a way to stop a rule at a new line would be cool. Something like:
comment : \"//\" <any_chars> <match_eol>

cmaughan · 2016-05-25T08:46:03Z

(perhaps I can define special rules myself and reference them in the grammar?)

orangeduck · 2016-05-25T08:54:39Z

Hi @cmaughan,

Actually that is a great idea I didn't think of - to reference rules in the grammar specified via the normal parser combinator approach - probably that is the easiest way to do something like discarding contents.

For whitespace - you may be able to specify whitespace in a regex. Your previous example could be given as comment: /\\/\\/[^\\n]*\\n/. Essentially "two slashes, zero or more characters that are not a newline, then a newline".

cmaughan · 2016-05-25T09:10:44Z

Hi :) I'd wondered about using a regex like that; it seems to work, but doesn't capture 2 comments in my unit test: "\ comment 1\n\ comment 2". Maybe I'm missing something (or the grammar generation is still being a bit aggressive about discarding newlines behind my back). It also has the side effect of capturing the newline into the AST node. But I'll try making custom rules to catch these special cases - a useful thing to be able to do.

cmaughan · 2016-05-25T09:13:13Z

Oh, the 2 line thing was the final rule not checking for many ;)

cmaughan · 2016-05-25T11:00:37Z

I tried defining a parser manually and adding it into the grammar, but I get a crash here; I think this is probably because my mpc_new/mpc_define parser doesn't have the AST info some how. The value of 'a' is an invalid pointer.
mpc_ast_t *mpc_ast_add_tag(mpc_ast_t *a, const char *t)

orangeduck · 2016-05-25T18:21:33Z

Hmm, perhaps you can post your code so I can see exactly what you tried?

cmaughan · 2016-05-25T18:46:21Z

Does this help?
I just tried creating a recognizer for "#line" and referencing it in the linepragma line...

auto line = mpc_new("line");
mpc_define(line, mpc_oneof("#line"));
    mpca_lang(MPCA_LANG_PREDICTIVE,
        R"(
          number                : /[0-9]+/ ;
          quoted_string         : /"(\\.|[^"])*"/ ;
          linepragma            : <line> <number> <quoted_string>;
          parser                    : /^/ (<linepragma>)* /$/ ;
        )",
        line, number, quoted_string, linepragma, parser, NULL);

orangeduck · 2016-06-05T14:30:18Z

Sorry for the late reply. What is up with the R before the string and the () characters in the string? I also noticed auto, are you using C++ or something?

Did you mean mpc_oneof("#line") - this means this parser recognizes any single one of the characters in the string "#line".

Just wondering what your particular use case is as this code and grammar look a little strange.

cmaughan · 2016-06-05T15:33:56Z

yes, I'm using C++ 11; the R is a string literal (so you don't need to escape stuff, or make multiple lines of ""). auto is just a way for the compiler to deduce the type of 'line'
I guess I used 'oneof' incorrectly, but it doesn't change the fact that this crashes in mpc_ast_add_tag (with a bad pointer IIRC). It should still work, whatever the parser for 'line' does, right?

orangeduck · 2016-06-09T15:40:54Z

True - let me investigate this in a bit more detail at the weekend.

cmaughan · 2016-06-09T20:05:59Z

I might get there before you, since I'm getting to the point where I need it to work; will let you know if I have time to figure it out!

orangeduck · 2016-06-11T12:10:24Z

Hi,

Looks like you were right - the error is because line is a parser which returns a const char* - but mcpa_lang expects all the input parsers to be returning mpc_ast_t*.

The fix is to make line into a parser which returns an ast with the thing it parses as the contents. This can be done using the mpcf_str_ast apply function. It is also worthwhile to give this returned thing tree a tag. Usually string literals (which I think it what you intended to parse) are tagged with string. So the only change here is to change the defition of line to the following:

mpc_define(line, mpca_tag(mpc_apply(mpc_sym("#line"), mpcf_str_ast), "string"));

Here mpc_sym is a string literal with trailing whitespace removed, mpcf_str_ast converts the string output by this parser to an ast, and mpca_tag tags this ast with the tag "string".

It isn't ideal, but in this case I think it was fine for mpc to crash as it is the programmer's responsibility to make sure all the expected parser input / output types match.

I've pushed an update to the repo with a new test in grammar.c if you want to see exactly how I got it working.

Hope this helps,

Dan

cmaughan · 2016-06-13T17:38:48Z

Thanks for investigating this, I think it makes sense! Is the 'string' tag the same thing that I'd see if I had a grammar statement like this: "string : /"[a-z]/" ". i.e. it's just the assigned tree tag?
I can see how to apply this, so all good.
I'm still trying to figure out if there's an easy way to check for a parse string and discard it from the AST tree automatically. For example:

int foo = 5;

Suppose I don't care about 'int' and '=' but it's part of the language spec.
I want a simple parser that checks the grammar but discards the unwanted nodes when it builds the tree. Something like this: parser: "int"% <ident> '='% <num>

The "%" is like saying 'Require this, but don't put it in the AST tree.
I guess parsing the tree afterwards is a way to do that, but it seems like it would be convenient to automatically prune it as it is generated. Maybe with mpc_pass, or something?

The only other comment I have is that the error reporting is a bit vague and hard to follow. I often get something like 'expected ', or ', or '.....''. Which can be tricky!
It might be useful to instead print the name of the grammar tags that were tried. Like 'expected <eol>, <string> or <number>'

Anyway, thanks for figuring it out!

orangeduck · 2016-06-13T18:52:55Z

Hi @cmaughan,

The "string" tag is more or less like that - really it is more like the automatic tags that get added by the grammmar E.G if a rule was parsed with a regex it will automatically get the tag "regex" in the tags - but basically these are the same concepts.

Using the combinators, discarding some part of the input is typically done in the fold function. For example this parser parses the expression you mentioned (int foo = 5) and returns only the elements you wanted as a mpc_ast_t* and frees those elements not required (warning - I've not actually tested this code).

static mpc_val_t *custom_fold(int n, mpc_val_t ** xs) {
    mpc_ast_t *r = mpc_ast_new("parser|>", "");
    mpc_ast_add_child(r, mpc_ast_new("ident", xs[1]));
    mpc_ast_add_child(r, mpc_ast_new("num", xs[3]));
    free(xs[0]); free(xs[1]); free(xs[2]); free(xs[3]); 
    return r;
}

mpc_parser_t *p = mpc_and(4, custom_fold,
    mpc_sym("int"),
    mpc_tok(mpc_ident()),
    mpc_sym("="),
    mpc_digits(),
    free, free, free);

So finally this parser p returns a mpc_ast_t* which you can reference directly from mpca_lang.

Probably the normal/natural way to do this is to prune the tree afterwards but I can see the advantage of pruning at parse time so let me think about what might be reasonable syntax to do so. Do you know if YACC/Bison supports this at all?

In regards to the error messages. This is actually already supported - you just need to write a human readable name as a string inbetween the rule name and the colon : E.G:

number "Number" : -?[0-9]+;

Dan

cmaughan · 2016-06-28T17:59:19Z

Thanks for the tip on error strings - that works well and makes things much clearer. Might be worth updating the samples so people know about it.
I don't know if YACC/Bison support pruning the tree - I'd imagine so, but not sure. It's been a long time since I used those tools.

cmaughan closed this as completed May 25, 2016

cmaughan reopened this May 25, 2016

orangeduck closed this as completed May 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignoring contents of lines that aren't recognized #53

Ignoring contents of lines that aren't recognized #53

cmaughan commented May 22, 2016

orangeduck commented May 22, 2016

cmaughan commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016

orangeduck commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016

cmaughan commented May 25, 2016 •

edited

Loading

orangeduck commented May 25, 2016

cmaughan commented May 25, 2016 •

edited

Loading

orangeduck commented Jun 5, 2016

cmaughan commented Jun 5, 2016

orangeduck commented Jun 9, 2016

cmaughan commented Jun 9, 2016

orangeduck commented Jun 11, 2016

cmaughan commented Jun 13, 2016 •

edited

Loading

orangeduck commented Jun 13, 2016 •

edited

Loading

cmaughan commented Jun 28, 2016

Ignoring contents of lines that aren't recognized #53

Ignoring contents of lines that aren't recognized #53

Comments

cmaughan commented May 22, 2016

orangeduck commented May 22, 2016

cmaughan commented May 25, 2016 • edited Loading

cmaughan commented May 25, 2016

orangeduck commented May 25, 2016 • edited Loading

cmaughan commented May 25, 2016 • edited Loading

cmaughan commented May 25, 2016

cmaughan commented May 25, 2016 • edited Loading

orangeduck commented May 25, 2016

cmaughan commented May 25, 2016 • edited Loading

orangeduck commented Jun 5, 2016

cmaughan commented Jun 5, 2016

orangeduck commented Jun 9, 2016

cmaughan commented Jun 9, 2016

orangeduck commented Jun 11, 2016

cmaughan commented Jun 13, 2016 • edited Loading

orangeduck commented Jun 13, 2016 • edited Loading

cmaughan commented Jun 28, 2016

cmaughan commented May 25, 2016 •

edited

Loading

orangeduck commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016 •

edited

Loading

cmaughan commented May 25, 2016 •

edited

Loading

cmaughan commented Jun 13, 2016 •

edited

Loading

orangeduck commented Jun 13, 2016 •

edited

Loading