Precedence #2

jkrumbiegel · 2022-06-23T07:32:58Z

Hi, I was trying to parse regexes with this parser and couldn't get the | operator to work.

For abc|de I get

sequence(
    expr(char()),
    expr(char()),
    expr(either(sequence(expr(char())), pipe(), sequence(expr(char()), expr(char())))),
)

Because I don't know how to specify that | should have the largest possible sequences on left and right.

The relevant grammar parts of this are :sequence => P.one_or_more(:expr), then :expr => P.first(:either, :negative_lookahead, :positive_lookahead, :positive_lookbehind, :negative_lookbehind, :zero_or_more, :one_or_more, :repetition, :repetition_at_least, :repetition_from_to, :maybe, :noncapturing_group, :capturing_group, :not_set, :set, :specialized_char, :dot, :normalized_char, :char, :_begin, :_end) and :either => P.seq(:sequence, :pipe, :sequence).

I saw that https://github.com/lukehutch/pikaparser has precedence integers for clauses but I didn't see anything here, is that a missing piece of the puzzle?

The text was updated successfully, but these errors were encountered:

exaexa · 2022-06-23T09:38:17Z

Hi,

predecence with PEGs is often a bit surprising. Original pikaparser basically compiles the precedence integers into "base grammar" by the usual failthrough construction with a base case, which removes much ambiguity and also allows you to nicely specify the fixity of whatever operators you have.

In your case, there's no telling whether the "glue chars together" in :sequence should have lower or higher precedence than the "alternatives" in :either, and what you got is a valid parse with respect to the grammar.

I succeeded with implementing the failthrough gadget manually:

using PikaParser
const P = PikaParser

r = Dict(
    :regex => P.seq(:expr1),  # toplevel
    :expr1 => P.first(:or => P.seq(:expr1, P.token('|'), :expr2), :expr2),  # precedence level 1
    :expr2 => P.one_or_more(:basic),   # precedence level 2
    :basic => P.first(  # easily decidable single-item base cases
        :char => P.satisfy(isletter),
        :parens => P.seq(P.token('('), :expr1, P.token(')')),
    )
)

g = P.make_grammar([:regex], P.flatten(r))

input = collect("abc|d(e|fg|h)ij|kl")

p = P.parse(g, input)

println(P.traverse_match(g, p, P.find_match_at(g, p, :regex, 1), :regex))

(At precedence level 1, you may also try other combinations of :expr1 and :expr2 operands to change the fixity of the |.)

On the string abc|d(e|fg|h)ij|kl it should give (with a bit of juliaformatter) something like:

regex(
    expr1(
        or(
            expr1(
                or(
                    expr1(expr2(basic(char()), basic(char()), basic(char()))),
                    var"or-2"(),
                    expr2(
                        basic(char()),
                        basic(
                            parens(
                                var"parens-1"(),
                                expr1(
                                    or(
                                        expr1(
                                            or(
                                                expr1(expr2(basic(char()))),
                                                var"or-2"(),
                                                expr2(basic(char()), basic(char())),
                                            ),
                                        ),
                                        var"or-2"(),
                                        expr2(basic(char())),
                                    ),
                                ),
                                var"parens-3"(),
                            ),
                        ),
                        basic(char()),
                        basic(char()),
                    ),
                ),
            ),
            var"or-2"(),
            expr2(basic(char()), basic(char())),
        ),
    ),
)

I don't see the immediate issue why your parser would produce the "bad" parsetree, but I expect the problem to be a bad interplay of greedy matching and left-recursion problems, which always gives surprises. In this case, if you'd parse with normal recursive descent, you would expect the :sequence rule to expand to :expr, then try :either which again tries :sequence at the same position and then loops forever. Pikaparser fixes this by reconstructing the parsetree from the end, and prefers matching the b|c as a continuation in ab|c as a higher-priority :either, which is a perfectly valid :expr (even better match than the :char because it doesn't produce failure later), but that doesn't give you the way to glue the chars together. In short, "priority" in first isn't the same thing as "precedence".

You might be able to quickfix your grammar by instead using :chars that greedily matches the whole char group, but I didn't try that.

Also, I should probably add some support for easily creating the precedence chains :D

exaexa · 2022-06-23T09:43:22Z

Bonus: parsing the regex suffixes with normal grammars is a major source of PITA but I guess here you'll have luck with just

...
:expr2 => P.one_or_more(:expr3),
:expr3 => P.first(:one_or_more, :zero_or_more, ..., :basic),
:one_or_more => P.seq(:basic, P.token('+')),
...

exaexa · 2022-06-23T13:25:09Z

Also, with #3 merged you can write the ruleset from above roughly as:

r = Dict(
    P.@precedences (i -> Symbol(:r, i)) c n begin
      :regex => P.seq(n, P.token('|'), c)
      :sequence => P.one_or_more(n)
      P.first(
        :group => P.seq(P.token('('), n, P.token(')')),
        :char => P.satisfy(isletter),
      )
    end
)

(untested)

jkrumbiegel · 2022-06-23T13:43:08Z

Cool, I will play around with this. It already looked like your previous suggestions worked, and I understood better why it didn't work before, so thanks!

exaexa · 2022-06-23T14:15:04Z

OK. Feel free to report your final grammar, might be useful to add some extra bits from it to tests.

jkrumbiegel · 2022-06-23T20:32:51Z

If you want you can have a look here: https://github.com/jkrumbiegel/ReadableRegex.jl/blob/explain-regex/explain_regex.jl

I wanted to do the reverse of what the package is doing, translate a given regex into function code that should hopefully be easier to understand.

For example, explain_regex(r"^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$") gives this (with a tiny bit manual indentation to make the either arguments more salient:

either(
    BEGIN *
        maybe('-') *
        between(0, 2, char_in('0':1:'9')) *
        maybe(capture('.' * between(1, 2, char_in('0':1:'9')))) *
        END,
    BEGIN *
        maybe('-') *
        capture("100") *
        maybe(capture('.' * between(1, 2, char_in('0')))) *
        END,
)

exaexa · 2022-06-23T21:41:26Z

Aaaah nice. Regex is a bit write-only tbh, but this should work. If I get it right, you are able to reconstruct actual ReadableRegex structure from any regex pattern?

Anyway, let me know in case you hit any difficulties!

(also, how new is the import Package as P syntax? the last time I tried it didn't work :D :D )

jkrumbiegel · 2022-06-24T05:43:47Z

Yes, that's the goal. And yeah it's mostly read only, but it's very hard to understand as it is. Like spotting a | in the middle of a long regex etc. So I thought it would be cool to have that split out a bit, and with the ReadableRegex functions you can then more easily try out changes. But anyway it's mostly a didactical thing :)

exaexa · 2022-06-24T08:10:55Z

Thinking about that, is there any regular language matcher in Julia? This way we could have a completely regex-less regular language matching. (Technically you could translate the regular expressions to pikaparser and match them as is, but compared to plain DFA/NFA required for regexes there's an ugly lot of overhead here...)

exaexa · 2022-06-27T14:47:47Z

Closing this because the precedence helpers gonna get released asap. Thanks for input!

exaexa mentioned this issue Jun 23, 2022

implement simplified precedence handling #3

Merged

exaexa closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precedence #2

Precedence #2

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022

exaexa commented Jun 23, 2022 •

edited

exaexa commented Jun 23, 2022

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022 •

edited

jkrumbiegel commented Jun 24, 2022 •

edited

exaexa commented Jun 24, 2022

exaexa commented Jun 27, 2022

Precedence #2

Precedence #2

Comments

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022

exaexa commented Jun 23, 2022 • edited

exaexa commented Jun 23, 2022

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022

jkrumbiegel commented Jun 23, 2022

exaexa commented Jun 23, 2022 • edited

jkrumbiegel commented Jun 24, 2022 • edited

exaexa commented Jun 24, 2022

exaexa commented Jun 27, 2022

exaexa commented Jun 23, 2022 •

edited

exaexa commented Jun 23, 2022 •

edited

jkrumbiegel commented Jun 24, 2022 •

edited