Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precedence #2

Closed
jkrumbiegel opened this issue Jun 23, 2022 · 10 comments
Closed

Precedence #2

jkrumbiegel opened this issue Jun 23, 2022 · 10 comments

Comments

@jkrumbiegel
Copy link

Hi, I was trying to parse regexes with this parser and couldn't get the | operator to work.

For abc|de I get

sequence(
    expr(char()),
    expr(char()),
    expr(either(sequence(expr(char())), pipe(), sequence(expr(char()), expr(char())))),
)

Because I don't know how to specify that | should have the largest possible sequences on left and right.

The relevant grammar parts of this are :sequence => P.one_or_more(:expr), then :expr => P.first(:either, :negative_lookahead, :positive_lookahead, :positive_lookbehind, :negative_lookbehind, :zero_or_more, :one_or_more, :repetition, :repetition_at_least, :repetition_from_to, :maybe, :noncapturing_group, :capturing_group, :not_set, :set, :specialized_char, :dot, :normalized_char, :char, :_begin, :_end) and :either => P.seq(:sequence, :pipe, :sequence).

I saw that https://github.com/lukehutch/pikaparser has precedence integers for clauses but I didn't see anything here, is that a missing piece of the puzzle?

@exaexa
Copy link
Collaborator

exaexa commented Jun 23, 2022

Hi,

predecence with PEGs is often a bit surprising. Original pikaparser basically compiles the precedence integers into "base grammar" by the usual failthrough construction with a base case, which removes much ambiguity and also allows you to nicely specify the fixity of whatever operators you have.

In your case, there's no telling whether the "glue chars together" in :sequence should have lower or higher precedence than the "alternatives" in :either, and what you got is a valid parse with respect to the grammar.

I succeeded with implementing the failthrough gadget manually:

using PikaParser
const P = PikaParser

r = Dict(
    :regex => P.seq(:expr1),  # toplevel
    :expr1 => P.first(:or => P.seq(:expr1, P.token('|'), :expr2), :expr2),  # precedence level 1
    :expr2 => P.one_or_more(:basic),   # precedence level 2
    :basic => P.first(  # easily decidable single-item base cases
        :char => P.satisfy(isletter),
        :parens => P.seq(P.token('('), :expr1, P.token(')')),
    )
)

g = P.make_grammar([:regex], P.flatten(r))

input = collect("abc|d(e|fg|h)ij|kl")

p = P.parse(g, input)

println(P.traverse_match(g, p, P.find_match_at(g, p, :regex, 1), :regex))

(At precedence level 1, you may also try other combinations of :expr1 and :expr2 operands to change the fixity of the |.)

On the string abc|d(e|fg|h)ij|kl it should give (with a bit of juliaformatter) something like:

regex(
    expr1(
        or(
            expr1(
                or(
                    expr1(expr2(basic(char()), basic(char()), basic(char()))),
                    var"or-2"(),
                    expr2(
                        basic(char()),
                        basic(
                            parens(
                                var"parens-1"(),
                                expr1(
                                    or(
                                        expr1(
                                            or(
                                                expr1(expr2(basic(char()))),
                                                var"or-2"(),
                                                expr2(basic(char()), basic(char())),
                                            ),
                                        ),
                                        var"or-2"(),
                                        expr2(basic(char())),
                                    ),
                                ),
                                var"parens-3"(),
                            ),
                        ),
                        basic(char()),
                        basic(char()),
                    ),
                ),
            ),
            var"or-2"(),
            expr2(basic(char()), basic(char())),
        ),
    ),
)

I don't see the immediate issue why your parser would produce the "bad" parsetree, but I expect the problem to be a bad interplay of greedy matching and left-recursion problems, which always gives surprises. In this case, if you'd parse with normal recursive descent, you would expect the :sequence rule to expand to :expr, then try :either which again tries :sequence at the same position and then loops forever. Pikaparser fixes this by reconstructing the parsetree from the end, and prefers matching the b|c as a continuation in ab|c as a higher-priority :either, which is a perfectly valid :expr (even better match than the :char because it doesn't produce failure later), but that doesn't give you the way to glue the chars together. In short, "priority" in first isn't the same thing as "precedence".

You might be able to quickfix your grammar by instead using :chars that greedily matches the whole char group, but I didn't try that.

Also, I should probably add some support for easily creating the precedence chains :D

@exaexa
Copy link
Collaborator

exaexa commented Jun 23, 2022

Bonus: parsing the regex suffixes with normal grammars is a major source of PITA but I guess here you'll have luck with just

...
:expr2 => P.one_or_more(:expr3),
:expr3 => P.first(:one_or_more, :zero_or_more, ..., :basic),
:one_or_more => P.seq(:basic, P.token('+')),
...

@exaexa
Copy link
Collaborator

exaexa commented Jun 23, 2022

Also, with #3 merged you can write the ruleset from above roughly as:

r = Dict(
    P.@precedences (i -> Symbol(:r, i)) c n begin
      :regex => P.seq(n, P.token('|'), c)
      :sequence => P.one_or_more(n)
      P.first(
        :group => P.seq(P.token('('), n, P.token(')')),
        :char => P.satisfy(isletter),
      )
    end
)

(untested)

@jkrumbiegel
Copy link
Author

Cool, I will play around with this. It already looked like your previous suggestions worked, and I understood better why it didn't work before, so thanks!

@exaexa
Copy link
Collaborator

exaexa commented Jun 23, 2022

OK. Feel free to report your final grammar, might be useful to add some extra bits from it to tests.

@jkrumbiegel
Copy link
Author

If you want you can have a look here: https://github.com/jkrumbiegel/ReadableRegex.jl/blob/explain-regex/explain_regex.jl

I wanted to do the reverse of what the package is doing, translate a given regex into function code that should hopefully be easier to understand.

For example, explain_regex(r"^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$") gives this (with a tiny bit manual indentation to make the either arguments more salient:

either(
    BEGIN *
        maybe('-') *
        between(0, 2, char_in('0':1:'9')) *
        maybe(capture('.' * between(1, 2, char_in('0':1:'9')))) *
        END,
    BEGIN *
        maybe('-') *
        capture("100") *
        maybe(capture('.' * between(1, 2, char_in('0')))) *
        END,
)

@exaexa
Copy link
Collaborator

exaexa commented Jun 23, 2022

Aaaah nice. Regex is a bit write-only tbh, but this should work. If I get it right, you are able to reconstruct actual ReadableRegex structure from any regex pattern?

Anyway, let me know in case you hit any difficulties!

(also, how new is the import Package as P syntax? the last time I tried it didn't work :D :D )

@jkrumbiegel
Copy link
Author

jkrumbiegel commented Jun 24, 2022

Yes, that's the goal. And yeah it's mostly read only, but it's very hard to understand as it is. Like spotting a | in the middle of a long regex etc. So I thought it would be cool to have that split out a bit, and with the ReadableRegex functions you can then more easily try out changes. But anyway it's mostly a didactical thing :)

@exaexa
Copy link
Collaborator

exaexa commented Jun 24, 2022

Thinking about that, is there any regular language matcher in Julia? This way we could have a completely regex-less regular language matching. (Technically you could translate the regular expressions to pikaparser and match them as is, but compared to plain DFA/NFA required for regexes there's an ugly lot of overhead here...)

@exaexa
Copy link
Collaborator

exaexa commented Jun 27, 2022

Closing this because the precedence helpers gonna get released asap. Thanks for input!

@exaexa exaexa closed this as completed Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants