A way to reject partial parses #117

dabrahams · 2022-06-14T16:34:56Z

Some rules in my grammar have constraints that there should be no whitespace between adjacent elements (some also have the constraint that there should be no newlines, but we can ignore that for now). Is there a good way to handle that sort of thing with Marpa? If there was a way to reject the completion of certain rules during parsing that would be simple.

The first alternative I see is to explicitly represent the possibility of whitespace in the grammar, but to do that without introducing huge ambiguity seems to mean I have to do the Aycock & Horspool nihilist normal form (NNF) transformation on my raw grammar, so I don't introduce whitespace recognition on both sides of a nulling symbol. Since Marpa is surely already doing that analysis, it feels wasteful to implement it myself.

The second alternative is to ignore the whitespace constraints until the parse forest is complete and then reject any complete parses that violate the constraints, but that seems inefficient, since many parses may be explored that could have been rejected earlier, and there's no way, once a constraint violation is detected in a complete parse, to avoid checking other complete parses that contain the same partial parse that violated the constraint.

jeffreykegler · 2022-06-14T18:01:59Z

In the lexer you could define special whitespace-preceded lexemes. When these are recognized they can be treated as you ish, for example as an error. Marpa::R2 has "lexeme events" and that kind of mechanism might be helpful.

I noticed the attempt at explicit representation of whitespace in the grammar for issue #116. This could also be a solution, and may not need to be ambiguous. I experimented with this, and my experience suggested that

all the top rules deal with only inter-symbol whitespace, delegating initial and final whitespace to parent rules, and
that there be a top rule to dealing with any initial and/or final whitespace allowed for the full parse.

I gave up on this in favor of the "discard" lexemes implemented for Marpa::R2, because getting it all right was like being pecked to death by ducks. Nonetheless, such a scheme does lend itself to the kind of locally non-orthogonal treatments of whitespace you have in mind.

dabrahams · 2022-06-14T19:10:33Z

For context, the EBNF source of my grammar (which I'm converting mechanically into MARPA rules) is in the ebnf markdown blocks here—search for the string "(no-" to find the special rules that don't implicitly allow whitespace.

Regarding the whitespace-preceded lexemes idea:

I don't understand “when these are recognized they can be treated as you ish, for example as an error.” I certainly don't want to cause an error when they are recognized by the lexical analyzer. Where and how do you propose I treat them as an error?
In most places I need to accept the lexemes whether or not they're preceded by whitespace, so I'd end up defining nonterminals like
```
 colon -> ws-colon
 colon -> no-ws-colon
```
and using “colon” almost everywhere. Not the end of the world, but not great either.
If you look at the places where these constraints are expressed, it's not in terms of lexemes at all, and it's not in terms of what-precedes, but what's-between. Automatically rewriting the rules in terms of these ws-prefixed lexemes is as much work as the NNF transformation I'd have to do to avoid ambiguity with my current strategy, and I've already coded the NNF transformation once—I don't even know what the rewrite for your suggestion would look like.

With regard to explicit representation of whitespace: no, it needn't be ambiguous, but the naïve approach I currently have of interjecting optional space tokens into the rules between every symbol does cause a lot of ambiguity in fact. Any nullable symbol surrounded by a pair of "any amount of whitespace" (AAoW) symbols will cause whitespace to be parsed ambiguously in that position, as either part of the first AAoW symbol or the second. That's why I would need the NNF transformation; to avoid adding these AAoW symbols around nullable symbols.

I don't think I understand what your experience suggested. Is there an automated transformation for it that I can apply to my specification?

jeffreykegler · 2022-06-14T19:17:56Z

I did my work on explicit whitespace as research, possibly toward an automated transformation, but I abandoned the idea in favor of the "discard" lexeme approach.

My idea was, for those lexemes which do not allow preceding whitespace, to define "error lexemes", which consist of that lexeme preceded by whitespace. If you see one of these, you know someone has put whitespace in the wrong spot and handle it accordingly. You only need define such "error lexemes' for the cases where whitespace is, indeed, an error.

In correct input, you won't see any "error lexemes".

dabrahams · 2022-06-14T19:54:19Z

Sure… but I don't have any lexemes that are defined to not allow preceding whitespace, and it's not clear to me that the grammar I am encoding can be represented in those terms.

jeffreykegler · 2022-06-14T21:51:02Z

Re "I don't have any lexemes that are defined to not allow preceding whitespace" in #117 (comment), my comments were attempting to be responsive to this from #117 (comment):

Some rules in my grammar have constraints that there should be no whitespace between adjacent elements

dabrahams · 2022-06-14T22:26:04Z

I understand that. What I'm saying is that the whitespace restrictions in my grammar rules are expressed in terms of the LHS nonterminal being recognized, not in terms of the participating lexemes. You could of course rewrite the grammar to attach those restrictions to lexemes, but that would be a massive rewrite that I don't know how to do mechanically.

At this point it is starting to look like I should switch to my own implementation of the algorithms, because I can easily add the necessary functionality.

jeffreykegler · 2022-06-14T22:29:10Z

Any objection to closing this issue?

dabrahams · 2022-06-15T01:19:55Z

🤷 I don't think my problem's been addressed; in case it wasn't clear, I think the fact that I don't have a solution means I'm going to stop using MARPA. The only reason I can see for leaving the issue open would be if you intended to do something about it, but it sounds as though you don't. If I've understood correctly, closing it would be appropriate.

jeffreykegler · 2022-06-15T01:49:28Z

Tomorrow I'll look this issue over and try to see what I have missed. Is there a write-up somewhere of your requirement or something similar?

The decision which parser to use belongs to the users, and I try to take from these choices what I can for Marpa, and otherwise to respect the user's choice and not to take it personally.

In case I don't get another opportunity, I want to thank you for all these issues. They include insightful editing suggestions and the detection of an obscure but serious bug. It will take me weeks to work through all these. The API doc will be significantly better. Marpa will be significantly improved by the time you've spent here.

dabrahams · 2022-06-15T01:51:54Z

You're welcome, and thank you for the work you've done here. If it turns out I switch implementations I hope I can keep asking you questions? If so, what's the best medium?

jeffreykegler · 2022-06-15T02:02:37Z

Best to communicate via a Github issue where it's a specific issue on a repo, and via the IRC channel otherwise.

One reason I may have seemed reticent is that I am avoiding the issue of "futures". I do have my own ideas for new directions, which may include things like undoing bits of parsing already done. None of these are things you'll see in the next few weeks.

dabrahams · 2022-06-15T02:52:21Z

I mean questions regarding what MARPA does internally, and why, rather than anything about its future direction.
I'd particularly like an answer to the first question in this comment.

jeffreykegler · 2022-06-16T02:52:22Z

I took a closer look at the Val doc. What does the (no-whitespace) signify in the following?

module-definition ::= (no-whitespace)
  whitespace-opt module-body whitespace-opt

It's an EBNF notation I have not seen before. Naively, it seems to say "no whitespace, but yes optional whitespace at the beginning".

jeffreykegler · 2022-06-16T03:02:54Z

Is it a meta-notation saying that the lines are not tokenized, but neither is the lexical whitespace in the text to be taken literally, instead it is indicated via symbols like whitespace-opt?

dabrahams · 2022-06-16T04:08:30Z

Yes exactly. The name is badly chosen; it should be (no-implicit-whitespace)

…

Sent from my iPhone

On Jun 15, 2022, at 8:03 PM, Jeffrey Kegler ***@***.***> wrote: Is it a meta-notation saying that the lines are not tokenized, but neither is the lexical whitespace in the text to be taken literally, instead it is indicated via symbols like whitespace-opt? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

jeffreykegler · 2022-06-17T15:51:02Z

@dabrahams OK. I looked over the Val doc.

Re the "go ambiguous then prune" approaches. I've done some work in creating tools for these, but the IF is not documented in Libmarpa, although it is in Marpa::R2 (as the ASF interface) for Perl users. I'm not enthusiastic about its chances because my work with general parsing has given me an enormous respect for exponential growth --- creating an ambiguous ASF, then pruning, is likely to be expensive. I'll say more if you want, but here I'll move on to an approach that does not exploit ambiguity.

Reading the Val doc, I see its EBNF as having two kinds of rules -- tokenized and explicit (the (no-whitespace) rules). That EBNF could be translated for Marpa by making every rule "explicit". This might be done automatically by inserting optional whitespace between every symbol in the tokenized rules. Some notes:

1.) To reduce ambiguity, non-medial (initial and final) whitespace should be avoided in the explicit rules. I'd be tempted to outright ban non-medial whitespace in explicit rules, but introduce a (raw) meta-notation to override the ban for those users who believe that they know what they are doing.

2.) The top rule obviously requires some sort of special treatment, because initial and final whitespace is necessary to express many useful grammars, and no ambiguities arise.

3.) Then there is the "weeknights" problem. How to avoid the ambiguity with reading it as the two identifiers "wee knights"? The longest acceptable token method, which always goes for "weeknights" is the naive approach, is native to Marpa::R2, and will cover almost all cases. But there may be corner cases, and useful apps for which hacks are needed.

This is like the approach I was investigating when I decided to go with the "discard symbols" alternative. I did this because:

1.) I thought the "discard" approach would be easier to explain. I didn't have the idea of mixing explicit and tokenized rules, which does make things more natural on the notation side.

2.) I thought working out all the corner cases, and providing workarounds, might be quite the project, possibly aggravating 1).

3.) I was nearly 100% sure the discard approach would work -- it was very closely related to what lexers had been doing for decades. The explicit whitespace approach had fewer parallels in practical experience. Those who pioneer unsuccessful approaches in a sense contribute as much as those who try the successful approaches, but I was not eager to be one of the former.

I hope you find this useful.

dabrahams · 2022-06-17T17:44:17Z

On Jun 17, 2022, at 8:51 AM, Jeffrey Kegler ***@***.***> wrote: @dabrahams <https://github.com/dabrahams> OK. I looked over the Val doc. Re the "go ambiguous then prune" approaches. I've done some work in creating tools for these, but the IF is not documented in Libmarpa, although it is in Marpa::R2 (as the ASF interface) for Perl users. I'm not enthusiastic about its chances because my work with general parsing has given me an enormous respect for exponential growth --- creating an ambiguous ASF, then pruning, is likely to be expensive.

ASF? I don’t think that’s true for my particular case, because of both top-down parsing and where the ambiguity occurs in my grammar. However, I wouldn’t dare make such a claim for the general case, which is what you must be concerned with as the library maintainer.

I'll say more if you want, but here I'll move on to an approach that does not exploit ambiguity. Reading the Val doc, I see its EBNF as having two kinds of rules -- tokenized and explicit (the (no-whitespace) rules). That EBNF could be translated for Marpa by making every rule "explicit". This might be done automatically by inserting optional whitespace between every symbol in the tokenized rules.

That’s what my EBNF->BNF translator already does. You can see that if you look for “spacer” in https://github.com/val-lang/val/blob/parser/Sources/ParseGen/BNF.swift. As mentioned several times earlier, however, this creates ambiguity when optional whitespace is inserted around nullable symbols. That’s why I keep mentioning doing the nihilist normal form (NNF) transformation; it would allow me to avoid inserting optional whitespace on both sides of a nullable symbol. Moreover I very strongly suspect that in the case of my grammar, even if I continue with this transformation, the amount of overall processing caused by treating whitespace as a token, and nearly doubling the length of most rules’ right-hand sides, would far outweigh the cost of rejecting a possible scan here and there.

Some notes: 1.) To reduce ambiguity, non-medial (initial and final) whitespace should be avoided in the explicit rules.

That won’t help with the problem I’m seeing; only the very top-level rule has non-medial whitespace in it and I could easily eliminate that by pre-trimming whitespace on the input. That’s not what’s causing the issue in my case.

I'd be tempted to outright ban non-medial whitespace in explicit rules, but introduce a (raw) meta-notation to override the ban for those users who believe that they know what they are doing. 2.) The top rule obviously requires some sort of special treatment, because initial and final whitespace is necessary to express many useful grammars, and no ambiguities arise. 3.) Then there is the "weeknights" problem. How to avoid the ambiguity with reading it as the two identifiers "wee knights"? The longest acceptable token method, which always goes for "weeknights" is the naive approach, is native to Marpa::R2, and will cover almost all cases. But there may be corner cases, and useful apps for which hacks are needed.

I don’t think I’m having that problem. My tokenizer (https://github.com/dabrahams/citron/blob/master/Sources/CitronLexerModule/Scanner.swift) currently already takes the longest acceptable token (it’s the standard lexer approach for programming languages).

This is like the approach I was investigating when I decided to go with the "discard symbols" alternative. I did this because: 1.) I thought the "discard" approach would be easier to explain. I didn't have the idea of mixing explicit and tokenized rules, which does make things more natural on the notation side. 2.) I thought working out all the corner cases, and providing workarounds, might be quite the project, possibly aggravating 1). 3.) I was nearly 100% sure the discard approach would work -- it was very closely related to what lexers had been doing for decades. The explicit whitespace approach had fewer parallels in practical experience. Those who pioneer unsuccessful approaches in a sense contribute as much as those who try the successful approaches, but I was not eager to be one of the former. I hope you find this useful.

It gives me a more detailed understanding of why you don’t want to change libmarpa APIs to help solve my problem. I hope it’s clear that’s a choice I never begrudged you in the first place, though! In terms of helping me actually solve the problem I have, I’m afraid it doesn’t do much: unless I’m misunderstanding, you’re still catching up with the essential nature of the problem I’m having. What would be really helpful to me, at this point, is if you would let me take you on a walkthrough of my implementation of MARPA’s essentials—as I understand them—so I can see what I’ve missed. It’s only a few hundred lines of code in Swift (https://github.com/val-lang/Lotsawa) and I can explain anything you may not understand.

jeffreykegler · 2022-06-17T18:48:49Z

ASF = Abstract Syntax Forest. This is an alternative evaluation interface developed under Marpa::R2 which uses the bocage to directly traverse the forest of parses. The standard valuator allows parses to be iterated one by one, but if you want to deal with all of an exponential number, that's often not adequate. The ASF allows you to prune the forest, elimnating potentially huge numbers of possibilities with each step.

To implements its ASF, Marpa::R2 violates the Libmarpa API, using Libmarpa calls which are currently not documented.

jeffreykegler · 2022-07-02T17:13:36Z

This discussion meandered quite a bit, and has gone off line. @dabrahams: Do you have any objection to my closing this?

dabrahams · 2022-07-02T17:22:47Z

none

jeffreykegler · 2022-07-03T00:43:50Z

Absent feedback to the contrary, and as per #117 (comment), I will close this issue after a couple of days.

jeffreykegler · 2022-07-13T18:26:44Z

Closed per #117 (comment).

jeffreykegler self-assigned this Jun 14, 2022

jeffreykegler added the About to be closed label Jun 14, 2022

jeffreykegler mentioned this issue Jun 15, 2022

Write paper on Marpa's elimination of LR(0) states. jeffreykegler/Marpa-arxiv-paper#8

Closed

jeffreykegler added In progress and removed About to be closed labels Jun 17, 2022

jeffreykegler added Pending feedback Pending feedback and removed In progress labels Jul 2, 2022

jeffreykegler added About to be closed and removed Pending feedback Pending feedback labels Jul 5, 2022

jeffreykegler closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A way to reject partial parses #117

A way to reject partial parses #117

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 15, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 15, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 16, 2022

jeffreykegler commented Jun 16, 2022

dabrahams commented Jun 16, 2022 via email

jeffreykegler commented Jun 17, 2022

dabrahams commented Jun 17, 2022 via email •

edited

Loading

jeffreykegler commented Jun 17, 2022

jeffreykegler commented Jul 2, 2022

dabrahams commented Jul 2, 2022

jeffreykegler commented Jul 3, 2022

jeffreykegler commented Jul 13, 2022

A way to reject partial parses #117

A way to reject partial parses #117

Comments

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 14, 2022

jeffreykegler commented Jun 14, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 15, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 15, 2022

dabrahams commented Jun 15, 2022

jeffreykegler commented Jun 16, 2022

jeffreykegler commented Jun 16, 2022

dabrahams commented Jun 16, 2022 via email

jeffreykegler commented Jun 17, 2022

dabrahams commented Jun 17, 2022 via email • edited Loading

jeffreykegler commented Jun 17, 2022

jeffreykegler commented Jul 2, 2022

dabrahams commented Jul 2, 2022

jeffreykegler commented Jul 3, 2022

jeffreykegler commented Jul 13, 2022

dabrahams commented Jun 17, 2022 via email •

edited

Loading