-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A way to reject partial parses #117
Comments
In the lexer you could define special whitespace-preceded lexemes. When these are recognized they can be treated as you ish, for example as an error. Marpa::R2 has "lexeme events" and that kind of mechanism might be helpful. I noticed the attempt at explicit representation of whitespace in the grammar for issue #116. This could also be a solution, and may not need to be ambiguous. I experimented with this, and my experience suggested that
I gave up on this in favor of the "discard" lexemes implemented for Marpa::R2, because getting it all right was like being pecked to death by ducks. Nonetheless, such a scheme does lend itself to the kind of locally non-orthogonal treatments of whitespace you have in mind. |
For context, the EBNF source of my grammar (which I'm converting mechanically into MARPA rules) is in the ebnf markdown blocks here—search for the string "(no-" to find the special rules that don't implicitly allow whitespace. Regarding the whitespace-preceded lexemes idea:
With regard to explicit representation of whitespace: no, it needn't be ambiguous, but the naïve approach I currently have of interjecting optional space tokens into the rules between every symbol does cause a lot of ambiguity in fact. Any nullable symbol surrounded by a pair of "any amount of whitespace" (AAoW) symbols will cause whitespace to be parsed ambiguously in that position, as either part of the first AAoW symbol or the second. That's why I would need the NNF transformation; to avoid adding these AAoW symbols around nullable symbols. I don't think I understand what your experience suggested. Is there an automated transformation for it that I can apply to my specification? |
I did my work on explicit whitespace as research, possibly toward an automated transformation, but I abandoned the idea in favor of the "discard" lexeme approach. My idea was, for those lexemes which do not allow preceding whitespace, to define "error lexemes", which consist of that lexeme preceded by whitespace. If you see one of these, you know someone has put whitespace in the wrong spot and handle it accordingly. You only need define such "error lexemes' for the cases where whitespace is, indeed, an error. In correct input, you won't see any "error lexemes". |
Sure… but I don't have any lexemes that are defined to not allow preceding whitespace, and it's not clear to me that the grammar I am encoding can be represented in those terms. |
Re "I don't have any lexemes that are defined to not allow preceding whitespace" in #117 (comment), my comments were attempting to be responsive to this from #117 (comment):
|
I understand that. What I'm saying is that the whitespace restrictions in my grammar rules are expressed in terms of the LHS nonterminal being recognized, not in terms of the participating lexemes. You could of course rewrite the grammar to attach those restrictions to lexemes, but that would be a massive rewrite that I don't know how to do mechanically. At this point it is starting to look like I should switch to my own implementation of the algorithms, because I can easily add the necessary functionality. |
Any objection to closing this issue? |
🤷 I don't think my problem's been addressed; in case it wasn't clear, I think the fact that I don't have a solution means I'm going to stop using MARPA. The only reason I can see for leaving the issue open would be if you intended to do something about it, but it sounds as though you don't. If I've understood correctly, closing it would be appropriate. |
Tomorrow I'll look this issue over and try to see what I have missed. Is there a write-up somewhere of your requirement or something similar? The decision which parser to use belongs to the users, and I try to take from these choices what I can for Marpa, and otherwise to respect the user's choice and not to take it personally. In case I don't get another opportunity, I want to thank you for all these issues. They include insightful editing suggestions and the detection of an obscure but serious bug. It will take me weeks to work through all these. The API doc will be significantly better. Marpa will be significantly improved by the time you've spent here. |
You're welcome, and thank you for the work you've done here. If it turns out I switch implementations I hope I can keep asking you questions? If so, what's the best medium? |
Best to communicate via a Github issue where it's a specific issue on a repo, and via the IRC channel otherwise. One reason I may have seemed reticent is that I am avoiding the issue of "futures". I do have my own ideas for new directions, which may include things like undoing bits of parsing already done. None of these are things you'll see in the next few weeks. |
I mean questions regarding what MARPA does internally, and why, rather than anything about its future direction. |
I took a closer look at the Val doc. What does the
It's an EBNF notation I have not seen before. Naively, it seems to say "no whitespace, but yes optional whitespace at the beginning". |
Is it a meta-notation saying that the lines are not tokenized, but neither is the lexical whitespace in the text to be taken literally, instead it is indicated via symbols like |
Yes exactly. The name is badly chosen; it should be (no-implicit-whitespace)
…Sent from my iPhone
On Jun 15, 2022, at 8:03 PM, Jeffrey Kegler ***@***.***> wrote:
Is it a meta-notation saying that the lines are not tokenized, but neither is the lexical whitespace in the text to be taken literally, instead it is indicated via symbols like whitespace-opt?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.
|
@dabrahams OK. I looked over the Val doc. Re the "go ambiguous then prune" approaches. I've done some work in creating tools for these, but the IF is not documented in Libmarpa, although it is in Marpa::R2 (as the ASF interface) for Perl users. I'm not enthusiastic about its chances because my work with general parsing has given me an enormous respect for exponential growth --- creating an ambiguous ASF, then pruning, is likely to be expensive. I'll say more if you want, but here I'll move on to an approach that does not exploit ambiguity. Reading the Val doc, I see its EBNF as having two kinds of rules -- tokenized and explicit (the 1.) To reduce ambiguity, non-medial (initial and final) whitespace should be avoided in the explicit rules. I'd be tempted to outright ban non-medial whitespace in explicit rules, but introduce a 2.) The top rule obviously requires some sort of special treatment, because initial and final whitespace is necessary to express many useful grammars, and no ambiguities arise. 3.) Then there is the "weeknights" problem. How to avoid the ambiguity with reading it as the two identifiers "wee knights"? The longest acceptable token method, which always goes for "weeknights" is the naive approach, is native to Marpa::R2, and will cover almost all cases. But there may be corner cases, and useful apps for which hacks are needed. This is like the approach I was investigating when I decided to go with the "discard symbols" alternative. I did this because: 1.) I thought the "discard" approach would be easier to explain. I didn't have the idea of mixing explicit and tokenized rules, which does make things more natural on the notation side. 2.) I thought working out all the corner cases, and providing workarounds, might be quite the project, possibly aggravating 1). 3.) I was nearly 100% sure the discard approach would work -- it was very closely related to what lexers had been doing for decades. The explicit whitespace approach had fewer parallels in practical experience. Those who pioneer unsuccessful approaches in a sense contribute as much as those who try the successful approaches, but I was not eager to be one of the former. I hope you find this useful. |
On Jun 17, 2022, at 8:51 AM, Jeffrey Kegler ***@***.***> wrote:
@dabrahams <https://github.com/dabrahams> OK. I looked over the Val doc.
Re the "go ambiguous then prune" approaches. I've done some work in creating tools for these, but the IF is not documented in Libmarpa, although it is in Marpa::R2 (as the ASF interface) for Perl users. I'm not enthusiastic about its chances because my work with general parsing has given me an enormous respect for exponential growth --- creating an ambiguous ASF, then pruning, is likely to be expensive.
ASF?
I don’t think that’s true for my particular case, because of both top-down parsing and where the ambiguity occurs in my grammar. However, I wouldn’t dare make such a claim for the general case, which is what you must be concerned with as the library maintainer.
I'll say more if you want, but here I'll move on to an approach that does not exploit ambiguity.
Reading the Val doc, I see its EBNF as having two kinds of rules -- tokenized and explicit (the (no-whitespace) rules). That EBNF could be translated for Marpa by making every rule "explicit". This might be done automatically by inserting optional whitespace between every symbol in the tokenized rules.
That’s what my EBNF->BNF translator already does. You can see that if you look for “spacer” in https://github.com/val-lang/val/blob/parser/Sources/ParseGen/BNF.swift.
As mentioned several times earlier, however, this creates ambiguity when optional whitespace is inserted around nullable symbols. That’s why I keep mentioning doing the nihilist normal form (NNF) transformation; it would allow me to avoid inserting optional whitespace on both sides of a nullable symbol.
Moreover I very strongly suspect that in the case of my grammar, even if I continue with this transformation, the amount of overall processing caused by treating whitespace as a token, and nearly doubling the length of most rules’ right-hand sides, would far outweigh the cost of rejecting a possible scan here and there.
Some notes:
1.) To reduce ambiguity, non-medial (initial and final) whitespace should be avoided in the explicit rules.
That won’t help with the problem I’m seeing; only the very top-level rule has non-medial whitespace in it and I could easily eliminate that by pre-trimming whitespace on the input. That’s not what’s causing the issue in my case.
I'd be tempted to outright ban non-medial whitespace in explicit rules, but introduce a (raw) meta-notation to override the ban for those users who believe that they know what they are doing.
2.) The top rule obviously requires some sort of special treatment, because initial and final whitespace is necessary to express many useful grammars, and no ambiguities arise.
3.) Then there is the "weeknights" problem. How to avoid the ambiguity with reading it as the two identifiers "wee knights"? The longest acceptable token method, which always goes for "weeknights" is the naive approach, is native to Marpa::R2, and will cover almost all cases. But there may be corner cases, and useful apps for which hacks are needed.
I don’t think I’m having that problem. My tokenizer (https://github.com/dabrahams/citron/blob/master/Sources/CitronLexerModule/Scanner.swift) currently already takes the longest acceptable token (it’s the standard lexer approach for programming languages).
This is like the approach I was investigating when I decided to go with the "discard symbols" alternative. I did this because:
1.) I thought the "discard" approach would be easier to explain. I didn't have the idea of mixing explicit and tokenized rules, which does make things more natural on the notation side.
2.) I thought working out all the corner cases, and providing workarounds, might be quite the project, possibly aggravating 1).
3.) I was nearly 100% sure the discard approach would work -- it was very closely related to what lexers had been doing for decades. The explicit whitespace approach had fewer parallels in practical experience. Those who pioneer unsuccessful approaches in a sense contribute as much as those who try the successful approaches, but I was not eager to be one of the former.
I hope you find this useful.
It gives me a more detailed understanding of why you don’t want to change libmarpa APIs to help solve my problem. I hope it’s clear that’s a choice I never begrudged you in the first place, though! In terms of helping me actually solve the problem I have, I’m afraid it doesn’t do much: unless I’m misunderstanding, you’re still catching up with the essential nature of the problem I’m having.
What would be really helpful to me, at this point, is if you would let me take you on a walkthrough of my implementation of MARPA’s essentials—as I understand them—so I can see what I’ve missed. It’s only a few hundred lines of code in Swift (https://github.com/val-lang/Lotsawa) and I can explain anything you may not understand.
|
ASF = Abstract Syntax Forest. This is an alternative evaluation interface developed under Marpa::R2 which uses the bocage to directly traverse the forest of parses. The standard valuator allows parses to be iterated one by one, but if you want to deal with all of an exponential number, that's often not adequate. The ASF allows you to prune the forest, elimnating potentially huge numbers of possibilities with each step. To implements its ASF, Marpa::R2 violates the Libmarpa API, using Libmarpa calls which are currently not documented. |
This discussion meandered quite a bit, and has gone off line. @dabrahams: Do you have any objection to my closing this? |
none |
Absent feedback to the contrary, and as per #117 (comment), I will close this issue after a couple of days. |
Closed per #117 (comment). |
Some rules in my grammar have constraints that there should be no whitespace between adjacent elements (some also have the constraint that there should be no newlines, but we can ignore that for now). Is there a good way to handle that sort of thing with Marpa? If there was a way to reject the completion of certain rules during parsing that would be simple.
The first alternative I see is to explicitly represent the possibility of whitespace in the grammar, but to do that without introducing huge ambiguity seems to mean I have to do the Aycock & Horspool nihilist normal form (NNF) transformation on my raw grammar, so I don't introduce whitespace recognition on both sides of a nulling symbol. Since Marpa is surely already doing that analysis, it feels wasteful to implement it myself.
The second alternative is to ignore the whitespace constraints until the parse forest is complete and then reject any complete parses that violate the constraints, but that seems inefficient, since many parses may be explored that could have been rejected earlier, and there's no way, once a constraint violation is detected in a complete parse, to avoid checking other complete parses that contain the same partial parse that violated the constraint.
The text was updated successfully, but these errors were encountered: