can this be theoretically parsed by peg? how? #489

zpdDG4gta8XKpMCd · 2017-02-05T18:48:32Z

i am about to break my head trying to come up with a PEG grammar that would parse according the following BNF of RFC 2396

      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

i got some serious help with domainlabel and toplabel in #487, so are not a problem (@gguerreiro, many thanks for that!)

however hostname, it seems, cannot be expressed in PEG because just like in #487 the whole input is consumed by *(domainlabel ".") which doesn't know when to stop since toplabel [ "." ] is indistinguishable from it

simplified self-contained illustration:

h = (d '.')* t '.'?
d = [dt]
t = [t]

would parse t, d.d.t and fail on d.d.d which is totally expected, but it fails to parse t. and d.d.t. which both a valid cases

The text was updated successfully, but these errors were encountered:

Mingun · 2017-02-05T18:58:14Z

Consider use lookahead, something like:

h = (!t1 d '.')* t1;
t1 = t '.'?;
d= [dt];
t = [t];

zpdDG4gta8XKpMCd · 2017-02-08T04:01:53Z

both t. and d.d.t. can be parsed, but d.t.t cannot anymore

flaviojs · 2017-03-05T05:10:24Z

This library (or any conformant PEG library) has a greedy expression * with no backtracking (can't go back 1 match), so if you want to stop earlier you will have to use lookahead to specify what must come after a domainlabel.

Looking at hostname, any domainlabel must be followed by "." domainlabel or "." toplabel, but since domainlabel contains all possible toplabel values, it's equivalent to just "." domainlabel.

Modifying you simplified self-contained illustration:

h = (d '.' & (d/t))* t '.'?
d = [dt]
t = [t]

which is equivalent to:

h = (d '.' & d)* t '.'?
d = [dt]
t = [t]

dmsnell · 2017-04-19T10:02:40Z

For reference I built a demo grammar you can play around with showing an implementation of what @flaviojs posted.

While we can very closely mimic the EBNF from the RFC I have chosen to slightly alter the format to hint that the domain part must be followed by a top-level domain part. For the most part, however, it's practically identical to the RFC and that's some of the fun of PEGs.

For a glance…

Hostname "Hostname"
  = domain:DomainPrefix+ TopLabel "."?

DomainPrefix "non-terminal domain part"
  = DomainLabel "." & (DomainLabel / TopLabel)

DomainLabel
  = $(AlphaNum (AlphaNum / "-" AlphaNum)*)

TopLabel
  = $(Alpha (AlphaNum / "-" AlphaNum)*)

frantic1048 · 2017-08-30T14:20:59Z

@dmsnell Your grammar http://peg.arcanis.fr/2cx6Sx/2/ seems lacking standalone t. form (the DomainPrefix+ in Hostname rule).

Recently I'm working on an rST parser with PEG.js. I implemented the standalone-hyperlinks according to RFC 3986's absolute-URI ABNF definition. (RFC 3986 is an update of RFC 2396)

Though rST spec restricts URI schemes to 'known schemes' , I don't put it in grammar, it is better to be put in semantic validating.

FYI here's my implementation in the parser named TextAbsoluteURI rule:
https://github.com/frantic1048/Est/blob/master/src/parser.pegjs#L764-L891
And the test:
https://github.com/frantic1048/Est/blob/master/test/grammar.StandAloneHyperlink.js

The URI is already parsed into several meaningful parts by the grammar, however, I just take the whole URI string for processing rST.

And about the hostname = *( domainlabel "." ) toplabel [ "." ] form, I used another way to rewrite the rule into recursion also achieve the same result. You can try it on http://peg.arcanis.fr/3tU7Hl/4/

This way is far no elegant as @dmsnell's assertion way. Just informing of thought. And I have to refine my past grammars... (/ω＼)

Notice the construction of Hostname rule make sure Hostname is always ended with TopLabel, and with a little [].concat() to make sure Hostname finally returns a flat object.

/*
PEG.js non greddy match
==========================
BNF as follows:

hostname      = *( domainlabel "." ) toplabel [ "." ]
domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
*/

Hostnames
    = a:Hostname b:("\n" Hostname)*
    { return [a].concat(b.map(z => z[1])) }

Hostname
    = d:DomainLabel "." h:Hostname
    {
      return Object.assign(h, {
        domainlabel: [d].concat(h.domainlabel || [])
      })
    }
    / t:($(TopLabel "."?))
    { return { toplabel: t } }

// be careful about the order of parsing expressions
// specific ones go first
DomainLabel
    // AlphaNum (AlphaNum / "-") c:AlphaNum
    // above is greddy, let's do similar convert like Hostname
    = $(a:AlphaNum b:DomainLabelNonFirst)
    / AlphaNum

DomainLabelNonFirst
    = $((Dash / AlphaNum) DomainLabelNonFirst)
    / AlphaNum

TopLabel
    // same method as DomainLabel
    = $(Alpha TopLabelNonFirst)
    / Alpha

TopLabelNonFirst
    = $((AlphaNum / Dash) TopLabelNonFirst)
    / AlphaNum

Dash = "-"
AlphaNum = Alpha / Num
Alpha = [a-zA-Z]
Num = [0-9]

StoneCypher mentioned this issue Feb 2, 2020

Shorthand for Semantic Actions #580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can this be theoretically parsed by peg? how? #489

can this be theoretically parsed by peg? how? #489

zpdDG4gta8XKpMCd commented Feb 5, 2017 •

edited

Mingun commented Feb 5, 2017

zpdDG4gta8XKpMCd commented Feb 8, 2017

flaviojs commented Mar 5, 2017 •

edited

dmsnell commented Apr 19, 2017 •

edited

frantic1048 commented Aug 30, 2017

can this be theoretically parsed by peg? how? #489

can this be theoretically parsed by peg? how? #489

Comments

zpdDG4gta8XKpMCd commented Feb 5, 2017 • edited

Mingun commented Feb 5, 2017

zpdDG4gta8XKpMCd commented Feb 8, 2017

flaviojs commented Mar 5, 2017 • edited

dmsnell commented Apr 19, 2017 • edited

frantic1048 commented Aug 30, 2017

zpdDG4gta8XKpMCd commented Feb 5, 2017 •

edited

flaviojs commented Mar 5, 2017 •

edited

dmsnell commented Apr 19, 2017 •

edited