Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can this be theoretically parsed by peg? how? #489

Open
zpdDG4gta8XKpMCd opened this issue Feb 5, 2017 · 5 comments
Open

can this be theoretically parsed by peg? how? #489

zpdDG4gta8XKpMCd opened this issue Feb 5, 2017 · 5 comments

Comments

@zpdDG4gta8XKpMCd
Copy link

zpdDG4gta8XKpMCd commented Feb 5, 2017

i am about to break my head trying to come up with a PEG grammar that would parse according the following BNF of RFC 2396

      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum

i got some serious help with domainlabel and toplabel in #487, so are not a problem (@gguerreiro, many thanks for that!)

however hostname, it seems, cannot be expressed in PEG because just like in #487 the whole input is consumed by *(domainlabel ".") which doesn't know when to stop since toplabel [ "." ] is indistinguishable from it

simplified self-contained illustration:

h = (d '.')* t '.'?
d = [dt]
t = [t]

would parse t, d.d.t and fail on d.d.d which is totally expected, but it fails to parse t. and d.d.t. which both a valid cases

@Mingun
Copy link
Contributor

Mingun commented Feb 5, 2017

Consider use lookahead, something like:

h = (!t1 d '.')* t1;
t1 = t '.'?;
d= [dt];
t = [t];

@zpdDG4gta8XKpMCd
Copy link
Author

both t. and d.d.t. can be parsed, but d.t.t cannot anymore

@flaviojs
Copy link

flaviojs commented Mar 5, 2017

This library (or any conformant PEG library) has a greedy expression * with no backtracking (can't go back 1 match), so if you want to stop earlier you will have to use lookahead to specify what must come after a domainlabel.

Looking at hostname, any domainlabel must be followed by "." domainlabel or "." toplabel, but since domainlabel contains all possible toplabel values, it's equivalent to just "." domainlabel.

Modifying you simplified self-contained illustration:

h = (d '.' & (d/t))* t '.'?
d = [dt]
t = [t]

which is equivalent to:

h = (d '.' & d)* t '.'?
d = [dt]
t = [t]

@dmsnell
Copy link

dmsnell commented Apr 19, 2017

For reference I built a demo grammar you can play around with showing an implementation of what @flaviojs posted.

While we can very closely mimic the EBNF from the RFC I have chosen to slightly alter the format to hint that the domain part must be followed by a top-level domain part. For the most part, however, it's practically identical to the RFC and that's some of the fun of PEGs.

For a glance…

Hostname "Hostname"
  = domain:DomainPrefix+ TopLabel "."?

DomainPrefix "non-terminal domain part"
  = DomainLabel "." & (DomainLabel / TopLabel)

DomainLabel
  = $(AlphaNum (AlphaNum / "-" AlphaNum)*)

TopLabel
  = $(Alpha (AlphaNum / "-" AlphaNum)*)

@frantic1048
Copy link

@dmsnell Your grammar http://peg.arcanis.fr/2cx6Sx/2/ seems lacking standalone t. form (the DomainPrefix+ in Hostname rule).


Recently I'm working on an rST parser with PEG.js. I implemented the standalone-hyperlinks according to RFC 3986's absolute-URI ABNF definition. (RFC 3986 is an update of RFC 2396)

Though rST spec restricts URI schemes to 'known schemes' , I don't put it in grammar, it is better to be put in semantic validating.

FYI here's my implementation in the parser named TextAbsoluteURI rule:
https://github.com/frantic1048/Est/blob/master/src/parser.pegjs#L764-L891
And the test:
https://github.com/frantic1048/Est/blob/master/test/grammar.StandAloneHyperlink.js

The URI is already parsed into several meaningful parts by the grammar, however, I just take the whole URI string for processing rST.


And about the hostname = *( domainlabel "." ) toplabel [ "." ] form, I used another way to rewrite the rule into recursion also achieve the same result. You can try it on http://peg.arcanis.fr/3tU7Hl/4/

This way is far no elegant as @dmsnell's assertion way. Just informing of thought. And I have to refine my past grammars... (/ω\)

Notice the construction of Hostname rule make sure Hostname is always ended with TopLabel, and with a little [].concat() to make sure Hostname finally returns a flat object.

/*
PEG.js non greddy match
==========================
BNF as follows:

hostname      = *( domainlabel "." ) toplabel [ "." ]
domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
*/

Hostnames
    = a:Hostname b:("\n" Hostname)*
    { return [a].concat(b.map(z => z[1])) }

Hostname
    = d:DomainLabel "." h:Hostname
    {
      return Object.assign(h, {
        domainlabel: [d].concat(h.domainlabel || [])
      })
    }
    / t:($(TopLabel "."?))
    { return { toplabel: t } }

// be careful about the order of parsing expressions
// specific ones go first
DomainLabel
    // AlphaNum (AlphaNum / "-") c:AlphaNum
    // above is greddy, let's do similar convert like Hostname
    = $(a:AlphaNum b:DomainLabelNonFirst)
    / AlphaNum

DomainLabelNonFirst
    = $((Dash / AlphaNum) DomainLabelNonFirst)
    / AlphaNum

TopLabel
    // same method as DomainLabel
    = $(Alpha TopLabelNonFirst)
    / Alpha

TopLabelNonFirst
    = $((AlphaNum / Dash) TopLabelNonFirst)
    / AlphaNum

Dash = "-"
AlphaNum = Alpha / Num
Alpha = [a-zA-Z]
Num = [0-9]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants