New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<.ws> poorly defined #1729

Open
jdoege opened this Issue Jan 11, 2018 · 3 comments

Comments

Projects
None yet
5 participants
@jdoege

jdoege commented Jan 11, 2018

In Language/grammars.pod ws is defined at line 209 as one or more whitepace characters or a word boundary, i.e./ \s+ | <|w> /
If, however, you define in a grammar, token ws { \s+ | <|w> } , you may find that your previously working parser no longer parses.
After looking in Language/regexes, I found there are some examples of redefining ws but no definition of how ws is defined by default. Those examples provided a clue that, perhaps ws is actually defined as, token ws { <!ww> | \s+ }. Searching further, I found in stackoverflow a comment by moritz that ws is, in fact defined as token ws { <!ww> | \s+ }. https://stackoverflow.com/questions/47728466/perl-6-grammar-doesnt-match-like-i-think-it-should
The default definition of ws should be corrected in grammars.pod and added to the sigspace section of regexes.pod.

One other point that might be made clear is that a grammar that explicitly defines ws behaves a bit differently than one which uses the default definition. Using the default, ws gets thrown away. When ws is explicitly defined, whatever it parses gets put into the parse tree.

@TisonKun

This comment has been minimized.

Contributor

TisonKun commented Jan 12, 2018

@coke coke added the docs label Jan 17, 2018

@ronaldxs

This comment has been minimized.

Contributor

ronaldxs commented Mar 23, 2018

Language/grammars in the middle of the section on <ws> says:

The default C<ws> matches one or more whitespace characters (C<\s>) or a
word boundary (C«<|w>»):

The sigspace section of Language/regexes first implies the same idea saying

By default, <.ws> makes sure that words are separated

and then shows, if you knew what you were looking for in the first place, that it is not talking about <|w> by saying

"^&" ... will match <.ws> in the middle

which is correct but neither <|w> nor \s+ matches that case.

doc/doc/Language/regexes.pod6

Lines 1546 to 1548 in 9b6ee71

C<m/ photo <.ws> shop <.ws> />. By default, C<< <.ws> >> makes sure that
words are separated, so C<a b> and C<^&> will match C<< <.ws> >> in the
middle, but C<ab> won't.

In the stack overflow article moritz says the <ws> definition is ws { <!ww> \s* } which is a shade different and close to source:

moritz stackoverflow: https://stackoverflow.com/questions/47728466/perl-6-grammar-doesnt-match-like-i-think-it-should#comment82426178_47728653
source: https://github.com/perl6/nqp/blob/a2f66567052e827a39cfda6d8908f62532ef3b12/src/NQP/Grammar.nqp#L59-L68

I think it might be helpful for the <!ww> \s* approximation to be in the relevant docs.

The idea that <ws> matches word separation other than spaces between words is adequately explained but a bit counterintuitive and buried in the middle of the relevant sections. For the regexes sigspace section a fourth example might be added to the first three demonstrating it (for example):

say so "I used a Photoshop(photo shop)" ~~ m:i:s/ photo shop /;'

Or a stronger hint could be added to the second example by adding a '.' period at the end of the sentence:

say so "I used a photo shop." ~~ m:i:s/ photo shop /;

In the grammars <ws> section the sentence starting with

The default ws

and the block of examples just below it might be moved to the top and again @moritz simplification of the <ws> definition to { <!ww> \s* } might be included there.

The two sections should also point to each other.

So I was confused about <ws> as was the original poster of the issue, and my confusion was based in part in looking at @albastev's modelica grammar which defines ws as zero or more spaces but doesn't have a concept of an actual word break leading, I believe, the grammar to add many otherwise unneeded <|w> tests next to keywords.

Thanks to @AlexDaniel for his patience helping to explain some of this on IRC #perl6.

@ronaldxs ronaldxs referenced this issue Apr 27, 2018

Merged

Ws cleanup #3

@JJ

This comment has been minimized.

Contributor

JJ commented Jul 24, 2018

Maybe this issue #2211 will also solve this one? At any rate, can you please check now for the definition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment