Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in the spec regarding nested emphasis of the same type #127

Closed
Knagis opened this issue Sep 16, 2014 · 1 comment
Closed

Comments

@Knagis
Copy link
Contributor

Knagis commented Sep 16, 2014

Closely related to #51 but that one is more about the implementation while this is about the spec itself.

The spec in section 6.4 mentions (note that it only implies the result, the spec does not describe what should be the result of these examples):

The following patterns are less widely supported, but the intent is clear and they are useful (especially in contexts like bibliography entries):
*emph *with emph* in it*
**strong **with strong** in it**

But after that goes and says:

.9. Emphasis begins with a delimiter that can open emphasis and includes inlines parsed sequentially until a delimiter that can close emphasis, and that uses the same character (_ or *) as the opening delimiter, is reached.

Also, section 6 contains:

Inlines are parsed sequentially from the beginning of the character stream to the end (left to right, in left-to-right languages).


Reading these rules when parsing *emph *with emph* in it* we should get <em>emph *with emph</em> in it which is certainly not what is probably intended by the author. Of course, the rules could also mean that we match the first opener with first closer and second opener with second closer. But while that looks OK in HTML, it cannot be properly represented in AST.

Note that this "good" scenario does not really represent the problems in parsing it - many completely wrong parsers could easily get this correctly (even by just replacing openers with <em> and closers with </em>, without even thinking about their relations). So a better sample to consider would be

*foo *bar foo*

Now should this be <em>foo *bar foo</em> or *foo <em>bar foo</em>? The stack based approach I mentioned in #51 that would also work for the existing C reference implementation allows both answers with a simple modification with no performance costs so the specification could go both ways - either each closer matches the closest or the farthest opener.


In short - the specification implies how *a *b* c* has to be parsed but that contradicts the current rules. Also there are no tests that verify the implementations with these examples.

@jgm
Copy link
Member

jgm commented Sep 16, 2014

+++ Kārlis Gaņģis [Sep 16 14 10:25 ]:

Closely related to [1]#51 but that one is more about the implementation
while this is about the spec itself.

The spec in section 6.4 mentions (note that it only implies the result,
the spec does not describe what should be the result of these
examples):

The following patterns are less widely supported, but the intent is
clear and they are useful (especially in contexts like bibliography
entries):
*emph *with emph* in it*
**strong **with strong** in it**

But after that goes and says:

.9. Emphasis begins with a delimiter that can open emphasis and
includes inlines parsed sequentially until a delimiter that can
close emphasis, and that uses the same character (_ or *) as the
opening delimiter, is reached.

Also, section 6 contains:

Inlines are parsed sequentially from the beginning of the character
stream to the end (left to right, in left-to-right languages).
__________________________________________________________________

Reading these rules when parsing emph *with emph in it* we should get
emph *with emph in it which is certainly not what is probably
intended by the author.

No, because *with emph* is a single inline element. When we parse
inlines in sequence starting from the initial *, we get:

  • Str(emph )
  • Emph(Str(with emph))
  • in it

Then we hit a * that can close emphasis, and stop.

Of course, the rules could also mean that we
match the first opener with first closer and second opener with second
closer. But while that looks OK in HTML, it cannot be properly
represented in AST.

Note that this "good" scenario does not really represent the problems
in parsing it - many completely wrong parsers could easily get this
correctly (even by just replacing openers with and closers with
, without even thinking about their relations). So a better sample
to consider would be
foo *bar foo

Now should this be foo *bar foo or *foo bar foo?

Spec says the latter. We continue to take single inline
elements, checking at each point for a closing delimiter.
So, first we take Str(foo ), then Emph(Str(bar foo)).
Then we're out of input, so we never hit a closing
delimiter. Hence the first * gets treated as a literal
*.

The
stack based approach I mentioned in [2]#51 that would also work for the
existing C reference implementation allows both answers with a simple
modification with no performance costs so the specification could go
both ways - either each closer matches the closest or the farthest
opener.

Could you spell out your algorithm more fully? I am
currently trying to improve emph/strong parsing in the
memoize branch. Still looking at different ideas.

__________________________________________________________________

In short - the specification implies how a *b c* has to be parsed but
that contradicts the current rules. Also there are no tests that verify
the implementations with these examples.

Yes, clearly there is a need for more examples, and also for
clearer explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants