Inconsistency in the spec regarding nested emphasis of the same type #127

Knagis · 2014-09-16T17:25:45Z

Closely related to #51 but that one is more about the implementation while this is about the spec itself.

The spec in section 6.4 mentions (note that it only implies the result, the spec does not describe what should be the result of these examples):

The following patterns are less widely supported, but the intent is clear and they are useful (especially in contexts like bibliography entries):
*emph *with emph* in it*
**strong **with strong** in it**

But after that goes and says:

.9. Emphasis begins with a delimiter that can open emphasis and includes inlines parsed sequentially until a delimiter that can close emphasis, and that uses the same character (_ or *) as the opening delimiter, is reached.

Also, section 6 contains:

Inlines are parsed sequentially from the beginning of the character stream to the end (left to right, in left-to-right languages).

Reading these rules when parsing *emph *with emph* in it* we should get emph *with emph in it which is certainly not what is probably intended by the author. Of course, the rules could also mean that we match the first opener with first closer and second opener with second closer. But while that looks OK in HTML, it cannot be properly represented in AST.

Note that this "good" scenario does not really represent the problems in parsing it - many completely wrong parsers could easily get this correctly (even by just replacing openers with  and closers with , without even thinking about their relations). So a better sample to consider would be

*foo *bar foo*

Now should this be foo *bar foo or *foo bar foo? The stack based approach I mentioned in #51 that would also work for the existing C reference implementation allows both answers with a simple modification with no performance costs so the specification could go both ways - either each closer matches the closest or the farthest opener.

In short - the specification implies how *a *b* c* has to be parsed but that contradicts the current rules. Also there are no tests that verify the implementations with these examples.

The text was updated successfully, but these errors were encountered:

jgm · 2014-09-16T17:48:32Z

+++ Kārlis Gaņģis [Sep 16 14 10:25 ]:

Closely related to [1]#51 but that one is more about the implementation
while this is about the spec itself.

The spec in section 6.4 mentions (note that it only implies the result,
the spec does not describe what should be the result of these
examples):
The following patterns are less widely supported, but the intent is
clear and they are useful (especially in contexts like bibliography
entries):
*emph *with emph* in it*
**strong **with strong** in it**
But after that goes and says:
.9. Emphasis begins with a delimiter that can open emphasis and
includes inlines parsed sequentially until a delimiter that can
close emphasis, and that uses the same character (_ or *) as the
opening delimiter, is reached.
Also, section 6 contains:
Inlines are parsed sequentially from the beginning of the character
stream to the end (left to right, in left-to-right languages).
__________________________________________________________________
Reading these rules when parsing emph *with emph in it* we should get
emph *with emph in it which is certainly not what is probably
intended by the author.

No, because *with emph* is a single inline element. When we parse
inlines in sequence starting from the initial *, we get:

Str(emph )
Emph(Str(with emph))
in it

Then we hit a * that can close emphasis, and stop.

Of course, the rules could also mean that we
match the first opener with first closer and second opener with second
closer. But while that looks OK in HTML, it cannot be properly
represented in AST.

Note that this "good" scenario does not really represent the problems
in parsing it - many completely wrong parsers could easily get this
correctly (even by just replacing openers with and closers with
, without even thinking about their relations). So a better sample
to consider would be
foo *bar foo

Now should this be foo *bar foo or *foo bar foo?

Spec says the latter. We continue to take single inline
elements, checking at each point for a closing delimiter.
So, first we take Str(foo ), then Emph(Str(bar foo)).
Then we're out of input, so we never hit a closing
delimiter. Hence the first * gets treated as a literal
*.

The
stack based approach I mentioned in [2]#51 that would also work for the
existing C reference implementation allows both answers with a simple
modification with no performance costs so the specification could go
both ways - either each closer matches the closest or the farthest
opener.

Could you spell out your algorithm more fully? I am
currently trying to improve emph/strong parsing in the
memoize branch. Still looking at different ideas.

__________________________________________________________________
In short - the specification implies how a *b c* has to be parsed but
that contradicts the current rules. Also there are no tests that verify
the implementations with these examples.

Yes, clearly there is a need for more examples, and also for
clearer explanations.

Knagis mentioned this issue Sep 16, 2014

Additional test for emphasis parsing MortenHoustonLudvigsen/CommonMarkSharp#1

Merged

Knagis mentioned this issue Sep 24, 2014

Underscores inside of emphasis. #51

Closed

Knagis closed this as completed Oct 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in the spec regarding nested emphasis of the same type #127

Inconsistency in the spec regarding nested emphasis of the same type #127

Knagis commented Sep 16, 2014

jgm commented Sep 16, 2014

Inconsistency in the spec regarding nested emphasis of the same type #127

Inconsistency in the spec regarding nested emphasis of the same type #127

Comments

Knagis commented Sep 16, 2014

jgm commented Sep 16, 2014