Ambiguity in block quote definition #460

aidantwoods · 2017-04-12T19:09:10Z

Let's talk about block quotes.

A block quote marker consists of 0-3 spaces of initial indent, plus (a) the character > together with a following space, or (b) a single character > not followed by a space.

The following rules define block quotes:

The following sections needs rephrasing.

Basic case. If a string of lines Ls constitute a sequence of blocks Bs, then the result of prepending a block quote marker to the beginning of each line in Ls is a block quote containing Bs.

Which block quote marker? There are two versions of the basic case for each line added.

Laziness. If a string of lines Ls constitute a block quote with contents Bs, then the result of deleting the initial block quote marker from one or more lines in which the next non-whitespace character after the block quote marker is paragraph continuation text is a block quote with Bs as its content. Paragraph continuation text is text that will be parsed as part of the content of a paragraph, but does not occur at the beginning of the paragraph.

Again, we have the problem of "which blockquote marker?".

These are not definitions. At best they are multivalued "functions".

They do not describe a which text constitutes a block quote, they describe how some contents Cs may be mapped to a block quote.

These maps are not invertible. As in, a single block quote may map to multiple versions of the contents Cs (by choosing different markers), and you can check that all versions of these contents may be mapped back to this block quote by choosing different markers (though they of course cannot be mapped uniquely).

Because there is no unique way to determine a block quote's contents, these cannot be definitions.

If these points are specified in a way such that the contents does become uniquely defined, then it would serve to be less ambiguous by stating the inversion of the current map provided (so that a block quote may be identified based on the lines that actually define it, which is how a parser would have to work in practice).

aidantwoods · 2017-04-12T20:47:46Z

Taking the definition at face value, we consider some contents to be contained in a block quote

- - - asdf
  -   sdfg

Which is parsed alone as:

<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</li>
</ul>

By prepending the list marker > to the first and second line we result in

>- - - asdf
>  -   sdfg

which is parsed as

<blockquote>
<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</blockquote>

Note that this is not the result we wanted, nor is it the one given by the specification where we start with a contents Cs and prepend quote markers to contain this contents.

Instead of doing this, we could have started with the markdown

- - - asdf
 -   sdfg

which is parsed alone as:

<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul>

and prepended the list marker > to both lines.

>- - - asdf
> -   sdfg

This time the result we get is per the specification, because we receive the block structure we started with nested in a block quote.

Or

We could have started with the same text and instead prepended > to the first line, and >• to the next, resulting again in:

>- - - asdf
>  -   sdfg

And we would receive again the same html.

However, the markdown we ended up with (by using the process given by the specification) is identical to the first markdown we considered, even though we started with a different content.

The specification MUST describe a process to reverse this procedure in order to be non ambiguous.

aidantwoods · 2017-04-12T21:38:31Z

IMO we should aim to preserve consistency in the given indentation so that lines that line up in the original text line up when considered as contents of a block quote, I would propose the following process:

Let Ls be an ordered set of lines
Let A and B be >• and > respectively.
• is a placeholder for a single space character

A block quote marker holds the value of either A or B, and is preceded by 0 to 3 spaces.

A line L begins with the block quote marker B if B is a block quote marker at the start of the line L and A is not a block quote marker at the start of the line L.

Let Cs be an empty ordered set of lines.
Let marker be A.

Foreach line L in Ls:
- If L begins with the block quote marker B and marker is A and there is a non whitespace character after B in L then let marker be B
- If L begins with a block quote marker then let C be the result of removing the block quote marker B and any whitespace which precedes it from the beginning of L
- If L does not begin with a block quote marker then
  - If L is paragraph continuation text in the context of Cs being its preceding lines then let C hold the value of L
  - Otherwise, let C be null
- If C is not null then append C to Cs
- If C is null then stop considering additional items from Ls
If marker is A then remove a single space from the beginning of all lines in Cs

If Cs is non empty then a block quote is defined with contents Cs.

This should ensure that if a single B type block quote marker is used in the block quote, then all lines will reflect the indentation as a result of that, otherwise the blockquote marker A will be used. And should preserve the "intuitive" indentation that can be obtained by looking at the lines with respect to one another.

jgm · 2017-04-12T22:00:50Z

This definitely needs work, thanks. The approach of the spec (for better or worse) was to specify how to construct each of the block and inline element types (writer's perspective), rather than how to parse (reader's perspective). If all possible ways of constructing elements are specified, then it should be possible to write a parser that recognizes them (and the reference implementations are meant to show that). There are pitfalls, though, to this approach. If two different constructions (for different element types) can result in the same text, then we have a problem. This is the kind of problem you're pointing out for block quotes. There are a couple of places in the text where we resort to specifying precedences, which isn't really in the spirit of the writer's-perspective strategy outlined above, but is necessary to avoid the problem. Perhaps it would have be better to rewrite the spec from the reader's (parser's) perspective, but I don't know if I have energy for that.

aidantwoods · 2017-04-12T22:13:20Z

Perhaps it would have be better to rewrite the spec from the
reader's (parser's) perspective

I'd probably agree with that, but I think it's salvageable without a complete rewrite.

I think the important thing to do is construct a definition from the reader's perspective, and see what that leaves in what the writer can do.

There are a couple of places in the text where we resort
to specifying precedences

For example, if the spec were to say that the writer should stick to a single marker type per block quote, and the shorter marker has higher priority if it is used on any line then I think that would keep everything consistent with the parsing strategy I outlined above.

The key thing I think is that the writer should not be using different marker lengths (so that indentation can be unambiguously preserved). The current reference implementation just grabs the longest marker it can find.

aidantwoods · 2017-04-12T23:33:01Z

I've worked an initial implementation of that algorithm I gave into the parser I'm working on, so the following is now produced (in yaml-ish notation):

>1. > asdf
>   > sdfg

> 1. > asdf
>    > sdfg

>   1. a
>2. b

>    1. a
>2. b

>    1. a
> 2. b

blockquote:
  ol:
    li:
      blockquote:
        p:
          text:
            asdf sdfg

blockquote:
  ol:
    li:
      blockquote:
        p:
          text:
            asdf sdfg

blockquote:
  ol:
    li:
      p:
        text:
          a
    li:
      p:
        text:
          b

blockquote:
  pre:
    code:
      1. a
  ol start="2":
    li:
      p:
        text:
          b

blockquote:
  ol:
    li:
      p:
        text:
          a
    li:
      p:
        text:
          b

Feel free to throw some test cases at me if you like the approach, I think I'm certainly going to use it.

Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described [here](commonmark#460 (comment)))

aidantwoods mentioned this issue Apr 12, 2017

Combination of blockquote and list inconsistency #421

Open

aidantwoods mentioned this issue Apr 27, 2017

Block quote marker choice uniformity #466

Open

jgm added this to the 0.29 milestone Aug 25, 2018

jgm modified the milestones: 0.29, 0.30 Apr 3, 2019

jgm mentioned this issue Jun 14, 2019

Multi-paragraph list item nested in blockquote commonmark/cmark#304

Closed

vassudanagunta mentioned this issue Feb 3, 2020

blockquotes violate principle of uniformity #634

Open

jgm mentioned this issue Feb 17, 2021

Reference implementation incorrectly detects setext heading as paragraph continuation text #675

Closed

jgm mentioned this issue Mar 22, 2024

Possibly undesired behavior uncovered by the spec #766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity in block quote definition #460

Ambiguity in block quote definition #460

aidantwoods commented Apr 12, 2017

aidantwoods commented Apr 12, 2017 •

edited

Loading

aidantwoods commented Apr 12, 2017

jgm commented Apr 12, 2017 via email

aidantwoods commented Apr 12, 2017 •

edited

Loading

aidantwoods commented Apr 12, 2017

Ambiguity in block quote definition #460

Ambiguity in block quote definition #460

Comments

aidantwoods commented Apr 12, 2017

aidantwoods commented Apr 12, 2017 • edited Loading

aidantwoods commented Apr 12, 2017

jgm commented Apr 12, 2017 via email

aidantwoods commented Apr 12, 2017 • edited Loading

aidantwoods commented Apr 12, 2017

aidantwoods commented Apr 12, 2017 •

edited

Loading

aidantwoods commented Apr 12, 2017 •

edited

Loading