Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in block quote definition #460

Open
aidantwoods opened this issue Apr 12, 2017 · 5 comments
Open

Ambiguity in block quote definition #460

aidantwoods opened this issue Apr 12, 2017 · 5 comments
Milestone

Comments

@aidantwoods
Copy link
Contributor

Let's talk about block quotes.

A block quote marker consists of 0-3 spaces of initial indent, plus (a) the character > together with a following space, or (b) a single character > not followed by a space.

The following rules define block quotes:

The following sections needs rephrasing.

  1. Basic case. If a string of lines Ls constitute a sequence of blocks Bs, then the result of prepending a block quote marker to the beginning of each line in Ls is a block quote containing Bs.

Which block quote marker? There are two versions of the basic case for each line added.

  1. Laziness. If a string of lines Ls constitute a block quote with contents Bs, then the result of deleting the initial block quote marker from one or more lines in which the next non-whitespace character after the block quote marker is paragraph continuation text is a block quote with Bs as its content. Paragraph continuation text is text that will be parsed as part of the content of a paragraph, but does not occur at the beginning of the paragraph.

Again, we have the problem of "which blockquote marker?".

These are not definitions. At best they are multivalued "functions".

They do not describe a which text constitutes a block quote, they describe how some contents Cs may be mapped to a block quote.

These maps are not invertible. As in, a single block quote may map to multiple versions of the contents Cs (by choosing different markers), and you can check that all versions of these contents may be mapped back to this block quote by choosing different markers (though they of course cannot be mapped uniquely).

Because there is no unique way to determine a block quote's contents, these cannot be definitions.

If these points are specified in a way such that the contents does become uniquely defined, then it would serve to be less ambiguous by stating the inversion of the current map provided (so that a block quote may be identified based on the lines that actually define it, which is how a parser would have to work in practice).

@aidantwoods
Copy link
Contributor Author

aidantwoods commented Apr 12, 2017

Taking the definition at face value, we consider some contents to be contained in a block quote

- - - asdf
  -   sdfg

Which is parsed alone as:

<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</li>
</ul>

By prepending the list marker > to the first and second line we result in

>- - - asdf
>  -   sdfg

which is parsed as

<blockquote>
<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</blockquote>

Note that this is not the result we wanted, nor is it the one given by the specification where we start with a contents Cs and prepend quote markers to contain this contents.

Instead of doing this, we could have started with the markdown

- - - asdf
 -   sdfg

which is parsed alone as:

<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul>

and prepended the list marker > to both lines.

>- - - asdf
> -   sdfg

This time the result we get is per the specification, because we receive the block structure we started with nested in a block quote.

Or

We could have started with the same text and instead prepended > to the first line, and >• to the next, resulting again in:

>- - - asdf
>  -   sdfg

And we would receive again the same html.

However, the markdown we ended up with (by using the process given by the specification) is identical to the first markdown we considered, even though we started with a different content.

The specification MUST describe a process to reverse this procedure in order to be non ambiguous.

@aidantwoods
Copy link
Contributor Author

IMO we should aim to preserve consistency in the given indentation so that lines that line up in the original text line up when considered as contents of a block quote, I would propose the following process:

Let Ls be an ordered set of lines
Let A and B be >• and > respectively.
is a placeholder for a single space character

A block quote marker holds the value of either A or B, and is preceded by 0 to 3 spaces.

A line L begins with the block quote marker B if B is a block quote marker at the start of the line L and A is not a block quote marker at the start of the line L.

Let Cs be an empty ordered set of lines.
Let marker be A.

  • Foreach line L in Ls:

    • If L begins with the block quote marker B and marker is A and there is a non whitespace character after B in L then let marker be B
    • If L begins with a block quote marker then let C be the result of removing the block quote marker B and any whitespace which precedes it from the beginning of L
    • If L does not begin with a block quote marker then
      • If L is paragraph continuation text in the context of Cs being its preceding lines then let C hold the value of L
      • Otherwise, let C be null
    • If C is not null then append C to Cs
    • If C is null then stop considering additional items from Ls
  • If marker is A then remove a single space from the beginning of all lines in Cs

If Cs is non empty then a block quote is defined with contents Cs.

This should ensure that if a single B type block quote marker is used in the block quote, then all lines will reflect the indentation as a result of that, otherwise the blockquote marker A will be used. And should preserve the "intuitive" indentation that can be obtained by looking at the lines with respect to one another.

@jgm
Copy link
Member

jgm commented Apr 12, 2017 via email

@aidantwoods
Copy link
Contributor Author

aidantwoods commented Apr 12, 2017

Perhaps it would have be better to rewrite the spec from the
reader's (parser's) perspective

I'd probably agree with that, but I think it's salvageable without a complete rewrite.

I think the important thing to do is construct a definition from the reader's perspective, and see what that leaves in what the writer can do.

There are a couple of places in the text where we resort
to specifying precedences

For example, if the spec were to say that the writer should stick to a single marker type per block quote, and the shorter marker has higher priority if it is used on any line then I think that would keep everything consistent with the parsing strategy I outlined above.

The key thing I think is that the writer should not be using different marker lengths (so that indentation can be unambiguously preserved). The current reference implementation just grabs the longest marker it can find.

@aidantwoods
Copy link
Contributor Author

I've worked an initial implementation of that algorithm I gave into the parser I'm working on, so the following is now produced (in yaml-ish notation):

>1. > asdf
>   > sdfg

> 1. > asdf
>    > sdfg

>   1. a
>2. b

>    1. a
>2. b

>    1. a
> 2. b
blockquote:
  ol:
    li:
      blockquote:
        p:
          text:
            asdf sdfg

blockquote:
  ol:
    li:
      blockquote:
        p:
          text:
            asdf sdfg

blockquote:
  ol:
    li:
      p:
        text:
          a
    li:
      p:
        text:
          b

blockquote:
  pre:
    code:
      1. a
  ol start="2":
    li:
      p:
        text:
          b

blockquote:
  ol:
    li:
      p:
        text:
          a
    li:
      p:
        text:
          b

Feel free to throw some test cases at me if you like the approach, I think I'm certainly going to use it.

aidantwoods added a commit to aidantwoods/CommonMark that referenced this issue Apr 27, 2017
Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described [here](commonmark#460 (comment)))
@jgm jgm added this to the 0.29 milestone Aug 25, 2018
@jgm jgm modified the milestones: 0.29, 0.30 Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants