"HTML blocks" section vague, doesn't match parsers #177

mildsunrise · 2014-11-01T15:17:35Z

The HTML blocks section provides the definition of an HTML block tag.
Later on, the following example is presented:

<div class
foo

The example header says: "An incomplete HTML block tag may also start an HTML block"

The problem with this is: what's an incomplete HTML block tag? There's no definition that allows me to parse the syntax for such a tag. Does a single < count as an incomplete tag? According to the [current] rules of HTML block parsing, the above would render as <p><div class\nfoo</p>.

But even if we do provide a definition, what's the point in allowing incomplete tags? Could that be useful sometimes? And, more importantly, if an incomplete block tag can start a block, why can't an incomplete comment, or an incomplete CDATA section, or an incomplete processing instruction start a block?

<!--

Markdown here

-->

<![CDATA[ some data

The text was updated successfully, but these errors were encountered:

jgm · 2014-11-01T16:33:05Z

If I recall, my thinking was that someone might start a block with an open tag with lots of attributes. If this gets hard-wrapped, the first line will be an incomplete tag. Since I was going for a parsing strategy that could be handled line-by-line (without lookahead), that motivated allowing HTML blocks to start with incomplete tags.

mildsunrise · 2014-11-01T16:48:17Z

Hmm... this isn't very consistent, because in order to parse this HTML block:

<!-- comment that spans
multiple lines -->

You need lookahead, since incomplete comments aren't allowed. Same with CDATAs, etc, etc.
So if lookahead is required (and used) to implement CommonMark, can't we just be consistent and require valid HTML tags?

jgm · 2014-12-29T07:10:29Z

@jmendeth, for consistency I think we need to say that  comes after). That requires no lookahead. In order to handle comments with blank lines, we should also allow --> to start an HTML block. See also #96.

This will allow some things that aren't valid HTML to be passed through as raw HTML. But I think that's okay -- Markdown has always let users pass through invalid HTML.

mildsunrise · 2014-12-29T09:07:24Z

I'm fine with this. The section needs to be rewritten then, I suppose? And the same with CDATA and the other types?

jgm · 2014-12-29T19:48:13Z

Yes, it needs to be rewritten -- and also CDATA, etc.

EinfachToll · 2015-01-06T12:47:06Z

I also would like to have a more strict definition of what counts as incomplete tag. At the moment,

<div

doesn't count as HTML block, but

<div

does (notice the trailing space). That's not very intuitive. (Tested with commonmark.js.)

mildsunrise · 2015-01-06T12:57:00Z

There are more serious problems. According the spec, this:

<!--

is not an HTML block, but both CMark and CommonMark will accept it as such.
The same happens with CDATA and others types.

jgm · 2015-02-01T06:32:06Z

On reflection, I think it would make sense to make an exception to the usual rule form HTML blocks (they end at a blank line) for the following elements: pre, script, style, and comments. We'd never want to interpret the contents of these elements as CommonMark, so instead of ending at the first blank line, we should keep the block open til we hit a matching closing tag. (I think it is fine to say that if no matching closing tag is found, the entire rest of the document belongs to the HTML block.)

mildsunrise · 2015-02-01T15:41:43Z

While this exception would of course be useful, I'm not sure it's okay to complicate the HTML blocks syntax even more.

I'd leave it as it is for now, get it matching on the spec, and if the community agree we can add it (it's backwards compatible, so no problem).

jgm · 2015-02-01T18:57:15Z

The suggestion was in the service of fixing comments. I don't think we want this to count as a link reference definition inside a comment:

<!--

[foo]: bar

-->

Nothing inside a comment should be parsed as Markdown. My original suggestion, treating --> as an HTML block start, doesn't fix this (and also might capture some -->s that aren't part of comments). So I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer. If we do that for comments, it would make sense to do it for script, style, and pre as well. This would also increase backwards compatibility with original Markdown.

mildsunrise · 2015-02-02T11:11:42Z

Okay, I see what you mean now. Let me make a suggestion that will hopefully keep the syntax simple while still providing these features:

An HTML block MUST be started with a complete comment, block tag, CDATA section, declaration or processing instruction, that may span multiple lines and may contain blank lines. From that, the HTML block ends at the next empty line.

These are all valid, single HTML blocks:
```

html
```
```
   foo
bar
```
```
<?php

  echo '>';

?>
```
```
<div style="
   display: inline-block;
   position: relative;
   top: -2px;

   font: 2em Arial, sans-serif;
   color: #bcbcbc;
"> <span>
```
These are not HTML blocks:
```
<table class="foo"
```
```
<?
bar
```
```

```
If an HTML block starts with a complete tag starting a raw text element or escapable raw text element, the HTML block ends at the next empty line after the corresponding close tag, or end of file.

Raw text elements (style and script) don't contain HTML, so it makes absolutely no sense to have Markdown inside of them. But pre can contain HTML, and there are cases where one wants to have Markdown inside a pre:

<pre>
<code class="language-c">
struct ospf_header
{
  u_char version;                       /* OSPF Version. */
  u_char type;                          /* Packet Type. */
  u_int16_t length;                     /* Packet Length. */
  struct in_addr router_id;             /* Router ID. */
  struct in_addr area_id;               /* Area ID. */
  u_int16_t checksum;                   /* Check Sum. */
  u_int16_t auth_type;                  /* Authentication Type. */
  /* Authentication Data. */
  union
  {
    /* Simple Authentication. */
    u_char auth_data [OSPF_AUTH_SIMPLE_SIZE];
    /* Cryptographic Authentication. */
    struct
    {
      u_int16_t zero;                   /* Should be 0. */
      u_char key_id;                    /* Key ID. */
      u_char auth_data_len;             /* Auth Data Length. */
      u_int32_t crypt_seqnum;           /* Cryptographic Sequence Number. */
    } crypt;
  } u;
};
</code>
<div class="caption">

Code from [quagga](http://nongnu.org/quagga)'s OSPF module.

</div></pre>

Therefore we should not make an exception for pre IMO. What do you think?

jgm · 2015-02-02T15:51:06Z

+++ Xavier Mendez [Feb 02 15 03:11 ]:

An HTML block MUST be started with a complete comment, block tag, CDATA section, declaration or processing instruction, that may span multiple lines and may contain blank lines. From that, the HTML block ends at the next empty line.

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

Raw text elements (style and script) don't contain HTML, so it makes absolutely no sense to have Markdown inside of them. But pre is not one of them; there are cases where one wants to have Markdown inside a pre:

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

mildsunrise · 2015-02-02T16:04:17Z

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

I misunderstood it then, sorry. When you said earlier: "I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer", what should we do if we don't hit an appropiate closer?
Shouldn't we backtrack?

<!--
foo

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

Semantically it makes sense. <pre> indicates a block of preformatted code, but the code itself goes under a <code>. The role of <pre> for code is very similar to the role of <figure> for images.

jgm · 2015-02-02T17:14:06Z

+++ Xavier Mendez [Feb 02 15 08:04 ]:

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

I misunderstood it then, sorry. When you said earlier: "I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer", what should we do if we don't hit an appropiate closer?
Shouldn't we backtrack?

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

Semantically it makes sense. <pre> indicates a block of preformatted code, but the code itself goes under a <code>. The role of <pre> for code is very similar to the role of <figure> for images.

Okay.

mildsunrise · 2015-02-02T17:45:06Z

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I've been fiddling around with the idea, and it looks awkward when you write it. But I'm curious, how would you define HTML blocks in the spec then?

jgm · 2015-02-02T18:01:23Z

+++ Xavier Mendez [Feb 02 15 09:45 ]:

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I've been fiddling around with the idea, and it looks awkward when you write it. But I'm curious, how would you define HTML blocks in the spec then?

It's not easy to see how to do it, exactly, since the matching closing
tag might not occur at the beginning of a line (though I suppose we
colud require this).

The more I think about raw HTML, the more I think we should just require
all HTML blocks to be segregated in an explicit

<html>
...
</html>

context, with these tags on lines by themselves. This would give
maximum flexibility for writers, and make parsing easy. But of course
this is problematic for backwards compatibility.

mildsunrise · 2015-02-02T18:03:24Z

Just define HTML blocks in general. For instance, how is a comment parsed?

jgm · 2015-02-02T18:11:55Z

+++ Xavier Mendez [Feb 02 15 10:03 ]:

Just define HTML blocks in general. For instance, how is a comment parsed?

Same problem here as with script, style. Defining the start of the
block is easy: a comment block starts with  come at the beginning of a line. (And yes, I'm aware that this
would capture some things that are not valid HTML5 comments. I'm not
too worried about this.)

Note that Gruber's original Markdown syntax guide says that in HTML
blocks, the open and close tag must appear at the beginning of the
line, so this wouldn't be a novel requirement, though it would
break some existing documents, since most current parsers don't
actually require this.

mildsunrise · 2015-02-02T18:17:52Z

Defining the end is harder, unless we require that the closing --> come at the beginning of a line. (And yes, I'm aware that this would capture some things that are not valid HTML5 comments. I'm not too worried about this.)

Ah, you read my mind. Browsers do backtrack, so they would never parse this:

<!--
--foo

bar
-->

as a single comment. But let's go on, how would you define a generic tag HTML block?
I can't find a good way.

jgm · 2015-02-02T18:26:40Z

+++ Xavier Mendez [Feb 02 15 10:17 ]:

Defining the end is harder, unless we require that the closing --> come at the beginning of a line. (And yes, I'm aware that this would capture some things that are not valid HTML5 comments. I'm not too worried about this.)

Ah, you read my mind. Browsers do backtrack, so they would never parse this:

as a single comment.

I think I'm okay with that - garbage in, garbage out.

But let's go on, how would you define a generic tag HTML block?
I can't find a good way.

Same idea. Block starts with a line beginning with a (possibly partial)
open tag.

Block ends with:

a blank line (unless the open tag was script or style)
a line beginning with </script> (if open tag was <script)
a line beginning with </style> (if open tag was <style)

So, yes, this would lead to things being parsed as raw HTML blocks that
are not valid HTML5. But, we have that anyway. If the user wants to
use raw HTML, then it's the user's responsibility to make it correct,
not the Markdown parser's. (And any application using a Markdown
parser with untrusted inputs should always run the output through a
sanitizer for security.) I take it for granted here that things like
<script are not going to occur naturally in any context where the user
doesn't intend raw HTML. --> is a bit trickier, since it might be
used naturally as an arrow.

mildsunrise · 2015-02-02T18:35:07Z

Same idea. Block starts with a line beginning with a (possibly partial) open tag.

Block ends with:
a blank line (unless the open tag was script or style)

Imagine this:

<div style="
   display: inline-block;
   position: relative;
   top: -2px;

   font: 2em Arial, sans-serif;
   color: #bcbcbc;
"> <span>

According to that rule, the above is an HTML block of four lines.
Just as we require complete comments, shouldn't we require a complete tag (or end of file) to end the block?

jgm · 2015-02-02T22:46:59Z

+++ Xavier Mendez [Feb 02 15 10:35 ]:

Imagine this:
<div style="
  display: inline-block;
  position: relative;
  top: -2px;

  font: 2em Arial, sans-serif;
  color: #bcbcbc;
"> <span>
According to that rule, the above is an HTML block of three lines.
Just as we require complete comments, shouldn't we require a complete tag (or end of file) to end the block?

Well, I'm trying to keep things simple. I don't think it's a priority to allow authors to do things like this (blank lines in attributes). Keeping the rule simple and making the blocks easier to parse seems to me something that needs to be weighed against the importance of fringe cases like this.

mildsunrise · 2015-02-06T10:18:38Z

Let's do it, then. I personally would like simple rules but if you want to avoid backtracking at all costs...

jgm · 2015-06-08T19:11:13Z

See http://talk.commonmark.org/t/raw-html-blocks-proposals-comments-wanted/983

mildsunrise · 2015-06-08T20:26:04Z

Hmm, I like that proposal! As long as it's well defined, this issue can be closed when / if it's implemented in the spec.

mildsunrise · 2015-07-26T11:26:55Z

Great! 🎉

I'll try to implement the new section ASAP.

jgm added the spec label Nov 17, 2014

mildsunrise changed the title ~~Incomplete tag~~ HTML blocks" Jan 6, 2015

mildsunrise changed the title ~~HTML blocks"~~ "HTML blocks" section vague, doesn't match parsers Jan 6, 2015

jgm mentioned this issue Feb 1, 2015

HTML comments #96

Closed

puzrin mentioned this issue Feb 11, 2015

multiline HTML comments are improperly encoded markdown-it/markdown-it#51

Closed

mildsunrise mentioned this issue Mar 9, 2015

Support CommonMark-style HTML blocks separated by blank lines hoedown/hoedown#127

Open

BigBadaboom mentioned this issue May 13, 2015

Space on end of first line of HTML block #324

Closed

jgm closed this as completed Jul 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"HTML blocks" section vague, doesn't match parsers #177

"HTML blocks" section vague, doesn't match parsers #177

mildsunrise commented Nov 1, 2014

jgm commented Nov 1, 2014

mildsunrise commented Nov 1, 2014

jgm commented Dec 29, 2014

mildsunrise commented Dec 29, 2014

jgm commented Dec 29, 2014

EinfachToll commented Jan 6, 2015

mildsunrise commented Jan 6, 2015

jgm commented Feb 1, 2015

mildsunrise commented Feb 1, 2015

jgm commented Feb 1, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 6, 2015

jgm commented Jun 8, 2015

mildsunrise commented Jun 8, 2015

mildsunrise commented Jul 26, 2015

"HTML blocks" section vague, doesn't match parsers #177

"HTML blocks" section vague, doesn't match parsers #177

Comments

mildsunrise commented Nov 1, 2014

jgm commented Nov 1, 2014

mildsunrise commented Nov 1, 2014

jgm commented Dec 29, 2014

mildsunrise commented Dec 29, 2014

jgm commented Dec 29, 2014

EinfachToll commented Jan 6, 2015

mildsunrise commented Jan 6, 2015

jgm commented Feb 1, 2015

mildsunrise commented Feb 1, 2015

jgm commented Feb 1, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 2, 2015

jgm commented Feb 2, 2015

mildsunrise commented Feb 6, 2015

jgm commented Jun 8, 2015

mildsunrise commented Jun 8, 2015

mildsunrise commented Jul 26, 2015