Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"HTML blocks" section vague, doesn't match parsers #177

Closed
mildsunrise opened this issue Nov 1, 2014 · 26 comments
Closed

"HTML blocks" section vague, doesn't match parsers #177

mildsunrise opened this issue Nov 1, 2014 · 26 comments
Labels

Comments

@mildsunrise
Copy link
Contributor

The HTML blocks section provides the definition of an HTML block tag.
Later on, the following example is presented:

<div class
foo

The example header says: "An incomplete HTML block tag may also start an HTML block"

The problem with this is: what's an incomplete HTML block tag? There's no definition that allows me to parse the syntax for such a tag. Does a single < count as an incomplete tag? According to the [current] rules of HTML block parsing, the above would render as <p>&lt;div class\nfoo</p>.

But even if we do provide a definition, what's the point in allowing incomplete tags? Could that be useful sometimes? And, more importantly, if an incomplete block tag can start a block, why can't an incomplete comment, or an incomplete CDATA section, or an incomplete processing instruction start a block?

<!--

Markdown here

-->
<![CDATA[ some data
@jgm
Copy link
Member

jgm commented Nov 1, 2014

If I recall, my thinking was that someone might start a block with an open tag with lots of attributes. If this gets hard-wrapped, the first line will be an incomplete tag. Since I was going for a parsing strategy that could be handled line-by-line (without lookahead), that motivated allowing HTML blocks to start with incomplete tags.

@mildsunrise
Copy link
Contributor Author

Hmm... this isn't very consistent, because in order to parse this HTML block:

<!-- comment that spans
multiple lines -->

You need lookahead, since incomplete comments aren't allowed. Same with CDATAs, etc, etc.
So if lookahead is required (and used) to implement CommonMark, can't we just be consistent and require valid HTML tags?

@jgm jgm added the spec label Nov 17, 2014
@jgm
Copy link
Member

jgm commented Dec 29, 2014

@jmendeth, for consistency I think we need to say that <!-- starts an HTML block (regardless of whether a --> comes after). That requires no lookahead. In order to handle comments with blank lines, we should also allow --> to start an HTML block. See also #96.

This will allow some things that aren't valid HTML to be passed through as raw HTML. But I think that's okay -- Markdown has always let users pass through invalid HTML.

@mildsunrise
Copy link
Contributor Author

I'm fine with this. The section needs to be rewritten then, I suppose? And the same with CDATA and the other types?

@jgm
Copy link
Member

jgm commented Dec 29, 2014

Yes, it needs to be rewritten -- and also CDATA, etc.

@EinfachToll
Copy link

I also would like to have a more strict definition of what counts as incomplete tag. At the moment,

<div

doesn't count as HTML block, but

<div 

does (notice the trailing space). That's not very intuitive. (Tested with commonmark.js.)

@mildsunrise
Copy link
Contributor Author

There are more serious problems. According the spec, this:

<!--

is not an HTML block, but both CMark and CommonMark will accept it as such.
The same happens with CDATA and others types.

@mildsunrise mildsunrise changed the title Incomplete tag HTML blocks" Jan 6, 2015
@mildsunrise mildsunrise changed the title HTML blocks" "HTML blocks" section vague, doesn't match parsers Jan 6, 2015
@jgm
Copy link
Member

jgm commented Feb 1, 2015

On reflection, I think it would make sense to make an exception to the usual rule form HTML blocks (they end at a blank line) for the following elements: pre, script, style, and comments. We'd never want to interpret the contents of these elements as CommonMark, so instead of ending at the first blank line, we should keep the block open til we hit a matching closing tag. (I think it is fine to say that if no matching closing tag is found, the entire rest of the document belongs to the HTML block.)

@jgm jgm mentioned this issue Feb 1, 2015
@mildsunrise
Copy link
Contributor Author

While this exception would of course be useful, I'm not sure it's okay to complicate the HTML blocks syntax even more.

I'd leave it as it is for now, get it matching on the spec, and if the community agree we can add it (it's backwards compatible, so no problem).

@jgm
Copy link
Member

jgm commented Feb 1, 2015

The suggestion was in the service of fixing comments. I don't think we want this to count as a link reference definition inside a comment:

<!--

[foo]: bar

-->

Nothing inside a comment should be parsed as Markdown. My original suggestion, treating --> as an HTML block start, doesn't fix this (and also might capture some -->s that aren't part of comments). So I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer. If we do that for comments, it would make sense to do it for script, style, and pre as well. This would also increase backwards compatibility with original Markdown.

@mildsunrise
Copy link
Contributor Author

Okay, I see what you mean now. Let me make a suggestion that will hopefully keep the syntax simple while still providing these features:

  1. An HTML block MUST be started with a complete comment, block tag, CDATA section, declaration or processing instruction, that may span multiple lines and may contain blank lines. From that, the HTML block ends at the next empty line.

    These are all valid, single HTML blocks:

    <!-- comment -->
    html
    
      <!--
    
      comment
    
    --> foo
    bar
    
    <?php
    
      echo '>';
    
    ?>
    
    <div style="
       display: inline-block;
       position: relative;
       top: -2px;
    
       font: 2em Arial, sans-serif;
       color: #bcbcbc;
    "> <span>
    

    These are not HTML blocks:

    <table class="foo"
    
    <?
    bar
    
    <!--
    foo--
    
    -->
    
  2. If an HTML block starts with a complete tag starting a raw text element or escapable raw text element, the HTML block ends at the next empty line after the corresponding close tag, or end of file.

Raw text elements (style and script) don't contain HTML, so it makes absolutely no sense to have Markdown inside of them. But pre can contain HTML, and there are cases where one wants to have Markdown inside a pre:

<pre>
<code class="language-c">
struct ospf_header
{
  u_char version;                       /* OSPF Version. */
  u_char type;                          /* Packet Type. */
  u_int16_t length;                     /* Packet Length. */
  struct in_addr router_id;             /* Router ID. */
  struct in_addr area_id;               /* Area ID. */
  u_int16_t checksum;                   /* Check Sum. */
  u_int16_t auth_type;                  /* Authentication Type. */
  /* Authentication Data. */
  union
  {
    /* Simple Authentication. */
    u_char auth_data [OSPF_AUTH_SIMPLE_SIZE];
    /* Cryptographic Authentication. */
    struct
    {
      u_int16_t zero;                   /* Should be 0. */
      u_char key_id;                    /* Key ID. */
      u_char auth_data_len;             /* Auth Data Length. */
      u_int32_t crypt_seqnum;           /* Cryptographic Sequence Number. */
    } crypt;
  } u;
};
</code>
<div class="caption">

Code from [quagga](http://nongnu.org/quagga)'s OSPF module.

</div></pre>

Therefore we should not make an exception for pre IMO. What do you think?

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 03:11 ]:

  1. An HTML block MUST be started with a complete comment, block tag, CDATA section, declaration or processing instruction, that may span multiple lines and may contain blank lines. From that, the HTML block ends at the next empty line.

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

Raw text elements (style and script) don't contain HTML, so it makes absolutely no sense to have Markdown inside of them. But pre is not one of them; there are cases where one wants to have Markdown inside a pre:

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

@mildsunrise
Copy link
Contributor Author

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

I misunderstood it then, sorry. When you said earlier: "I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer", what should we do if we don't hit an appropiate closer?
Shouldn't we backtrack?

<!--
foo

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

Semantically it makes sense. <pre> indicates a block of preformatted code, but the code itself goes under a <code>. The role of <pre> for code is very similar to the role of <figure> for images.

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 08:04 ]:

The problem with this, as explained above, is that it requires potentially indefinite backtracking in the parsers. We want to avoid that. I think that allowing partially formed tags is a reasonable price for avoiding backtracking.

I misunderstood it then, sorry. When you said earlier: "I think to deal with this properly we must keep the HTML block open til we hit an appropriate closer", what should we do if we don't hit an appropiate closer?
Shouldn't we backtrack?

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I suppose we could mkae an exception for pre. (It seems a bit odd to me to include the caption in the pre element, though.)

Semantically it makes sense. <pre> indicates a block of preformatted code, but the code itself goes under a <code>. The role of <pre> for code is very similar to the role of <figure> for images.

Okay.

@mildsunrise
Copy link
Contributor Author

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I've been fiddling around with the idea, and it looks awkward when you write it. But I'm curious, how would you define HTML blocks in the spec then?

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 09:45 ]:

No. If I forgot to mention it before, it would work like unclosed fenced code blocks currently work: the entire remainder of the document would be considered part of the HTML block.

I've been fiddling around with the idea, and it looks awkward when you write it. But I'm curious, how would you define HTML blocks in the spec then?

It's not easy to see how to do it, exactly, since the matching closing
tag might not occur at the beginning of a line (though I suppose we
colud require this).

The more I think about raw HTML, the more I think we should just require
all HTML blocks to be segregated in an explicit

<html>
...
</html>

context, with these tags on lines by themselves. This would give
maximum flexibility for writers, and make parsing easy. But of course
this is problematic for backwards compatibility.

@mildsunrise
Copy link
Contributor Author

Just define HTML blocks in general. For instance, how is a comment parsed?

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 10:03 ]:

Just define HTML blocks in general. For instance, how is a comment parsed?

Same problem here as with script, style. Defining the start of the
block is easy: a comment block starts with <!-- at the beginning of a
line. Defining the end is harder, unless we require that the closing
--> come at the beginning of a line. (And yes, I'm aware that this
would capture some things that are not valid HTML5 comments. I'm not
too worried about this.)

Note that Gruber's original Markdown syntax guide says that in HTML
blocks, the open and close tag must appear at the beginning of the
line, so this wouldn't be a novel requirement, though it would
break some existing documents, since most current parsers don't
actually require this.

@mildsunrise
Copy link
Contributor Author

Defining the end is harder, unless we require that the closing --> come at the beginning of a line. (And yes, I'm aware that this would capture some things that are not valid HTML5 comments. I'm not too worried about this.)

Ah, you read my mind. Browsers do backtrack, so they would never parse this:

<!--
--foo

bar
-->

as a single comment. But let's go on, how would you define a generic tag HTML block?
I can't find a good way.

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 10:17 ]:

Defining the end is harder, unless we require that the closing --> come at the beginning of a line. (And yes, I'm aware that this would capture some things that are not valid HTML5 comments. I'm not too worried about this.)

Ah, you read my mind. Browsers do backtrack, so they would never parse this:

as a single comment.

I think I'm okay with that - garbage in, garbage out.

But let's go on, how would you define a generic tag HTML block?
I can't find a good way.

Same idea. Block starts with a line beginning with a (possibly partial)
open tag.

Block ends with:

  • a blank line (unless the open tag was script or style)
  • a line beginning with </script> (if open tag was <script)
  • a line beginning with </style> (if open tag was <style)

So, yes, this would lead to things being parsed as raw HTML blocks that
are not valid HTML5. But, we have that anyway. If the user wants to
use raw HTML, then it's the user's responsibility to make it correct,
not the Markdown parser's. (And any application using a Markdown
parser with untrusted inputs should always run the output through a
sanitizer for security.) I take it for granted here that things like
<script are not going to occur naturally in any context where the user
doesn't intend raw HTML. --> is a bit trickier, since it might be
used naturally as an arrow.

@mildsunrise
Copy link
Contributor Author

Same idea. Block starts with a line beginning with a (possibly partial) open tag.

Block ends with:
a blank line (unless the open tag was script or style)

Imagine this:

<div style="
   display: inline-block;
   position: relative;
   top: -2px;

   font: 2em Arial, sans-serif;
   color: #bcbcbc;
"> <span>

According to that rule, the above is an HTML block of four lines.
Just as we require complete comments, shouldn't we require a complete tag (or end of file) to end the block?

@jgm
Copy link
Member

jgm commented Feb 2, 2015

+++ Xavier Mendez [Feb 02 15 10:35 ]:

Imagine this:

<div style="
  display: inline-block;
  position: relative;
  top: -2px;

  font: 2em Arial, sans-serif;
  color: #bcbcbc;
"> <span>

According to that rule, the above is an HTML block of three lines.
Just as we require complete comments, shouldn't we require a complete tag (or end of file) to end the block?

Well, I'm trying to keep things simple. I don't think it's a priority to allow authors to do things like this (blank lines in attributes). Keeping the rule simple and making the blocks easier to parse seems to me something that needs to be weighed against the importance of fringe cases like this.

@mildsunrise
Copy link
Contributor Author

Let's do it, then. I personally would like simple rules but if you want to avoid backtracking at all costs...

@jgm
Copy link
Member

jgm commented Jun 8, 2015

@mildsunrise
Copy link
Contributor Author

Hmm, I like that proposal! As long as it's well defined, this issue can be closed when / if it's implemented in the spec.

@jgm jgm closed this as completed Jul 25, 2015
@mildsunrise
Copy link
Contributor Author

Great! 🎉

I'll try to implement the new section ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants