Skip to content

CommonMark compliance #1202

joshbruce started this conversation in General
CommonMark compliance #1202
Apr 4, 2018 · 8 comments

Marked version: 0.3.19

Markdown flavor: CommonMark

Proposal type: other

What pain point are you perceiving?

We are not compliant with the CommonMark specification. The CM spec is the foundation for the GitHub Flavored Markdown specification. (GFM only adds some extensions for things like tables.)

Part of how we got here was that historically we did not strictly stick to the specification. In some cases we implemented custom features not discussed in the specifications (header id, for example).

What solution are you suggesting?

PR #1160 looks to run Marked against the CommonMark spec test cases: http://spec.commonmark.org/0.28/spec.json

So far the results show that we are roughly 60% spec-compliant.

The spec is divided into sections and each section has a number of test cases. (Counts may be off and are meant to represent estimate percent compliance with each section.)

Section Count Percent
Tabs 7 of 11 63%
Thematic breaks 16 of 19 84%
ATX headings 13 of 18 72%
Setext headings 20 of 26 77%
Indented code blocks 11 of 12 92%
Fenced code blocks 17 of 28 61%
HTML blocks 12 of 43 28%
Link reference definitions 21 of 23 91%
Paragraphs 6 of 8 75%
Block quotes 21 of 25 84%
List items 32 of 48 67%
Lists 10 of 24 42%
Backslash escapes 4 of 13 31%
Entity and numeric character references 8 of 12 67%
Code spans 10 of 17 59%
Emphasis and strong emphasis 62 of 128 48%
Links 46 of 84 55%
Images 13 of 22 59%
Autolinks 14 of 19 74%
Hard line breaks 32 of 36 89%
Soft line breaks 1 of 2 50%

Replies

@joshbruce Did you use a script to generate this table?
I just did a count based on shouldPassButFails arrays and this is what I got:

We are failing 157 of 624 commonmark tests which means we're passing 467 of 624.

That's about 75% spec-compliant in v0.5.0 👍

0 replies

joshbruce
Aug 17, 2018
Maintainer Author

@styfle: Nice! No I did it by hand back in April. Think we've had two releases since then.

Not sure I have time right now to keep up with maintaining the ticket, maybe we should close it??

0 replies

joshbruce
Aug 17, 2018
Maintainer Author

Scratch that last about spec compliance - just remembered it's an epic related to all the 0.x stuff. Maybe I can get to it this weekend. (Trying to find a new day job is full-time work as it has been said. 😀)

0 replies

I counted the remaining shouldPassButFails arrays and we are failing 122 of 624.

So we're passing 502/624 which means v0.6.0 is 80% spec-compliant 👍

0 replies
Section Count Percent
Tabs 10 of 11 91%
Precedence 1 of 1 100%
Thematic breaks 19 of 19 100%
ATX headings 17 of 18 94%
Setext headings 24 of 27 89%
Indented code blocks 12 of 12 100%
Fenced code blocks 28 of 29 97%
HTML blocks 43 of 43 100%
Link reference definitions 25 of 28 89%
Paragraphs 8 of 8 100%
Blank lines 1 of 1 100%
Block quotes 23 of 25 92%
List items 35 of 48 73%
Lists 15 of 26 58%
Inlines 1 of 1 100%
Backslash escapes 11 of 13 85%
Entity and numeric character references 13 of 17 76%
Code spans 21 of 22 95%
Emphasis and strong emphasis 85 of 131 65%
Links 70 of 87 80%
Images 15 of 22 68%
Autolinks 19 of 19 100%
Raw HTML 19 of 21 90%
Hard line breaks 15 of 15 100%
Soft line breaks 2 of 2 100%
Textual content 3 of 3 100%

v0.7.0 is > 82%

Emphasis and strong emphasis is the area where we can improve the most.

Keep up the good work, we are getting there slowly but surely.

0 replies

It appears that a major roadblock to CommonMark compliance is the block-level tokenizing strategy that is being used here. marked uses a recursive block tokenizing strategy, but CommonMark was very clearly written with a line-by-line token strategy in mind. (see: the block structure parsing strategy recommended by CommonMark) Many of the failed test cases in the container blocks sections (blockquotes, list items, and lists) stem from the fact that CommonMark is fundamentally designed around the idea of line-by-line tokenizing while marked uses recursive tokenizing.

marked's recursive tokenizing strategy works as follows: Whenever it detects the start of a new container block it

  1. attempts to find the end of the container block
  2. parses out all of the text that the container block contains
  3. removes line prefixes related to the container block
  4. recursively tokenizes the cleaned contents of the container block

The problem is that with CommonMark it is impossible to determine when a code block ends until you've tokenized the contents of the container block due to CommonMark's lazy continuation. It is, however, very difficult to tokenize the contents of a container block without knowing where the container block ends.

The obvious solution is to rewrite the block-level tokenizer to operate line-by-line instead of recursively. It would help achieve CommonMark/GFM complience, and it would remove some of the "hacky" code that already exists in the tokenizer. Is this something that maintainers of this project would find acceptable if done well? Or is this such a drastic change to the project that a pull request like this would be immediately denied?

0 replies

As long as marked is still easy to extend, fast, secure, and all of the current tests pass I don't think it matters how it happens.

Full CommonMark compliance would be amazing but I don't know that it is the golden trophy. There are some specs that just don't make actual sense in the real world.

For Example #416:

markdown html
foo******bar*********baz <p>foo<strong><strong><strong>bar</strong></strong></strong>***baz</p>

I don't think sacrificing speed is worth making sure a <strong> can be nested in another <strong> since the browser will display <strong><strong>text</strong></strong> the same as <strong>text</strong> and the ultimate goal of markdown is to be displayed in the browser.

0 replies

Here are the latest values for reference. I copied these from the latest CI run

GFM

Section Count Percent
Tabs 11 of 11 100%
Backslash escapes 11 of 13 85%
Entity and numeric character references 13 of 17 76%
Precedence 1 of 1 100%
Thematic breaks 19 of 19 100%
ATX headings 18 of 18 100%
Setext headings 24 of 27 89%
Indented code blocks 12 of 12 100%
Fenced code blocks 29 of 29 100%
HTML blocks 44 of 44 100%
Link reference definitions 24 of 27 89%
Paragraphs 8 of 8 100%
Blank lines 1 of 1 100%
Block quotes 23 of 25 92%
List items 48 of 48 100%
Lists 26 of 26 100%
Inlines 1 of 1 100%
Code spans 22 of 22 100%
Emphasis and strong emphasis 131 of 131 100%
Links 75 of 90 83%
Images 15 of 22 68%
Autolinks 15 of 19 79%
Raw HTML 19 of 21 90%
Hard line breaks 15 of 15 100%
Soft line breaks 2 of 2 100%
Textual content 3 of 3 100%
[extension] Tables 8 of 8 100%
[extension] Task list items 2 of 2 100%
[extension] Strikethrough 2 of 2 100%
[extension] Autolinks 11 of 11 100%
[extension] Disallowed Raw HTML 0 of 1 0%

CommonMark

Section Count Percent
Tabs 11 of 11 100%
Backslash escapes 11 of 13 85%
Entity and numeric character references 13 of 17 76%
Precedence 1 of 1 100%
Thematic breaks 19 of 19 100%
ATX headings 18 of 18 100%
Setext headings 24 of 27 89%
Indented code blocks 12 of 12 100%
Fenced code blocks 29 of 29 100%
HTML blocks 44 of 44 100%
Link reference definitions 24 of 27 89%
Paragraphs 8 of 8 100%
Blank lines 1 of 1 100%
Block quotes 23 of 25 92%
List items 48 of 48 100%
Lists 26 of 26 100%
Inlines 1 of 1 100%
Code spans 22 of 22 100%
Emphasis and strong emphasis 131 of 131 100%
Links 75 of 90 83%
Images 15 of 22 68%
Autolinks 19 of 19 100%
Raw HTML 19 of 21 90%
Hard line breaks 15 of 15 100%
Soft line breaks 2 of 2 100%
Textual content 3 of 3 100%

0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants
Converted from issue