CommonMark compliance #1202

joshbruce · 2018-04-04T16:41:05Z

joshbruce
Apr 4, 2018
Maintainer

Marked version: 0.3.19

Markdown flavor: CommonMark

Proposal type: other

What pain point are you perceiving?

We are not compliant with the CommonMark specification. The CM spec is the foundation for the GitHub Flavored Markdown specification. (GFM only adds some extensions for things like tables.)

Part of how we got here was that historically we did not strictly stick to the specification. In some cases we implemented custom features not discussed in the specifications (header id, for example).

What solution are you suggesting?

PR #1160 looks to run Marked against the CommonMark spec test cases: http://spec.commonmark.org/0.28/spec.json

So far the results show that we are roughly 60% spec-compliant.

The spec is divided into sections and each section has a number of test cases. (Counts may be off and are meant to represent estimate percent compliance with each section.)

Section	Count	Percent
Tabs	7 of 11	63%
Thematic breaks	16 of 19	84%
ATX headings	13 of 18	72%
Setext headings	20 of 26	77%
Indented code blocks	11 of 12	92%
Fenced code blocks	17 of 28	61%
HTML blocks	12 of 43	28%
Link reference definitions	21 of 23	91%
Paragraphs	6 of 8	75%
Block quotes	21 of 25	84%
List items	32 of 48	67%
Lists	10 of 24	42%
Backslash escapes	4 of 13	31%
Entity and numeric character references	8 of 12	67%
Code spans	10 of 17	59%
Emphasis and strong emphasis	62 of 128	48%
Links	46 of 84	55%
Images	13 of 22	59%
Autolinks	14 of 19	74%
Hard line breaks	32 of 36	89%
Soft line breaks	1 of 2	50%

styfle · 2018-08-17T15:26:40Z

styfle
Aug 17, 2018
Maintainer

@joshbruce Did you use a script to generate this table?
I just did a count based on shouldPassButFails arrays and this is what I got:

We are failing 157 of 624 commonmark tests which means we're passing 467 of 624.

That's about 75% spec-compliant in v0.5.0 👍

0 replies

joshbruce · 2018-08-17T15:38:12Z

joshbruce
Aug 17, 2018
Maintainer Author

@styfle: Nice! No I did it by hand back in April. Think we've had two releases since then.

Not sure I have time right now to keep up with maintaining the ticket, maybe we should close it??

0 replies

joshbruce · 2018-08-17T15:39:59Z

joshbruce
Aug 17, 2018
Maintainer Author

Scratch that last about spec compliance - just remembered it's an epic related to all the 0.x stuff. Maybe I can get to it this weekend. (Trying to find a new day job is full-time work as it has been said. 😀)

0 replies

styfle · 2019-01-03T16:08:56Z

styfle
Jan 3, 2019
Maintainer

I counted the remaining shouldPassButFails arrays and we are failing 122 of 624.

So we're passing 502/624 which means v0.6.0 is 80% spec-compliant 👍

0 replies

UziTech · 2019-09-13T13:36:42Z

UziTech
Sep 13, 2019
Maintainer

Section	Count	Percent
Tabs	10 of 11	91%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	17 of 18	94%
Setext headings	24 of 27	89%
Indented code blocks	12 of 12	100%
Fenced code blocks	28 of 29	97%
HTML blocks	43 of 43	100%
Link reference definitions	25 of 28	89%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	35 of 48	73%
Lists	15 of 26	58%
Inlines	1 of 1	100%
Backslash escapes	11 of 13	85%
Entity and numeric character references	13 of 17	76%
Code spans	21 of 22	95%
Emphasis and strong emphasis	85 of 131	65%
Links	70 of 87	80%
Images	15 of 22	68%
Autolinks	19 of 19	100%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%

v0.7.0 is > 82%

Emphasis and strong emphasis is the area where we can improve the most.

Keep up the good work, we are getting there slowly but surely.

0 replies

corbinmcneill · 2020-09-11T22:20:21Z

corbinmcneill
Sep 11, 2020

It appears that a major roadblock to CommonMark compliance is the block-level tokenizing strategy that is being used here. marked uses a recursive block tokenizing strategy, but CommonMark was very clearly written with a line-by-line token strategy in mind. (see: the block structure parsing strategy recommended by CommonMark) Many of the failed test cases in the container blocks sections (blockquotes, list items, and lists) stem from the fact that CommonMark is fundamentally designed around the idea of line-by-line tokenizing while marked uses recursive tokenizing.

marked's recursive tokenizing strategy works as follows: Whenever it detects the start of a new container block it

attempts to find the end of the container block
parses out all of the text that the container block contains
removes line prefixes related to the container block
recursively tokenizes the cleaned contents of the container block

The problem is that with CommonMark it is impossible to determine when a code block ends until you've tokenized the contents of the container block due to CommonMark's lazy continuation. It is, however, very difficult to tokenize the contents of a container block without knowing where the container block ends.

The obvious solution is to rewrite the block-level tokenizer to operate line-by-line instead of recursively. It would help achieve CommonMark/GFM complience, and it would remove some of the "hacky" code that already exists in the tokenizer. Is this something that maintainers of this project would find acceptable if done well? Or is this such a drastic change to the project that a pull request like this would be immediately denied?

0 replies

UziTech · 2020-09-12T03:55:16Z

UziTech
Sep 12, 2020
Maintainer

As long as marked is still easy to extend, fast, secure, and all of the current tests pass I don't think it matters how it happens.

Full CommonMark compliance would be amazing but I don't know that it is the golden trophy. There are some specs that just don't make actual sense in the real world.

For Example #416:

markdown	html
`foo****bar*******baz`	`<p>foo<strong><strong><strong>bar</strong></strong></strong>***baz</p>`

I don't think sacrificing speed is worth making sure a  can be nested in another  since the browser will display text the same as text and the ultimate goal of markdown is to be displayed in the browser.

0 replies

bilalshaikh42 · 2022-01-04T18:20:05Z

bilalshaikh42
Jan 4, 2022

Here are the latest values for reference. I copied these from the latest CI run

GFM

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	11 of 13	85%
Entity and numeric character references	13 of 17	76%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	24 of 27	89%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	24 of 27	89%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	15 of 19	79%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%
[extension] Tables	8 of 8	100%
[extension] Task list items	2 of 2	100%
[extension] Strikethrough	2 of 2	100%
[extension] Autolinks	11 of 11	100%
[extension] Disallowed Raw HTML	0 of 1	0%

CommonMark

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	11 of 13	85%
Entity and numeric character references	13 of 17	76%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	24 of 27	89%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	24 of 27	89%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	19 of 19	100%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%

1 reply

mrienstra Sep 13, 2022

In case anyone is curious: I grabbed the most recent CI run (Sep 12) & diffed it against what @bilalshaikh42 posted above (from Jan 3 CI run), looks like there have been no changes (same number of spec tests, same percentages).

UziTech · 2022-11-12T15:57:21Z

UziTech
Nov 12, 2022
Maintainer

V4.2.2

Backslash escapes is 100% 🎉

GFM

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	13 of 13	100%
Entity and numeric character references	13 of 17	76%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	24 of 27	89%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	25 of 27	93%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	15 of 19	79%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%
[extension] Tables	8 of 8	100%
[extension] Task list items	2 of 2	100%
[extension] Strikethrough	2 of 2	100%
[extension] Autolinks	11 of 11	100%
[extension] Disallowed Raw HTML	0 of 1	0%

CommonMark

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	13 of 13	100%
Entity and numeric character references	13 of 17	76%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	24 of 27	89%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	25 of 27	93%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	19 of 19	100%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%

0 replies

UziTech · 2022-11-21T05:13:47Z

UziTech
Nov 21, 2022
Maintainer

V4.2.3

GFM

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	13 of 13	100%
Entity and numeric character references	15 of 17	88%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	26 of 27	96%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	27 of 27	100%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	15 of 19	79%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%
[extension] Tables	8 of 8	100%
[extension] Task list items	2 of 2	100%
[extension] Strikethrough	2 of 2	100%
[extension] Autolinks	11 of 11	100%
[extension] Disallowed Raw HTML	0 of 1	0%

CommonMark

Section	Count	Percent
Tabs	11 of 11	100%
Backslash escapes	13 of 13	100%
Entity and numeric character references	15 of 17	88%
Precedence	1 of 1	100%
Thematic breaks	19 of 19	100%
ATX headings	18 of 18	100%
Setext headings	26 of 27	96%
Indented code blocks	12 of 12	100%
Fenced code blocks	29 of 29	100%
HTML blocks	44 of 44	100%
Link reference definitions	27 of 27	100%
Paragraphs	8 of 8	100%
Blank lines	1 of 1	100%
Block quotes	23 of 25	92%
List items	48 of 48	100%
Lists	26 of 26	100%
Inlines	1 of 1	100%
Code spans	22 of 22	100%
Emphasis and strong emphasis	131 of 131	100%
Links	75 of 90	83%
Images	15 of 22	68%
Autolinks	19 of 19	100%
Raw HTML	19 of 21	90%
Hard line breaks	15 of 15	100%
Soft line breaks	2 of 2	100%
Textual content	3 of 3	100%

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CommonMark compliance #1202

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CommonMark compliance #1202

joshbruce Apr 4, 2018 Maintainer

What pain point are you perceiving?

What solution are you suggesting?

Replies: 10 comments · 1 reply

styfle Aug 17, 2018 Maintainer

joshbruce Aug 17, 2018 Maintainer Author

joshbruce Aug 17, 2018 Maintainer Author

styfle Jan 3, 2019 Maintainer

UziTech Sep 13, 2019 Maintainer

corbinmcneill Sep 11, 2020

UziTech Sep 12, 2020 Maintainer

bilalshaikh42 Jan 4, 2022

GFM

CommonMark

mrienstra Sep 13, 2022

UziTech Nov 12, 2022 Maintainer

V4.2.2

GFM

CommonMark

UziTech Nov 21, 2022 Maintainer

V4.2.3

GFM

CommonMark

joshbruce
Apr 4, 2018
Maintainer

Replies: 10 comments 1 reply

styfle
Aug 17, 2018
Maintainer

joshbruce
Aug 17, 2018
Maintainer Author

joshbruce
Aug 17, 2018
Maintainer Author

styfle
Jan 3, 2019
Maintainer

UziTech
Sep 13, 2019
Maintainer

corbinmcneill
Sep 11, 2020

UziTech
Sep 12, 2020
Maintainer

bilalshaikh42
Jan 4, 2022

UziTech
Nov 12, 2022
Maintainer

UziTech
Nov 21, 2022
Maintainer