First markdown parser tests #97

fcollonval · 2022-04-20T10:50:44Z

Discussion about Markdown parsing is a long standing issue. This PR lays out a structure to test nbconvert and web frontend parser based on the GitHub flavored Commonmark tests.

Most of those tests are failing

Xref: jupyterlab/jupyterlab#272

fcollonval · 2022-04-20T11:49:46Z

Part of the errors are due to the additional id and link nbconvert and JupyterLab are adding on headings compared to gfm commonmark; e.g. '<h3 id="foo">foo<a class="anchor-link" href="#foo">¶</a></h3>' VS '<h3>foo</h3>'

fcollonval · 2022-04-21T09:10:15Z

Adding the normalization of GFM on the output HTML the results for JupyterLab is 293 failed, 378 passed

The artifact to compare the outputs is available at: https://github.com/jupyterlab/benchmarks/actions/runs/2200787289

fcollonval · 2022-04-29T09:11:37Z

So a more in-depth analysis shows the following group of discrepancies:

Desired discrepancies:
- Heading id and link
- Code block styling
- HTML sanitization

id	section	markdown	commonmark-gfm	JupyterLab
10	Tabs	`#\tFoo\n`	`<h1>Foo</h1>`	`<h1 id="Foo">Foo<a class="jp-InternalAnchorLink" href="#Foo" target="_self">¶</a></h1>`
112	Fenced code blocks	````ruby\ndef foo(x)\n return 3\nend\n```\n`	`<pre><code class="language-ruby">def foo(x)\n return 3\nend\n</code></pre>`	`<pre><code class="cm-s-jupyter language-ruby"><span class="cm-keyword">def</span> <span class="cm-def">foo</span>(<span class="cm-variable">x</span>)\n <span class="cm-keyword">return</span> <span class="cm-number">3</span>\n<span class="cm-keyword">end</span>\n</code></pre>`
133	HTML blocks	`<Warning>\nbar\n</Warning>\n`	`<warning> bar </warning>`	`bar`
140	HTML blocks	`<script type="text/javascript">\n// JavaScript example\n\ndocument.getElementById("demo").innerHTML = "Hello JavaScript!";\n</script>\nokay\n`	`<script type="text/javascript">// JavaScript example document.getElementById("demo").innerHTML = "Hello JavaScript!";</script><p>okay</p>`	`<p>okay</p>`
141	HTML blocks	`<style\n type="text/css">\nh1 {color:red;}\n\np {color:blue;}\n</style>\nokay\n`	`<style type="text/css">h1 {color:red;} p {color:blue;}</style><p>okay</p>`	`<p>okay</p>`

To fix discrepancies:
- Tabulation and new lines characters handling
- [TBC] Unpaired HTML tag

id	section	markdown	commonmark-gfm	JupyterLab
1	Tabs	`\tfoo\tbaz\t\tbim\n`	`<pre><code>foo\tbaz\t\tbim\n</code></pre>`	`<pre><code>foo baz bim\n</code></pre>`
120	HTML blocks	`<div>\n hello\n <foo><a>\n`	`<div>hello <foo><a>`	`<div>hello <a rel="nofollow" target="_self"> </a></div>`
121	HTML blocks	`</div>\nfoo\n`	`</div>foo`	`foo`

fcollonval · 2022-05-01T12:07:44Z

Actually marked is running the commonmark and gfm tests in its CI. The results as of Jan 4th 2022 are reported there: markedjs/marked#1202 (comment)

jasongrout · 2022-05-19T20:48:52Z

Interestingly, Github just added math rendering, so now there is another opinion about exactly what syntax is used to create math: https://github.blog/changelog/2022-05-19-render-mathematical-expressions-in-markdown/

Math is also rendering in github's notebook preview (see https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Lorenz%20Differential%20Equations.ipynb, for example). It appears to use MathJax 3.2.0.

fcollonval · 2022-05-23T13:33:30Z

@williamstein pointing me out to some concerns about the GitHub implementation: https://nschloe.github.io/2022/05/20/math-on-github.html; they bring interesting points to have in mind for our parser.

fcollonval · 2022-05-23T13:49:22Z

Let's list the wanted feature for an ideal markdown parser for JupyterLab:

Parse GitHub-flavored CommonMark syntax
- This does not covered late addition of mermaid-js nor their way of supporting math equations.
Support Math syntax (to be specified)
Cell attachments
[TBC] Extensible by JupyterLab extensions
[TBC] MyST support (see JupyterLab survey analysis)

Reference: Notebook documentation

fcollonval · 2022-05-23T14:24:20Z

WIP Candidates / features matrix

	marked.js	markdown-it	MyST-parser
Support gfm	x¹	x²	x
Math syntax	x³	?	x
Attachment	x⁴	?	?
Extensible		x	?
MyST			x

Some comments:

MyST-parser provide opinionated markdown-it plugins

What other are using?

VS Code: markdown-it
Cocalc: markdown-it

Partly true - see test results ↩
CommonMark run as part of the CI - GFM features available as plugins ↩
Using some pre processing ↩
Using customized link handler ↩

williamstein · 2022-05-23T15:01:47Z

@fcollonval a few days ago I rewrote the upstream markdown-it plugin I'm using for parsing out math, so in cocalc we fully parse math via a plugin, rather than some sort of hack involving parsing before or after markdown is used (like github and jupyter both do, I think). Here's the code:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

It's MIT licensed. My goal with that code is to align with upstream Jupyter in fidelity in terms of what is parsed as math. *In cases where there is a reasonable difference, I would lobby for Jupyter to change. As an example, my plugin parses this properly as inline math:

consider \begin{math}x^3\end{math} and ...

JupyterLab doesn't detect it as math at all. I think it's reasonable to detect.

jasongrout · 2022-05-25T16:52:49Z

It seems that https://github.github.com/gfm/ has not been updated for math support. Edit: which is what @fcollonval was saying above

jasongrout · 2022-05-25T16:54:08Z

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

JasonWeill · 2022-05-25T16:56:54Z

New Markdown parsers should also address existing Markdown bugs and feature requests, such as:

markdown: LaTeX sometimes rendered over text jupyterlab#12524 (LaTeX overlapping)
Render local file with whitespace in the name in Markdown cells fails jupyterlab#12561 / Evaluate markdown-it as a commonmark renderer jupyterlab#272 (URLs with spaces)
Support ordered list with letters or Roman numerals jupyterlab#12432 (ordered list with letter or Roman numeral ordinals)

williamstein · 2022-05-25T17:07:45Z

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

It's by definition exactly what this file parses:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

when run as the first plugin in markdown-it. It would be a lot of (very valuable) work for that to get converted to an official spec. My goal with writing and iterating on math-plugin.ts has been to get fidelity with what I think JupyterLab does or should do, and I've incorporated significant feedback from my users. I would not be at all surprised if there are significant bad surprises related to the above linked code though. In fact, I can't wait to test it on the bugs @jweill-aws just listed, and see if my code isn't all broken or not on those...

williamstein · 2022-05-25T17:44:55Z

I wrote up some thoughts in a README here along with a notebook testing the issues mentioned above:

https://cocalc.com/wstein/support/markdown-math

williamstein · 2022-05-27T00:42:43Z

There's a related discussion about math + markdown here: https://chat.zulip.org/#narrow/stream/2-general/topic/LaTeX.20math/near/1382932

fcollonval added 2 commits April 20, 2022 12:50

First markdown parser tests

8238d05

[skip ci] Clean-up

03a8e78

fcollonval mentioned this pull request Apr 20, 2022

Evaluate markdown-it as a commonmark renderer jupyterlab/jupyterlab#272

Open

fcollonval added 2 commits April 20, 2022 13:35

Speed tests by using the same web page

3a39553

Set base-url in CI

fceeb6a

Fix some doc string

6bbd444

fcollonval mentioned this pull request Apr 21, 2022

Weekly Team Meetings: Jan-Jun 2022 jupyterlab/frontends-team-compass#135

Closed

fcollonval added 6 commits April 21, 2022 10:14

Add reports generation

861cec8

Use commonmark normalization

8e3d63e

Debug CI

f90969f

Force rootdir

9269f89

Move addoption up

6bd4c3b

Fix test filename

ed6b83f

Force trigger

a11a37f

fcollonval mentioned this pull request Jun 5, 2023

SSC meeting minutes 2023 jupyter/software-steering-council-team-compass#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First markdown parser tests #97

First markdown parser tests #97

fcollonval commented Apr 20, 2022 •

edited

Loading

fcollonval commented Apr 20, 2022

fcollonval commented Apr 21, 2022

fcollonval commented Apr 29, 2022

fcollonval commented May 1, 2022

jasongrout commented May 19, 2022 •

edited

Loading

fcollonval commented May 23, 2022

fcollonval commented May 23, 2022

fcollonval commented May 23, 2022

williamstein commented May 23, 2022

jasongrout commented May 25, 2022 •

edited

Loading

jasongrout commented May 25, 2022

JasonWeill commented May 25, 2022

williamstein commented May 25, 2022

williamstein commented May 25, 2022

williamstein commented May 27, 2022

First markdown parser tests #97

Are you sure you want to change the base?

First markdown parser tests #97

Conversation

fcollonval commented Apr 20, 2022 • edited Loading

fcollonval commented Apr 20, 2022

fcollonval commented Apr 21, 2022

fcollonval commented Apr 29, 2022

fcollonval commented May 1, 2022

jasongrout commented May 19, 2022 • edited Loading

fcollonval commented May 23, 2022

fcollonval commented May 23, 2022

fcollonval commented May 23, 2022

Footnotes

williamstein commented May 23, 2022

jasongrout commented May 25, 2022 • edited Loading

jasongrout commented May 25, 2022

JasonWeill commented May 25, 2022

williamstein commented May 25, 2022

williamstein commented May 25, 2022

williamstein commented May 27, 2022

fcollonval commented Apr 20, 2022 •

edited

Loading

jasongrout commented May 19, 2022 •

edited

Loading

jasongrout commented May 25, 2022 •

edited

Loading