Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First markdown parser tests #97

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

fcollonval
Copy link
Member

@fcollonval fcollonval commented Apr 20, 2022

Discussion about Markdown parsing is a long standing issue. This PR lays out a structure to test nbconvert and web frontend parser based on the GitHub flavored Commonmark tests.

Most of those tests are failing

Xref: jupyterlab/jupyterlab#272

@fcollonval
Copy link
Member Author

Part of the errors are due to the additional id and link nbconvert and JupyterLab are adding on headings compared to gfm commonmark; e.g. '<h3 id="foo">foo<a class="anchor-link" href="#foo">&#182;</a></h3>' VS '<h3>foo</h3>'

@fcollonval
Copy link
Member Author

Adding the normalization of GFM on the output HTML the results for JupyterLab is 293 failed, 378 passed

The artifact to compare the outputs is available at: https://github.com/jupyterlab/benchmarks/actions/runs/2200787289

@fcollonval
Copy link
Member Author

So a more in-depth analysis shows the following group of discrepancies:

  • Desired discrepancies:
    • Heading id and link
    • Code block styling
    • HTML sanitization
id section markdown commonmark-gfm JupyterLab
10 Tabs #\tFoo\n <h1>Foo</h1> <h1 id="Foo">Foo<a class="jp-InternalAnchorLink" href="#Foo" target="_self">¶</a></h1>
112 Fenced code blocks ````ruby\ndef foo(x)\n return 3\nend\n```\n` <pre><code class="language-ruby">def foo(x)\n return 3\nend\n</code></pre> <pre><code class="cm-s-jupyter language-ruby"><span class="cm-keyword">def</span> <span class="cm-def">foo</span>(<span class="cm-variable">x</span>)\n <span class="cm-keyword">return</span> <span class="cm-number">3</span>\n<span class="cm-keyword">end</span>\n</code></pre>
133 HTML blocks <Warning>\n*bar*\n</Warning>\n <warning> *bar* </warning> *bar*
140 HTML blocks <script type="text/javascript">\n// JavaScript example\n\ndocument.getElementById("demo").innerHTML = "Hello JavaScript!";\n</script>\nokay\n <script type="text/javascript">// JavaScript example document.getElementById("demo").innerHTML = "Hello JavaScript!";</script><p>okay</p> <p>okay</p>
141 HTML blocks <style\n type="text/css">\nh1 {color:red;}\n\np {color:blue;}\n</style>\nokay\n <style type="text/css">h1 {color:red;} p {color:blue;}</style><p>okay</p> <p>okay</p>
  • To fix discrepancies:
    • Tabulation and new lines characters handling
    • [TBC] Unpaired HTML tag
id section markdown commonmark-gfm JupyterLab
1 Tabs \tfoo\tbaz\t\tbim\n <pre><code>foo\tbaz\t\tbim\n</code></pre> <pre><code>foo baz bim\n</code></pre>
120 HTML blocks <div>\n *hello*\n <foo><a>\n <div>*hello* <foo><a> <div>*hello* <a rel="nofollow" target="_self"> </a></div>
121 HTML blocks </div>\n*foo*\n </div>*foo* *foo*

@fcollonval
Copy link
Member Author

Actually marked is running the commonmark and gfm tests in its CI. The results as of Jan 4th 2022 are reported there: markedjs/marked#1202 (comment)

@jasongrout
Copy link

jasongrout commented May 19, 2022

Interestingly, Github just added math rendering, so now there is another opinion about exactly what syntax is used to create math: https://github.blog/changelog/2022-05-19-render-mathematical-expressions-in-markdown/

Math is also rendering in github's notebook preview (see https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Lorenz%20Differential%20Equations.ipynb, for example). It appears to use MathJax 3.2.0.

@fcollonval
Copy link
Member Author

@williamstein pointing me out to some concerns about the GitHub implementation: https://nschloe.github.io/2022/05/20/math-on-github.html; they bring interesting points to have in mind for our parser.

@fcollonval
Copy link
Member Author

Let's list the wanted feature for an ideal markdown parser for JupyterLab:

  • Parse GitHub-flavored CommonMark syntax
    • This does not covered late addition of mermaid-js nor their way of supporting math equations.
  • Support Math syntax (to be specified)
  • Cell attachments
  • [TBC] Extensible by JupyterLab extensions
  • [TBC] MyST support (see JupyterLab survey analysis)

Reference: Notebook documentation

@fcollonval
Copy link
Member Author

WIP Candidates / features matrix

marked.js markdown-it MyST-parser
Support gfm x1 x2 x
Math syntax x3 ? x
Attachment x4 ? ?
Extensible x ?
MyST x

Some comments:

  • MyST-parser provide opinionated markdown-it plugins

What other are using?

Footnotes

  1. Partly true - see test results

  2. CommonMark run as part of the CI - GFM features available as plugins

  3. Using some pre processing

  4. Using customized link handler

@williamstein
Copy link

@fcollonval a few days ago I rewrote the upstream markdown-it plugin I'm using for parsing out math, so in cocalc we fully parse math via a plugin, rather than some sort of hack involving parsing before or after markdown is used (like github and jupyter both do, I think). Here's the code:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

It's MIT licensed. My goal with that code is to align with upstream Jupyter in fidelity in terms of what is parsed as math. *In cases where there is a reasonable difference, I would lobby for Jupyter to change. As an example, my plugin parses this properly as inline math:

consider \begin{math}x^3\end{math} and ...

JupyterLab doesn't detect it as math at all. I think it's reasonable to detect.

@jasongrout
Copy link

jasongrout commented May 25, 2022

It seems that https://github.github.com/gfm/ has not been updated for math support. Edit: which is what @fcollonval was saying above

@jasongrout
Copy link

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

@JasonWeill
Copy link

New Markdown parsers should also address existing Markdown bugs and feature requests, such as:

@williamstein
Copy link

In cases where there is a reasonable difference, I would lobby for Jupyter to change.

@williamstein - can you give a comprehensive description of what your plugin parses as math to typeset?

It's by definition exactly what this file parses:

https://github.com/sagemathinc/cocalc/blob/master/src/packages/frontend/markdown/math-plugin.ts

when run as the first plugin in markdown-it. It would be a lot of (very valuable) work for that to get converted to an official spec. My goal with writing and iterating on math-plugin.ts has been to get fidelity with what I think JupyterLab does or should do, and I've incorporated significant feedback from my users. I would not be at all surprised if there are significant bad surprises related to the above linked code though. In fact, I can't wait to test it on the bugs @jweill-aws just listed, and see if my code isn't all broken or not on those...

@williamstein
Copy link

I wrote up some thoughts in a README here along with a notebook testing the issues mentioned above:

https://cocalc.com/wstein/support/markdown-math

@williamstein
Copy link

There's a related discussion about math + markdown here: https://chat.zulip.org/#narrow/stream/2-general/topic/LaTeX.20math/near/1382932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants