Discussion: support round trips? #892

chriskrycho · 2024-05-16T14:03:45Z

pulldown-cmark-to-cmark is a really handy crate for doing things like writing mdBook preprocessors, which I have been doing a bit of while working on The Rust Programming Language, but I have noticed that it is challenging for that crate to match the input precisely, and there are a fair number of subtle bugs that are (a) difficult for them to fix and therefore (b) require special handling in preprocessors using it, e.g. to manually re-insert newlines.

This is not a bug report, though, as I don’t think that kind of round-tripping was part of the design here! Rather, it is intended to start a discussion, on two axes:

Is supporting that kind of round-tripping desirable?
If so, what changes would be required to support it?
- Maintaining original source spans (e.g. Provide Tag and Event source locations. #725)—but then how do you integrate that info when you insert new Events into a stream of Events (e.g. when rewriting in a preprocessor, extending behavior, etc.)?
- Including list item types?
- Including line wrapping for blockquotes?
- etc.: I am sure there are a bunch of others

I think there are probably a bunch of open design questions there beyond just what would have to change, so, as I said: opening this for discussion and not assuming it is something the library should do.

The text was updated successfully, but these errors were encountered:

ehuss · 2024-05-16T14:08:57Z

If your preprocessor received and emitted an AST instead of markdown, would that make it easier to make transformations? Also, output from the preprocessor could have AST fragments that are markdown, to lean on mdbook to do the more renderering.

chriskrycho · 2024-05-16T16:46:37Z

The preprocessing itself is actually fairly straightforward—an AST would be nice in some ways, but I have been doing transformations with streams of events from pulldown-cmark for a long time, so that part isn’t really the issue. It’s more that the stream of Events ends up with things like not having all the original newline data attached so needing to re-insert newlines to preserve Markdown’s semantics when rewriting events. In that regard, you would need something more like a CST or an AST with a significant amount of extra source data attached to be fully-structure-preserving, I think?

A concrete example: I am taking input like this—

Normal text, yay.

> Note: This is a callout, more or less.
> It can run across lines.

Back to normal text!

—and rewriting it into this, so that it has the correct HTML semantics:

Normal text, yay.

<section class="note" aria-role="note">

Note: This is a callout, more or less.
It can run across lines.

</section>

Back to normal text!

If I don’t take care to insert newlines when generating the HTML for the <section> tags, I end up with this instead:

Normal text, yay.

<section class="note" aria-role="note">
Note: This is a callout, more or less.
It can run across lines.
</section>

Back to normal text!

But because of the rules around block elements, Markdown treats that as plain text within the <section>, not as a paragraph—and the same for other kinds of block content within it.

This is all quite doable, but after dealing with it a couple of times—lots of tests for edge cases now!—and reading through the issue tracker on pulldown-cmark-to-cmark, I started thinking about what it would take for it to be easier on the core library side.

kdarkhan · 2024-05-27T23:04:59Z

There is a discussion about roundtripping in the context of fuzz testing at Byron/pulldown-cmark-to-cmark#55 cc @mgeisler

When it was attempted the last time the fuzzer was failing due to lossiness of text->markdown->text conversion. It probably improved a lot with latest changes in both libraries. I just attempted to patch the PR and run it locally and the fuzzer failed again.

A useful feature of this roundtripping would be finding implementation bugs in both libraries. Running the PR locally, for instance, fuzzer failed with input:
* <!CD,\n~ .

pulldown-cmark generates the following events:

[
  Start(List(None))
  Start(Item)
  Start(HtmlBlock)
  Html(Borrowed("<!CD,\n"))
  End(HtmlBlock)
  End(Item)
  End(List(false))
  Start(Paragraph)
  Text(Borrowed("~"))
  End(Paragraph)
]

It does not seem to conform to the spec due to extra newline by pulldown-cmark based on this

abhillman · 2024-06-17T05:49:22Z

+1 – the parser and its ability to modify events is so good. There is a nice opportunity here.

alyjak · 2024-06-23T10:00:15Z

Do most developers care that roundripping preserves the original markdown or that generated markdown produces the same html as the original? In my case it's the latter. I belief this is easier to get correct and as a side-effect produce an automatic markdown style formatter because it will deterministically transform an event stream into markdown text.

instead of checking markdown_a == markdown_b, check that html_a == html_b

markdown_a -> events -> markdown_b -> events -> html_b
           \-> events -> html_a

chriskrycho · 2024-07-22T13:37:14Z

@alyjak I don’t think it’s so much “most developers” as “at least some developers”, though obviously which subset matters. Lots of folks, me included (but you can see a fair number of comments to this effect scattered around other discussions on the repo over the years) use pulldown-cmark to do varieties of interesting things which are not just emitting HTML, because the API it exposes makes it very easy to do so. While “does it emit the same HTML” is a good end-to-end test, it’s not the only one we might care about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: support round trips? #892

Discussion: support round trips? #892

chriskrycho commented May 16, 2024

ehuss commented May 16, 2024

chriskrycho commented May 16, 2024

kdarkhan commented May 27, 2024

abhillman commented Jun 17, 2024 •

edited

Loading

alyjak commented Jun 23, 2024

chriskrycho commented Jul 22, 2024

Discussion: support round trips? #892

Discussion: support round trips? #892

Comments

chriskrycho commented May 16, 2024

ehuss commented May 16, 2024

chriskrycho commented May 16, 2024

kdarkhan commented May 27, 2024

abhillman commented Jun 17, 2024 • edited Loading

alyjak commented Jun 23, 2024

chriskrycho commented Jul 22, 2024

abhillman commented Jun 17, 2024 •

edited

Loading