Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: support round trips? #892

Open
chriskrycho opened this issue May 16, 2024 · 4 comments
Open

Discussion: support round trips? #892

chriskrycho opened this issue May 16, 2024 · 4 comments

Comments

@chriskrycho
Copy link

pulldown-cmark-to-cmark is a really handy crate for doing things like writing mdBook preprocessors, which I have been doing a bit of while working on The Rust Programming Language, but I have noticed that it is challenging for that crate to match the input precisely, and there are a fair number of subtle bugs that are (a) difficult for them to fix and therefore (b) require special handling in preprocessors using it, e.g. to manually re-insert newlines.

This is not a bug report, though, as I don’t think that kind of round-tripping was part of the design here! Rather, it is intended to start a discussion, on two axes:

  • Is supporting that kind of round-tripping desirable?
  • If so, what changes would be required to support it?
    • Maintaining original source spans (e.g. Provide Tag and Event source locations. #725)—but then how do you integrate that info when you insert new Events into a stream of Events (e.g. when rewriting in a preprocessor, extending behavior, etc.)?
    • Including list item types?
    • Including line wrapping for blockquotes?
    • etc.: I am sure there are a bunch of others

I think there are probably a bunch of open design questions there beyond just what would have to change, so, as I said: opening this for discussion and not assuming it is something the library should do.

@ehuss
Copy link
Contributor

ehuss commented May 16, 2024

If your preprocessor received and emitted an AST instead of markdown, would that make it easier to make transformations? Also, output from the preprocessor could have AST fragments that are markdown, to lean on mdbook to do the more renderering.

@chriskrycho
Copy link
Author

The preprocessing itself is actually fairly straightforward—an AST would be nice in some ways, but I have been doing transformations with streams of events from pulldown-cmark for a long time, so that part isn’t really the issue. It’s more that the stream of Events ends up with things like not having all the original newline data attached so needing to re-insert newlines to preserve Markdown’s semantics when rewriting events. In that regard, you would need something more like a CST or an AST with a significant amount of extra source data attached to be fully-structure-preserving, I think?

A concrete example: I am taking input like this—

Normal text, yay.

> Note: This is a callout, more or less.
> It can run across lines.

Back to normal text!

—and rewriting it into this, so that it has the correct HTML semantics:

Normal text, yay.

<section class="note" aria-role="note">

Note: This is a callout, more or less.
It can run across lines.

</section>

Back to normal text!

If I don’t take care to insert newlines when generating the HTML for the <section> tags, I end up with this instead:

Normal text, yay.

<section class="note" aria-role="note">
Note: This is a callout, more or less.
It can run across lines.
</section>

Back to normal text!

But because of the rules around block elements, Markdown treats that as plain text within the <section>, not as a paragraph—and the same for other kinds of block content within it.

This is all quite doable, but after dealing with it a couple of times—lots of tests for edge cases now!—and reading through the issue tracker on pulldown-cmark-to-cmark, I started thinking about what it would take for it to be easier on the core library side.

@kdarkhan
Copy link

There is a discussion about roundtripping in the context of fuzz testing at Byron/pulldown-cmark-to-cmark#55 cc @mgeisler

When it was attempted the last time the fuzzer was failing due to lossiness of text->markdown->text conversion. It probably improved a lot with latest changes in both libraries. I just attempted to patch the PR and run it locally and the fuzzer failed again.

A useful feature of this roundtripping would be finding implementation bugs in both libraries. Running the PR locally, for instance, fuzzer failed with input:
* <!CD,\n~ .

pulldown-cmark generates the following events:

[
  Start(List(None))
  Start(Item)
  Start(HtmlBlock)
  Html(Borrowed("<!CD,\n"))
  End(HtmlBlock)
  End(Item)
  End(List(false))
  Start(Paragraph)
  Text(Borrowed("~"))
  End(Paragraph)
]

It does not seem to conform to the spec due to extra newline by pulldown-cmark based on this

@abhillman
Copy link

abhillman commented Jun 17, 2024

+1 – the parser and its ability to modify events is so good. There is a nice opportunity here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants