Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extension to allow Critic Markup pass through #5430

Open
alerque opened this issue Apr 5, 2019 · 25 comments
Open

Add extension to allow Critic Markup pass through #5430

alerque opened this issue Apr 5, 2019 · 25 comments

Comments

@alerque
Copy link
Contributor

alerque commented Apr 5, 2019

I have reviewed issue #2873 regarding supporting Critic Markup. Personally I seriously think that needs to be revisited (see also #1560), but this is a different issue.

Critic Markup is an extension to Markdown syntax. It currently quite an ordeal to mix and match the use of CM in a workflow inlovling Pandoc. The arguments made in the other issue about how CM should be handled only cover use cases where you are either a) outputting to some special non-published format for review or b) resolving the status of an edit. The use case is for copy-editing books and translations of books at a publishing company. As such our markup has a much longer lifespan than this, and often we want to pass it though our publishing pipeline with the markup intact.

One of the things we do is normalize our Markdown by passing it through Pandoc periodically. We are also using downstream tools that know what to do with CM in the output.

Pandoc is currently escaping several aspects of CM syntax in Markdown formats. For example:

$ pandoc -f markdown -t markdown <<< "{>>test<<}"
{\>\>test\<\<}

$ pandoc -f markdown -t markdown <<< "{~~test~>result~~}"
{\~~test~\>result\~\~}

This leaves me in a situation where I have to preprocess the text before Pandoc sees it replace all the CM with tokens it won't care about, then convert the tokens back to markup on the other end.

I think there should be an extension to allow know instances of CM through:

$ pandoc -f markdown+critic -t markdown+critic <<< "{>>test<<}"
{>>test<<}

$ pandoc -f markdown+critic -t markdown+critic <<< "{~~test~>result~~}"
{~~test~>result~~}

This would allow Pandoc to be used as a pre-processor on Markdown files that include Critic Markup without molesting the source.

Note the {++add++}, {--remove--}, and {==highlight==} syntaxes don't have any characters that get escaped in normal usage, only the strike, change, and comment syntaxes are a problem.

@jgm
Copy link
Owner

jgm commented Apr 5, 2019

Our earlier discussions about supporting CM at the AST level were thinking that the CM would be parsed into Inserted and Deleted elements, containing (presumably) inlines. There are some issues with that.

A very lightweight change would be to add a +critic extension that essentially just recognizes the CM delimiters (at least in inline contexts) and parses them as RawInline (Format "markdown") "{~~") and the like. This would give you clean pass-through when going from markdown -> markdown. The delimiters would simply disappear in other formats, but you could write filters that changed them into something else.

This seems like a half-measure, but given the conceptual difficulties supporting CM at the AST level (discussed in the other issues), it might be useful. I'd be interested in hearing comments from other CM users.

@jgm
Copy link
Owner

jgm commented Apr 5, 2019

Another approach would be to have the +critic extension cause {~~test~>result~~} to be parsed as [Str "{~~test~>result~~}"] rather than [Str "{~",Subscript [Str "test"],Str ">result~~}"] as now, and to adjust escaping in the writer.

On this approach, the CM delimiters would appear verbatim in all formats.

@jkr
Copy link
Collaborator

jkr commented Apr 5, 2019

Or RawInline (Format "CriticMarkup") "{~~test~>result~~}", since we'd already be recognizing it as such and treating it differently anyway. This would make it easier for filter writers to make them into spans or whatever.

@alerque
Copy link
Contributor Author

alerque commented Apr 5, 2019

I really like the idea of having a way to pass through all the CM syntax as RawInline Markdown. This would make my publishing workflow a lot easier.

Regarding your second approach, wouldn't that mean other inline markup would not get parsed if it happens to fall inside CM? E.g.:

This is my {~~*very* first~>most recent~~} test.

What would happen to the emphasis markup on *very*?

@alerque
Copy link
Contributor Author

alerque commented Apr 5, 2019

@davidar and @ickc might have some input on this, having written apps related to Pandoc that handle CriticMarkup.

@jgm
Copy link
Owner

jgm commented Apr 5, 2019

Or RawInline (Format "CriticMarkup") "{~~test~>result~~}

Regarding your second approach, wouldn't that mean other inline markup would not get parsed if it happens to fall inside CM?

Just to be clear, on both approaches the idea was to make the delimiters like {-- or ~> separate elements; that allows the interior contents to be parsed as inlines as usual.

@ickc
Copy link
Contributor

ickc commented Apr 5, 2019

Critic Markup is an extension to Markdown syntax.

To be more precise, CriticMarkup is a preprocessor for Markdown syntaxes. The reference implementation of CriticMarkup is actually a preprocessor (i.e. the markdown parser doesn't "see" the CriticMarkup. All existing implementation of CriticMarkup (that I know of) is happening at the preprocessor level.

I've a tool at https://github.com/ickc/pancritic, what it does is to take the reference implementation of CriticMarkup (which is no longer maintained, so I cleaned it up and improved it a bit), and wrap pandoc inside it. So you could use pancritic as if it is pandoc (with a pandoc-like cli interface) if --engine panflute|pypandoc is used.

There's another issue here from a LaTeX package maintainer that has a nicer LaTeX output for CriticMarkup. I'm interested in implementing it but don't get time yet.

Obviously there's a few issues there too. PR is appreciated, otherwise I might take a look at them this weekend (don't hold your breath though.)

Edit: to be clear, I mean since CriticMarkup is happening at the preprocessor level, any "cleaning up" of the round trip markdown should also happens at the preprocessor level. This is very easy to do and you've a couple of options.

Edit 2: the "issue" mentioned above is actually in pandoc-discuss: https://groups.google.com/d/msg/pandoc-discuss/sHoQhJsxEXw/9bN7cAwqCQAJ

@alerque
Copy link
Contributor Author

alerque commented Apr 6, 2019

Critic Markup is an extension to Markdown syntax.

To be more precise, CriticMarkup is a preprocessor for Markdown syntaxes.

Actually I disagree here. It was originally conceived that way and the toolkit on the concept site acts that way, but I think this is both an oversight and a missed opportunity on their part. I would go so far as to say their own documentation is contradictory on this point. Their toolkit covers usage not related to pre-processing as well. Your own attempt to wrap full blown version of Pandoc inside a "pre" processor and suggestion that any round-trip cleanup would also happen "pre" highlights the concept of this being more than a pre-processor issue. It's both a pre and post issue, and hence why wrapping Pandoc makes sense at all.

Right not I'm also both pre and post processing content to get it o round-trip. Hence the thought that Pandoc ought to be taught to treat the syntax as part of the document format.

The fact is that anything that exists in Markdown source does no "at rest", it is part of the file and hence part of the syntax. Assuming that the only thing you would want to do with the syntax is remove it is selling the idea short.

In my use case copy editing whole books, the life cycle of such edits (comments, suggestions, etc.) is much longer lived than a single pipeline. I don't just use Pandoc for final output where a preprocessor would have stripped the CM out. I want to actually do something with it on the output side, and I want to use Pandoc to normalize the source as I go along (part of the project linter is making sure the book source round-trips safely).

@ickc
Copy link
Contributor

ickc commented Apr 6, 2019

Are you claiming CriticMarkup should/can be implemented as part of the AST, or are you proposing pandoc to have built-in pre/post-processor of CriticMarkup? If it’s the later, you ain’t disagreeing with me.

@alerque
Copy link
Contributor Author

alerque commented Apr 8, 2019

@ickc I'm not sure how to answer that because –as much as I've reviewed the related issues and discussions– I'm not entirely clear on what the difference would be, particularly as an end user. For sure at least the former would be a boon to my workflow(s), but I can't get my head around why the latter wouldn't be better.

There seems to be two main issues:

  1. CriticMarkup has an identify crisis in that it allows inline level markup to span blocks. I could see this being a bigger problem if implemented only as a pre/post processor, while making this a part of the AST would open up the door to a solution: undo any block-wise markup and wrap the block contents in the correct inline markup. Of course the other way would be just to ignore this issue and pass the problem on down the line. Not having CM be part of the AST would mandate that solution, correct? Or is needing to have both block and inline versions of the same syntax the problem? If the latter, wouldn't just requiring it to be inline syntax only solve this? My use cases at least would be fine with this.

  2. CriticMarkup has the potential to introduce actual syntax errors, which are pretty difficult to have in Markdown as it stands. Inline markup just got passed through as characters if not understood. Emphasis syntax spanning lines just became stars. No big deal. CM is a little more complicated in that the tag-like syntax has to be opened and closed. That being said, I don't see how this is any worse than the new inline span syntax and resolving this the same way (just passing through any unmatched syntax as characters to the output) wouldn't be just fine.

@jgm
Copy link
Owner

jgm commented Apr 8, 2019

I have to say, CriticMarkup seems a bit of a mess in its present form. For example, if you try it on

`code {--` this is deleted --}

you get this result:

<code>code &lt;del&gt;</code> this is deleted </del>

which isn't even well-formed HTML. Oddly, their toolchain seems not to be just a preprocessor which does a markdown -> markdown translation prior to converting to HTML. That would make a lot more sense, and it would be easy to implement (10 line script). You'd have problems if your document contained code that had the CM delimiters in it, because they'd be treated as delimiters rather than literal text, but the present system has this problem too.

One could imagine a CM-like system (perhaps using the same symbols) that created nodes in the pandoc AST instead of acting as a preprocessor. With this system, you wouldn't be able to put CM delimiters inside literal contexts, like code blocks or spans, and there would be some limits to the kinds of edits you could notate. But the advantage would be that, in principle, one could convert a document with the CM marks into, say, a Word document with track changes, or a LaTeX document using the changes package. This is something a preprocessor couldn't give you.

@alerque
Copy link
Contributor Author

alerque commented Apr 8, 2019

@jgm The original toolchain isn't worth fiddlesticks. No offense to anybody involved, but it was more of a proof of concept than a reference implementation, and it suffers from a litany of ailments. I highly recommend ignoring it in this discussion. Maybe the syntax highlighters were useful for some editors, that's about it. Any serious use I have seen in the wild involves other systems, either home brews or with tools like @ickc's.

The later thing you describe would be much more useful (even with it's limitations) than what we have currently. I'd much rather limits to what could be marked up this way and be able to interact between document formats than not have anything at all. Not being able to use CM markup inside code blocks is trivial considering the primary use for this is prose.

@ickc
Copy link
Contributor

ickc commented Apr 8, 2019

Why would you think since the inception of CriticMarkup there’s no improvement? It is because the whole concept is flawed in making it a markdown syntax (I.e. happening at the AST level.) CriticMarkup is about tracking change at the source level and by definition that can cross any markup boundary making it impossible to have a spec (unless you really enumerate all the ways it is crossing boundary but is it really tractable?)

And it is because of that it is decided historically that pandoc is not adopting that in the AST.

And since pandoc is in no business in pre/post-processor, once it is decided it is not part of the AST, it is essentially decided that it would be a 3rd party effort to support that. (So one thing and one thing good and making it compossible.)

Of course I’m not opposing to making it part of the AST if @jgm agrees, even if that means a more restrictive CriticMarkup. From time to time he has changed in his mind.

@mb21
Copy link
Collaborator

mb21 commented Apr 9, 2019

I'm guessing there's also relatively little interest in CriticMarkup because a lot of people that use markdown (and pandoc), also use a version control system like git, that comes with diff tools. For example, for prose I use:

git diff --word-diff --patience

@ickc
Copy link
Contributor

ickc commented Apr 9, 2019

But CriticMarkup is different. It is more like a collaboration tool then a personal diff tool.

For this reason like @jgm said if it gains native support and can be converted back and forth to Words’ track change then it’s going to be very helpful.

In pancritic I implemented output to LaTeX diff using the changes package. There’s an issue over there requesting converting from docx track change to CriticMarkup, while I might have an idea how to do that, having native pandoc support is much better (all 2 way streets will be much easier.)

About CriticMarkup in AST, I wonder if it is possible to solve the boundary crossing problem by normalizing syntax (I.e. syntax closed and opened again when CriticMarkup boundary is crossed.)

@davidar
Copy link
Contributor

davidar commented Apr 9, 2019

converting from docx track change to CriticMarkup

pandiff supports this (fwiw)

having native pandoc support is much better

Agreed (though this issue seems to have become conflated with #1560 now)

@ickc
Copy link
Contributor

ickc commented Apr 9, 2019

It’s a different thing, it takes 2 inputs and take a diff. CriticMarkup is one of its output format.

Pancritic is to take the diff as part of the document (eg a track changed authored, or a CriticMarkup written in the same document.)

And did the README didn’t mention it or what? Because it doesn’t seem to do what you said it does. It mentioned both as possible output formats but not one to another.

@ickc
Copy link
Contributor

ickc commented Apr 9, 2019

Agreed (though this issue seems to have become conflated with #1560 now)

Skimming through that thread, approaching the end the discussion really goes towards more CriticMarkup related. Merging the 2 issues?

However what I said up there is slightly different. I think the kind of native support one would want to have for CriticMarkup in the AST is really dedicated AST elements for them. I think @jgm might means this up there but I could be wrong.

@davidar
Copy link
Contributor

davidar commented Apr 9, 2019

@ickc see the last example at the bottom of the README (pandiff test/track_changes_move.docx)

@ickc
Copy link
Contributor

ickc commented Apr 9, 2019

Interesting. Then I probably should not reinvent the wheel but just close that issue by referring to this (although the 2 languages are different. But they are composable.)

They got to mention that in the readme though... do you know if it has a CriticMarkup reader? (Pancritic is essentially a CriticMarkup reader and pandiff from the readme is a CriticMarkup writer (which takes a different kinds of inputs.))

@davidar
Copy link
Contributor

davidar commented Apr 9, 2019

do you know if it has a CriticMarkup reader?

It does now. This can also serve as a preprocessor for normalising CriticMarkup syntax as you suggested (see test/normalise.{in,out}.md). I'll update the readme.

@alerque
Copy link
Contributor Author

alerque commented Apr 10, 2019

@mb21 I hear you loud and clear. In fact my own personal workflow is strongly with you — and uses git based tools and an editing workflow that involves branches, merges, etc. In fact when I started a publishing company 4 years ago I had a dream that our entire workflow would revolve around a git based workflow. At the time I had extensive experience with LaTeX and figured we'd just have everything in that from the get-go. It took about a week trying to teach some translators how to use LaTeX before I gave that up and know we needed a different canonical input format. Markdown to the rescue. Markdown has been good to us. In spite of some friction getting people away from WYSIWYG word processors, this has gone well. Four years in, the git based workflow is well established but it is still a hard sell and an ongoing point of frustration. This year I started allowing some ... how shall I say this ... more free form inline editing. In particular the demand for inline commenting outstripped the usefulness of other available tools. Diffing between commits and branches, commenting on diffs, creating issues, commenting on PR's, etc. could all be cobbled together, but at the end of the day it took so many parts cobbled together that it was restricting authors and editors from doing their actual work. When I started allowing CriticMarkup formatted comments inline the tension let up. Our git based tools suddenly started making more sense to people (because they were doing jobs better suited to them rather than trying to serve outside their realm of expertise). The resistance to using the rest of the too chain dropped way down. I can now build various output formats, some that include inline notes and editing trails, others that don't --- from the same sources.

CriticMarkup serves a purpose that other tooling does not serve well, and it serves best when it is an integral part of both the input and output formats — in other words when it can be kept intact through the whole pipeline.

@ickc
Copy link
Contributor

ickc commented Apr 10, 2019

@alerque, I think @mb21 is just trying to explain why CriticMarkup hasn't gained much interests from the pandoc community. Discussions like this happened long time ago, repeatedly. That's why so many people tried to DIY when this is needed.

Pandoc's development isn't based on needs. To a certain extent it isn't even based on volunteering. What I mean is even someone spent the time to do the hard work, it might not be merged if the community doesn't agree on that feature. (Not happened often, and just my observations.)

Pandoc community seems to like to spend the time needed to decide on the right feature to add, or the right way to implement something. In the case of CriticMarkup, historically there's enough flaws to deter it to be included in the AST (I mean having AST elements that makes it possible to be in Markdown and other formats.) And among those @jgm's opinion takes most weights. But from my experience once the "philosophical" problems are solved, implementations often comes very quickly.

So I think the fastest way to move this forward is to think of a design that is convincing (i.e. doable, not too ugly/hacky, and not too much of a compromise.) In the past people have failed to convince this. But recent developments and usage patterns (e.g. like Word's track change) or may be new ideas could be changing that. I'm counting on you to convince us ;)

(Just my 2 cents though.)

@ickc
Copy link
Contributor

ickc commented Apr 10, 2019

It does now. This can also serve as a preprocessor for normalising CriticMarkup syntax as you suggested (see test/normalise.{in,out}.md). I'll update the readme.

Seems pretty good. I'll try it out later when I got time. Also, did you advertise it in pandoc-discuss and the wiki? I might have missed it. There's a feature request in mine that yours seems to already be supporting so after trying it out I might just direct people needing that to use yours.

@ttxtea
Copy link

ttxtea commented Jun 16, 2021

There is a wonderful lua-filter [1] that transforms docx comments to criticmarkdown. Is it not possible to do the same visa versa? So just as a filter in pandoc not as a preprocessor?

[1] https://gist.github.com/noamross/12e67a8d8d1fb71c4669cd1ceb9bbcf9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants