Exporting an MD doc to itself as a cleanup trick #2814

tajmone · 2016-03-25T09:04:22Z

A trick which I often use (but don't see mentioned much) to cleanup my draft MD documents, is to use Pandoc to process the document to itself (as output) using the same format as input — plus some other option to achieve some "celanup tricks".

>pandoc -f markdown -t markdown -o mydoc.md mydoc.md

Pandoc defaults to 80 columns auto-wrapping, which means that all text blocks will be "paginated" to 80 columns, giving a cleaner feel to the raw source. Using the --columns= option one can customize the width.

>pandoc -f markdown -t markdown --columns=120 -o mydoc.md mydoc.md

All lazy syntax is cleaned up by this process, making lists and quotations look clean (to mention just a few).
Pandoc will apply default (or chosen) styling to the document (ATX vs Setext headers, and so on) giving uniformity to the document.
The --smart option will convert straight quotes, dashes, ecc.
The --standalone --toc options will create an auto-generated TOC at the beginning of the document — quite useful for working with API docs, READMEs, ecc. (The -s / --standalone is required for this to work).

And possibly quite a few other useful hacks one could apply to the document he is working on.

So, with this issue I propose two things:

I think this trick could be added to Documentation/Usage, it's a neat trick for beginners.
Also, it would nice to have a new option implemented to invoke this in a quick way. Something like --cleanup, which would require only the source file as a parameter, defaulting to itself as an output.
Some sort of alias for -f markdown -t markdown -o filename.md filename.md.

The text was updated successfully, but these errors were encountered:

ghost · 2016-03-25T17:01:53Z

Beware that hard wrapping on prose works really bad with source control when you reflow after edits. It should be added as a warning if this trick goes into the docs

tajmone · 2016-03-26T00:35:54Z

I thought of that —but haven't got to actually bang my head against it to get a feeling of how disrupting it might be.

Yes: definitely worth a warning.

I thought that there are some Markdown Diffing tools that can be integrated with source control to obviate that — ie, something like reparsing the text into unwrapped text before diffing, and keeping track of inline styles also. I haven't dwelved deeply into them, but I remember reading that they make life easier when using version control with MD docs.

https://help.github.com/articles/rendering-differences-in-prose-documents/

Also, I've found some different opinions on the problems regarding prose diffing:

http://www.cirosantilli.com/markdown-style-guide/#line-wrapping

https://community.lsst.org/t/standard-for-wrapping-prose-in-version-controlled-documents/227/2

Some seem to prefer 80 column wrapping to having very long one-line-paragraphs. And there are mentions of Git extra options/switches, like --word-diff or --color-words.

Other prefere a "one sentence, one line" approach.

Others prefer to leave paragraphs as they are, and rely on editor's visual wrapping.

I gues I'll have to check out for myself how badly reflowing source text affects versioning—and by that I mean: working collbaboratively on a large prose text, involving simultaneous edits and real scenarios of diff conflicts.

But I guess that the whole issue is worth mentioning anyhow whenever version control of prose is involved—ie: mentioning benefits and problems of the different approaches, and possible solutions and tools specific to markdown.

jgm · 2016-03-26T05:06:02Z

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).

ghost · 2016-03-26T07:59:24Z

@tajmone Thank’you for the links, I’ll check them out.
@jgm The problem with hard wrapping is not with pandoc itself, as it has great options to deal with wrapping however we want (thank’you!); the problem is that if you go back to a finished paragraph and change something the text flow gets messed up and you need a hard reflow; so when you’ll diff it will seems that everything in the paragraph has changed, even if it isn’t really

tajmone · 2016-03-26T09:56:06Z

@andya9 Yes, it's true, but cleanup operations pertaining to hard wrapping could be used wisely—ie: not in every commit but on some major "wrap up" steps, like final drafts, releases, ecc.

But this whole issue does bring up some of the shortcoming of using version control for prose. At least for markdown it's doable, whilst with XML like docs it's a true nightmare. There is something paradoxal about all this, though. The whole idea of collaborative projects hinges on the idea of including good documentation in human-readable and plain format.

I've did some further research, and I haven't so far found any specific tools for handling markdown diffs and conflicts (but I remember that I once came across [on GitHub] a tool for visualozing MD diffs in elegant format).

I keep thinking that it would be possible to create some smart tool to handle this: taking advantage of Pandoc powerful features, it could convert each doc to its AST and carry out diffing and conflict merging based on the AST — this would wash away the whole issue of hard-wrapping.

But even a simpler approach, like re-processing each doc with wrapping disabled, and diffing the un-wrapped output might do (then, all paragraphs would be rebuild cleanly as single lines). Of course, one would need to implement some smart functionality to it, allowing the user to pick and choose easily which changes to keep and which to discard.

From a script point of view, single EOLs should be considered as spaces, and multiple EOLs as real end of lines. So even some simple shell-script filtering might do the trick.

I'm wondering if the use of CriticMarkup simplifies or adds complexion to this scenario.

I definitely think that Pandoc documentation should devote a section or page to this issue — there is no way that anyone using MD documents in version controlled projects is going to avoid bumping into these problems. And whichever approach/solution one might preferr, the bottom line is that in collaborative environments members should stick to some agreed on way of doing things. And, so far, googling up the subject didn't bring up a full-fledged article or tutorial on the issue—instead, bits and pieces here and there, and many discussion threads.

ghost · 2016-03-26T10:27:01Z

Yes, there’s much controversy on the net about what’s the best approach among the three, but no definite solution; for example asciidoctor recommends the one sentence-one line solution and provides lots of interesting points about it.
We should really have a section dealing with this problem too, but maybe just providing an overview of different approaches and how to deal with them in markdown (the --wrap option) instead of taking a position.

(About xml, yes: I recently had the same issue on a database-like document and decided to go the yaml route mainly because of this [and then discovered yaml is better in everything anyway 😄])

KurtPfeifle · 2016-03-26T10:32:20Z

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).

I would like to see and use an option which could be named "--wrap=sentence".

This is how I author my own prosa-like documents. I admit, it doesn't look nice (if I want to output/publish nice-looking Markdown, I can always use "--columns=90" or whatever...), but it helps me with two things:

It gives me an immediate visual feedback about my sentence lengths. Very long, complicated ones stand out clearly, and so I can break them down into shorter, simpler ones.
It is of course also very good for diffing.

I'm aware that it may not always be straight-forward to define what a "sentence" is. But just assuming any sequence of ". ", "! ", ": " or "?" to mark an end-of-sentence would already be a good start.

Such a "standard" format for people who collaborate to write technical documentation would also help them if they check in their respective contributions into a source control system.

Would such a feature addition to Pandoc be worth considering?

ghost · 2016-03-26T10:36:13Z

@KurtPfeifle I like your idea, it would allow each team to choose their preferred workflow without recurring to external scripts – it would be really easy to write them, but it’s an added step in the workflow. Among the symbols, we need ";" too.

tajmone · 2016-03-26T10:37:14Z

@andya9 Thanx for the link! Andi I agree totally: not promoting any particular approach, but present the whole scenario (at present, there is no mention of the whole issue).

I've been doing some tests in the meantime, with various diff tools. It seems to me that the whole problem changes in magnitude depending on which diff tools one uses.

The creator of GitSense wrote:

The diffing problem is solved quite easily with the Google Diff, Match, Patch algorithm and our Smart View technology.
https://news.ycombinator.com/item?id=6296861

So, I've been playing around with google-diff-match-patch online test tool:
https://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html

I'd say that it provides good visual feedback, and hopefully it might also provide smart merging.

ghost · 2016-03-26T10:39:11Z

It seems to me that the whole problem changes in magnitude depending on which diff tools one uses.

This is really important to mention, as there’s the risk of lock-in to a single tool to consider too

tajmone · 2016-03-26T10:42:58Z

@KurtPfeifle your idea of a --wrap=sentence options is really good. Definitelty the one-sentence-one-line approach is used by a lot of people.

For the other 2 approaches (wrapping at an established column number, or no wrapping at all), Pandoc is already equiped with the required options.

Since Pandoc's natural use is for prose, and it's usually employed in collaborative envinroments, this --wrap=sentence option should become a feature request with a high priority in my opionion, because whenever I google up the issue I always stumble up on the one-sentence-one-line approach.

And converting by hand a doc is just to messy.

@KurtPfeifle, why don't you propose it in a separate issue as a feature request?

tajmone · 2016-03-26T10:48:57Z

Another possible solution:

@gknoy:
Perhaps you could have a special kind of commit, a "text-rewrap" commit (similar to merge commits), which might let your diff presentation tool alternate between diffing lines, pararaphs, or the like. I agree, the presentation layer really needs to be as granular as your editing team needs to be. I really like how Github shows the differences within a line that is different.
https://news.ycombinator.com/item?id=6296861

jgm · 2016-03-26T19:55:06Z

+++ Kurt Pfeifle [Mar 26 16 03:32 ]:

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).
I would like to see and use an option which could be named
"--wrap=sentence".

This is overkill, I think. Just make sure your source file
has one sentence per line, and use --wrap=preserve.

It's actually not easy to for pandoc to determine
the boundaries of sentences.

jgm · 2016-03-26T19:56:03Z

See #2374

ghost · 2016-03-26T20:31:04Z

@jgv On second thought, it would be tricky also because each team would have different rules on what to consider a sentence boundary (e.g. not consider it if the sentence is a short exclamation, and so on); it’s better to have a custom made script that’s run once at the beginning and then just wrap=preserve as you suggested

tajmone · 2016-03-27T09:43:23Z

@jgm, I've gone through Issue #2374, which you provided us a link. A very interesting thread, sheds light on the issue. So, it was as I though: operating at the AST level is the best solution.

But I have a further consideration, isn't the ideal approach to execute a diff on three documents? ie: the common ancestor along the conflicting versions?

When it comes to semantics, its easy to loose tracks of the many changes that comprise a single "correction/edition". I might change a paragraph by deleting a few words, adding some others, and all this would comprise a single intervention, from the semantic point.

But when the problem relates to conflicts regarding merging back into master an edited doc which in the meantime has changed also in master branch, then it might become confusing to understand if a single word deletion relates to the original common ancestor or the in-the-meantime changed version — after all, being the context one of semantics, resolving conflicts for mergin purposes means having to re-align "semantically" the two diffing docs, so the ancestor should play a major role here (not only for human readibility, but also for any smart automation filtering approach one might envise to create).

Diffing conflicts in the revised doc doesn't shed really much light on the different directions the two editors went along — unless you compare them to the original document from which they departed.

I din't notice this issue coming up on the topic, is there some reason for it? Am I missing out something? Am I on the wrong track of thinking?

In the meanwhile (in absence of a Pandoc-AST diffing tool) it looks like using Pandoc to process the conflicting documents (form and to markdown, with no-wrapping option!) could be the best way to diff conflicting md docs: all block elements would be rendered into single lines before diffing. So, provided one has a diffing tool working at words level, with good visualization of very long lines, and their atomic differences, this approach would leave out the problems of EOL and all differences of text-distrubtion for wrapped-paragraphs.

NOTE: The dwdiff tool that you mention in #2374 looks interesting. Only *nix though (right now I am writing from a Win OS PC, but I was looking for some cross-platform solution to propose in collaborative projects anyhow)

@ickc

Thanks to @ickc.

tmerse · 2017-07-12T14:10:25Z

I pipe markdown files through pandoc to in order to have consistent formatting.
More precisely: pandoc -f markdown+autolink_bare_uris -t markdown+autolink_bare_uris --atx-headers

Unfortunately, existing metadata blocks are lost in the process. Appending the yaml_metadata_block extension does not help.

pandoc -f markdown+autolink_bare_uris+yaml_metadata_block -t markdown+autolink_bare_uris+yaml_metadata_block --atx-headers

Does this work as intendet or is there an option to preserve metadata blocks during markdown to markdown conversion. The file should be self-containing, so external .yaml files are not an option.

jgm · 2017-07-12T20:39:55Z

+++ Tobias Mersmann [Jul 12 17 07:10 ]:

I pipe markdown files through pandoc to in order to have consistent formatting. More precisely: pandoc -f markdown+autolink_bare_uris -t markdown+autolink_bare_uris --atx-headers Unfortunately, existing metadata blocks are lost in the process. Appending the yaml_metadaa_block extension won't work either.

You need to add `-s` (or `--standalone`) to your command if you want the metadata.

tajmone · 2017-07-13T09:07:33Z

I pipe markdown files through pandoc to in order to have consistent formatting.

I use this a lot because it's really useful, especially in version controlled files, but I've not found a way to preserve reference-style links — they are lost in the process, becuase they are converted to inline-style links.

jgm · 2017-07-13T19:24:37Z

+++ Tristano Ajmone [Jul 13 17 09:07 ]:

I pipe markdown files through pandoc to in order to have consistent formatting. I use this a lot because it's really useful, especially in version controlled files, but I've not found a way to preserve reference-style links — they are lost in the process, becuase they are converted to inline-style links.

You can use --reference-links, but this will make ALL of your links reference links.

ickc mentioned this issue Apr 24, 2016

CriticMarkup Support? #2873

Open

jgm closed this as completed May 11, 2016

ickc referenced this issue Oct 8, 2016

Added a small clarification on --webtex with Markdown output.

d8600d6

Thanks to @ickc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting an MD doc to itself as a cleanup trick #2814

Exporting an MD doc to itself as a cleanup trick #2814

tajmone commented Mar 25, 2016

ghost commented Mar 25, 2016

tajmone commented Mar 26, 2016

jgm commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

ghost commented Mar 26, 2016

KurtPfeifle commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

tajmone commented Mar 26, 2016

jgm commented Mar 26, 2016

jgm commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 27, 2016

tmerse commented Jul 12, 2017 •

edited

Loading

jgm commented Jul 12, 2017 via email

tajmone commented Jul 13, 2017

jgm commented Jul 13, 2017 via email

Exporting an MD doc to itself as a cleanup trick #2814

Exporting an MD doc to itself as a cleanup trick #2814

Comments

tajmone commented Mar 25, 2016

ghost commented Mar 25, 2016

tajmone commented Mar 26, 2016

jgm commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

ghost commented Mar 26, 2016

KurtPfeifle commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 26, 2016

tajmone commented Mar 26, 2016

jgm commented Mar 26, 2016

jgm commented Mar 26, 2016

ghost commented Mar 26, 2016

tajmone commented Mar 27, 2016

tmerse commented Jul 12, 2017 • edited Loading

jgm commented Jul 12, 2017 via email

tajmone commented Jul 13, 2017

jgm commented Jul 13, 2017 via email

tmerse commented Jul 12, 2017 •

edited

Loading