Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting an MD doc to itself as a cleanup trick #2814

Closed
tajmone opened this issue Mar 25, 2016 · 20 comments
Closed

Exporting an MD doc to itself as a cleanup trick #2814

tajmone opened this issue Mar 25, 2016 · 20 comments

Comments

@tajmone
Copy link
Contributor

tajmone commented Mar 25, 2016

A trick which I often use (but don't see mentioned much) to cleanup my draft MD documents, is to use Pandoc to process the document to itself (as output) using the same format as input — plus some other option to achieve some "celanup tricks".

>pandoc -f markdown -t markdown -o mydoc.md mydoc.md
  • Pandoc defaults to 80 columns auto-wrapping, which means that all text blocks will be "paginated" to 80 columns, giving a cleaner feel to the raw source. Using the --columns= option one can customize the width.
>pandoc -f markdown -t markdown --columns=120 -o mydoc.md mydoc.md
  • All lazy syntax is cleaned up by this process, making lists and quotations look clean (to mention just a few).
  • Pandoc will apply default (or chosen) styling to the document (ATX vs Setext headers, and so on) giving uniformity to the document.
  • The --smart option will convert straight quotes, dashes, ecc.
  • The --standalone --toc options will create an auto-generated TOC at the beginning of the document — quite useful for working with API docs, READMEs, ecc. (The -s / --standalone is required for this to work).

And possibly quite a few other useful hacks one could apply to the document he is working on.

So, with this issue I propose two things:

  1. I think this trick could be added to Documentation/Usage, it's a neat trick for beginners.
  2. Also, it would nice to have a new option implemented to invoke this in a quick way. Something like --cleanup, which would require only the source file as a parameter, defaulting to itself as an output.
    Some sort of alias for -f markdown -t markdown -o filename.md filename.md.
@ghost
Copy link

ghost commented Mar 25, 2016

Beware that hard wrapping on prose works really bad with source control when you reflow after edits. It should be added as a warning if this trick goes into the docs

@tajmone
Copy link
Contributor Author

tajmone commented Mar 26, 2016

I thought of that —but haven't got to actually bang my head against it to get a feeling of how disrupting it might be.

Yes: definitely worth a warning.

I thought that there are some Markdown Diffing tools that can be integrated with source control to obviate that — ie, something like reparsing the text into unwrapped text before diffing, and keeping track of inline styles also. I haven't dwelved deeply into them, but I remember reading that they make life easier when using version control with MD docs.

https://help.github.com/articles/rendering-differences-in-prose-documents/

Also, I've found some different opinions on the problems regarding prose diffing:

http://www.cirosantilli.com/markdown-style-guide/#line-wrapping

https://community.lsst.org/t/standard-for-wrapping-prose-in-version-controlled-documents/227/2

Some seem to prefer 80 column wrapping to having very long one-line-paragraphs. And there are mentions of Git extra options/switches, like --word-diff or --color-words.

Other prefere a "one sentence, one line" approach.

Others prefer to leave paragraphs as they are, and rely on editor's visual wrapping.

I gues I'll have to check out for myself how badly reflowing source text affects versioning—and by that I mean: working collbaboratively on a large prose text, involving simultaneous edits and real scenarios of diff conflicts.

But I guess that the whole issue is worth mentioning anyhow whenever version control of prose is involved—ie: mentioning benefits and problems of the different approaches, and possible solutions and tools specific to markdown.

@jgm
Copy link
Owner

jgm commented Mar 26, 2016

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).

@ghost
Copy link

ghost commented Mar 26, 2016

@tajmone Thank’you for the links, I’ll check them out.
@jgm The problem with hard wrapping is not with pandoc itself, as it has great options to deal with wrapping however we want (thank’you!); the problem is that if you go back to a finished paragraph and change something the text flow gets messed up and you need a hard reflow; so when you’ll diff it will seems that everything in the paragraph has changed, even if it isn’t really

@tajmone
Copy link
Contributor Author

tajmone commented Mar 26, 2016

@andya9 Yes, it's true, but cleanup operations pertaining to hard wrapping could be used wisely—ie: not in every commit but on some major "wrap up" steps, like final drafts, releases, ecc.

But this whole issue does bring up some of the shortcoming of using version control for prose. At least for markdown it's doable, whilst with XML like docs it's a true nightmare. There is something paradoxal about all this, though. The whole idea of collaborative projects hinges on the idea of including good documentation in human-readable and plain format.

I've did some further research, and I haven't so far found any specific tools for handling markdown diffs and conflicts (but I remember that I once came across [on GitHub] a tool for visualozing MD diffs in elegant format).

I keep thinking that it would be possible to create some smart tool to handle this: taking advantage of Pandoc powerful features, it could convert each doc to its AST and carry out diffing and conflict merging based on the AST — this would wash away the whole issue of hard-wrapping.

But even a simpler approach, like re-processing each doc with wrapping disabled, and diffing the un-wrapped output might do (then, all paragraphs would be rebuild cleanly as single lines). Of course, one would need to implement some smart functionality to it, allowing the user to pick and choose easily which changes to keep and which to discard.

From a script point of view, single EOLs should be considered as spaces, and multiple EOLs as real end of lines. So even some simple shell-script filtering might do the trick.

I'm wondering if the use of CriticMarkup simplifies or adds complexion to this scenario.

I definitely think that Pandoc documentation should devote a section or page to this issue — there is no way that anyone using MD documents in version controlled projects is going to avoid bumping into these problems. And whichever approach/solution one might preferr, the bottom line is that in collaborative environments members should stick to some agreed on way of doing things. And, so far, googling up the subject didn't bring up a full-fledged article or tutorial on the issue—instead, bits and pieces here and there, and many discussion threads.

@ghost
Copy link

ghost commented Mar 26, 2016

Yes, there’s much controversy on the net about what’s the best approach among the three, but no definite solution; for example asciidoctor recommends the one sentence-one line solution and provides lots of interesting points about it.
We should really have a section dealing with this problem too, but maybe just providing an overview of different approaches and how to deal with them in markdown (the --wrap option) instead of taking a position.

(About xml, yes: I recently had the same issue on a database-like document and decided to go the yaml route mainly because of this [and then discovered yaml is better in everything anyway 😄])

@KurtPfeifle
Copy link

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).

I would like to see and use an option which could be named "--wrap=sentence".

This is how I author my own prosa-like documents. I admit, it doesn't look nice (if I want to output/publish nice-looking Markdown, I can always use "--columns=90" or whatever...), but it helps me with two things:

  1. It gives me an immediate visual feedback about my sentence lengths. Very long, complicated ones stand out clearly, and so I can break them down into shorter, simpler ones.
  2. It is of course also very good for diffing.

I'm aware that it may not always be straight-forward to define what a "sentence" is. But just assuming any sequence of ". ", "! ", ": " or "?" to mark an end-of-sentence would already be a good start.

Such a "standard" format for people who collaborate to write technical documentation would also help them if they check in their respective contributions into a source control system.

Would such a feature addition to Pandoc be worth considering?

@ghost
Copy link

ghost commented Mar 26, 2016

@KurtPfeifle I like your idea, it would allow each team to choose their preferred workflow without recurring to external scripts – it would be really easy to write them, but it’s an added step in the workflow. Among the symbols, we need ";" too.

@tajmone
Copy link
Contributor Author

tajmone commented Mar 26, 2016

@andya9 Thanx for the link! Andi I agree totally: not promoting any particular approach, but present the whole scenario (at present, there is no mention of the whole issue).

I've been doing some tests in the meantime, with various diff tools. It seems to me that the whole problem changes in magnitude depending on which diff tools one uses.

The creator of GitSense wrote:

The diffing problem is solved quite easily with the Google Diff, Match, Patch algorithm and our Smart View technology.
https://news.ycombinator.com/item?id=6296861

So, I've been playing around with google-diff-match-patch online test tool:
https://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html

I'd say that it provides good visual feedback, and hopefully it might also provide smart merging.

@ghost
Copy link

ghost commented Mar 26, 2016

It seems to me that the whole problem changes in magnitude depending on which diff tools one uses.

This is really important to mention, as there’s the risk of lock-in to a single tool to consider too

@tajmone
Copy link
Contributor Author

tajmone commented Mar 26, 2016

@KurtPfeifle your idea of a --wrap=sentence options is really good. Definitelty the one-sentence-one-line approach is used by a lot of people.

For the other 2 approaches (wrapping at an established column number, or no wrapping at all), Pandoc is already equiped with the required options.

Since Pandoc's natural use is for prose, and it's usually employed in collaborative envinroments, this --wrap=sentence option should become a feature request with a high priority in my opionion, because whenever I google up the issue I always stumble up on the one-sentence-one-line approach.

And converting by hand a doc is just to messy.

@KurtPfeifle, why don't you propose it in a separate issue as a feature request?

@tajmone
Copy link
Contributor Author

tajmone commented Mar 26, 2016

Another possible solution:

@gknoy:
Perhaps you could have a special kind of commit, a "text-rewrap" commit (similar to merge commits), which might let your diff presentation tool alternate between diffing lines, pararaphs, or the like. I agree, the presentation layer really needs to be as granular as your editing team needs to be. I really like how Github shows the differences within a line that is different.
https://news.ycombinator.com/item?id=6296861

@jgm
Copy link
Owner

jgm commented Mar 26, 2016

+++ Kurt Pfeifle [Mar 26 16 03:32 ]:

Try --wrap=preserve if you want to retain line breaks in the
source (which is good for diffing).

I would like to see and use an option which could be named
"--wrap=sentence".

This is overkill, I think. Just make sure your source file
has one sentence per line, and use --wrap=preserve.

It's actually not easy to for pandoc to determine
the boundaries of sentences.

@jgm
Copy link
Owner

jgm commented Mar 26, 2016

See #2374

@ghost
Copy link

ghost commented Mar 26, 2016

@jgv On second thought, it would be tricky also because each team would have different rules on what to consider a sentence boundary (e.g. not consider it if the sentence is a short exclamation, and so on); it’s better to have a custom made script that’s run once at the beginning and then just wrap=preserve as you suggested

@tajmone
Copy link
Contributor Author

tajmone commented Mar 27, 2016

@jgm, I've gone through Issue #2374, which you provided us a link. A very interesting thread, sheds light on the issue. So, it was as I though: operating at the AST level is the best solution.

But I have a further consideration, isn't the ideal approach to execute a diff on three documents? ie: the common ancestor along the conflicting versions?

When it comes to semantics, its easy to loose tracks of the many changes that comprise a single "correction/edition". I might change a paragraph by deleting a few words, adding some others, and all this would comprise a single intervention, from the semantic point.

But when the problem relates to conflicts regarding merging back into master an edited doc which in the meantime has changed also in master branch, then it might become confusing to understand if a single word deletion relates to the original common ancestor or the in-the-meantime changed version — after all, being the context one of semantics, resolving conflicts for mergin purposes means having to re-align "semantically" the two diffing docs, so the ancestor should play a major role here (not only for human readibility, but also for any smart automation filtering approach one might envise to create).

Diffing conflicts in the revised doc doesn't shed really much light on the different directions the two editors went along — unless you compare them to the original document from which they departed.

I din't notice this issue coming up on the topic, is there some reason for it? Am I missing out something? Am I on the wrong track of thinking?

In the meanwhile (in absence of a Pandoc-AST diffing tool) it looks like using Pandoc to process the conflicting documents (form and to markdown, with no-wrapping option!) could be the best way to diff conflicting md docs: all block elements would be rendered into single lines before diffing. So, provided one has a diffing tool working at words level, with good visualization of very long lines, and their atomic differences, this approach would leave out the problems of EOL and all differences of text-distrubtion for wrapped-paragraphs.

NOTE: The dwdiff tool that you mention in #2374 looks interesting. Only *nix though (right now I am writing from a Win OS PC, but I was looking for some cross-platform solution to propose in collaborative projects anyhow)

@tmerse
Copy link

tmerse commented Jul 12, 2017

I pipe markdown files through pandoc to in order to have consistent formatting.
More precisely: pandoc -f markdown+autolink_bare_uris -t markdown+autolink_bare_uris --atx-headers

Unfortunately, existing metadata blocks are lost in the process. Appending the yaml_metadata_block extension does not help.

pandoc -f markdown+autolink_bare_uris+yaml_metadata_block -t markdown+autolink_bare_uris+yaml_metadata_block --atx-headers

Does this work as intendet or is there an option to preserve metadata blocks during markdown to markdown conversion. The file should be self-containing, so external .yaml files are not an option.

@jgm
Copy link
Owner

jgm commented Jul 12, 2017 via email

@tajmone
Copy link
Contributor Author

tajmone commented Jul 13, 2017

I pipe markdown files through pandoc to in order to have consistent formatting.

I use this a lot because it's really useful, especially in version controlled files, but I've not found a way to preserve reference-style links — they are lost in the process, becuase they are converted to inline-style links.

@jgm
Copy link
Owner

jgm commented Jul 13, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants