-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exporting an MD doc to itself as a cleanup trick #2814
Comments
Beware that hard wrapping on prose works really bad with source control when you reflow after edits. It should be added as a warning if this trick goes into the docs |
I thought of that —but haven't got to actually bang my head against it to get a feeling of how disrupting it might be. Yes: definitely worth a warning. I thought that there are some Markdown Diffing tools that can be integrated with source control to obviate that — ie, something like reparsing the text into unwrapped text before diffing, and keeping track of inline styles also. I haven't dwelved deeply into them, but I remember reading that they make life easier when using version control with MD docs. https://help.github.com/articles/rendering-differences-in-prose-documents/ Also, I've found some different opinions on the problems regarding prose diffing: http://www.cirosantilli.com/markdown-style-guide/#line-wrapping https://community.lsst.org/t/standard-for-wrapping-prose-in-version-controlled-documents/227/2 Some seem to prefer 80 column wrapping to having very long one-line-paragraphs. And there are mentions of Git extra options/switches, like Other prefere a "one sentence, one line" approach. Others prefer to leave paragraphs as they are, and rely on editor's visual wrapping. I gues I'll have to check out for myself how badly reflowing source text affects versioning—and by that I mean: working collbaboratively on a large prose text, involving simultaneous edits and real scenarios of diff conflicts. But I guess that the whole issue is worth mentioning anyhow whenever version control of prose is involved—ie: mentioning benefits and problems of the different approaches, and possible solutions and tools specific to markdown. |
Try --wrap=preserve if you want to retain line breaks in the |
@tajmone Thank’you for the links, I’ll check them out. |
@andya9 Yes, it's true, but cleanup operations pertaining to hard wrapping could be used wisely—ie: not in every commit but on some major "wrap up" steps, like final drafts, releases, ecc. But this whole issue does bring up some of the shortcoming of using version control for prose. At least for markdown it's doable, whilst with XML like docs it's a true nightmare. There is something paradoxal about all this, though. The whole idea of collaborative projects hinges on the idea of including good documentation in human-readable and plain format. I've did some further research, and I haven't so far found any specific tools for handling markdown diffs and conflicts (but I remember that I once came across [on GitHub] a tool for visualozing MD diffs in elegant format). I keep thinking that it would be possible to create some smart tool to handle this: taking advantage of Pandoc powerful features, it could convert each doc to its AST and carry out diffing and conflict merging based on the AST — this would wash away the whole issue of hard-wrapping. But even a simpler approach, like re-processing each doc with wrapping disabled, and diffing the un-wrapped output might do (then, all paragraphs would be rebuild cleanly as single lines). Of course, one would need to implement some smart functionality to it, allowing the user to pick and choose easily which changes to keep and which to discard. From a script point of view, single EOLs should be considered as spaces, and multiple EOLs as real end of lines. So even some simple shell-script filtering might do the trick. I'm wondering if the use of CriticMarkup simplifies or adds complexion to this scenario. I definitely think that Pandoc documentation should devote a section or page to this issue — there is no way that anyone using MD documents in version controlled projects is going to avoid bumping into these problems. And whichever approach/solution one might preferr, the bottom line is that in collaborative environments members should stick to some agreed on way of doing things. And, so far, googling up the subject didn't bring up a full-fledged article or tutorial on the issue—instead, bits and pieces here and there, and many discussion threads. |
Yes, there’s much controversy on the net about what’s the best approach among the three, but no definite solution; for example asciidoctor recommends the one sentence-one line solution and provides lots of interesting points about it. (About xml, yes: I recently had the same issue on a database-like document and decided to go the yaml route mainly because of this [and then discovered yaml is better in everything anyway 😄]) |
I would like to see and use an option which could be named This is how I author my own prosa-like documents. I admit, it doesn't look nice (if I want to output/publish nice-looking Markdown, I can always use
I'm aware that it may not always be straight-forward to define what a "sentence" is. But just assuming any sequence of ". ", "! ", ": " or "?" to mark an end-of-sentence would already be a good start. Such a "standard" format for people who collaborate to write technical documentation would also help them if they check in their respective contributions into a source control system. Would such a feature addition to Pandoc be worth considering? |
@KurtPfeifle I like your idea, it would allow each team to choose their preferred workflow without recurring to external scripts – it would be really easy to write them, but it’s an added step in the workflow. Among the symbols, we need ";" too. |
@andya9 Thanx for the link! Andi I agree totally: not promoting any particular approach, but present the whole scenario (at present, there is no mention of the whole issue). I've been doing some tests in the meantime, with various diff tools. It seems to me that the whole problem changes in magnitude depending on which diff tools one uses. The creator of GitSense wrote:
So, I've been playing around with google-diff-match-patch online test tool: I'd say that it provides good visual feedback, and hopefully it might also provide smart merging. |
This is really important to mention, as there’s the risk of lock-in to a single tool to consider too |
@KurtPfeifle your idea of a For the other 2 approaches (wrapping at an established column number, or no wrapping at all), Pandoc is already equiped with the required options. Since Pandoc's natural use is for prose, and it's usually employed in collaborative envinroments, this And converting by hand a doc is just to messy. @KurtPfeifle, why don't you propose it in a separate issue as a feature request? |
Another possible solution:
|
+++ Kurt Pfeifle [Mar 26 16 03:32 ]:
This is overkill, I think. Just make sure your source file It's actually not easy to for pandoc to determine |
See #2374 |
@jgv On second thought, it would be tricky also because each team would have different rules on what to consider a sentence boundary (e.g. not consider it if the sentence is a short exclamation, and so on); it’s better to have a custom made script that’s run once at the beginning and then just wrap=preserve as you suggested |
@jgm, I've gone through Issue #2374, which you provided us a link. A very interesting thread, sheds light on the issue. So, it was as I though: operating at the AST level is the best solution. But I have a further consideration, isn't the ideal approach to execute a diff on three documents? ie: the common ancestor along the conflicting versions? When it comes to semantics, its easy to loose tracks of the many changes that comprise a single "correction/edition". I might change a paragraph by deleting a few words, adding some others, and all this would comprise a single intervention, from the semantic point. But when the problem relates to conflicts regarding merging back into master an edited doc which in the meantime has changed also in master branch, then it might become confusing to understand if a single word deletion relates to the original common ancestor or the in-the-meantime changed version — after all, being the context one of semantics, resolving conflicts for mergin purposes means having to re-align "semantically" the two diffing docs, so the ancestor should play a major role here (not only for human readibility, but also for any smart automation filtering approach one might envise to create). Diffing conflicts in the revised doc doesn't shed really much light on the different directions the two editors went along — unless you compare them to the original document from which they departed. I din't notice this issue coming up on the topic, is there some reason for it? Am I missing out something? Am I on the wrong track of thinking? In the meanwhile (in absence of a Pandoc-AST diffing tool) it looks like using Pandoc to process the conflicting documents (form and to markdown, with no-wrapping option!) could be the best way to diff conflicting md docs: all block elements would be rendered into single lines before diffing. So, provided one has a diffing tool working at words level, with good visualization of very long lines, and their atomic differences, this approach would leave out the problems of EOL and all differences of text-distrubtion for wrapped-paragraphs. NOTE: The dwdiff tool that you mention in #2374 looks interesting. Only *nix though (right now I am writing from a Win OS PC, but I was looking for some cross-platform solution to propose in collaborative projects anyhow) |
I pipe markdown files through pandoc to in order to have consistent formatting. Unfortunately, existing metadata blocks are lost in the process. Appending the
Does this work as intendet or is there an option to preserve metadata blocks during markdown to markdown conversion. The file should be self-containing, so external |
+++ Tobias Mersmann [Jul 12 17 07:10 ]:
I pipe markdown files through pandoc to in order to have consistent
formatting.
More precisely: pandoc -f markdown+autolink_bare_uris -t
markdown+autolink_bare_uris --atx-headers
Unfortunately, existing metadata blocks are lost in the process.
Appending the yaml_metadaa_block extension won't work either.
You need to add `-s` (or `--standalone`) to your command if
you want the metadata.
|
I use this a lot because it's really useful, especially in version controlled files, but I've not found a way to preserve reference-style links — they are lost in the process, becuase they are converted to inline-style links. |
+++ Tristano Ajmone [Jul 13 17 09:07 ]:
I pipe markdown files through pandoc to in order to have consistent
formatting.
I use this a lot because it's really useful, especially in version
controlled files, but I've not found a way to preserve reference-style
links — they are lost in the process, becuase they are converted to
inline-style links.
You can use --reference-links, but this will make ALL of
your links reference links.
|
A trick which I often use (but don't see mentioned much) to cleanup my draft MD documents, is to use Pandoc to process the document to itself (as output) using the same format as input — plus some other option to achieve some "celanup tricks".
--columns=
option one can customize the width.--smart
option will convert straight quotes, dashes, ecc.--standalone --toc
options will create an auto-generated TOC at the beginning of the document — quite useful for working with API docs, READMEs, ecc. (The-s
/--standalone
is required for this to work).And possibly quite a few other useful hacks one could apply to the document he is working on.
So, with this issue I propose two things:
--cleanup
, which would require only the source file as a parameter, defaulting to itself as an output.Some sort of alias for
-f markdown -t markdown -o filename.md filename.md
.The text was updated successfully, but these errors were encountered: