Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandoc diff #2374

Closed
naught101 opened this issue Aug 28, 2015 · 29 comments
Closed

pandoc diff #2374

naught101 opened this issue Aug 28, 2015 · 29 comments

Comments

@naught101
Copy link

There is a tool called latexdiff, that takes two latex files, and creates a PDF that is a track-changes style diff off the two (e.g. deletions in red, additions in blue or similar).

It would be really nice to have such a thing with pandoc. In particular, because you can make latexdiff play nicely with git, and make nice "this is what I've done since you last saw it" pdf documents that are really useful for showing to paper reviewers.

Mostly I'm only interested in this for markup formats (markdown, rst), at least initially.

I don't know if it makes sense to do something like this within pandoc, or as a separate script, but even if you're not interested in adding it, I'd be interested to get your thoughts on how best to go about it as a separate script.

@jgm
Copy link
Owner

jgm commented Aug 28, 2015

I like the idea.

In principle, it would be possible to create something that
did a diff at the level of the pandoc AST. The tool could
mark up insertions and deletions in inline contexts by
putting them in special Span elements, and in block contexts
by putting them in special Div elements. The result could be
viewed in HTML using some appropriate CSS, or in other
formats. The tool could be made to work on documents
in any format pandoc can read.

It would a different from other diff tools in that it only
recognizes differences in the basic structural elements
recognized by pandoc. So, for example, the two markdown
strings *hi* and _hi_ would not have differences
according to this tool.

It would not be too hard to implement this as a separate
program using the pandoc library. You'd probably want to
use the Diff library on Hackage for the diff'ing part.

+++ naught101 [Aug 27 15 22:03 ]:

There is a tool called latexdiff, that takes two latex files, and
creates a PDF that is a track-changes style diff off the two (e.g.
deletions in red, additions in blue or similar).

It would be really nice to have such a thing with pandoc. In
particular, because you can make latexdiff play nicely with git, and
make nice "this is what I've done since you last saw it" pdf documents
that are really useful for showing to paper reviewers.

Mostly I'm only interested in this for markup formats (markdown, rst),
at least initially.

I don't know if it makes sense to do something like this withing
pandoc, or as a separate script, but even if you're not interested in
adding it, I'd be interested to get your thoughts on how best to go
about it as a separate script.


Reply to this email directly or [1]view it on GitHub.

References

  1. pandoc diff #2374

@allefeld
Copy link
Contributor

I would be really interested in seeing this happen, too. Like the OP I use pandoc to write academic papers, and it is often necessary to show coauthors or reviewers what changes have been done. I experimented with doing the diff on generated LaTeX files using latexdiff, or doing the diff on the Markdown files having wdiff insert some CriticMarkup-inspired marks, but almost always the diff files prove to be not compilable, and repairing them by hand is extremely tedious. Especially wdiff tends not to play well with embedded LaTeX math in the Markdown code. I therefore believe that implementing something like this on the internal AST representation is the way to go. @naught101, have you made any progress on this?

@jgm
Copy link
Owner

jgm commented Oct 10, 2015

One tricky thing is that the AST is not just a list; it's
a tree-like structure that includes list-like structures.
The Diff library provides functions to give you diffs of
list-like structures, but it's not entirely obvious how
to do diffs on trees. There must be prior art on this
somewhere.

@allefeld
Copy link
Contributor

@jgm, is there documentation for the AST structure, i.e. what kind of nodes can be encountered? I couldn't find any.

@jgm
Copy link
Owner

jgm commented Oct 10, 2015

https://hackage.haskell.org/package/pandoc-types-1.12.4.7/docs/Text-Pandoc-Definition.html

+++ murfit [Oct 10 15 12:07 ]:

[1]@jgm, is there documentation for the AST structure, i.e. what kind
of nodes can be encountered? I couldn't find any.


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. pandoc diff #2374 (comment)

@technocrat
Copy link

@murfit you can see what AST pandoc will make of your source by using the -t native flag:

$ pandoc -f latex -t native yourdoc.tex > yourdoc.ast

@allefeld
Copy link
Contributor

@jgm, thanks, but that doesn't really explain the meaning and usage of these structure elements. For instance, can I always assume that the first element of the primary list in the JSON form is of type 'unMeta'? And why doesn't it appear at the beginning of the native form? What's the meaning of the extra parameters of a node of type 'Header'? etc.

@technocrat, yes, I've started to look at both the native and the JSON form, but figuring out things from there is reverse engineering, and I can never be sure that any code that I produce will work on arbitrary Markdown documents. And metadata doesn't appear to be even included in the native form, which means I can never reconstruct the full document from that.

@technocrat
Copy link

@murfit I'm struggling with the same problem; it's all on hackage, but
my Haskell reading skills aren't there quite yet, which is why I'm going
through the exercise of picking out pieces of the AST and writing
filters for them. Right now I'm stuck on RawBlock, but I'm patient.

murfit wrote:

@jgm https://github.com/jgm, thanks, but that doesn't really explain
the meaning and usage of these structure elements. For instance, can I
always assume that the first element of the primary list in the JSON
form is of type 'unMeta'? And why doesn't it appear at the beginning
of the native form? What's the meaning of the extra parameters of a
node of type 'Header'? etc.

@technocrat https://github.com/technocrat, yes, I've started to look
at both the native and the JSON form, but figuring out things from
there is reverse engineering, and I can never be sure that any code
that I produce will work on arbitrary Markdown documents.


Reply to this email directly or view it on GitHub
#2374 (comment).

@jgm
Copy link
Owner

jgm commented Oct 10, 2015

@murfit, to get a feel for the AST, I'd recommend using
pandoc -t native to convert some Markdown samples.

The JSON is automatically converted from the native Haskell
structure; it's best to get familiar with the latter, as it
would probably be easiest to write a pandoc-diff tool
in Haskell and have it operate directly on the AST, rather
than going through the JSON representation.

+++ murfit [Oct 10 15 13:24 ]:

[1]@jgm, thanks, but that doesn't really explain the meaning and usage
of these structure elements. For instance, can I always assume that the
first element of the primary list in the JSON form is of type 'unMeta'?
And why doesn't it appear at the beginning of the native form? What's
the meaning of the extra parameters of a node of type 'Header'? etc.

[2]@technocrat, yes, I've started to look at both the native and the
JSON form, but figuring out things from there is reverse engineering,
and I can never be sure that any code that I produce will work on
arbitrary Markdown documents.


Reply to this email directly or [3]view it on GitHub.

References

  1. https://github.com/jgm
  2. https://github.com/technocrat
  3. pandoc diff #2374 (comment)

@jgm
Copy link
Owner

jgm commented Oct 10, 2015

+++ murfit [Oct 10 15 13:24 ]:

[1]@jgm, thanks, but that doesn't really explain the meaning and usage
of these structure elements. For instance, can I always assume that the
first element of the primary list in the JSON form is of type 'unMeta'?
And why doesn't it appear at the beginning of the native form?

When you do pandoc -t native -s, you'll get the metadata
(the complete Pandoc structure). Without -s you'll just
get a list of blocks.

@allefeld
Copy link
Contributor

@jgm, I understand that Haskell would be the optimal choice, but I don't want to learn a new language just for a single project. I've seen that there is Python support for filters, but for a diff one would need to operate on two files at once. Is there a good Python way to access JSON-formatted AST, preferably with a simple API to traverse the tree, get the JSON representation of subtrees, etc.? (I'm not a Python expert either, but I know the basics and would probably be able to manage.)

An alternative would be to operate on some kind of normalized Markdown, e.g. a form where each sentence or partial sentence is on a single line. The diff then wouldn't go down to the word level, but it would be easier to implement (only two levels: paragraphs and sentences), and the result should still be useful.

@jgm
Copy link
Owner

jgm commented Oct 11, 2015

There is my jgm/pandocfilters library, which is designed mainly
for writing filters. The walk function it provides is for
convenient tree walking. The source for toJSONFilter
gives a simple example of its use.

However, I think for this project it would end up being far
easier to use Haskell, even if you currently know python
better.

Although Haskell has lots of complexities, this sort of
project would use only fairly basic features. The hardest
part would be algorithmic, figuring out how to do a diff
on a Pandoc structure, given a way to do a diff on arbitrary
lists (which the Diff library provides).

+++ murfit [Oct 11 15 09:39 ]:

[1]@jgm, I understand that Haskell would be the optimal choice, but I
don't want to learn a new language just for a single project. I've seen
that there is Python support for filters, but for a diff one would need
to operate on two files at once. Is there a good Python way to access
JSON-formatted AST, preferably with a simple API to traverse the tree,
get the JSON representation of subtrees, etc.? (I'm not a Python expert
either, but I know the basics and would probably be able to manage.)

An alternative would be to operate on some kind of normalized Markdown,
e.g. a form where each sentence or partial sentence is on a single
line. The diff then wouldn't go down to the word level, but it would be
easier to implement (only two levels: paragraphs and sentences), and
the result should still be useful.


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. pandoc diff #2374 (comment)

@jgm
Copy link
Owner

jgm commented Oct 11, 2015

I just found this Haskell library, which might go almost all
the way:
https://hackage.haskell.org/package/gdiff-1.1/docs/Data-Generic-Diff.html
It does use some advance features. I'll look into what
would be required.

+++ John MacFarlane [Oct 11 15 14:24 ]:

There is my jgm/pandocfilters library, which is designed mainly
for writing filters. The walk function it provides is for
convenient tree walking. The source for toJSONFilter
gives a simple example of its use.

However, I think for this project it would end up being far
easier to use Haskell, even if you currently know python
better.

Although Haskell has lots of complexities, this sort of
project would use only fairly basic features. The hardest
part would be algorithmic, figuring out how to do a diff
on a Pandoc structure, given a way to do a diff on arbitrary
lists (which the Diff library provides).

+++ murfit [Oct 11 15 09:39 ]:

[1]@jgm, I understand that Haskell would be the optimal choice, but I
don't want to learn a new language just for a single project. I've seen
that there is Python support for filters, but for a diff one would need
to operate on two files at once. Is there a good Python way to access
JSON-formatted AST, preferably with a simple API to traverse the tree,
get the JSON representation of subtrees, etc.? (I'm not a Python expert
either, but I know the basics and would probably be able to manage.)

An alternative would be to operate on some kind of normalized Markdown,
e.g. a form where each sentence or partial sentence is on a single
line. The diff then wouldn't go down to the word level, but it would be
easier to implement (only two levels: paragraphs and sentences), and
the result should still be useful.


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. pandoc diff #2374 (comment)

@jgm
Copy link
Owner

jgm commented Oct 11, 2015

@murfit Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.

@jgm
Copy link
Owner

jgm commented Oct 11, 2015

I got good results just now comparing two versions of the pandoc README using dwdiff. You can specify the string you want to use as start and end markers for deleted and inserted text. So, for example, you could use <ins> and <del> tags if you were targeting HTML.

@allefeld
Copy link
Contributor

@jgm, a word diff gives good results only if the changes are to single words or small phrases within a paragraph. Problems occur when changes cross structural borders. A few examples that I found in diffing two revisions of a paper I'm currently working on, using the default change indicators of dwdiff, ([- -], {+ +}); just imagine them being replaced by e.g. <del> and <add> elements.

A change at the end of a footnote which is at the end of a paragraph, and after that a heading is inserted:

^[Footnote text with [-change.]-] {+change; see below.]

## Another Section+}

A change at the end of embedded LaTeX math and in the immediately following text:

$a_0 = [-50\,\%$). If-] {+50\,\%$.+}

A change at the end of a reference:

[-@RefA].-] {+@RefA; @RefB].+}

Problems occur also when changes are localized within elements that don't support change markup:

$[-a-] {+b+}$

Another drawback of a pure word-based diff is that it will ignore changes in whitespace, which might be relevant e.g. when one paragraph is being split into two, or two are joined.

And a whole other world of trouble is implied by more complex structures. Imagine a list item is being split into two list items: word diff will mark "-" as an extra word.

I have come up with a strategy to deal with at least most of these problems: 1) Do a diff on the paragraph (block) level. Since the diff algorithm won't be able to detect that a paragraph has been changed, this will always show up as the whole old paragraph being being deleted and the whole new one being inserted. 2) Use a string distance to find candidate pairs of deleted and added paragraphs that are actually the same paragraph being changed. 3) For each changed paragraph, do a word-based diff within. However, for that the diff algorithms has to be taught to treat embedded math, references, footnotes... as "words", i.e. as units that can only be changed as a whole. 4) Some changed blocks can't be treated like this, e.g. metadata blocks, tables, maybe even lists, so it is better to always show them as deleted+added as a whole. (But metadata won't allow even that.)

It would be even better to do this in a recursive way, e.g. compare list blocks by treating them as sequences of list items that have to be matched the same way as paragraphs are on the top level. But I think I'll be glad if I manage to implement 1-4.

Do you have tips on which rules to follow for the "treat some elements as words" part of step3?

I'm currently prototyping in a language that shall remain unnamed (because it's embarassing;), operating on the level of reformatted Markdown source, calling diff and wdiff as external helpers. This might eventually lead to an implementation in bash & friends. If I encounter problems that can't be solved on this level, I may get back to the idea of doing this on the AST representation. How long do you think it would take me to learn the basics of Haskell necessary for this project?

@jgm
Copy link
Owner

jgm commented Oct 12, 2015 via email

@technocrat
Copy link

Or, it could be done whole hog with cmp if it's necessary to pick up white space:

$ echo 'foo bar' > f1
$ echo 'foobar' > f2
$ cmp -bl f1 f2

4  40      142 b
5 142 b    141 a
6 141 a    162 r
7 162 r     12 ^J
cmp: EOF on f2

but that is going to pick up absolutely everything, which may result in more noise than signal.

@allefeld
Copy link
Contributor

I implemented a version of a Markdown diff in Matlab; with a few changes, it should also run under Octave. The diff is text-based and therefore limited; it produces usable output for my specific use case, but is far from being general (which would need the AST representation). It is not ready for publication, but I can provide the code to any individual who is interested in trying it out. Moreover, I believe the basic logic is sound and can form the basis for a more general approach.

@allefeld
Copy link
Contributor

@jgm, I'm willing to give a reimplementation in Haskell a shot, and I read some introduction to it. Can you give a little starter's guide? Let's say I have converted a document into native form, which as far as I understand is Haskell code. How do I read it into ghci, and which libraries do I have to install / load to work with it?

@jgm
Copy link
Owner

jgm commented Oct 19, 2015

Have you gotten this far?

% cabal update && cabal install pandoc Diff
% ghci
GHCi, version 7.10.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m + Text.Pandoc
Prelude Text.Pandoc> readMarkdown def "Hi!\n\n* World"
Right (Pandoc (Meta {unMeta = fromList []}) [Para [Str "Hi!"],BulletList [[Plain [Str "World"]]]])
Prelude Text.Pandoc> let Right doc = readMarkdown def "Hi!\n\n* World"
Prelude Text.Pandoc> doc
Pandoc (Meta {unMeta = fromList []}) [Para [Str "Hi!"],BulletList [[Plain [Str "World"]]]]
Prelude Text.Pandoc> :m + Data.Algorithm.Diff
Prelude Text.Pandoc Data.Algorithm.Diff> :browse
data Diff a = First a | Second a | Both a a
getDiff :: Eq t => [t] -> [t] -> [Diff t]
getDiffBy :: (t -> t -> Bool) -> [t] -> [t] -> [Diff t]
getGroupedDiff :: Eq t => [t] -> [t] -> [Diff [t]]
getGroupedDiffBy :: (t -> t -> Bool) -> [t] -> [t] -> [Diff [t]]
Prelude Text.Pandoc Data.Algorithm.Diff> getDiff [1,2,3,5,6] [1,3,6]
[Both 1 1,First 2,Both 3 3,First 5,Both 6 6]
Prelude Text.Pandoc Data.Algorithm.Diff> getDiff [Str "Hi", Space, Str "there"] [Str "Hi", Str "there"]
[Both (Str "Hi") (Str "Hi"),First Space,Both (Str "there") (Str "there")]

@allefeld
Copy link
Contributor

@jgm, thanks, that's a good start. However, I quickly got stuck somewhere else. I managed to read a markdown file using System.IO.readFile, and now I'd like to pass its string contents to Text.Pandoc.readMarkdown, but the former gives me an IO String while the latter wants a String. Googling about it I found some explanations that converting from an IO String to a String would violate functional purity, and I have some faint idea what that might be about, but the fact remains that I can't even figure out how to parse a markdown text file.

I'm beginning to feel that implementing this in Haskell amounts to more than just putting together some snippets gathered from other people's code, and that I'd need to get into this whole Haskell programming philosophy thing. Which I'm sure is great, but seriously, it's not what I'm interested in right now. So I'd like to ask you again, do you think it makes sense for me to pursue this? Maybe dealing with the JSON representation in some plain old imperative language is the lesser evil here...

@technocrat
Copy link

I'm learning Haskell the hard way, too, by working on parsing markdown.
But it's much easier if you take advantage of the toJSONFilter package
that serializes everything to an AST and you can walk through the
structure of a doc making changes as you go. Not saying it's easy (still
trying to suss out how to get from a Div to a RawBlock to its Str).

#!/usr/bin/env runghc
import Text.Pandoc.JSON
import Data.Char (toTitle)

main :: IO ()
main = toJSONFilter capitalizeStrings

capitalizeStrings :: Inline -> Inline
capitalizeStrings (Str s) = Str (map toTitle s)
capitalizeStrings x = x

murfit mailto:notifications@github.com
October 21, 2015 at 10:35 AM

@jgm https://github.com/jgm, thanks, that's a good start. However, I
quickly got stuck somewhere else. I managed to read a markdown file
using System.IO.readFile, and now I'd like to pass its string contents
to Text.Pandoc.readMarkdown, but the former gives me an IO String
while the latter wants a String. Googling about it I found some
explanations that converting from an IO String to a String would
violate functional purity, and I have some faint idea what that might
be about, but the fact remains that I can't even figure out how to
parse a markdown text file.

I'm beginning to feel that implementing this in Haskell amounts to
more than just putting together some snippets gathered from other
people's code, and that I'd need to get into this whole Haskell
programming philosophy thing. Which I'm sure is great, but seriously,
it's not what I'm interested in right now. So I'd like to ask you
again, do you think it makes sense for me to pursue this? Maybe
dealing with the JSON representation in some plain old imperative
language is the lesser evil here...


Reply to this email directly or view it on GitHub
#2374 (comment).

Sent from Postbox
https://www.postbox-inc.com/?utm_source=email&utm_medium=siglink&utm_campaign=reach

@allefeld
Copy link
Contributor

@technocrat, thanks, but in my understanding a filter can only operate on one file. For a diff I need two input files.

@jgm
Copy link
Owner

jgm commented Oct 21, 2015

I can't answer the question whether it's worth your time to
get into Haskell. But here's how you can do what you were
trying to do:

<$> will combine an f a with an a -> b to yield an f b, for any functor f. IO is a functor, as well as a monad. So, since readFile "my.md" is IO String and readMarkdown def is String -> Either PandocError Pandoc, readMarkdown def <$> readFile "my.md" is IO (Either PandocError Pandoc).

Note: you've still got a value in the IO monad: once you're in, you can't get out. But <$> allows you to apply a function that takes a plain String as argument across the monad boundary, on an IO String.

+++ murfit [Oct 21 15 10:35 ]:

@jgm, thanks, that's a good start. However, I quickly got stuck somewhere else. I managed to read a markdown file using System.IO.readFile, and now I'd like to pass its string contents to Text.Pandoc.readMarkdown, but the former gives me an IO String while the latter wants a String. Googling about it I found some explanations that converting from an IO String to a String would violate functional purity, and I have some faint idea what that might be about, but the fact remains that I can't even figure out how to parse a markdown text file.

I'm beginning to feel that implementing this in Haskell amounts to more than just putting together some snippets gathered from other people's code, and that I'd need to get into this whole Haskell programming philosophy thing. Which I'm sure is great, but seriously, it's not what I'm interested in right now. So I'd like to ask you again, do you think it makes sense for me to pursue this? Maybe dealing with the JSON representation in some plain old imperative language is the lesser evil here...


Reply to this email directly or view it on GitHub:
#2374 (comment)

@jgm
Copy link
Owner

jgm commented Nov 20, 2015

Closing this. It's a good idea for a separate project but doesn't belong on this tracker.

@davidar
Copy link
Contributor

davidar commented May 1, 2018

Actually your idea of using pandoc (+ maybe some other processing) to produce a canonical document in, say, Markdown, and comparing these makes quite a bit of sense. You could even use a word-level diff algorithm and skip the step of putting each sentence on a line.

I've put together a script that does this (using HTML as the canonical intermediate format), if anyone is still interested in this issue:

https://github.com/davidar/pandiff

@technocrat
Copy link

technocrat commented May 1, 2018 via email

@jgm
Copy link
Owner

jgm commented May 1, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants