Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a diff between two manuscript versions #54

Open
dhimmel opened this issue Aug 14, 2017 · 17 comments
Open

Creating a diff between two manuscript versions #54

dhimmel opened this issue Aug 14, 2017 · 17 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Aug 14, 2017

Oftentimes, it's important (and required in scholarly publishing) to show the changes between two versions of a manuscript. It would be ideal if Manubot users could "track changes" between two manuscript versions.

Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374. Other options would be:

  1. Exporting to latex and using latexdiff
  2. Exporting to docx and using LibreOffice's Compare Document feature. Currently, not accessible via command line.
  3. Export to ODT and use oodiff
  4. Diffing manuscript.md as a text file (perhaps using diff, prettydiff, or rich-text-diff)
  5. Use GitHub's rich diff view preview or react-rich-diff
@dhimmel
Copy link
Member Author

dhimmel commented Sep 28, 2017

For the Project Rephetio manuscript, now published in eLife, I had to create diffs to show changes in response to reviewers. I ended up enabling DOCX export (dhimmel/rephetio-manuscript@b7b8bd3), and then using Microsoft Word to compare the documents. While manual and thus sub-optimal, this worked. We may want to consider setting BUILD_DOCX=true by default, so these past DOCX versions are automatically created.

@agitter
Copy link
Member

agitter commented Sep 28, 2017

That's good to know you were able to satisfy the journal. Did you not encounter the image embedding problems I did in #40?

I'm okay defaulting to BUILD_DOCX=true if the DOCX versions are not too broken. I think diffing manuscript.md is also appealing in the long term. I haven't tested git diff variants to know how hard it would be to diff the Markdown and color the modified text with a post-diff script.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 28, 2017

Did you not encounter the image embedding problems I did in #40?

Well we used PNG not SVG images, so they exported to DOCX fine. But in this case the export failure would have been a feature, since the journal required images be uploaded separately!

@vsmalladi
Copy link
Collaborator

Should we resurrect #40 to merge?

@dhimmel
Copy link
Member Author

dhimmel commented Oct 2, 2017

Should we resurrect #40 to merge?

@vsmalladi I'm still leaning against any heavyweight SVG export solution as these are things that really make the most sense to fix upstream. We don't want to place ourselves in a position where we have to maintain this heavy machinery.

@vsmalladi
Copy link
Collaborator

@dhimmel that makes sense.

@rgieseke
Copy link
Contributor

rgieseke commented Oct 8, 2017

Here is another approach used with a GitHub-based project, the COP21 project:

https://github.com/okfn/cop21

https://github.com/okfn/cop21/blob/gh-pages/scripts/diff.sh
https://github.com/okfn/cop21/blob/gh-pages/scripts/diff2html.py

Example output
http://cop21.okfnlabs.org/diff/4-dec-vs-9-dec/

Still a bit manual for specific document versions but could likely be automated more.

@agitter
Copy link
Member

agitter commented Oct 8, 2017

Thanks, the output looks great.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 9, 2017

Here is another approach used with a GitHub-based project, the COP21 project:

Thanks @rgieseke. To summarize, this method pipes the output of diff --unified=99999 to a python 2 script to create a HTML view.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 9, 2017

Draftable

I came across the Draftable webapp to create diffs for PDF and DOCX files. Their example showed that it worked well for diffing two arXiv PDFs. They have an API and python package for using the API. To use the API, the free tier is limited to 200 requests per month. API calls return a URL for viewing the diff.

We could potentially use this tool for creating diffs. The URLs could even be embedded into the CI logs, so you could see the changes a PR would create to the PDF output. Obviously, the whole registration / API key / quota / third-party dependency thing kind of sucks.

There may be an open source PDF diff solution that works as well like https://vslavik.github.io/diff-pdf/. Or even create a probot to comment on GitHub PRs with the PDF diff uploaded as an attachment.

@slochower
Copy link
Collaborator

I wrote a little notebook that will highlight the differences between two manuscript versions in the HTML and PDF. It is not pretty, but in my limited testing, it seems to do any okay job and I personally like it better than using the external tools listed above. The notebook is here, with the limitations listed at the bottom.

For example, I compared manuscript versions b8eeea542ce238bbcaf2023add2aecb86ef726bd and 5bb8dd518c1f744bbb679d76456d285058bf6b8f of meta-review.

Here is the PDF as of b8eeea542ce238bbcaf2023add2aecb86ef726bd:
screen shot 2018-08-04 at 10 06 37 am

Here is the PDF as of 5bb8dd518c1f744bbb679d76456d285058bf6b8f:
screen shot 2018-08-04 at 10 08 25 am

And here is manuscript_diff.pdf:
screen shot 2018-08-04 at 10 09 11 am

Which should match git diff:
screen shot 2018-08-04 at 10 10 06 am

@dhimmel
Copy link
Member Author

dhimmel commented Aug 7, 2018

@slochower nice approach.

I agree that using HTML tags to color portions of the text in the source markdown document may be the right solution. I don't think it's inelegant to put HTML in the markdown (we already do that for manuscripts in places).

However, as you note, tables and figures and some other more complex constructs might be problematic. Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words.

I think your approach of using HTML to demarcate markdown source based on git diff output is a promising direction. Were we to refine it a bit more, I think it could be appropriate for Manubot.

@slochower
Copy link
Collaborator

slochower commented Aug 7, 2018

Also I find the whole line highlighting problematic. It would be much better to get behavior along the lines of git diff --color-words.

I agree. The issue is getting either vanilla diff or git diff to give us what we need: line numbers and (even better) character level changes in a machine parsable format. I went with regular diff for the proof of concept, because I could get it to easily print which lines have changed. Working with git diff required grep'ing through regular expressions @@.*@@ and was more challenging. I suppose I could always use the patch command from diff to print the original and changed lines, then manually find the differences and print both the old and new versions. I can imagine exactly how this would work if someone changes a word in a sentence, but I can also imagine that large changes could get out of control. What do you think?

It would probably be pretty fragile, but I suppose we could simply parse the ANSI codes that do the coloring in the output of git diff --color-words.

@slochower
Copy link
Collaborator

@rgieseke
Copy link
Contributor

Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374.

I just learned about pandiff, discussed in the thread above, and it seems amazing (Node-based).

https://github.com/davidar/pandiff

@jmonlong
Copy link

Adding to the thread that Google Docs now has a feature to compare two documents (in Tools -> Compare documents). So we can build the DOCX output for two versions of the manuscript, upload them to Google Drive, convert them to Google Docs and use this feature.

Just another option like the LibreOffice compare documents. Still manual but some people might prefer Google Docs. The end result is a bit different too so maybe worth trying out if LibreOffice doesn't work properly.

In our experience, going through Google Doc helped with the tables. It was worth it to even upload the "diff" DOCX produced by LibreOffice, just to get the tables right. (Maybe it has to do with my version of LibreOffice on Ubuntu.)

Also, Google Docs doesn't seem to be able to print/export the track-changes in PDF except when printing from Chrome.

@castedo
Copy link

castedo commented Apr 27, 2022

To add to the record here, here is a project doing diffs for JATS XML:
https://github.com/milos-cuculovic/jats-diff
The focus of the project seems to be more on the backend algorithm rather than any particular UI or presentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants