Skip to content

Latest commit

 

History

History
165 lines (123 loc) · 6.98 KB

COMPARE.rst

File metadata and controls

165 lines (123 loc) · 6.98 KB

Problem statement

I need a tool to compare two PDF files to catch regressions in mgp2pdf.

  • I know the PDFs have the same number of pages (well, they're supposed to; if they don't, the tool needs to report that as a difference).
  • I care about the visual layout, not just textual contents. Perceptual diff: good, piping pdftotext to GNU diff: bad.
  • I want automation: the tool should be able to compare two sets of PDFs and tell me if they're identical, and if not, produce a report with the differences between each pair of files.

My options are:

  • Find an existing tool.
  • Write a new tool.

I actually once wrote such a tool for an internal project. I can extract and open-source it. I just need a name. This means I have to survey existing tools to avoid name clashes. And hey, maybe one of the existing tools will turn out to do what I want and I can save effort!

Survey of existing tools

Sources:

compare.py in this very repository:

  • Can compare an .mgp with a .pdf, not just two PDFs.
  • Renders the pages to images, makes them translucent, overlays them for manual visual comparison.
  • Can be used interactively.
  • Can produe a "report" (set of PNG files with translucent original pages overlaid on top of each other) in non-interactive mode.
  • Not automated: doesn't tell if the two presentations are identical.
  • Written in Python, uses Pillow and/or Pygame.
  • Relies on external tools: mgp, pdftoppm, ImageMagick.

compare-reportgen-output from that internal project I mentioned:

  • Can compare two sets of PDFs.
  • Renders the pages to images, compares them pixel-by-pixel.
  • Automated.
  • Produes a report with differing pages shown next to each other and a third page with the differences highlighted.
  • Some work needs to be done to make it generic.
  • Written in Python.
  • Relies on external tools: ImageMagick.

DiffPDF

  • Packaged for Ubuntu (apt-get install diffpdf).
  • Upstream no longer open source.
  • Interactive.

ComparePDF

  • Packaged for Ubuntu (apt-get install comparepdf).
  • Automated.
  • Reports "yes" or "no", doesn't show differences, doesn't produce reports.

vslavik/diff-pdf

  • Website: https://vslavik.github.io/diff-pdf/
  • Uses overlaid red/green channels to compose an image from two sources.
  • Automated.
  • Can produce a report as PDF.
  • Has an interactive mode.
  • Written in C++.

JoshData/pdf-diff

  • Compares document text rather than visual layout.
  • Produces a nice PNG report.
  • Written in Python.
  • Relies on external tools: pdftotext.

jnweiger/pdfcompare

  • Can compare document text, annotate the PDF with highlighted changes.
  • Doesn't compare images.
  • Written in Python.
  • Relies on pdftohtml.

magmax/pdfcomparator

  • https://pypi.python.org/pypi/pdfcomparator
  • Compares rendered images.
  • Automated: can report yes/no, can report similarity percentage (using difflib on extracted text).
  • Doesn't produce a report with the differences.
  • Written in Python.
  • Relies on Poppler and Cairo Python bindings to render them.

kspeeckaert/pyPdfCompare