Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata is silently deleted? #327

Closed
gwern opened this issue Dec 17, 2018 · 6 comments
Closed

Metadata is silently deleted? #327

gwern opened this issue Dec 17, 2018 · 6 comments

Comments

@gwern
Copy link

gwern commented Dec 17, 2018

I was experimenting with whether --skip-text --optimize 3 --jbig2-lossy might be a good idea for re-processing my PDFs to save what looks like a ton of space, but I noticed that when a PDF is processed, ocrmypdf drops metadata fields without any warning or instruction?

The documentation makes no mention of metadata being deleted that I can find and the man page implies all fields will simply be copied over (because why would anything else be done):

Set output PDF/A metadata (default: copy input document's metadata)

Sample PDF: https://www.gwern.net/docs/aspirin/2011-rothwell.pdf

$ ocrmypdf --version
7.3.1
$ ocrmypdf --skip-text 2011-rothwell.pdf 2011-rothwell-bigger.pdf ; exiftool -All 2011-rothwell.pdf 2011-rothwell-bigger.pdf
INFO -    1: page already has text! – skipping all processing on this page
INFO -    2: page already has text! – skipping all processing on this page
INFO -    3: page already has text! – skipping all processing on this page
INFO -    4: page already has text! – skipping all processing on this page
INFO -    5: page already has text! – skipping all processing on this page
INFO -    6: page already has text! – skipping all processing on this page
INFO -    7: page already has text! – skipping all processing on this page
INFO -    8: page already has text! – skipping all processing on this page
INFO -    9: page already has text! – skipping all processing on this page
INFO -   10: page already has text! – skipping all processing on this page
INFO -   11: page already has text! – skipping all processing on this page
INFO - Optimize ratio: 1.01 savings: 1.1%
INFO - Output file is a PDF/A-2B (as expected)
======== 2011-rothwell-bigger.pdf
ExifTool Version Number         : 10.80
File Name                       : 2011-rothwell-bigger.pdf
Directory                       : .
File Size                       : 167 kB
File Modification Date/Time     : 2018:12:17 13:45:58-05:00
File Access Date/Time           : 2018:12:17 13:45:59-05:00
File Inode Change Date/Time     : 2018:12:17 13:45:58-05:00
File Permissions                : rw-------
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : No
Author                          : Prof Peter M Rothwell FMedSci
Create Date                     : 2010:12:22 20:04:03+05:30
Modify Date                     : 2018:12:17 18:45:57+00:00
Subject                         : The Lancet, 377 2011 31-41. doi:10.1016/S0140-67361062110-1
XMP Toolkit                     : XMP toolkit 2.9.1-13, framework 1.6
Producer                        : GPL Ghostscript 9.26
Keywords                        : 
Creator Tool                    : ocrmypdf 7.3.1 / Tesseract OCR-PDF 4.0.0-beta.1
Document ID                     : uuid:8ac37963-3a48-11f4-0000-6e22f5d1dcf2
Format                          : application/pdf
Title                           : Effect of daily aspirin on long-term risk of death due to cancer: analysis of individual patient data from randomised trials
Creator                         : Prof Peter M Rothwell FMedSci
Description                     : The Lancet, 377 2011 31-41. doi:10.1016/S0140-67361062110-1
Part                            : 2
Conformance                     : B
Page Count                      : 11
======== 2011-rothwell.pdf
ExifTool Version Number         : 10.80
File Name                       : 2011-rothwell.pdf
Directory                       : .
File Size                       : 502 kB
File Modification Date/Time     : 2016:01:08 13:14:16-05:00
File Access Date/Time           : 2018:12:17 00:09:31-05:00
File Inode Change Date/Time     : 2018:06:26 12:07:31-04:00
File Permissions                : rw-------
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : Yes
Page Mode                       : UseOutlines
XMP Toolkit                     : Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39
Format                          : application/pdf
Identifier                      : doi:10.1016/S0140-6736(10)62110-1
Title                           : Effect of daily aspirin on long-term risk of death due to cancer: analysis of individual patient data from randomised trials
Creator                         : Prof Peter M Rothwell FMedSci, Prof F Gerald R Fowkes FRCPE, Prof Jill FF Belch FRCP, Hisao Ogawa MD, Prof Charles P Warlow FMedSci, Prof Tom W Meade FRS
Subject                         : 
Description                     : The Lancet, 377 (2011) 31-41. doi:10.1016/S0140-6736(10)62110-1
Publisher                       : Elsevier Ltd
Aggregation Type                : journal
Publication Name                : The Lancet
Copyright                       : Copyright © 2011 Elsevier Ltd. All rights reserved.
ISSN                            : 0140-6736
Volume                          : 377
Number                          : 9759
Cover Display Date              : 1 January 2011
Cover Date                      : 2011:01:01
Page Range                      : 31-41
Starting Page                   : 31
Ending Page                     : 41
Digital Object Identifier       : 10.1016/S0140-6736(10)62110-1
Robots                          : noindex
URL                             : http://dx.doi.org/10.1016/S0140-6736(10)62110-1
Elsevier Web PDF Specifications : 6.1
Authoritative Domain            : sciencedirect.com, elsevier.com
Creator Tool                    : Elsevier
Create Date                     : 2010:12:22 20:04:03+05:30
Modify Date                     : 2010:12:29 13:36:28+05:30
Metadata Date                   : 2010:12:29 13:36:28+05:30
Marked                          : True
Keywords                        : 
Producer                        : Acrobat Distiller 6.0 for Windows
Document ID                     : uuid:0ce959b1-bd55-4587-bd60-08c6c6ef001d
Instance ID                     : uuid:e8bc37a9-a759-44c1-b1c5-3adf7b1e8738
Page Count                      : 11
Page Layout                     : SinglePage
Author                          : Prof Peter M Rothwell FMedSci
Authoritative Domain 1          : sciencedirect.com
Authoritative Domain 2          : elsevier.com
    2 image files read

29 vs 52 fields, including meaningful ones like the URL, ISSN, Volume, Number, etc.

The regular output doesn't mention anything about metadata, and the debug --verbose 4 output mentions metadata only in passing with nothing about deletion or not preserving metadata:

Task enters queue = 'ocrmypdf._pipeline.metadata_fixup' 
    Job  = [[.../origin.repaired.pdf, .../layers.rendered.pdf, .../pdfa.ps] -> .../metafix.pdf, <LoggingProxy>, <ocrmypdf._jobcontext.JobContext>] ...
        Missing file [.../metafix.pdf] 
  DEBUG - ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dCompatibilityLevel=1.6', '-dNumRenderingThreads=32', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-sOutputFile=/tmp/tmp7bso7ctq', '/tmp/com.github.ocrmypdf.fc6h_vdy/layers.rendered.pdf', '/tmp/com.github.ocrmypdf.fc6h_vdy/pdfa.ps']
  DEBUG - Ghostscript had to remove PDF 'overprinting' from the input file to complete PDF/A conversion. 
    Job  = [[.../origin.repaired.pdf, .../layers.rendered.pdf, .../pdfa.ps] -> .../metafix.pdf, <LoggingProxy>, <ocrmypdf._jobcontext.JobContext>] completed
Completed Task = 'ocrmypdf._pipeline.metadata_fixup' 
@jbarlow83
Copy link
Collaborator

When doing PDF/A conversion, Ghostscript rewrites the metadata (the whole PDF, actually). Starting in v7.4.0 I added new features to manage metadata rather than relying on Ghostscript, although I haven't entirely cut Ghostscript out when PDF/A is involved. It's news to me, but unsurprising, that Ghostscript doesn't properly replicate all of the input XMP metadata in its output. Ghostscript does have a legitimate need to change some of the metadata, however.

If we turn off Ghostscript and PDF/A conversion with ocrmypdf --output-type pdf, at v7.4.0, it preserves metadata - although some fields get modified, and there's some harmless reformatting at the XML level.

I should mention that PDF metadata is really, really messy. There's two components, one in XML and an older PDF-specific one that has to be retained for backward compatibility. They have to be synchronized as much as possible, but it's not actually possible to represent all of the same data in them. The XMP format itself is an amalgamation of multiple XML specifications and has multiple ways of representing the same information.

I will leave this as open as a reminder to fix metadata PDF/A. That will be complicated since it's a 3-way merge in XML of the input file, ocrmypdf's changes to the metadata, and Ghostscript's changes.

@gwern
Copy link
Author

gwern commented Dec 17, 2018

I'm not surprised it's something like that. As you say, PDF is a horrifying family of formats and backwards-compatibility and hacks. But data loss can't be excused by noting that the right thing is hard - that's the one thing above all which a tool like ocrmypdf must not do silently or by default.

Until the metadata is preserved, can something be done like erroring out if entire fields are dropped or at least dumping a warning? A heuristic might be something like if the field count is different between start and finish, or if the new metadata is X bytes smaller than the old metadata. Semantically irrelevant minor formatting and legitimate fields shouldn't cause entire fields to be deleted or major changes in metadata size. Then the user is notified and can compare before/after to decide if it's a problem and can re-add metadata manually if they really want the new version with the OCR/compression. At the least, a warning in the man page & manual seems merited?

@jbarlow83
Copy link
Collaborator

Semantic XML diff is not trivial. Semantic XMP diff is more difficult, because some attributes in XMP are shorthand for certain child tags and there are some semantically equivalent or almost-equivalent constructs.

I can't also recommend modifying XMP after PDF/A conversion since most programs are incapable of editing PDF/As without breaking conformance. (exiftool among them.) To be clear, Ghostscript transfers most XMP metadata, it just seems to drop a nonstandard add-on you happen to care about.

I added something to the documentation. You're welcome to submit a PR if you want to help with this issue in some way. Otherwise, this is open source software and like everyone else, my time is limited.

I also provide commercial support for OCRmyPDF, if you need this addressed urgently.

@jbarlow83
Copy link
Collaborator

I looked into this further and found veraPDF reports the following when your input metadata are attached to a PDF/A:

All properties specified in XMP form shall use either the predefined schemas defined in the XMP Specification, ISO 19005-1 or this part of ISO 19005, or any extension schemas that comply with 6.6.2.3.2.

So, Ghostscript had a reason for dropping this metadata: they are not allowed in conformant PDF/A-2b files. The full list of "predefined schemas" veraPDF refers to are reproduced here:
https://www.pdfa.org/wp-content/until2016_uploads/2011/08/tn0003_metadata_in_pdfa-1_2008-03-182.pdf#page=4

As far as I'm aware Dublin Core is inclusive of most or all of the "PRISM" metadata that is attached to this file. To retain the metadata, you'd have to find a tool that can rewrite PRISM as Dublin Core. python-xmp-toolkit would be capable. You'd have to is find a tool that can rewrite PRISM XMP metadata in Dublin Core and run this on the file before ocrmypdf.

The next release will issue a warning when some metadata gets lost.

@gwern
Copy link
Author

gwern commented Dec 30, 2018

The next release will issue a warning when some metadata gets lost.

Thanks.

@jbarlow83
Copy link
Collaborator

Fixed in v8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants