testing MediaBox and other metadata #61

davidcarlisle · 2018-03-24T16:22:46Z

Setting up some tests for geometry https://github.com/davidcarlisle/geometry

It would be useful to check that the following does produce a pdf of size 100x200 eg pdfinfo reports

Page size: 100 x 200 pts

This was the bug being addressed (with luatex the page size was not affected, leaving this as A4) so it would be good to have a test for this case.

One possibility would be to provide a custom "normalization" function in build.lua that just runs pdfinfo (or a lua function extracting the same information) other interfaces to this functionality could be imagined.

\documentclass{article}

\usepackage[papersize={100bp,200bp}]{geometry}

\begin{document}
x
\end{document}

The text was updated successfully, but these errors were encountered:

josephwright · 2018-03-24T16:26:43Z

Relates in some ways to #21: probably the same need for a general mechanism for 'non-log' testing will be required.

josephwright · 2018-03-24T16:36:47Z

We could do this various ways, for example a specific test mode, a switch to add the info to the .log, ...

car222222 · 2018-03-25T03:10:02Z

a switch to add the info to the .log, ...

The general ability to Add stuff like this to the .log file (as much as possible) to the .log file would be very widely useful.

wspr · 2018-03-26T00:59:27Z

While a general solution might be good, it’s also going to be more work. To address David’s needs here, what about a thin wrapper around pdfinfo which add its output to the tlg in the same as recordstatus? E.g., recordpdfinfo=true would list something like Creator: LaTeX with hyperref package Producer: […] CreationDate: […] Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 56 Encrypted: no Page size: 612 x 792 pts (letter) Page rot: 0 File size: […] Optimized: no PDF version: 1.5 where the […] lines are either stripped out entirely or normalised. (I’m assuming pdfinfo is included in MiKTeX…)

blefloch · 2018-03-26T01:00:44Z

Is pdfinfo allowed by restricted shell escape? Perhaps it would be cleaner to do things on the TeX side since only specific tests will need pdf info?

wspr · 2018-03-26T01:05:57Z

I thought about that, but don’t you need to do this after the PDF has been written out completely? I don’t think we are able to call \ShellEscape that late (but I could be wrong).

blefloch · 2018-03-26T02:50:08Z

Ah, ignore me sorry. Will's right.

FrankMittelbach · 2018-03-26T09:59:58Z

what about a simple flag in regression-test.tex, such as

\EXTERNALDATA{pdfinfo}{meta}

which outputs

<<PLACEHOLDER pdfinfo meta>>

in the log. The normalization (or a step next to it) could then run pdfinfo -meta on the pdf file and insert the result in place of that placeholder.

That step could support different external commands though I would limit that to a defined set initially just pdfinfo say

josephwright · 2018-07-29T19:13:26Z

The idea of using pdfinfo or similar is quite a good one, at least in as far as testing tagging and similar goes. If you run pdfinfo -rawdates, you get basic info, e.g.

Title:          Testing Tagged PDF with LaTeX
Subject:        Testing paragraph split across pages in Tagged PDF
Author:         Dr. Ross Moore
Creator:        pdfTeX + pdfx.sty with a-1a option
Producer:       pdfTeX
CreationDate:   D:20170117160113+11'00'
ModDate:        D:20170117160113+11'00'
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          4
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      419873 bytes
Optimized:      no
PDF version:    1.4

(PDF from web.science.mq.edu.au/~ross/TaggedPDF/test-LaTeX-article-unc.pdf), and then can look at tag structure using pdfinfo -struct, which starts here

  Div "Topmatter"
    H "title" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
    P "author" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
      Reference (inline)
        Note (inline)
          Lbl (block)
    P "date" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
  NonStruct
    Object 64 0
  TOC "Contents"
    H "Contents" (block)
    TOCI
      Reference (inline)
        Link "to destination section.1" (inline)
          Object 74 0
      TOC "subsections"
        TOCI

I suspect that using a third-party tool is a better plan long-term than trying to use the PDF stream (#10), and avoids the issues that have become apparent in #21.

I'd welcome thoughts on this, particularly from @FrankMittelbach, @car222222 and @u-fischer. My idea at present would be to take the work I've already done on PDF-based testing and alter it to something like

Use .pvt to indicate a PDF-based test
Include in the .pvt instructions on which PDF tests to run, either in TeX form or perhaps
as comments (thus parseable by Lua up-front): are there advantages to doing it from TeX?
Store the output of the analysis tools as a .tlg or perhaps some related extension (.plg?)
Use a simple diff with no need to normalise

That is all pretty easy, so the question is 'does the logic stack up'? It should cover tag testing, but I'm not sure what else might be wanted. Is should though allow us to add other tools later.

u-fischer · 2018-07-29T20:53:43Z

pdfinfo is certainly quite useful, and it would good to be able to include its output and compare it with some reference file in some test setups. But it doesn't test everything. E.g. if I remove the \tagmcend from https://github.com/u-fischer/tagpdf/blob/master/source/examples/structure/ex-patch-sectioning-koma.tex one gets an invalid pdf due to the missing EMC operator, but pdfinfo doesn't mind.

So I would need to be able to compare parts of the uncompressed pdf too.

The longer I ponder this the more I think something like the arara rules would be useful: a way to tell by file that e.g. this test should compare in the pdf everything from 24 0 obj to the next endstream.

It would be imho ok, if every lvt only does always only one or two tests, so that one doesn't need tons of different extensions for the reference files.

josephwright · 2018-07-29T21:38:28Z

@u-fischer I'm still getting a handle on this, but I suspect what makes sense for automated testing ('is the output as expected') isn't the same as what is needed to set up the tests ('is the output right'). One sees the same in a lot of box-based tests: in the end, it takes a human looking at the output to be sure they are correct, our tests then pick up if they change. So for tagging, I was imaging using Adobe Pro to check the PDFs are correct, then checking in appropriate data that the fact they don't get bust later can be verified.

I did wonder too about picking out objects: that is certainly doable, and of course ends up again with purely text files. Again, one might imaging using the same set up with some form of marker data. If doing as 'magic' comments something like

% l3build checkobjs <numbers>
% l3build pdfinfo --rawdates
% l3build pdfinfo --struct

or if done at the macro level

\CHECKPDFOBJS{<numbers>}
\CHECKPDFINFO
\CHECKPDFSTRUCT

I guess the later approach would work with the 'standard' .lvt extension, but it does make the internal logic easier if a separate code path can be indicated by the input file name, so I'd favour using .pvt or similar.

u-fischer · 2018-07-29T21:58:25Z

Yes, the initial pdf must certainly be checked for correctness by a human with tools like preflight, pax3, pdfinfo and other things. The automatic testing often "only" needs to compare the two pdf or the pdfinfo output minus some normalization (I checked miktex against texlive and the differences where few: only producer, id and time) or by restricting the comparision to some parts of the pdf.

josephwright · 2018-07-30T07:34:21Z

@u-fischer The pdfinfo output can be coerced to be the same: things like date can be fixed (they all should be with l3build).

What I guess I'm wondering is do we feel testing individual obj blocks plus pdfinfo output is enough for automated testing of pre-checked PDFs. For example, you say that pdfinfo doesn't mind about missing ending tags, but what's important is that it shows up in the tag dump. Is that the case?

wspr · 2018-07-30T07:47:31Z

I guess the later approach would work with the 'standard' .lvt extension, but it does make the internal logic easier if a separate code path can be indicated by the input file name, so I'd favour using .pvt or similar.

Now that there is the machinery to load different config files at the build.lua level, maybe in light of all the possibilities that are coming up it would make sense to also have per-test config files. E.g., m3test001.lvt m3test001.config.lua m3test001.(tlg/etc) This would allow different types of check, engines, numbers of runs, on a per-test basis. WDYT?

josephwright · 2018-07-30T08:04:39Z

I've thought more about the extension business, and perhaps it is confusing needing two: if one goes with macro-based switches for doing 'info' steps, then it's strange for them to be only used with one type of test input. I think the logic in l3build would make sense:

Run test
Make .tlg file and compare
Parse .log (or .lvt) for 'magic' lines
Collect info in e.g. <name>.tpf (Test PdF) file and compare

josephwright · 2018-07-30T08:22:35Z

(Just to note that having macro-based commands in the .lvt for extracting PDF data doesn't mean they have to be parsed from the log: as the .lvt is likely shorter, it's probably faster to just parse that using Lua in any case.)

u-fischer · 2018-07-30T09:34:21Z

Sorry I think I'm getting lost a bit, I'm not sure I really get the question.

I think we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.
I think that this pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.
I also think that the needed pdf normalization is not so difficult, probably even quite easier than the log-normalization -- if you don't try to make tests which compare pdf's of different engines, as the pdf output of pdftex and say xetex is too different: one should make engine specific tests.

So to get it working one need variables/tools to tell l3build the output filetype (pdf/log/something else) and the "normalization function" which should be used for a test.

Regarding per-test configuration: there is/will be a need for different test setups: pdftex tests for tagging e.g. need more compilation runs than luatex. But one can get around the problem by using config-files and different test folders. On travis one could probably use some environment variable and set the config-list depending on its value. So while it would neat to have more control, it is not so vital.

josephwright · 2018-07-30T12:40:00Z

@u-fischer OK, it sounds like an approach based on extracting some parts of PDFs into .tlg or similar files is workable. So the only real question is how best to express that in input terms. We have two basic options:

Call all input files .lvt and pick up that the PDF should be post-processed based on marker in the file
Use a different extension for 'parse the PDF' tests

I can see advantages to both approaches: at a technical level, using a different extension does make branching a bit easier. I'd welcome thoughts on what is clearer to the user.

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

\CHECKPDFSECTION{2 0 obj}{endobj}

which could of course have a focussed variant for objects (just the number).

I'll look to adjust the current code to at least demo how this might work: very much a fluid situation!

As you say, PDF-based testing is going to need to be single-engine with different configs: that all seems workable.

u-fischer · 2018-07-30T13:52:45Z

@josephwright I think we have a third basic option:

call all input files .lvt and pick up that the PDF should be post-processed based on setting in the config-file.

I would have no problem with putting all the pdf-tests in some testfiles-pdf folder.

Regarding a dedicated extension: how does your idea to parse and compare the output of pdfinfo fit in here?

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

That's a possibility, but imho one should check first if it is not enough to delete a few os-specific lines from the pdf.

josephwright · 2018-07-30T14:33:22Z

Certainly if it's possible then simply 'thinning out' PDFs sounds good: I think the main issue is binary content.

I guess I'm slightly in favour of input files somehow 'knowing' they are for PDF-based tests, either by extension or by content. I'm sure I can find a way of doing pdfinfo: one could just make that a blanket thing.

josephwright · 2018-07-30T16:49:25Z

I think I've got a good idea of how we might look to tackle things now: I'll probably adjust the current code shortly so we can see how 'PDF-to-text' testing works. For the present I'll likely keep the input as .pvt files, simply as that means minimal changes. We can then discuss what is the clearest for the user.

FrankMittelbach · 2018-07-30T17:39:36Z

I would also prefer a separation by extension (or by magic line inside) - this is keeping this pretty clear already on directory level and chances that you want to use a single test in both ways are small imho. I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it.

car222222 · 2018-07-30T17:47:08Z

I an pretty sure that Ulrike has (most of:-) the correct suggestions on this one.

Is this discussion going to Ross?

In particular I strongly agree with these two points from Ulrike:

Point 1:
. . . we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.

Point 2:
. . . pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.

u-fischer · 2018-07-30T18:13:23Z

@FrankMittelbach

I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it

That would be certainly good. But currently already a test files doesn't always speak for itself - you need the configuration file to see which engines should compile it how often.

FrankMittelbach · 2018-07-30T21:26:37Z

@FrankMittelbach <https://github.com/FrankMittelbach> I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it That would be certainly good. But currently already a test files doesn't always speak for itself - you need the configuration file to see which engines should compile it how often.

ok, granted not to 100% but on the whole they do (for me at least). Number of compile runs is really optional if you have the default not too low, so a config there is basically an optimization method and in fact with the new method you can set a fairly high value and it compiles as often as necessary until it gets a match or hits the limit.

FrankMittelbach · 2018-07-30T21:33:37Z

Am 30.07.2018 um 19:47 schrieb Chris Rowley:

In particular I strongly agree with these two points from Ulrike: Point 1: . . . we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc. Point 2: . . . pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.

I seriously doubt that anybody is contesting that. The discussion is around which tools you provide and how much noise compared to benefit they provide. A bit like the \showbox log testing which is useful but not in all cases and usually showing the whole pages of test files does more harm than good as it is extremely hard to identify if or when changes are "ok" or "wrong" But that doesn't mean they should not be there but it also means that simpler partial test possibilities comparing limited info are also needed and helpful. to me that all looks as if it goes in the right direction and there is no difference in opinion really.

ozross · 2018-07-30T23:23:22Z

OK, I'm glad you guys are looking at all this stuff.

I've been using Preflight for several years now, to verify that I'm building the PDFs correctly. This includes checking that:

PDF syntax is correct in all regards (incl. /Alt and /ActualText delimiters, etc.);
that Metadata is recorded via the XMP packet – some of the author-supplied /Info fields are deprecated in later revisions of PDF/A and PDF 2.0, in favour of using XMP;
object streams are correctly delimited: pdfTeX had this wrong for awhile, fixed now;
fake spaces occur only within BDC ... EMC blocks;
BTW, it is allowable to nest BDC ... EMC blocks, but it is rarely useful to do so, as this can result in content being duplicated upon text-extraction; e.g., for screen reading.
the parent tree is built correctly, with an entry for each MCID occurring on each page, in the correct order;
/ToUnicode map entries are present for all font characters;
tagging is consistent with the declared standards;
images use the correct Colour space: else use Preflight to produce a new version of the image, which is essentially wrapping the original graphic up with a small piece of extra coding to do a colour conversion;
and many, many more things that may crop up, especially when using content/images produced external to TeX software.

There are hundreds of individual tests, grouped according to types, as in the attached image.

Having a look in detail may give some ideas about what test can be usefully done using TeX-based tools.

Hope this helps.
Ross

See issue #10 and issue #61) for discussions. This first pass only strips out a minimal amount of data: more normalization may well be needed.

josephwright · 2018-07-31T10:00:27Z

First attempt at stripping info from the PDF file not exactly encouraging: see Travis-CI failures. It seems that font data is stored entirely differently on Windows and on Linux in the PDF. More importantly, there's no obvious/easy pattern to pick up on in the very simple PDF I'm using to test this. With something more complex, I'm very doubtful that it would be easy to pull out just the 'right' parts.

I'm going to see if I can get something more self-consistent if I set the various compression settings differently. However, if that doesn't work then we are likely back needing to 'opt in' material for comparison, or using external tools (pdfinfo, etc.), or both.

u-fischer · 2018-07-31T10:20:06Z

Could you sent me an example of an uncompressed pdf from linux? (Along with the texfile, or by using one of my example files).

josephwright · 2018-07-31T10:42:37Z

@u-fischer Problem solved: it's a question of getting the font set up correct (Type 1 vs Type 3): things look good now! I'll track down remaining minor issues, then I think we'll be able to run tests on the 'massaged' PDF, which also mean we can look at the output for debugging.

josephwright · 2018-07-31T11:28:04Z

Right, this does seem to work: probably I'll need over time to adjust the normalisation, but if we retain as much of the 'raw' PDF as we can, other tests (as outlined by @ozross) can be used to set up reference data whilst l3build can show what has changed when issues occur.

car222222 · 2018-07-31T11:50:46Z

there is no difference in opinion really.
Sure,
But: Who ever suggested there was such a difference?

Note also that one of things I agreed with was “use only uncompressed pdfs”, so maybe there was a difference in suggested actions, if not of opinions.

josephwright · 2018-07-31T12:43:37Z

On further testing, with Type1 fonts we don't have to worry about PDFs varying at all between platforms, or at least not in the cases I've tried. I'll keep a separate extension in case we do have to 'mangle' the PDFs, but this looks much easier than perhaps expected.

josephwright · 2018-07-31T12:45:26Z

I've stuck with removing binary data: when a test fails, this means we do get a useful .diff.

josephwright self-assigned this Mar 24, 2018

josephwright added the enhancement label Mar 24, 2018

josephwright added a commit that referenced this issue Jul 31, 2018

Use a text-based approach to PDF comparison

64825da

See issue #10 and issue #61) for discussions. This first pass only strips out a minimal amount of data: more normalization may well be needed.

josephwright closed this as completed Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing MediaBox and other metadata #61

testing MediaBox and other metadata #61

davidcarlisle commented Mar 24, 2018

josephwright commented Mar 24, 2018

josephwright commented Mar 24, 2018

car222222 commented Mar 25, 2018

wspr commented Mar 26, 2018 via email

blefloch commented Mar 26, 2018 via email

wspr commented Mar 26, 2018 via email

blefloch commented Mar 26, 2018 via email

FrankMittelbach commented Mar 26, 2018

josephwright commented Jul 29, 2018

u-fischer commented Jul 29, 2018

josephwright commented Jul 29, 2018

u-fischer commented Jul 29, 2018

josephwright commented Jul 30, 2018

wspr commented Jul 30, 2018 via email

josephwright commented Jul 30, 2018

josephwright commented Jul 30, 2018

u-fischer commented Jul 30, 2018

josephwright commented Jul 30, 2018

u-fischer commented Jul 30, 2018

josephwright commented Jul 30, 2018

josephwright commented Jul 30, 2018

FrankMittelbach commented Jul 30, 2018

car222222 commented Jul 30, 2018

u-fischer commented Jul 30, 2018

FrankMittelbach commented Jul 30, 2018 via email

FrankMittelbach commented Jul 30, 2018 via email

ozross commented Jul 30, 2018

josephwright commented Jul 31, 2018

u-fischer commented Jul 31, 2018

josephwright commented Jul 31, 2018

josephwright commented Jul 31, 2018

car222222 commented Jul 31, 2018

josephwright commented Jul 31, 2018

josephwright commented Jul 31, 2018

testing MediaBox and other metadata #61

testing MediaBox and other metadata #61

Comments

davidcarlisle commented Mar 24, 2018

josephwright commented Mar 24, 2018

josephwright commented Mar 24, 2018

car222222 commented Mar 25, 2018

wspr commented Mar 26, 2018 via email

blefloch commented Mar 26, 2018 via email

wspr commented Mar 26, 2018 via email

blefloch commented Mar 26, 2018 via email

FrankMittelbach commented Mar 26, 2018

josephwright commented Jul 29, 2018

u-fischer commented Jul 29, 2018

josephwright commented Jul 29, 2018

u-fischer commented Jul 29, 2018

josephwright commented Jul 30, 2018

wspr commented Jul 30, 2018 via email

josephwright commented Jul 30, 2018

josephwright commented Jul 30, 2018

u-fischer commented Jul 30, 2018

josephwright commented Jul 30, 2018

u-fischer commented Jul 30, 2018

josephwright commented Jul 30, 2018

josephwright commented Jul 30, 2018

FrankMittelbach commented Jul 30, 2018

car222222 commented Jul 30, 2018

u-fischer commented Jul 30, 2018

FrankMittelbach commented Jul 30, 2018 via email

FrankMittelbach commented Jul 30, 2018 via email

ozross commented Jul 30, 2018

josephwright commented Jul 31, 2018

u-fischer commented Jul 31, 2018

josephwright commented Jul 31, 2018

josephwright commented Jul 31, 2018

car222222 commented Jul 31, 2018

josephwright commented Jul 31, 2018

josephwright commented Jul 31, 2018