Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing MediaBox and other metadata #61

Closed
davidcarlisle opened this issue Mar 24, 2018 · 34 comments
Closed

testing MediaBox and other metadata #61

davidcarlisle opened this issue Mar 24, 2018 · 34 comments
Assignees

Comments

@davidcarlisle
Copy link
Member

Setting up some tests for geometry https://github.com/davidcarlisle/geometry

It would be useful to check that the following does produce a pdf of size 100x200 eg pdfinfo reports

Page size: 100 x 200 pts

This was the bug being addressed (with luatex the page size was not affected, leaving this as A4) so it would be good to have a test for this case.

One possibility would be to provide a custom "normalization" function in build.lua that just runs pdfinfo (or a lua function extracting the same information) other interfaces to this functionality could be imagined.

\documentclass{article}

\usepackage[papersize={100bp,200bp}]{geometry}

\begin{document}
x
\end{document}

@josephwright
Copy link
Member

Relates in some ways to #21: probably the same need for a general mechanism for 'non-log' testing will be required.

@josephwright
Copy link
Member

We could do this various ways, for example a specific test mode, a switch to add the info to the .log, ...

@car222222
Copy link

a switch to add the info to the .log, ...

The general ability to Add stuff like this to the .log file (as much as possible) to the .log file would be very widely useful.

@wspr
Copy link
Contributor

wspr commented Mar 26, 2018 via email

@blefloch
Copy link
Member

blefloch commented Mar 26, 2018 via email

@wspr
Copy link
Contributor

wspr commented Mar 26, 2018 via email

@blefloch
Copy link
Member

blefloch commented Mar 26, 2018 via email

@FrankMittelbach
Copy link
Member

what about a simple flag in regression-test.tex, such as

\EXTERNALDATA{pdfinfo}{meta}

which outputs

<<PLACEHOLDER pdfinfo meta>>

in the log. The normalization (or a step next to it) could then run pdfinfo -meta on the pdf file and insert the result in place of that placeholder.

That step could support different external commands though I would limit that to a defined set initially just pdfinfo say

@josephwright
Copy link
Member

The idea of using pdfinfo or similar is quite a good one, at least in as far as testing tagging and similar goes. If you run pdfinfo -rawdates, you get basic info, e.g.

Title:          Testing Tagged PDF with LaTeX
Subject:        Testing paragraph split across pages in Tagged PDF
Author:         Dr. Ross Moore
Creator:        pdfTeX + pdfx.sty with a-1a option
Producer:       pdfTeX
CreationDate:   D:20170117160113+11'00'
ModDate:        D:20170117160113+11'00'
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          4
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      419873 bytes
Optimized:      no
PDF version:    1.4

(PDF from web.science.mq.edu.au/~ross/TaggedPDF/test-LaTeX-article-unc.pdf), and then can look at tag structure using pdfinfo -struct, which starts here

  Div "Topmatter"
    H "title" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
    P "author" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
      Reference (inline)
        Note (inline)
          Lbl (block)
    P "date" (block):
       /Placement /Block
       /WritingMode /LrTb
       /TextAlign /Center
       /Padding [10 10 0 0]
  NonStruct
    Object 64 0
  TOC "Contents"
    H "Contents" (block)
    TOCI
      Reference (inline)
        Link "to destination section.1" (inline)
          Object 74 0
      TOC "subsections"
        TOCI

I suspect that using a third-party tool is a better plan long-term than trying to use the PDF stream (#10), and avoids the issues that have become apparent in #21.

I'd welcome thoughts on this, particularly from @FrankMittelbach, @car222222 and @u-fischer. My idea at present would be to take the work I've already done on PDF-based testing and alter it to something like

  • Use .pvt to indicate a PDF-based test
  • Include in the .pvt instructions on which PDF tests to run, either in TeX form or perhaps
    as comments (thus parseable by Lua up-front): are there advantages to doing it from TeX?
  • Store the output of the analysis tools as a .tlg or perhaps some related extension (.plg?)
  • Use a simple diff with no need to normalise

That is all pretty easy, so the question is 'does the logic stack up'? It should cover tag testing, but I'm not sure what else might be wanted. Is should though allow us to add other tools later.

@u-fischer
Copy link
Member

pdfinfo is certainly quite useful, and it would good to be able to include its output and compare it with some reference file in some test setups. But it doesn't test everything. E.g. if I remove the \tagmcend from https://github.com/u-fischer/tagpdf/blob/master/source/examples/structure/ex-patch-sectioning-koma.tex one gets an invalid pdf due to the missing EMC operator, but pdfinfo doesn't mind.

So I would need to be able to compare parts of the uncompressed pdf too.

The longer I ponder this the more I think something like the arara rules would be useful: a way to tell by file that e.g. this test should compare in the pdf everything from 24 0 obj to the next endstream.

It would be imho ok, if every lvt only does always only one or two tests, so that one doesn't need tons of different extensions for the reference files.

@josephwright
Copy link
Member

@u-fischer I'm still getting a handle on this, but I suspect what makes sense for automated testing ('is the output as expected') isn't the same as what is needed to set up the tests ('is the output right'). One sees the same in a lot of box-based tests: in the end, it takes a human looking at the output to be sure they are correct, our tests then pick up if they change. So for tagging, I was imaging using Adobe Pro to check the PDFs are correct, then checking in appropriate data that the fact they don't get bust later can be verified.

I did wonder too about picking out objects: that is certainly doable, and of course ends up again with purely text files. Again, one might imaging using the same set up with some form of marker data. If doing as 'magic' comments something like

% l3build checkobjs <numbers>
% l3build pdfinfo --rawdates
% l3build pdfinfo --struct 

or if done at the macro level

\CHECKPDFOBJS{<numbers>}
\CHECKPDFINFO
\CHECKPDFSTRUCT

I guess the later approach would work with the 'standard' .lvt extension, but it does make the internal logic easier if a separate code path can be indicated by the input file name, so I'd favour using .pvt or similar.

@u-fischer
Copy link
Member

Yes, the initial pdf must certainly be checked for correctness by a human with tools like preflight, pax3, pdfinfo and other things. The automatic testing often "only" needs to compare the two pdf or the pdfinfo output minus some normalization (I checked miktex against texlive and the differences where few: only producer, id and time) or by restricting the comparision to some parts of the pdf.

@josephwright
Copy link
Member

@u-fischer The pdfinfo output can be coerced to be the same: things like date can be fixed (they all should be with l3build).

What I guess I'm wondering is do we feel testing individual obj blocks plus pdfinfo output is enough for automated testing of pre-checked PDFs. For example, you say that pdfinfo doesn't mind about missing ending tags, but what's important is that it shows up in the tag dump. Is that the case?

@wspr
Copy link
Contributor

wspr commented Jul 30, 2018 via email

@josephwright
Copy link
Member

I've thought more about the extension business, and perhaps it is confusing needing two: if one goes with macro-based switches for doing 'info' steps, then it's strange for them to be only used with one type of test input. I think the logic in l3build would make sense:

  • Run test
  • Make .tlg file and compare
  • Parse .log (or .lvt) for 'magic' lines
  • Collect info in e.g. <name>.tpf (Test PdF) file and compare

@josephwright
Copy link
Member

(Just to note that having macro-based commands in the .lvt for extracting PDF data doesn't mean they have to be parsed from the log: as the .lvt is likely shorter, it's probably faster to just parse that using Lua in any case.)

@u-fischer
Copy link
Member

Sorry I think I'm getting lost a bit, I'm not sure I really get the question.

  • I think we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.
  • I think that this pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.
  • I also think that the needed pdf normalization is not so difficult, probably even quite easier than the log-normalization -- if you don't try to make tests which compare pdf's of different engines, as the pdf output of pdftex and say xetex is too different: one should make engine specific tests.

So to get it working one need variables/tools to tell l3build the output filetype (pdf/log/something else) and the "normalization function" which should be used for a test.

Regarding per-test configuration: there is/will be a need for different test setups: pdftex tests for tagging e.g. need more compilation runs than luatex. But one can get around the problem by using config-files and different test folders. On travis one could probably use some environment variable and set the config-list depending on its value. So while it would neat to have more control, it is not so vital.

@josephwright
Copy link
Member

@u-fischer OK, it sounds like an approach based on extracting some parts of PDFs into .tlg or similar files is workable. So the only real question is how best to express that in input terms. We have two basic options:

  • Call all input files .lvt and pick up that the PDF should be post-processed based on marker in the file
  • Use a different extension for 'parse the PDF' tests

I can see advantages to both approaches: at a technical level, using a different extension does make branching a bit easier. I'd welcome thoughts on what is clearer to the user.

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

\CHECKPDFSECTION{2 0 obj}{endobj}

which could of course have a focussed variant for objects (just the number).

I'll look to adjust the current code to at least demo how this might work: very much a fluid situation!

As you say, PDF-based testing is going to need to be single-engine with different configs: that all seems workable.

@u-fischer
Copy link
Member

@josephwright I think we have a third basic option:

  • call all input files .lvt and pick up that the PDF should be post-processed based on setting in the config-file.

I would have no problem with putting all the pdf-tests in some testfiles-pdf folder.

Regarding a dedicated extension: how does your idea to parse and compare the output of pdfinfo fit in here?

There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example

That's a possibility, but imho one should check first if it is not enough to delete a few os-specific lines from the pdf.

@josephwright
Copy link
Member

Certainly if it's possible then simply 'thinning out' PDFs sounds good: I think the main issue is binary content.

I guess I'm slightly in favour of input files somehow 'knowing' they are for PDF-based tests, either by extension or by content. I'm sure I can find a way of doing pdfinfo: one could just make that a blanket thing.

@josephwright
Copy link
Member

I think I've got a good idea of how we might look to tackle things now: I'll probably adjust the current code shortly so we can see how 'PDF-to-text' testing works. For the present I'll likely keep the input as .pvt files, simply as that means minimal changes. We can then discuss what is the clearest for the user.

@FrankMittelbach
Copy link
Member

I would also prefer a separation by extension (or by magic line inside) - this is keeping this pretty clear already on directory level and chances that you want to use a single test in both ways are small imho. I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it.

@car222222
Copy link

I an pretty sure that Ulrike has (most of:-) the correct suggestions on this one.

Is this discussion going to Ross?

In particular I strongly agree with these two points from Ulrike:

Point 1:
. . . we need testing based on the (uncompressed) pdf. E.g. to test that the BDC/EMC-operators are correct in the page stream, that no objects got lost, that the escaping/encoding of /Alt and /ActualTest arguments is correct, that fake spaces are in the stream, that the unicode mapping is there etc.

Point 2:
. . . pdf testing is conceptually not different to the current log testing: You take the output (pdf here, log there), normalize it, store it and then compare it with the normalized output of the test run.

@u-fischer
Copy link
Member

@FrankMittelbach

I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it

That would be certainly good. But currently already a test files doesn't always speak for itself - you need the configuration file to see which engines should compile it how often.

@FrankMittelbach
Copy link
Member

FrankMittelbach commented Jul 30, 2018 via email

@FrankMittelbach
Copy link
Member

FrankMittelbach commented Jul 30, 2018 via email

@ozross
Copy link

ozross commented Jul 30, 2018

OK, I'm glad you guys are looking at all this stuff.

I've been using Preflight for several years now, to verify that I'm building the PDFs correctly. This includes checking that:

  • PDF syntax is correct in all regards (incl. /Alt and /ActualText delimiters, etc.);
  • that Metadata is recorded via the XMP packet – some of the author-supplied /Info fields are deprecated in later revisions of PDF/A and PDF 2.0, in favour of using XMP;
  • object streams are correctly delimited: pdfTeX had this wrong for awhile, fixed now;
  • fake spaces occur only within BDC ... EMC blocks;
  • BTW, it is allowable to nest BDC ... EMC blocks, but it is rarely useful to do so, as this can result in content being duplicated upon text-extraction; e.g., for screen reading.
  • the parent tree is built correctly, with an entry for each MCID occurring on each page, in the correct order;
  • /ToUnicode map entries are present for all font characters;
  • tagging is consistent with the declared standards;
  • images use the correct Colour space: else use Preflight to produce a new version of the image, which is essentially wrapping the original graphic up with a small piece of extra coding to do a colour conversion;
  • and many, many more things that may crop up, especially when using content/images produced external to TeX software.

There are hundreds of individual tests, grouped according to types, as in the attached image.
screen shot 2018-07-31 at 9 19 55 am

Having a look in detail may give some ideas about what test can be usefully done using TeX-based tools.

Hope this helps.
Ross

josephwright added a commit that referenced this issue Jul 31, 2018
See issue #10 and issue #61) for discussions.

This first pass only strips out a minimal amount of data:
more normalization may well be needed.
@josephwright
Copy link
Member

First attempt at stripping info from the PDF file not exactly encouraging: see Travis-CI failures. It seems that font data is stored entirely differently on Windows and on Linux in the PDF. More importantly, there's no obvious/easy pattern to pick up on in the very simple PDF I'm using to test this. With something more complex, I'm very doubtful that it would be easy to pull out just the 'right' parts.

I'm going to see if I can get something more self-consistent if I set the various compression settings differently. However, if that doesn't work then we are likely back needing to 'opt in' material for comparison, or using external tools (pdfinfo, etc.), or both.

@u-fischer
Copy link
Member

Could you sent me an example of an uncompressed pdf from linux? (Along with the texfile, or by using one of my example files).

@josephwright
Copy link
Member

@u-fischer Problem solved: it's a question of getting the font set up correct (Type 1 vs Type 3): things look good now! I'll track down remaining minor issues, then I think we'll be able to run tests on the 'massaged' PDF, which also mean we can look at the output for debugging.

@josephwright
Copy link
Member

Right, this does seem to work: probably I'll need over time to adjust the normalisation, but if we retain as much of the 'raw' PDF as we can, other tests (as outlined by @ozross) can be used to set up reference data whilst l3build can show what has changed when issues occur.

@car222222
Copy link

there is no difference in opinion really.
Sure,
But: Who ever suggested there was such a difference?

Note also that one of things I agreed with was “use only uncompressed pdfs”, so maybe there was a difference in suggested actions, if not of opinions.

@josephwright
Copy link
Member

On further testing, with Type1 fonts we don't have to worry about PDFs varying at all between platforms, or at least not in the cases I've tried. I'll keep a separate extension in case we do have to 'mangle' the PDFs, but this looks much easier than perhaps expected.

@josephwright
Copy link
Member

I've stuck with removing binary data: when a test fails, this means we do get a useful .diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants