New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing MediaBox and other metadata #61
Comments
Relates in some ways to #21: probably the same need for a general mechanism for 'non-log' testing will be required. |
We could do this various ways, for example a specific test mode, a switch to add the info to the |
The general ability to Add stuff like this to the .log file (as much as possible) to the .log file would be very widely useful. |
While a general solution might be good, it’s also going to be more work.
To address David’s needs here, what about a thin wrapper around pdfinfo which add its output to the tlg in the same as recordstatus? E.g., recordpdfinfo=true would list something like
Creator: LaTeX with hyperref package
Producer: […]
CreationDate: […]
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 56
Encrypted: no
Page size: 612 x 792 pts (letter)
Page rot: 0
File size: […]
Optimized: no
PDF version: 1.5
where the […] lines are either stripped out entirely or normalised.
(I’m assuming pdfinfo is included in MiKTeX…)
|
Is pdfinfo allowed by restricted shell escape? Perhaps it would be
cleaner to do things on the TeX side since only specific tests will need
pdf info?
|
I thought about that, but don’t you need to do this after the PDF has been written out completely? I don’t think we are able to call \ShellEscape that late (but I could be wrong).
|
Ah, ignore me sorry. Will's right.
|
what about a simple flag in regression-test.tex, such as
which outputs
in the log. The normalization (or a step next to it) could then run pdfinfo -meta on the pdf file and insert the result in place of that placeholder. That step could support different external commands though I would limit that to a defined set initially just pdfinfo say |
The idea of using
(PDF from web.science.mq.edu.au/~ross/TaggedPDF/test-LaTeX-article-unc.pdf), and then can look at tag structure using
I suspect that using a third-party tool is a better plan long-term than trying to use the PDF stream (#10), and avoids the issues that have become apparent in #21. I'd welcome thoughts on this, particularly from @FrankMittelbach, @car222222 and @u-fischer. My idea at present would be to take the work I've already done on PDF-based testing and alter it to something like
That is all pretty easy, so the question is 'does the logic stack up'? It should cover tag testing, but I'm not sure what else might be wanted. Is should though allow us to add other tools later. |
pdfinfo is certainly quite useful, and it would good to be able to include its output and compare it with some reference file in some test setups. But it doesn't test everything. E.g. if I remove the So I would need to be able to compare parts of the uncompressed pdf too. The longer I ponder this the more I think something like the arara rules would be useful: a way to tell by file that e.g. this test should compare in the pdf everything from 24 0 obj to the next endstream. It would be imho ok, if every lvt only does always only one or two tests, so that one doesn't need tons of different extensions for the reference files. |
@u-fischer I'm still getting a handle on this, but I suspect what makes sense for automated testing ('is the output as expected') isn't the same as what is needed to set up the tests ('is the output right'). One sees the same in a lot of box-based tests: in the end, it takes a human looking at the output to be sure they are correct, our tests then pick up if they change. So for tagging, I was imaging using Adobe Pro to check the PDFs are correct, then checking in appropriate data that the fact they don't get bust later can be verified. I did wonder too about picking out objects: that is certainly doable, and of course ends up again with purely text files. Again, one might imaging using the same set up with some form of marker data. If doing as 'magic' comments something like
or if done at the macro level
I guess the later approach would work with the 'standard' |
Yes, the initial pdf must certainly be checked for correctness by a human with tools like preflight, pax3, pdfinfo and other things. The automatic testing often "only" needs to compare the two pdf or the pdfinfo output minus some normalization (I checked miktex against texlive and the differences where few: only producer, id and time) or by restricting the comparision to some parts of the pdf. |
@u-fischer The What I guess I'm wondering is do we feel testing individual |
I guess the later approach would work with the 'standard' .lvt extension, but it does make the internal logic easier if a separate code path can be indicated by the input file name, so I'd favour using .pvt or similar.
Now that there is the machinery to load different config files at the build.lua level, maybe in light of all the possibilities that are coming up it would make sense to also have per-test config files. E.g.,
m3test001.lvt
m3test001.config.lua
m3test001.(tlg/etc)
This would allow different types of check, engines, numbers of runs, on a per-test basis. WDYT?
|
I've thought more about the extension business, and perhaps it is confusing needing two: if one goes with macro-based switches for doing 'info' steps, then it's strange for them to be only used with one type of test input. I think the logic in
|
(Just to note that having macro-based commands in the |
Sorry I think I'm getting lost a bit, I'm not sure I really get the question.
So to get it working one need variables/tools to tell l3build the output filetype (pdf/log/something else) and the "normalization function" which should be used for a test. Regarding per-test configuration: there is/will be a need for different test setups: pdftex tests for tagging e.g. need more compilation runs than luatex. But one can get around the problem by using config-files and different test folders. On travis one could probably use some environment variable and set the config-list depending on its value. So while it would neat to have more control, it is not so vital. |
@u-fischer OK, it sounds like an approach based on extracting some parts of PDFs into
I can see advantages to both approaches: at a technical level, using a different extension does make branching a bit easier. I'd welcome thoughts on what is clearer to the user. There's then 'how do we mark parts of the PDF for comparison'. I'm leaning toward a syntax which specifies the begin- and end-of-extraction points, which might read for example
which could of course have a focussed variant for objects (just the number). I'll look to adjust the current code to at least demo how this might work: very much a fluid situation! As you say, PDF-based testing is going to need to be single-engine with different configs: that all seems workable. |
@josephwright I think we have a third basic option:
I would have no problem with putting all the pdf-tests in some Regarding a dedicated extension: how does your idea to parse and compare the output of pdfinfo fit in here?
That's a possibility, but imho one should check first if it is not enough to delete a few os-specific lines from the pdf. |
Certainly if it's possible then simply 'thinning out' PDFs sounds good: I think the main issue is binary content. I guess I'm slightly in favour of input files somehow 'knowing' they are for PDF-based tests, either by extension or by content. I'm sure I can find a way of doing |
I think I've got a good idea of how we might look to tackle things now: I'll probably adjust the current code shortly so we can see how 'PDF-to-text' testing works. For the present I'll likely keep the input as |
I would also prefer a separation by extension (or by magic line inside) - this is keeping this pretty clear already on directory level and chances that you want to use a single test in both ways are small imho. I think it is important that the test file speaks for itself and doesn't need a config just to decide what to do with it. |
I an pretty sure that Ulrike has (most of:-) the correct suggestions on this one. Is this discussion going to Ross? In particular I strongly agree with these two points from Ulrike: Point 1: Point 2: |
That would be certainly good. But currently already a test files doesn't always speak for itself - you need the configuration file to see which engines should compile it how often. |
@FrankMittelbach <https://github.com/FrankMittelbach>
I think it is important that the test file speaks for itself and
doesn't need a config just to decide what to do with it
That would be certainly good. But currently already a test files doesn't
always speak for itself - you need the configuration file to see which
engines should compile it how often.
ok, granted not to 100% but on the whole they do (for me at least).
Number of compile runs is really optional if you have the default not
too low, so a config there is basically an optimization method and in
fact with the new method you can set a fairly high value and it compiles
as often as necessary until it gets a match or hits the limit.
|
Am 30.07.2018 um 19:47 schrieb Chris Rowley:
In particular I strongly agree with these two points from Ulrike:
Point 1:
. . . we need testing based on the (uncompressed) pdf. E.g. to test that
the BDC/EMC-operators are correct in the page stream, that no objects
got lost, that the escaping/encoding of /Alt and /ActualTest arguments
is correct, that fake spaces are in the stream, that the unicode mapping
is there etc.
Point 2:
. . . pdf testing is conceptually not different to the current log
testing: You take the output (pdf here, log there), normalize it, store
it and then compare it with the normalized output of the test run.
I seriously doubt that anybody is contesting that. The discussion is
around which tools you provide and how much noise compared to benefit
they provide. A bit like the \showbox log testing which is useful but
not in all cases and usually showing the whole pages of test files does
more harm than good as it is extremely hard to identify if or when
changes are "ok" or "wrong"
But that doesn't mean they should not be there but it also means that
simpler partial test possibilities comparing limited info are also
needed and helpful.
to me that all looks as if it goes in the right direction and there is
no difference in opinion really.
|
First attempt at stripping info from the PDF file not exactly encouraging: see Travis-CI failures. It seems that font data is stored entirely differently on Windows and on Linux in the PDF. More importantly, there's no obvious/easy pattern to pick up on in the very simple PDF I'm using to test this. With something more complex, I'm very doubtful that it would be easy to pull out just the 'right' parts. I'm going to see if I can get something more self-consistent if I set the various compression settings differently. However, if that doesn't work then we are likely back needing to 'opt in' material for comparison, or using external tools ( |
Could you sent me an example of an uncompressed pdf from linux? (Along with the texfile, or by using one of my example files). |
@u-fischer Problem solved: it's a question of getting the font set up correct (Type 1 vs Type 3): things look good now! I'll track down remaining minor issues, then I think we'll be able to run tests on the 'massaged' PDF, which also mean we can look at the output for debugging. |
Right, this does seem to work: probably I'll need over time to adjust the normalisation, but if we retain as much of the 'raw' PDF as we can, other tests (as outlined by @ozross) can be used to set up reference data whilst |
Note also that one of things I agreed with was “use only uncompressed pdfs”, so maybe there was a difference in suggested actions, if not of opinions. |
On further testing, with Type1 fonts we don't have to worry about PDFs varying at all between platforms, or at least not in the cases I've tried. I'll keep a separate extension in case we do have to 'mangle' the PDFs, but this looks much easier than perhaps expected. |
I've stuck with removing binary data: when a test fails, this means we do get a useful |
Setting up some tests for geometry https://github.com/davidcarlisle/geometry
It would be useful to check that the following does produce a pdf of size 100x200 eg pdfinfo reports
Page size: 100 x 200 pts
This was the bug being addressed (with luatex the page size was not affected, leaving this as A4) so it would be good to have a test for this case.
One possibility would be to provide a custom "normalization" function in build.lua that just runs pdfinfo (or a lua function extracting the same information) other interfaces to this functionality could be imagined.
The text was updated successfully, but these errors were encountered: