Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible PS/PDF output (master) #6597

Merged
merged 31 commits into from Dec 9, 2016
Merged

Reproducible PS/PDF output (master) #6597

merged 31 commits into from Dec 9, 2016

Conversation

ghost
Copy link

@ghost ghost commented Jun 16, 2016

This is a rebase of #6595 to the master branch.

Several software packages use matplotlib in their building process (mainly to produce PS or PDF documents). To make their build reproducible, it would be great to make matplotlib output reproducible.

To allow reproducible PS and PDF output:

  • honour SOURCE_DATE_EPOCH for timestamps in PS and PDF files.
    See https://reproducible-builds.org/specs/source-date-epoch/
  • get keys sorted so that hatch patterns, images and markers are included with
    a reproducible order in the PDF file. Another solution is to sort self.hatchPatterns in writeHatches (and similar ordering in writeImages and writeMarkers), but this consumes more memory.

This patch has been submitted in debian bug #827361

See also https://reproducible-builds.org/

* honour SOURCE_DATE_EPOCH for timestamps in PS and PDF files.
  See https://reproducible-builds.org/specs/source-date-epoch/
* get keys sorted so that hatchPatterns, images and markers are included with
  a reproducible order in the PDF file.
See https://reproducible-builds.org/
@ghost ghost mentioned this pull request Jun 16, 2016
@tacaswell tacaswell added this to the 2.1 (next point release) milestone Jun 17, 2016
@tacaswell
Copy link
Member

I think we are already sorting all of those keys on the way out so do not need the ordered dicts.

Can you add an entry in https://github.com/matplotlib/matplotlib/tree/master/doc/users/whats_new and a test (like the deterministic svg test) that this actually works?

attn @jkseppan

@jkseppan
Copy link
Member

I agree that a test is needed. I don't see the reason for the ordered dicts either, but there could be some subtlety that I'm missing. It would be neat to make the date settable by the user (via e.g. the PdfPages constructor) but of course not necessary for this particular use case.

@ghost
Copy link
Author

ghost commented Jun 18, 2016

I added a test for reproducible PDF output (you're right: this is very important!).
The test shows that the ordered dicts are needed for markers and hatches, due to the dict iterations at the beginning of writeHatches and writeMarkers, which are unreproducible with python3.
Even if the same iteration occurs in writeImages, the tests passes without ordering _images. I think this is because the _images keys are integers.
I also added two orderings (related to fonts), that appeared to be essential thanks to the test.

@jkseppan
Copy link
Member

Oh, of course! I was thinking about the page stream, but it also matters in which order the objects are output at the top level of the pdf file. I suspect that the ordering is needed for _images too - maybe the dict implementation is a list for integer-keyed small dictionaries, or something.

@ghost
Copy link
Author

ghost commented Jun 18, 2016

I would like to add tests for postscript too, but I'm afraid the code will be nearly the same as for PDF: do you think I should move the common code somewhere (maybe in a new testing/reproducible.py file), or duplicate it in test_backend_ps.py and test_backend_pdf.py?

@tacaswell
Copy link
Member

.SK...
======================================================================
FAIL: Test SOURCE_DATE_EPOCH support for PS output
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Miniconda-x64\envs\test-environment\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "c:\projects\matplotlib\lib\matplotlib\testing\decorators.py", line 152, in wrapped_callable
    func(*args, **kwargs)
  File "c:\projects\matplotlib\lib\matplotlib\tests\test_backend_ps.py", line 186, in test_source_date_epoch
    _test_source_date_epoch("ps", b"%%CreationDate: Sat Jan  1 00:00:00 2000")
  File "c:\projects\matplotlib\lib\matplotlib\testing\determinism.py", line 122, in _test_source_date_epoch
    assert string in buff
AssertionError
--------------------------------------------------------------------

These failures on appveyor look real.

@tacaswell
Copy link
Member

My naive guess is that there is a unicode vs bytes issues

@ghost
Copy link
Author

ghost commented Jul 8, 2016

These failures on appveyor look real.

Sure! I'm working on it but for now I can't reproduce the problem at home.

Alexis Bienvenüe added 4 commits July 8, 2016 12:16
…th, day of month and times.

This way we don't mind if timestamps are written with leading 0 or space.
@ghost
Copy link
Author

ghost commented Jul 11, 2016

I think this is better now.

  • For parallel tests, I think threads are used, and they share the same environment. So setting the SOURCE_DATE_EPOCH environment variable in a test is not good. So I used subprocesses instead.
  • The timestamp was 'Jan 1' or 'Jan 01' sometimes, so I used Dec 15 instead to solve this issue.

Appveyor is now reporting only a HTTPError: 500 Server Error. I think this is not related to Matplotlib code, but I don't know if I can restart the build.

@tacaswell
Copy link
Member

👍 from me, attn @jkseppan can you do a final review?


The ``SOURCE_DATE_EPOCH`` environment variable can now be used to set
the timestamps value in the PS and PDF outputs, which are then
reproducible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should list the known limitations, because when downstream users see this promise, they may start depending on it and run into the limitations later. At least usetex with the ps backend should be ruled out, based on comments in the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a link to the specification that describes the value of this environment variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other possible limitations include at least mathtext, ps.usedistiller, ps.fonttype, and pdf.fonttype. These are features or settings that I think could easily hide a source of nondeterminism and that don't seem to get exercised by the new test. (To get a more reliable list of relevant features than just my guesses, you'd need to run the test with coverage checking and see which parts of the backends get no coverage.)

To be clear, I don't mean we should delay merging until there are tests for everything, but we should be realistic in how we document this. I imagine the reproducible-builds community would appreciate the feature as it is now, with careful documentation of what exactly they can rely on.

@jkseppan
Copy link
Member

Commit c007f49 changes the tests to use a two-digit date in December instead of January 1st. I think the underlying problem should be fixed instead of avoiding it in the tests, since we don't want to tell users that their SOURCE_DATE_EPOCH values must be two-digit days.

@ghost
Copy link
Author

ghost commented Jul 12, 2016

I considered reproducibility this way: with the same tools (same versions of all what you installed on your computer), the same commands lead to the same results.
With this point of view, I accept that with some different versions/implementations of python, the date can be written different ways (leading space or 0 for example). The problem with the tests was that we face different implementations with different behaviors. We don't pretend that the output will be exactly the same with all future versions of all the tools, all implementations of python.
I also accept that ghostscript can change its way to format the timestamp in a future version, but I did not include a test with usetex relying to the current behavior of ghostscript. Maybe this is too prudent.
Let me try to be more explicit about that in the doc.

@Kojoley
Copy link
Member

Kojoley commented Oct 13, 2016

I think the failure is due to ghostscript (some versions now honors SOURCE_DATE_EPOCH, but some don't).

The fail happens on the build where ghostscript is not installed, so bf7387e commit is not the right way to handle such situation.

@ghost
Copy link
Author

ghost commented Oct 15, 2016

The [XPASS(strict)] This test needs a ghostscript installation fail happened because the test should have failed (tagged @needs_ghostscript) but did not fail, because usetex was not properly passed to _determinism_save before af4213e, so that the test ran without using ghostscript.
Correcting this, I then saw that now that usetex is properly passed, the test using usetex failed because of ghostscript, which does not honour SOURCE_DATE_EPOCH in the installed version. So I applied bf7387e so that we can see if the test pass (good ghostscript version) or not, but without failing.

@tacaswell tacaswell closed this Nov 1, 2016
@tacaswell tacaswell reopened this Nov 1, 2016
@tacaswell
Copy link
Member

'power cycled to restart CI (appveyor failures looked due to qt-related packaging issues).

@tacaswell tacaswell changed the title Reproducible PS/PDF output (master) [MRG+1] Reproducible PS/PDF output (master) Nov 1, 2016
Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resuse UTC timezone from dates.py

@@ -135,6 +136,20 @@ def _string_escape(match):
assert False


# tzinfo class for UTC
class UTCtimezone(tzinfo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this in mpl/dates.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Sorry I missed that.

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resuse UTC timezone from dates.py

Copy link
Member

@tacaswell tacaswell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resuse UTC timezone from dates.py

@tacaswell tacaswell changed the title [MRG+1] Reproducible PS/PDF output (master) [MRG] Reproducible PS/PDF output (master) Nov 1, 2016
@tacaswell
Copy link
Member

Other than one minor change, this looks good to go!

Don't worry about the coverall failure, that is un-related.

Sorry this has dragged on so long.

@QuLogic
Copy link
Member

QuLogic commented Nov 2, 2016

The coveralls drop is probably due to a change in master, and not this PR; a rebase to latest master might help that, but is not strictly necessary.

Copy link
Member

@QuLogic QuLogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor things.

------------------------------

The ``SOURCE_DATE_EPOCH`` environment variable can now be used to set
the timestamps value in the PS and PDF outputs. See
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp

default value is "mhi", so that the test includes all these objects.
format : str
format string. The default value is "pdf".
uid : str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing this being used for anything?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uid option is used in test_determinism_all_tex from lib/matplotlib/tests/test_backend_ps.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it was only useful when files were used to store the figures. I will remove that.

format string, such as "pdf".
string : str
timestamp string for 2000-01-01 00:00 UTC.
keyword : str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to be bytes, not str.

@ghost
Copy link
Author

ghost commented Nov 3, 2016

@QuLogic: thanks a lot for your review. I corrected those points.

@QuLogic
Copy link
Member

QuLogic commented Dec 7, 2016

A rebase might fix coveralls (as I'm pretty sure it was a change in master that dropped coverage), but it's not strictly needed. I guess we aren't really waiting on anything else here.

@QuLogic QuLogic changed the title [MRG] Reproducible PS/PDF output (master) [MRG+1] Reproducible PS/PDF output (master) Dec 7, 2016
@QuLogic QuLogic merged commit eadafc6 into matplotlib:master Dec 9, 2016
@QuLogic QuLogic changed the title [MRG+1] Reproducible PS/PDF output (master) Reproducible PS/PDF output (master) Dec 9, 2016
@ghost
Copy link
Author

ghost commented Dec 9, 2016

Thanks to you all for your kind support!

@tacaswell
Copy link
Member

@JojoBoulix Thanks for taking care of this! Sorry it was such a protracted process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants