Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Lower test tolerance #5307
Conversation
mdboom
added the
needs_review
label
Oct 23, 2015
|
In some ways this experiment was a success. The failures are not due to text differences, but due to a bunch of other differences. Not clear whether it's a Mac vs. Linux issue, just general non-reproducibility or what... |
mdboom
referenced
this pull request
Oct 26, 2015
Merged
Use a specific version of Freetype for testing #5306
|
I have been going through updating Cartopy tests so I've probably seen many of these changes. Probably most of them will be due to a small change in the antialiasing at the corner of plots, and possibly the upgrade to a newer Agg in 2a17839. |
|
I don't think the Agg change is the culprit. There's a certain group of tests that just generate slightly different results from run to run on the same machine/platform/environment. Some are obvious (like the random number generator not being reset in the xkcd test), but some are still an enigma. |
|
Oh, you mean after updating the result? Yea, that wouldn't be the Agg change then. |
This was referenced Oct 27, 2015
|
Woot! This is passing. (The Python 2.6 failure is on purpose -- I used To review this PR, I suggest clicking on individual commits except for the one called "Update all test images...". The whole diff here is too large for github to display. It turns out the source of the random test failures on the same platform/machine were due to the spines being stored in a dictionary, and thus their drawing order being non-deterministic. This creates small one-value differences in two pixels in the corner of the axes. Changing this to an ordered dict resolved this from around 68 random failures to 6. The remaining ones with inter-run differences involve boxplot (which I think is a known issue) and axes_grid1. For those, I just turned the tolerance up slightly and filed #5334 to deal with them later. Everything else is now happily running with a tolerance of zero and a direct I still would advise against merging this until some known-to-change-the-baseline-images PRs are merged first, especially #5301 and #5214. Though #5306 (which just turns on the testing version of freetype without changing the tolerances) can be merged without harm at any time. |
|
Very interesting and great work! Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right? Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype? Furthermore, do we want to eliminate fuzzy matching completely? Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations? Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs. |
Yes. Most end users will continue to build against their system freetype or conda's freetype etc. as they always have. That said there's no harm in using the special testing freetype if they want to -- but most packagers, esp. Debian, aren't going to do that.
Possibly. But I don't see that as a big issue: I've never seen an issue of that type appear, where the version of freetype was causing some sort of true problem. I've only ever seen small differences in the antialiasing of fonts that cause problems when comparing test images, but otherwise appear fine to a human being. Those are the kind of differences I would not consider a bug, but the natural refinement of the details in freetype over the years.
The functionality is still there, and individual tests may still turn it on on an individual basis. (In fact the 6 tests mentioned in #5334 do). The only change is that the default is 0. I think it's possible that we may turn it on on some more tests over time if we learn that exact matching isn't important for them.
That's an interesting question. At present (thanks to feedback from @jkseppan), they can still run the tests, but many of the tests will fail due to inexact matching. One possibility is to detect the use of a system freetype and turn the default tolerance up to some small but non-zero value. However, my worry there is that we need to communicate that very clearly to all developers, or there will be mismatches when they commit new test images to the repository. I deliberately took a hard line on this because I want to keep the images working well. We could somehow track this much like one tracks code coverage. I don't know how you do that within nose though without building a bunch of scaffolding on top. It also might not hurt to do "smoke tests" to just compile with various versions of freetype and make sure there's no API changes that bite us (though freetype is old and super stable wrt API at this point, and it's not like it's a small project that no one much notices anymore).
Yeah -- there have been a number of bugs in the past few weeks that slipped right through the unit tests that we would have found if they were less tolerant. That's my real motivator here. |
|
This is great work, will make things much easier! I haven't reviewed the commits, but I found that running |
|
The spine order is addressed by #4434 as well. |
Oh, good to know. The OrderedDict approach is maybe slightly better in that you don't have to remember to sort it in all the places you might iterate over them, but it's a minor enough point. We should be sure to not solve it both ways when this or that PR is merged. |
|
I would much rather solve this with ordered dicts than by sorting on every draw. |
|
Yes -- that makes sense. |
mdboom
added this to the
next major release (2.0)
milestone
Oct 30, 2015
This was referenced Nov 19, 2015
|
Now, all of a sudden we have a bunch of failing tests I can't reproduce locally. The vast majority are SVG-only, which is perhaps a clue. |
|
I could not reproduce the travis failures either. Travis is using inkscape 0.48 from ubuntu, whereas in my debian testing box I have inkscape 0.91. There might be some differences in the png generation between the versions. |
|
The SVG problem appears to be 2.7 only is is due to a different number of digits being written out for the numbers in transforms on 2.7 vs 3.4 where I generated the images. Using |
|
Now we're down to this Python 2 vs 3 issue: From What's New in Python 3:
Never noticed this one before. I think we'll need to use |
That also has lower precision; I don't know if that would be a problem, but $ python2
>>> '%f' % (0.123456789012345678901234567890, )
'0.123457'
>>> str(0.123456789012345678901234567890)
'0.123456789012'
>>> repr(0.123456789012345678901234567890)
'0.12345678901234568'
$ python3
>>> '%f' % (0.123456789012345678901234567890, )
'0.123457'
>>> str(0.123456789012345678901234567890)
'0.12345678901234568'
>>> repr(0.123456789012345678901234567890)
'0.12345678901234568' |
|
If all numbers are written out using repr, won't the file size balloon? |
|
That is true, but based on 39d9e9a, I don't think it applies to all numbers in the file. In any case, I'm only pointing out that |
|
I actually don't think any more precision that %f is needed -- these are pixel positions. I chose it specifically because it was the shortest length of the easily-available options. |
|
We are now officially green. This is ready for a final review and merge (look at individual commits, since the changeset is otherwise too large to view). |
mdboom
changed the title from
WIP: Lower test tolerance to Lower test tolerance
Dec 9, 2015
|
Oh, looking through the changes, it is possible that SVG file size will still increase, because something like |
That's a good point. We could do something like |
|
Is there a short summary of the 6 tests that were bumped up from 0 tolerance? |
They have as-of-yet-undetermined nondeterminacy in them. I'll create a new issue for those, but I don't think we should hold this up any longer trying to investigate them right now. |
This was referenced Dec 10, 2015
mdboom
and others
added some commits
Oct 27, 2015
tacaswell
commented on lib/matplotlib/tests/test_axes.py in 8cad8ff
Dec 10, 2015
|
Why the change from 'a' -> '0'? |
|
Sorry -- the unfortunate need to rebase (given the size of this PR) obliterated my earlier comment. This test creates two figures -- once for normal usage and once for the labeled data approach -- but saves them to the same file, so the results must match exactly. In the labeled data example, |
|
Sorry for asking lots of annoying picky questions. |
tacaswell
added a commit
that referenced
this pull request
Dec 10, 2015
|
|
tacaswell |
61f0eea
|
tacaswell
merged commit 61f0eea
into matplotlib:master
Dec 10, 2015
tacaswell
removed the
needs_review
label
Dec 10, 2015
|
\o/ Thanks for merging! |
|
Are you going to backport to 2.0.x, or should I? |
|
doing it now |
|
Maybe I'm too late with this comment, but it just occurred to me that it might be worth doing a PR of the backport just because of the high potential for breakage here... |
|
not too late ,will do |
|
It does not back-port cleanly.... |
|
Ok -- I can take a crack at it if the conflicts are non-obvious |
|
I'll leave this to you |
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 10, 2015
|
|
tacaswell + mdboom |
61cc364
|
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 11, 2015
|
|
tacaswell + mdboom |
65dc69a
|
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 11, 2015
|
|
tacaswell + mdboom |
57c33f0
|
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 11, 2015
|
|
tacaswell + mdboom |
a977b8f
|
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 13, 2015
|
|
tacaswell + mdboom |
6583065
|
mdboom
added a commit
to mdboom/matplotlib
that referenced
this pull request
Dec 13, 2015
|
|
tacaswell + mdboom |
164222a
|
tacaswell
added a commit
that referenced
this pull request
Dec 13, 2015
|
|
tacaswell |
17a3a6d
|
This was referenced Dec 14, 2015
QuLogic
added the
Testing
label
Sep 28, 2016
|
Backported via #5652. |
mdboom commentedOct 23, 2015
This is a test to see if baseline images generated on a Mac with the new local build of freetype functionality will work on Travis with 0 tolerance. Fingers crossed.