Lower test tolerance #5307

Merged
merged 16 commits into from Dec 10, 2015

Conversation

Projects
None yet
8 participants
Owner

mdboom commented Oct 23, 2015

This is a test to see if baseline images generated on a Mac with the new local build of freetype functionality will work on Travis with 0 tolerance. Fingers crossed.

mdboom added the needs_review label Oct 23, 2015

Owner

mdboom commented Oct 26, 2015

In some ways this experiment was a success. The failures are not due to text differences, but due to a bunch of other differences. Not clear whether it's a Mac vs. Linux issue, just general non-reproducibility or what...

Member

QuLogic commented Oct 26, 2015

I have been going through updating Cartopy tests so I've probably seen many of these changes. Probably most of them will be due to a small change in the antialiasing at the corner of plots, and possibly the upgrade to a newer Agg in 2a17839.

Owner

mdboom commented Oct 26, 2015

I don't think the Agg change is the culprit. There's a certain group of tests that just generate slightly different results from run to run on the same machine/platform/environment. Some are obvious (like the random number generator not being reset in the xkcd test), but some are still an enigma.

Member

QuLogic commented Oct 26, 2015

Oh, you mean after updating the result? Yea, that wouldn't be the Agg change then.

Owner

mdboom commented Oct 27, 2015

Woot! This is passing. (The Python 2.6 failure is on purpose -- I used OrderedDict without backporting it. Didn't want to expend the effort given that we're killing Python 2.6 in a few days).

To review this PR, I suggest clicking on individual commits except for the one called "Update all test images...". The whole diff here is too large for github to display.

It turns out the source of the random test failures on the same platform/machine were due to the spines being stored in a dictionary, and thus their drawing order being non-deterministic. This creates small one-value differences in two pixels in the corner of the axes. Changing this to an ordered dict resolved this from around 68 random failures to 6. The remaining ones with inter-run differences involve boxplot (which I think is a known issue) and axes_grid1. For those, I just turned the tolerance up slightly and filed #5334 to deal with them later. Everything else is now happily running with a tolerance of zero and a direct numpy.array_equal comparison. I think time will tell whether that becomes annoying (as in it catches too many non-important differences and we end up updating the test images a lot), or it gives us more certainty in unit testing.

I still would advise against merging this until some known-to-change-the-baseline-images PRs are merged first, especially #5301 and #5214. Though #5306 (which just turns on the testing version of freetype without changing the tolerances) can be merged without harm at any time.

Member

WeatherGod commented Oct 27, 2015

Very interesting and great work!

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right? Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Furthermore, do we want to eliminate fuzzy matching completely? Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations? Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

Owner

mdboom commented Oct 27, 2015

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right?

Yes. Most end users will continue to build against their system freetype or conda's freetype etc. as they always have. That said there's no harm in using the special testing freetype if they want to -- but most packagers, esp. Debian, aren't going to do that.

Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Possibly. But I don't see that as a big issue: I've never seen an issue of that type appear, where the version of freetype was causing some sort of true problem. I've only ever seen small differences in the antialiasing of fonts that cause problems when comparing test images, but otherwise appear fine to a human being. Those are the kind of differences I would not consider a bug, but the natural refinement of the details in freetype over the years.

Furthermore, do we want to eliminate fuzzy matching completely?

The functionality is still there, and individual tests may still turn it on on an individual basis. (In fact the 6 tests mentioned in #5334 do). The only change is that the default is 0. I think it's possible that we may turn it on on some more tests over time if we learn that exact matching isn't important for them.

Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations?

That's an interesting question. At present (thanks to feedback from @jkseppan), they can still run the tests, but many of the tests will fail due to inexact matching. One possibility is to detect the use of a system freetype and turn the default tolerance up to some small but non-zero value. However, my worry there is that we need to communicate that very clearly to all developers, or there will be mismatches when they commit new test images to the repository. I deliberately took a hard line on this because I want to keep the images working well.

We could somehow track this much like one tracks code coverage. I don't know how you do that within nose though without building a bunch of scaffolding on top.

It also might not hurt to do "smoke tests" to just compile with various versions of freetype and make sure there's no API changes that bite us (though freetype is old and super stable wrt API at this point, and it's not like it's a small project that no one much notices anymore).

Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

Yeah -- there have been a number of bugs in the past few weeks that slipped right through the unit tests that we would have found if they were less tolerant. That's my real motivator here.

Member

zblz commented Oct 28, 2015

This is great work, will make things much easier! I haven't reviewed the commits, but I found that running python setup.py develop on a clean repo (i.e., before python setpu.py build) will fail because it tries to download freetype to the non-existing build directory.

Owner

tacaswell commented Oct 29, 2015

The spine order is addressed by #4434 as well.

Owner

mdboom commented Oct 29, 2015

The spine order is addressed by #4434 as well.

Oh, good to know. The OrderedDict approach is maybe slightly better in that you don't have to remember to sort it in all the places you might iterate over them, but it's a minor enough point. We should be sure to not solve it both ways when this or that PR is merged.

Owner

tacaswell commented Oct 29, 2015

I would much rather solve this with ordered dicts than by sorting on every draw.

Member

zblz commented Oct 29, 2015

@mdboom -- this should be merged after #5301 (which changes baseline images), but before #5214 (which adds new images and therefore would benefit from the fixed freetype). Therefore, the priority would be getting #5301 in, which is a small PR, and then this one.

Owner

mdboom commented Oct 29, 2015

Yes -- that makes sense.

mdboom added this to the next major release (2.0) milestone Oct 30, 2015

Owner

mdboom commented Dec 5, 2015

Now, all of a sudden we have a bunch of failing tests I can't reproduce locally. The vast majority are SVG-only, which is perhaps a clue.

Member

zblz commented Dec 6, 2015

I could not reproduce the travis failures either. Travis is using inkscape 0.48 from ubuntu, whereas in my debian testing box I have inkscape 0.91. There might be some differences in the png generation between the versions.

mdboom referenced this pull request Dec 7, 2015

Closed

Deterministic SVG output #4434

Owner

mdboom commented Dec 9, 2015

The SVG problem appears to be 2.7 only is is due to a different number of digits being written out for the numbers in transforms on 2.7 vs 3.4 where I generated the images. Using '%f' % x rather than str(x) should hopefully give us consistent results everywhere. Stay tuned...

Owner

mdboom commented Dec 9, 2015

Now we're down to this Python 2 vs 3 issue:

From What's New in Python 3:

The round() function rounding strategy and return type have changed. Exact halfway cases are now rounded to the nearest even result instead of away from zero. (For example, round(2.5) now returns 2 rather than 3.)

Never noticed this one before. I think we'll need to use numpy.round (which uses the new behavior, even on Python 2) to get consistent behavior.

Member

QuLogic commented Dec 9, 2015

'%f' % x rather than str(x)

That also has lower precision; I don't know if that would be a problem, but repr should be consistent:

$ python2
>>> '%f' % (0.123456789012345678901234567890, )
'0.123457'
>>> str(0.123456789012345678901234567890)
'0.123456789012'
>>> repr(0.123456789012345678901234567890)
'0.12345678901234568'

$ python3
>>> '%f' % (0.123456789012345678901234567890, )
'0.123457'
>>> str(0.123456789012345678901234567890)
'0.12345678901234568'
>>> repr(0.123456789012345678901234567890)
'0.12345678901234568'
Owner

efiring commented Dec 9, 2015

If all numbers are written out using repr, won't the file size balloon?

Member

QuLogic commented Dec 9, 2015

That is true, but based on 39d9e9a, I don't think it applies to all numbers in the file. In any case, I'm only pointing out that '%f' would be a lower precision than str, not that repr is the optimal choice.

Owner

mdboom commented Dec 9, 2015

I actually don't think any more precision that %f is needed -- these are pixel positions. I chose it specifically because it was the shortest length of the easily-available options.

Owner

mdboom commented Dec 9, 2015

🎉 🎈 🍬 🍻

We are now officially green. This is ready for a final review and merge (look at individual commits, since the changeset is otherwise too large to view).

mdboom changed the title from WIP: Lower test tolerance to Lower test tolerance Dec 9, 2015

Member

QuLogic commented Dec 9, 2015

Oh, looking through the changes, it is possible that SVG file size will still increase, because something like 0.5 passed through '%f' will create 0.500000.

Owner

mdboom commented Dec 9, 2015

Oh, looking through the changes, it is possible that SVG file size will still increase, because something like 0.5 passed through '%f' will create 0.500000.

That's a good point. We could do something like ('%f' % x).rstrip('0')

Member

QuLogic commented Dec 10, 2015

Is there a short summary of the 6 tests that were bumped up from 0 tolerance?

Owner

mdboom commented Dec 10, 2015

Is there a short summary of the 6 tests that were bumped up from 0 tolerance?

They have as-of-yet-undetermined nondeterminacy in them. I'll create a new issue for those, but I don't think we should hold this up any longer trying to investigate them right now.

Why the change from 'a' -> '0'?

Owner

mdboom replied Dec 10, 2015

Sorry -- the unfortunate need to rebase (given the size of this PR) obliterated my earlier comment.

This test creates two figures -- once for normal usage and once for the labeled data approach -- but saves them to the same file, so the results must match exactly. In the labeled data example, tick_label is replaced with data['a'] which is 0. So strictly speaking, only the first example needed to be changed to be 0, or we could change both to FOO (something not a). This wasn't caught when the labeled data support was first added because the a and 0 in the output image weren't different enough to go above tolerance.

Owner

tacaswell commented Dec 10, 2015

Sorry for asking lots of annoying picky questions.

@tacaswell tacaswell added a commit that referenced this pull request Dec 10, 2015

@tacaswell tacaswell Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
61f0eea

@tacaswell tacaswell merged commit 61f0eea into matplotlib:master Dec 10, 2015

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.5%) to 68.25%
Details

tacaswell removed the needs_review label Dec 10, 2015

Owner

mdboom commented Dec 10, 2015

\o/ Thanks for merging!

Owner

mdboom commented Dec 10, 2015

Are you going to backport to 2.0.x, or should I?

Owner

tacaswell commented Dec 10, 2015

doing it now

Owner

mdboom commented Dec 10, 2015

Maybe I'm too late with this comment, but it just occurred to me that it might be worth doing a PR of the backport just because of the high potential for breakage here...

Owner

tacaswell commented Dec 10, 2015

not too late ,will do

Owner

tacaswell commented Dec 10, 2015

It does not back-port cleanly....

Owner

mdboom commented Dec 10, 2015

Ok -- I can take a crack at it if the conflicts are non-obvious

Owner

tacaswell commented Dec 10, 2015

I'll leave this to you 😈

mdboom referenced this pull request Dec 10, 2015

Merged

Shorter svg files #5651

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 10, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
61cc364

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
65dc69a

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
57c33f0

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
a977b8f

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
6583065

@mdboom mdboom added a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015

@tacaswell @mdboom tacaswell + mdboom Merge pull request #5307 from mdboom/lower-tolerance
Lower test tolerance
164222a

@tacaswell tacaswell added a commit that referenced this pull request Dec 13, 2015

@tacaswell tacaswell Merge pull request #5652 from mdboom/backport-lower-tolerance-2.0
Backport #5307 to v2.0.x
17a3a6d

QuLogic added the Testing label Sep 28, 2016

Member

QuLogic commented Oct 16, 2016

Backported via #5652.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment