Fix image comparison #1291

mgiuca-google · 2012-09-21T00:08:13Z

Fixes the compare_image RMS calculation algorithm, so that it computes the RMS of the difference between corresponding pixels, as opposed to the RMS of the histograms between the two images.

See discussion on Issue 1287.

Note: This is not yet ready to merge, since it breaks a lot of tests. Some negotiation is required to figure out whether to update the expected output for each test, or bump up the tolerance.

pelson · 2012-09-21T08:36:34Z

lib/matplotlib/testing/compare.py

+   actualImage = actualImage.astype(np.int32)
+
+   # calculate the per-pixel errors, then compute the root mean square error
+   num_values = reduce(operator.mul, expectedImage.shape)


np.prod(expectedImage.shape) would do the trick here (obviously your version works, but feels less numpy-y).

Ah much better.

pelson · 2012-09-21T09:11:58Z

@mgiuca-google : This is really good stuff, thank you!

As you can see, I have raised a couple of questions, but in principle, I think this will be a beneficial change. As I hinted at in my comment on the original issue, I probably wouldn't call the original image comparison test "broke", just that it has some characteristics which may not be ideal for our image testing requirements. On that basis, I wonder if it is worth us maintaining the two functions side by side, primarily so that other users who may want to do image comparison could decide which algorithm to use. This may be a contentious issue, as inevitably it will increase the amount of code that mpl has to maintain...

One nitpick observation: you have built on code which is obviously not PEP8 compliant, resulting in you own code not being strictly PEP8 compliant (although you have followed the guiding principle: "A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important."). I would certainly find it an improvement if you were to rename the variables you have added/touched to be more PEP8-y (i.e. from camelCase to underscored_variables).

On the whole, pretty awesome!

WeatherGod · 2012-09-21T17:44:43Z

Just something I have come across today in my work that might be relevant is the MapReady toolkit: http://www.asf.alaska.edu/downloads/software_tools

In it, there is a program called "diffimage" (which, because this is a geoprocessing tool, does a bit more than we are looking for), but has the following description:

DESCRIPTION:
   1. diffimage calculates image statistics within each input image
      and calculates the peak signal-to-noise (PSNR) between the two
      images.
   2. diffimage then lines up the two images, to slightly better
      than single-pixel precision, and determines if any geolocation
      shift has occurred between the two and the size of the shift.
      Because an fft-based match is utilized it will work with images of
      any size, but is most efficient when the image dimensions are
      near a power of 2.  The images need not be square.
   3. diffimage then compares image statistics and geolocation shift
      (if it occurred) and determines if the two images are different from
      each other or not.
   4. If there are no differences, the output file will exist but will be
      of zero length.  Otherwise, a summary of the differences will be placed
      in both the output file and the log file (if specified.)

So, what is interesting is the use of the fourier-transform as part of the image differentiating technique. Don't know if that might be an interesting avenue to pursue or not. Cheers!

dmcdougall · 2012-09-21T22:07:49Z

On Fri, Sep 21, 2012 at 6:44 PM, Benjamin Root notifications@github.comwrote:

Just something I have come across today in my work that might be relevant
is the MapReady toolkit:
http://www.asf.alaska.edu/downloads/software_tools

In it, there is a program called "diffimage" (which, because this is a
geoprocessing tool, does a bit more than we are looking for), but has the
following description:

DESCRIPTION:

diffimage calculates image statistics within each input image
and calculates the peak signal-to-noise (PSNR) between the two
images.

diffimage then lines up the two images, to slightly better
than single-pixel precision, and determines if any geolocation
shift has occurred between the two and the size of the shift.
Because an fft-based match is utilized it will work with images of
any size, but is most efficient when the image dimensions are
near a power of 2. The images need not be square.

diffimage then compares image statistics and geolocation shift
(if it occurred) and determines if the two images are different from
each other or not.

If there are no differences, the output file will exist but will be
of zero length. Otherwise, a summary of the differences will be placed
in both the output file and the log file (if specified.)

So, what is interesting is the use of the fourier-transform as part of the
image differentiating technique. Don't know if that might be an interesting
avenue to pursue or not. Cheers!

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1291#issuecomment-8775091.

Interesting! Good find.

Damon McDougall
http://www.damon-is-a-geek.com
B2.39
Mathematics Institute
University of Warwick
Coventry
West Midlands
CV4 7AL
United Kingdom

mgiuca-google · 2012-09-24T01:24:42Z

Thanks for your comments, @pelson. I have taken care of them.

I would certainly find it an improvement if you were to rename the variables you have added/touched to be more PEP8-y (i.e. from camelCase to underscored_variables).

I have renamed absDiffImage and sumOfSquares to PEP-8 style, but I didn't want to touch expectedImage and actualImage since that will make my patch look much bigger than it should.

@WeatherGod good find. I will have a look at that tool later on. The main improvement I'd be interested in over the RMSE algorithm I implemented would be whether it can detect minor pixel shifts and assign a small penalty (whereas RMS assigns a large penalty because it just thinks that all of the pixels have changed). It shoulds like step 2 (lining up the two images) is designed to solve this, but again, we need to be able to deal with sub-image shifts, not just whole-image shifts. The new test cases Phil suggested that I add are helpful in judging this requirement. They currently output 22 and 13 respectively. I'd expect them to output some positive value, but much smaller, perhaps about 4 and 2, respectively.

mgiuca-google · 2012-10-18T00:58:38Z

@WeatherGod wrote:

Just something I have come across today in my work that might be relevant is the MapReady toolkit:
http://www.asf.alaska.edu/downloads/software_tools

I'm not sure if you're advocating using this tool or just borrowing the idea. If you meant the former, I had a brief look at the license agreement and it is incompatible with Matplotlib. It seems to be basically the BSD license, but with the additional BSD-incompatible clause:

Redistribution and use of source and binary forms are for noncommercial purposes only.

pelson · 2013-01-14T11:01:46Z

@mgiuca-google - if you wouldn't mind rebasing this, I'd like to see if we can get this merged in the next couple of weeks. Previous commenter from #1287 were @mdboom, @dmcdougall, @WeatherGod so ideally we would get either a 👍, 👎 or and explicit abstinence from those before we actually press the merge button (other commenter more than welcome too!).

Cheers,

mdboom · 2013-01-14T15:52:26Z

I'm definitely in favor of this in principle. Once this is rebased and we have something to test again, I'd like to kick the tires one more time (since accidentally breaking the test suite would be a major problem). Assuming all that goes well, I'd say this is good to go.

NelleV · 2013-01-14T15:58:38Z

lib/matplotlib/tests/test_compare_images.py

+    succeed if compare_images succeeds. Otherwise, the test will succeed if
+    compare_images fails and returns an RMS error almost equal to this value.
+    """
+    from nose.tools import assert_almost_equal


It would be better to have those imports at the top of the file

Done. Thanks for spotting.

dmcdougall · 2013-01-14T20:14:28Z

I'm definitely in favor of this in principle. Once this is rebased and we have something to test again, I'd like to kick the tires one more time (since accidentally breaking the test suite would be a major problem). Assuming all that goes well, I'd say this is good to go.

I agree with this. The negotiation @mgiuca-google mentions in this PR message I think should be carefully considered particularly given the issues we have comparing images of rasterised text across different versions of freetype.

mgiuca-google · 2013-01-16T06:43:20Z

Hey guys,

Thanks for bumping this, Phil. I've done a merge up to the current head. (You said "rebase" and I'm not sure if you actually prefer a rebase instead of a merge -- I'm personally not a fan but if you really want a rebase, let me know and I'll do that.) You guys should be able to pull this branch and run the tests.

The other change I made was in commit 5e22c11, I deleted the section which Michael added in 1283feb, which retries a failed comparison, removing any pixel differences of 1. This change was presumably made to work around the fact that if you have a lot of pixels with a slightly different colour, you will get a big error, such as in my all-127 versus all-128 case. My branch fixes that issue, so I don't think we need this extra case. Let me know if there is another good reason for it.

Now this still fails a lot of tests due to RMS failures. As I said in the original PR, we will have to go through and update either the expected output, or the tolerance, for each test. I can do this but it would be good to come to a policy decision first.

pelson · 2013-01-16T10:16:08Z

I can do this but it would be good to come to a policy decision first.

That is perfectly reasonable - we don't want you doing a lot of tedious work if it only goes stale again.

I'm confident that if we can blitz through a review of what's here this week, @mgiuca-google can then go through rms values next (or if there are other volunteers to help with that process, it can be shared appropriately).

I've done a merge up to the current head. You said "rebase" and I'm not sure if you actually prefer a rebase instead of a merge

I did mean rebase, which is generally our preferred way of bringing branches up to date, but the reason why this is preferred over merge eludes me (for a linear history on master???). I'm sure others can fill in the details on that and whether or not to undo the merge and rebase instead.

Cheers,

mdboom · 2013-01-16T16:29:33Z

Yes -- we definitely want a rebase, not a merge. The merge creates clutter in the history, and it makes it look like the old master is not the "trunk".

Why are there more RMS failures with this change? The images should either be identical to the input (in which case they pass) or any differences should be handled by this new algorithm. If not, then updating the baselines will only cause the comparisons to work for you but fail for me (who produced most of the existing baseline images). Or am I missing something? I hope to find some time shortly to check this out and poke at it a bit.

mdboom · 2013-01-16T18:25:03Z

Ok -- I see what's happening. It seems like most of these tests are failing due to a subtle text positioning improvement, which shows up mostly in the vector formats. I don't see any failures that look problematic -- in fact one failure is due to a baseline image still showing an old bug. I think the thing to do here is "reset" all of the baselines by updating all of them. I'll file a PR against this one shortly to do just that.

mdboom · 2013-01-16T18:29:48Z

@mgiuca-google : github won't let me file a PR against your repo (???). Perhaps you could just manually merge mdboom/matplotlib:update_baseline_images.

mgiuca-google · 2013-01-17T00:36:10Z

Yes -- we definitely want a rebase, not a merge. The merge creates clutter in the history, and it makes it look like the old master is not the "trunk".

Well, for what it's worth, if you merge the branch into the master with git merge --no-ff, you don't get that bad history. The merge commit's first parent will be the previous commit to master, so that anyone doing a git log --first-parent will see only the trunk, and not the individual commits to the branch. Note: I'm still going to do the rebase, since you asked me to, but I still recommend you merge with --no-ff to avoid splatting my branch commits (at this point, dozens) into the trunk.

Thanks for going to the effort of resetting all of those images. I wasn't sure you'd want to do that, but I think it's the best outcome. I was able to manually merge it, but it doesn't seem directly relevant to my branch. Wouldn't it be better to cherry-pick fb68c58 into master (since it should not break with the existing comparison function)? Then this branch is just about fixing the comparison function, and not the images themselves.

mgiuca-google · 2013-01-17T04:04:10Z

Okay, I have done the rebase. Now all of my commits are applied to the current HEAD.

I am not sure whether I've done a "rebase" as you intended though. Did you just want my commits applied to HEAD, or did you actually want me to go back through the history and fix up the commits so that they are all in logical order and pristine? For example, removing the "Use int16 instead of int32 arrays," and just using int16 from the start. Also, should it be the case that the tests pass on all of the commits (so, don't commit a failing test case before fixing the code)? I'm just trying to get an idea of what style of branch you want to accept.

If you intend to do a merge --no-ff, then it shouldn't matter if the history is a bit buggy, as long as the final product is fine. If you intend to do a fast-forward merge, then all of the commits need to be sensible.

pelson · 2013-01-17T09:15:17Z

lib/matplotlib/testing/compare.py

@@ -316,37 +300,25 @@ def compare_images( expected, actual, tol, in_decorator=False ):
   # open the image files and remove the alpha channel (if it exists)
   expectedImage = _png.read_png_int( expected )
   actualImage = _png.read_png_int( actual )
+   expectedImage = expectedImage[:,:,:3]
+   actualImage = actualImage[:,:,:3]


Hmmm. Is there a reason for not taking the alpha channel into account?

I think for the vast majority of our tests, it doesn't matter. But it's conceivable it might for one testing that the background of the figure is transparent for example. It probably saves some time, which is important when considering how long the tests currently take to run.

Note also that for PDF comparisons, Ghostscript never gives an alpha channel, so it's completely redundant there.

Maybe it should be a kwarg compare_alpha (defaulting to False)? I wouldn't want that to hold up this PR because of it.

mgiuca-google · 2013-01-18T04:46:09Z

Updating the PR (note: I didn't rebase again but I will after further discussion). Here are the remaining issues. Let me know if I've missed some:

PEP8 compliance. I've fixed the issue that was pointed out.
Removing the alpha channel. Interesting that the current code says "remove the alpha channel" but doesn't actually do so. I believe it used to do this, but it was lost in one of the refactorings over the past six months. I think that removing the alpha channel is correct at the moment, because none of the images have alpha and we want to make sure we can compare two images if one has alpha and the other does not. I'm happy to add a compare_alpha argument, but probably not in this PR.
Automatically generating some of the expected output images. I don't really want to generate the image algorithmically because a) that makes the tests non-deterministic (since it is a random scramble), and b) it involves lots of new code in the testing infrastructure with new ways to go wrong. The new images total 1.6 MB which is fairly hefty. Instead, I've replaced that rather large image with a tiny one that makes the same point, for a total of 4.3 kB of new images. Note that the old images are still in my commit history, but I'll rebase them away before we merge the PR.
The per-test tolerance settings. I decided to reset all of the custom tolerances since they are basically meaningless against the new algorithm (and some of them were huge). Obviously it's hard to choose an appropriate tolerance without having lots of computers to test it on, so maybe we can just set them low to begin with and creep them up as necessary. I've set the default tolerance to 10 out of 255, which allows for small changes in the kerning. For images with lots of text, the tolerance may need to be higher (for example, one images gives me an RMS error of 13.3 compared to Michael's updated baseline image). I'm getting huge RMS errors on the mathtext (presumably because Michael and I have different TeX renderers) -- in some cases up to 50 out of 255. I've set the tolerance to 50 for those tests, but this worries me, because with such a high tolerance, it won't detect a lot of problems with those images.

pelson · 2013-01-18T09:26:34Z

Obviously it's hard to choose an appropriate tolerance without having lots of computers to test it on, so maybe we can just set them low to begin with and creep them up as necessary.

I agree with that approach. I'm prepared to accept that some developers' (depending on machine/os) test suite will fail after merging this PR - its easy for us to iteratively increase tolerances as needed.

I've set the tolerance to 50 for those tests, but this worries me, because with such a high tolerance, it won't detect a lot of problems with those images.

Perhaps, in an ideal world, we would do well to be able to specify regions of different tolerances in the same image. But not here. 😄

pelson · 2013-01-18T09:27:15Z

for a total of 4.3 kB of new images

Excellent saving! Thanks for doing this.

pelson · 2013-01-18T09:28:44Z

lib/matplotlib/testing/compare.py

@@ -299,8 +282,9 @@ def compare_images( expected, actual, tol, in_decorator=False ):
   = INPUT VARIABLES
   - expected  The filename of the expected image.
   - actual    The filename of the actual image.
-   - tol       The tolerance (a unitless float).  This is used to
-               determine the 'fuzziness' to use when comparing images.
+   - tol       The tolerance (a colour value difference, where 255 is the


That's my kind of spelling - but is probably inconsistent with the rest of the docs/codebase. Would you mind dropping the "u" from colour? (It feels very alien asking you to do this...)

@pelson in fact:

nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep color | wc 15543 81737 1965493 nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep colour | wc 59 396 4876

That's probably my fault...

mgiuca-google · 2013-01-20T22:52:45Z

Changed. (Don't worry, I'm used to writing "color" to be consistent with the code around me. I'm usually less careful in comments, but fixed for consistency.)

This tests the image comparison function itself. Currently, all three cases fail, due to a buggy comparison algorithm. In particular, test_image_compare_scrambled shows the algorithm massively under-computing the error, and test_image_compare_shade_difference shows the algorithm massively over-computing the error.

(This regressed when the PIL code was migrated to numpy.)

The previous implementation did not compute the RMS error. It computed the RMS in the difference of the number of colour components of each value. While this computes 0 for equal images, it is incorrect in general. In particular, it does not detect differences in images with the same pixels in different places. It also cannot distinguish small changes in the colour of a pixel from large ones.

This was arbitrary and made no sense. Increased all tolerances by a factor of 10000. Note that some are ridiculously large (e.g., 200 out of 255).

…and sub-image 1-pixel offset.

…nction.

…ly 1." This was introduced in 1283feb, presumably to hack around the fact that 1-pixel differences can make a very large error value. This is not necessary any more, since the root cause has been fixed."

…is not available.

…line images derived from basn3p02 in pngsuite tests. These are much smaller images than the cosine tests.

…evant under the new algorithm. Added a few new tolerance values, for output I am seeing that is valid but slightly different to the baseline image.

This is possible due to anti-aliasing.

mgiuca-google · 2013-02-27T01:00:44Z

I have rebased from master and there was just one image that had changed in the meantime and needed to be regenerated (test_legend/legend_various_labels.png). I've regenerated that image with anti-aliased text and rolled that into mdboom's "Update all of the images to use antialiased text" commit. All tests seem to pass; we'll see what Travis says.

mdboom · 2013-02-27T13:51:36Z

The Travis failure on 2.6 appears to be due to a network failure, not really any fault of ours. Ideally, we'd do something to push Travis to try again, but I'm reasonably confident that we are ok here, given that we have 2.7 and 3.2 working.

mdboom · 2013-02-28T01:34:12Z

I'm just going to bite the bullet and merge this. I'm reasonably confident that the 2.6 test will pass once this is merged.

Thanks for all of this work -- I know this was a long lived PR for being so pervasive and fundamental to our workflow, but I think it represents a real improvement.

Fix image comparison

mgiuca-google · 2013-02-28T02:27:54Z

Sweet! Thanks for dealing with this Michael. It's a relief to have it done.

pelson · 2013-02-28T10:07:25Z

Sweet!

Indeed. Very nice work @mgiuca-google - thanks for this.

pelson reviewed Sep 21, 2012
View reviewed changes

NelleV reviewed Jan 14, 2013
View reviewed changes

pelson reviewed Jan 17, 2013
View reviewed changes

pelson reviewed Jan 18, 2013
View reviewed changes

mgiuca-google and others added 20 commits February 27, 2013 11:48

Remove the alpha channel from the expected and actual images.

2f139f2

(This regressed when the PIL code was migrated to numpy.)

Do not divide RMS by 10000 when testing against tolerance.

125d235

This was arbitrary and made no sense. Increased all tolerances by a factor of 10000. Note that some are ridiculously large (e.g., 200 out of 255).

Update the documentation to the 'tol' parameter.

a344de6

compare: Use np.prod instead of Python reduce.

7a9a71b

compare: Use int16 instead of int32 arrays.

4e12012

compare: Simplify and speed up sum of squares algorithm.

62a548d

test_compare_images: Added two new test cases, testing a whole-image …

6edf6bc

…and sub-image 1-pixel offset.

compare: Use PEP-8 variable names.

9701a43

test_compare_images: Import nose.tools at the top, instead of in a fu…

c855310

…nction.

testing/compare: Remove "retry ignoring pixels with differences of on…

23b03aa

…ly 1." This was introduced in 1283feb, presumably to hack around the fact that 1-pixel differences can make a very large error value. This is not necessary any more, since the root cause has been fixed."

test_compare_images: Fix errors on Python 2.6 because assert_is_none …

80f89a1

…is not available.

test_compare_images: Replace cosine_peak-nn test images with new base…

91c969a

…line images derived from basn3p02 in pngsuite tests. These are much smaller images than the cosine tests.

compare: PEP-8 formatting.

64ae6c8

tests: Removed existing custom tolerance values, since they are irrel…

8d44f75

…evant under the new algorithm. Added a few new tolerance values, for output I am seeing that is valid but slightly different to the baseline image.

Change spelling of 'colour' to 'color'.

38b6157

test_axes: Remove more arbitrary large tolerances.

9d3e774

Update all of the images to use antialiased text

afc50bf

Reduce the tolerance for test_mathtext from 50 down to 32.

9410fcf

This is possible due to anti-aliasing.

mdboom added a commit that referenced this pull request Feb 28, 2013

Merge pull request #1291 from mgiuca-google/fix-image-comparison

f5d86b1

Fix image comparison

mdboom merged commit f5d86b1 into matplotlib:master Feb 28, 2013

mgiuca-google deleted the fix-image-comparison branch February 28, 2013 02:27

dmcdougall mentioned this pull request Mar 29, 2013

compare_images computes RMS incorrectly #1287

Closed

pelson mentioned this pull request Aug 22, 2013

BUG: Fix broken test result TestAtlanticProfiles SciTools/iris#671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix image comparison #1291

Fix image comparison #1291

mgiuca-google commented Sep 21, 2012

pelson Sep 21, 2012

mgiuca-google Sep 24, 2012

pelson commented Sep 21, 2012

WeatherGod commented Sep 21, 2012

dmcdougall commented Sep 21, 2012

mgiuca-google commented Sep 24, 2012

mgiuca-google commented Oct 18, 2012

pelson commented Jan 14, 2013

mdboom commented Jan 14, 2013

NelleV Jan 14, 2013

mgiuca-google Jan 16, 2013

dmcdougall commented Jan 14, 2013

mgiuca-google commented Jan 16, 2013

pelson commented Jan 16, 2013

mdboom commented Jan 16, 2013

mdboom commented Jan 16, 2013

mdboom commented Jan 16, 2013

mgiuca-google commented Jan 17, 2013

mgiuca-google commented Jan 17, 2013

pelson Jan 17, 2013

mdboom Jan 17, 2013

mgiuca-google commented Jan 18, 2013

pelson commented Jan 18, 2013

pelson commented Jan 18, 2013

pelson Jan 18, 2013

NelleV Jan 18, 2013

dmcdougall Jan 18, 2013

mgiuca-google commented Jan 20, 2013

mgiuca-google commented Feb 27, 2013

mdboom commented Feb 27, 2013

mdboom commented Feb 28, 2013

mgiuca-google commented Feb 28, 2013

pelson commented Feb 28, 2013

Fix image comparison #1291

Fix image comparison #1291

Conversation

mgiuca-google commented Sep 21, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pelson commented Sep 21, 2012

WeatherGod commented Sep 21, 2012

dmcdougall commented Sep 21, 2012

mgiuca-google commented Sep 24, 2012

mgiuca-google commented Oct 18, 2012

pelson commented Jan 14, 2013

mdboom commented Jan 14, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmcdougall commented Jan 14, 2013

mgiuca-google commented Jan 16, 2013

pelson commented Jan 16, 2013

mdboom commented Jan 16, 2013

mdboom commented Jan 16, 2013

mdboom commented Jan 16, 2013

mgiuca-google commented Jan 17, 2013

mgiuca-google commented Jan 17, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgiuca-google commented Jan 18, 2013

pelson commented Jan 18, 2013

pelson commented Jan 18, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgiuca-google commented Jan 20, 2013

mgiuca-google commented Feb 27, 2013

mdboom commented Feb 27, 2013

mdboom commented Feb 28, 2013

mgiuca-google commented Feb 28, 2013

pelson commented Feb 28, 2013