Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix image comparison #1291

Merged
merged 20 commits into from Feb 28, 2013
Merged

Conversation

mgiuca-google
Copy link
Contributor

Fixes the compare_image RMS calculation algorithm, so that it computes the RMS of the difference between corresponding pixels, as opposed to the RMS of the histograms between the two images.

See discussion on Issue 1287.

Note: This is not yet ready to merge, since it breaks a lot of tests. Some negotiation is required to figure out whether to update the expected output for each test, or bump up the tolerance.

actualImage = actualImage.astype(np.int32)

# calculate the per-pixel errors, then compute the root mean square error
num_values = reduce(operator.mul, expectedImage.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.prod(expectedImage.shape) would do the trick here (obviously your version works, but feels less numpy-y).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah much better.

@pelson
Copy link
Member

pelson commented Sep 21, 2012

@mgiuca-google : This is really good stuff, thank you!

As you can see, I have raised a couple of questions, but in principle, I think this will be a beneficial change. As I hinted at in my comment on the original issue, I probably wouldn't call the original image comparison test "broke", just that it has some characteristics which may not be ideal for our image testing requirements. On that basis, I wonder if it is worth us maintaining the two functions side by side, primarily so that other users who may want to do image comparison could decide which algorithm to use. This may be a contentious issue, as inevitably it will increase the amount of code that mpl has to maintain...

One nitpick observation: you have built on code which is obviously not PEP8 compliant, resulting in you own code not being strictly PEP8 compliant (although you have followed the guiding principle: "A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important."). I would certainly find it an improvement if you were to rename the variables you have added/touched to be more PEP8-y (i.e. from camelCase to underscored_variables).

On the whole, pretty awesome!

@WeatherGod
Copy link
Member

Just something I have come across today in my work that might be relevant is the MapReady toolkit: http://www.asf.alaska.edu/downloads/software_tools

In it, there is a program called "diffimage" (which, because this is a geoprocessing tool, does a bit more than we are looking for), but has the following description:

DESCRIPTION:
   1. diffimage calculates image statistics within each input image
      and calculates the peak signal-to-noise (PSNR) between the two
      images.
   2. diffimage then lines up the two images, to slightly better
      than single-pixel precision, and determines if any geolocation
      shift has occurred between the two and the size of the shift.
      Because an fft-based match is utilized it will work with images of
      any size, but is most efficient when the image dimensions are
      near a power of 2.  The images need not be square.
   3. diffimage then compares image statistics and geolocation shift
      (if it occurred) and determines if the two images are different from
      each other or not.
   4. If there are no differences, the output file will exist but will be
      of zero length.  Otherwise, a summary of the differences will be placed
      in both the output file and the log file (if specified.)

So, what is interesting is the use of the fourier-transform as part of the image differentiating technique. Don't know if that might be an interesting avenue to pursue or not. Cheers!

@dmcdougall
Copy link
Member

On Fri, Sep 21, 2012 at 6:44 PM, Benjamin Root notifications@github.comwrote:

Just something I have come across today in my work that might be relevant
is the MapReady toolkit:
http://www.asf.alaska.edu/downloads/software_tools

In it, there is a program called "diffimage" (which, because this is a
geoprocessing tool, does a bit more than we are looking for), but has the
following description:

DESCRIPTION:

  1. diffimage calculates image statistics within each input image
    and calculates the peak signal-to-noise (PSNR) between the two
    images.
  2. diffimage then lines up the two images, to slightly better
    than single-pixel precision, and determines if any geolocation
    shift has occurred between the two and the size of the shift.
    Because an fft-based match is utilized it will work with images of
    any size, but is most efficient when the image dimensions are
    near a power of 2. The images need not be square.
  3. diffimage then compares image statistics and geolocation shift
    (if it occurred) and determines if the two images are different from
    each other or not.
  4. If there are no differences, the output file will exist but will be
    of zero length. Otherwise, a summary of the differences will be placed
    in both the output file and the log file (if specified.)

So, what is interesting is the use of the fourier-transform as part of the
image differentiating technique. Don't know if that might be an interesting
avenue to pursue or not. Cheers!


Reply to this email directly or view it on GitHubhttps://github.com//pull/1291#issuecomment-8775091.

Interesting! Good find.

Damon McDougall
http://www.damon-is-a-geek.com
B2.39
Mathematics Institute
University of Warwick
Coventry
West Midlands
CV4 7AL
United Kingdom

@mgiuca-google
Copy link
Contributor Author

Thanks for your comments, @pelson. I have taken care of them.

I would certainly find it an improvement if you were to rename the variables you have added/touched to be more PEP8-y (i.e. from camelCase to underscored_variables).

I have renamed absDiffImage and sumOfSquares to PEP-8 style, but I didn't want to touch expectedImage and actualImage since that will make my patch look much bigger than it should.

@WeatherGod good find. I will have a look at that tool later on. The main improvement I'd be interested in over the RMSE algorithm I implemented would be whether it can detect minor pixel shifts and assign a small penalty (whereas RMS assigns a large penalty because it just thinks that all of the pixels have changed). It shoulds like step 2 (lining up the two images) is designed to solve this, but again, we need to be able to deal with sub-image shifts, not just whole-image shifts. The new test cases Phil suggested that I add are helpful in judging this requirement. They currently output 22 and 13 respectively. I'd expect them to output some positive value, but much smaller, perhaps about 4 and 2, respectively.

@mgiuca-google
Copy link
Contributor Author

@WeatherGod wrote:

Just something I have come across today in my work that might be relevant is the MapReady toolkit:
http://www.asf.alaska.edu/downloads/software_tools

I'm not sure if you're advocating using this tool or just borrowing the idea. If you meant the former, I had a brief look at the license agreement and it is incompatible with Matplotlib. It seems to be basically the BSD license, but with the additional BSD-incompatible clause:

Redistribution and use of source and binary forms are for noncommercial purposes only.

@pelson
Copy link
Member

pelson commented Jan 14, 2013

@mgiuca-google - if you wouldn't mind rebasing this, I'd like to see if we can get this merged in the next couple of weeks. Previous commenter from #1287 were @mdboom, @dmcdougall, @WeatherGod so ideally we would get either a 👍, 👎 or and explicit abstinence from those before we actually press the merge button (other commenter more than welcome too!).

Cheers,

@mdboom
Copy link
Member

mdboom commented Jan 14, 2013

I'm definitely in favor of this in principle. Once this is rebased and we have something to test again, I'd like to kick the tires one more time (since accidentally breaking the test suite would be a major problem). Assuming all that goes well, I'd say this is good to go.

succeed if compare_images succeeds. Otherwise, the test will succeed if
compare_images fails and returns an RMS error almost equal to this value.
"""
from nose.tools import assert_almost_equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have those imports at the top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for spotting.

@dmcdougall
Copy link
Member

I'm definitely in favor of this in principle. Once this is rebased and we have something to test again, I'd like to kick the tires one more time (since accidentally breaking the test suite would be a major problem). Assuming all that goes well, I'd say this is good to go.

I agree with this. The negotiation @mgiuca-google mentions in this PR message I think should be carefully considered particularly given the issues we have comparing images of rasterised text across different versions of freetype.

@mgiuca-google
Copy link
Contributor Author

Hey guys,

Thanks for bumping this, Phil. I've done a merge up to the current head. (You said "rebase" and I'm not sure if you actually prefer a rebase instead of a merge -- I'm personally not a fan but if you really want a rebase, let me know and I'll do that.) You guys should be able to pull this branch and run the tests.

The other change I made was in commit 5e22c11, I deleted the section which Michael added in 1283feb, which retries a failed comparison, removing any pixel differences of 1. This change was presumably made to work around the fact that if you have a lot of pixels with a slightly different colour, you will get a big error, such as in my all-127 versus all-128 case. My branch fixes that issue, so I don't think we need this extra case. Let me know if there is another good reason for it.

Now this still fails a lot of tests due to RMS failures. As I said in the original PR, we will have to go through and update either the expected output, or the tolerance, for each test. I can do this but it would be good to come to a policy decision first.

@pelson
Copy link
Member

pelson commented Jan 16, 2013

I can do this but it would be good to come to a policy decision first.

That is perfectly reasonable - we don't want you doing a lot of tedious work if it only goes stale again.

I'm confident that if we can blitz through a review of what's here this week, @mgiuca-google can then go through rms values next (or if there are other volunteers to help with that process, it can be shared appropriately).

I've done a merge up to the current head. You said "rebase" and I'm not sure if you actually prefer a rebase instead of a merge

I did mean rebase, which is generally our preferred way of bringing branches up to date, but the reason why this is preferred over merge eludes me (for a linear history on master???). I'm sure others can fill in the details on that and whether or not to undo the merge and rebase instead.

Cheers,

@mdboom
Copy link
Member

mdboom commented Jan 16, 2013

Yes -- we definitely want a rebase, not a merge. The merge creates clutter in the history, and it makes it look like the old master is not the "trunk".

Why are there more RMS failures with this change? The images should either be identical to the input (in which case they pass) or any differences should be handled by this new algorithm. If not, then updating the baselines will only cause the comparisons to work for you but fail for me (who produced most of the existing baseline images). Or am I missing something? I hope to find some time shortly to check this out and poke at it a bit.

@mdboom
Copy link
Member

mdboom commented Jan 16, 2013

Ok -- I see what's happening. It seems like most of these tests are failing due to a subtle text positioning improvement, which shows up mostly in the vector formats. I don't see any failures that look problematic -- in fact one failure is due to a baseline image still showing an old bug. I think the thing to do here is "reset" all of the baselines by updating all of them. I'll file a PR against this one shortly to do just that.

@mdboom
Copy link
Member

mdboom commented Jan 16, 2013

@mgiuca-google : github won't let me file a PR against your repo (???). Perhaps you could just manually merge mdboom/matplotlib:update_baseline_images.

@mgiuca-google
Copy link
Contributor Author

Yes -- we definitely want a rebase, not a merge. The merge creates clutter in the history, and it makes it look like the old master is not the "trunk".

Well, for what it's worth, if you merge the branch into the master with git merge --no-ff, you don't get that bad history. The merge commit's first parent will be the previous commit to master, so that anyone doing a git log --first-parent will see only the trunk, and not the individual commits to the branch. Note: I'm still going to do the rebase, since you asked me to, but I still recommend you merge with --no-ff to avoid splatting my branch commits (at this point, dozens) into the trunk.

Thanks for going to the effort of resetting all of those images. I wasn't sure you'd want to do that, but I think it's the best outcome. I was able to manually merge it, but it doesn't seem directly relevant to my branch. Wouldn't it be better to cherry-pick fb68c58 into master (since it should not break with the existing comparison function)? Then this branch is just about fixing the comparison function, and not the images themselves.

@mgiuca-google
Copy link
Contributor Author

Okay, I have done the rebase. Now all of my commits are applied to the current HEAD.

I am not sure whether I've done a "rebase" as you intended though. Did you just want my commits applied to HEAD, or did you actually want me to go back through the history and fix up the commits so that they are all in logical order and pristine? For example, removing the "Use int16 instead of int32 arrays," and just using int16 from the start. Also, should it be the case that the tests pass on all of the commits (so, don't commit a failing test case before fixing the code)? I'm just trying to get an idea of what style of branch you want to accept.

If you intend to do a merge --no-ff, then it shouldn't matter if the history is a bit buggy, as long as the final product is fine. If you intend to do a fast-forward merge, then all of the commits need to be sensible.

@@ -316,37 +300,25 @@ def compare_images( expected, actual, tol, in_decorator=False ):
# open the image files and remove the alpha channel (if it exists)
expectedImage = _png.read_png_int( expected )
actualImage = _png.read_png_int( actual )
expectedImage = expectedImage[:,:,:3]
actualImage = actualImage[:,:,:3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Is there a reason for not taking the alpha channel into account?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the vast majority of our tests, it doesn't matter. But it's conceivable it might for one testing that the background of the figure is transparent for example. It probably saves some time, which is important when considering how long the tests currently take to run.

Note also that for PDF comparisons, Ghostscript never gives an alpha channel, so it's completely redundant there.

Maybe it should be a kwarg compare_alpha (defaulting to False)? I wouldn't want that to hold up this PR because of it.

@mgiuca-google
Copy link
Contributor Author

Updating the PR (note: I didn't rebase again but I will after further discussion). Here are the remaining issues. Let me know if I've missed some:

  • PEP8 compliance. I've fixed the issue that was pointed out.
  • Removing the alpha channel. Interesting that the current code says "remove the alpha channel" but doesn't actually do so. I believe it used to do this, but it was lost in one of the refactorings over the past six months. I think that removing the alpha channel is correct at the moment, because none of the images have alpha and we want to make sure we can compare two images if one has alpha and the other does not. I'm happy to add a compare_alpha argument, but probably not in this PR.
  • Automatically generating some of the expected output images. I don't really want to generate the image algorithmically because a) that makes the tests non-deterministic (since it is a random scramble), and b) it involves lots of new code in the testing infrastructure with new ways to go wrong. The new images total 1.6 MB which is fairly hefty. Instead, I've replaced that rather large image with a tiny one that makes the same point, for a total of 4.3 kB of new images. Note that the old images are still in my commit history, but I'll rebase them away before we merge the PR.
  • The per-test tolerance settings. I decided to reset all of the custom tolerances since they are basically meaningless against the new algorithm (and some of them were huge). Obviously it's hard to choose an appropriate tolerance without having lots of computers to test it on, so maybe we can just set them low to begin with and creep them up as necessary. I've set the default tolerance to 10 out of 255, which allows for small changes in the kerning. For images with lots of text, the tolerance may need to be higher (for example, one images gives me an RMS error of 13.3 compared to Michael's updated baseline image). I'm getting huge RMS errors on the mathtext (presumably because Michael and I have different TeX renderers) -- in some cases up to 50 out of 255. I've set the tolerance to 50 for those tests, but this worries me, because with such a high tolerance, it won't detect a lot of problems with those images.

@pelson
Copy link
Member

pelson commented Jan 18, 2013

Obviously it's hard to choose an appropriate tolerance without having lots of computers to test it on, so maybe we can just set them low to begin with and creep them up as necessary.

I agree with that approach. I'm prepared to accept that some developers' (depending on machine/os) test suite will fail after merging this PR - its easy for us to iteratively increase tolerances as needed.

I've set the tolerance to 50 for those tests, but this worries me, because with such a high tolerance, it won't detect a lot of problems with those images.

Perhaps, in an ideal world, we would do well to be able to specify regions of different tolerances in the same image. But not here. 😄

@pelson
Copy link
Member

pelson commented Jan 18, 2013

for a total of 4.3 kB of new images

Excellent saving! Thanks for doing this.

@@ -299,8 +282,9 @@ def compare_images( expected, actual, tol, in_decorator=False ):
= INPUT VARIABLES
- expected The filename of the expected image.
- actual The filename of the actual image.
- tol The tolerance (a unitless float). This is used to
determine the 'fuzziness' to use when comparing images.
- tol The tolerance (a colour value difference, where 255 is the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my kind of spelling - but is probably inconsistent with the rest of the docs/codebase. Would you mind dropping the "u" from colour? (It feels very alien asking you to do this...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pelson in fact:

nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep color | wc
   15543   81737 1965493
nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep colour | wc
    59     396    4876

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably my fault...

@mgiuca-google
Copy link
Contributor Author

Changed. (Don't worry, I'm used to writing "color" to be consistent with the code around me. I'm usually less careful in comments, but fixed for consistency.)

mgiuca-google and others added 20 commits February 27, 2013 11:48
This tests the image comparison function itself. Currently, all three cases
fail, due to a buggy comparison algorithm.

In particular, test_image_compare_scrambled shows the algorithm massively
under-computing the error, and test_image_compare_shade_difference shows the
algorithm massively over-computing the error.
(This regressed when the PIL code was migrated to numpy.)
The previous implementation did not compute the RMS error. It computed the RMS
in the difference of the number of colour components of each value. While this
computes 0 for equal images, it is incorrect in general. In particular, it does
not detect differences in images with the same pixels in different places. It
also cannot distinguish small changes in the colour of a pixel from large ones.
This was arbitrary and made no sense. Increased all tolerances by a factor of
10000. Note that some are ridiculously large (e.g., 200 out of 255).
…ly 1."

This was introduced in 1283feb, presumably to hack around the fact that 1-pixel
differences can make a very large error value. This is not necessary any more,
since the root cause has been fixed."
…line images

derived from basn3p02 in pngsuite tests.
These are much smaller images than the cosine tests.
…evant under the new algorithm.

Added a few new tolerance values, for output I am seeing that is valid but slightly different to the baseline image.
This is possible due to anti-aliasing.
@mgiuca-google
Copy link
Contributor Author

I have rebased from master and there was just one image that had changed in the meantime and needed to be regenerated (test_legend/legend_various_labels.png). I've regenerated that image with anti-aliased text and rolled that into mdboom's "Update all of the images to use antialiased text" commit. All tests seem to pass; we'll see what Travis says.

@mdboom
Copy link
Member

mdboom commented Feb 27, 2013

The Travis failure on 2.6 appears to be due to a network failure, not really any fault of ours. Ideally, we'd do something to push Travis to try again, but I'm reasonably confident that we are ok here, given that we have 2.7 and 3.2 working.

@mdboom
Copy link
Member

mdboom commented Feb 28, 2013

I'm just going to bite the bullet and merge this. I'm reasonably confident that the 2.6 test will pass once this is merged.

Thanks for all of this work -- I know this was a long lived PR for being so pervasive and fundamental to our workflow, but I think it represents a real improvement.

mdboom added a commit that referenced this pull request Feb 28, 2013
@mdboom mdboom merged commit f5d86b1 into matplotlib:master Feb 28, 2013
@mgiuca-google
Copy link
Contributor Author

Sweet! Thanks for dealing with this Michael. It's a relief to have it done.

@mgiuca-google mgiuca-google deleted the fix-image-comparison branch February 28, 2013 02:27
@pelson
Copy link
Member

pelson commented Feb 28, 2013

Sweet!

Indeed. Very nice work @mgiuca-google - thanks for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants