Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare_images computes RMS incorrectly #1287

Closed
mgiuca-google opened this issue Sep 20, 2012 · 8 comments
Closed

compare_images computes RMS incorrectly #1287

mgiuca-google opened this issue Sep 20, 2012 · 8 comments

Comments

@mgiuca-google
Copy link
Contributor

TL;DR: The image comparison algorithm is broken. I have a patch branch, but it isn't ready because it makes some test cases fail and I would like some input on how to proceed.

The matplotlib.testing.compare.compare_images function, which compares images for the purpose of testing, implements a particularly poor image comparison algorithm which, among other things, is unable to detect movement of pixels between two images. It claims to be calculating the root mean square (implicitly, by the use of the variable name "rms"), but it is not the RMS of the images at all, but rather the RMS of their histograms.

The algorithm currently works this way. To compare expectedImage and actualImage:

For each colour component c in (r, g, b):
    Build a histogram of the c values in expectedImage
    Build a histogram of the c values in actualImage
    Find the root mean square of the histogram
Sum the three root mean square values

To explain this, let's make the simplifying assumption that the image is greyscale. What this algorithm does is it counts up the total number of pixels with the value 0, the total number with the value 1, up to 255. Then it considers the images to mismatch if these counts are significantly different.

Here are two specific examples of where this approach fails epically. Firstly, imagine you have two images that are extremely similar, but all of the pixels differ by exactly 1 value. For example, all pixels are 127 in the first image, all pixels are 128 in the second. This algorithm will consider these two images to be maximally different, because the pixels are in totally different histogram buckets.

Secondly, imagine you take an image, and then take all of the pixels and randomly shuffle them around. The algorithm will consider these two images to be exactly identical, because even though the pixels are completely messed up, they generate the exact same histograms. A real-world example of where this will not catch a failure is if a label is, for some reason, printed at the completely wrong coordinates.

The correct approach is to take the RMS of the differences between each corresponding pixel. To do this, use the following algorithm:

Subtract the two images from one another, and take the absolute value
Find the root mean square of the delta image

This can be optimised using a histogram (take the histogram of the delta image, and then for each bucket, consider there to be that many pixels of that value). I have prepared a patch which fixes this issue.

Note that my algorithm produces identical results to ImageMagick's compare -metric rmse command (once you remove the alpha channels).

The other problem with the current algorithm is that the units are meaningless. The unit of measurement is technically "absolute number of pixels", which means that the magnitude is determined by the image size. For a 10x10 image, the worst RMS possible is 100; for a 640x480 image, it is 307200. This makes it impossible to set a reasonable tolerance value. In the corrected solution, the unit of measurement is "colour value difference", where the worst RMS possible is 255. To cope with this change, I have dropped the division by 10000 and multiplied all of the tolerances by 10000 -- I found that 10 (the default) makes a reasonable tolerance under the new scheme.

Unfortunately, by fixing this algorithm, a number of test cases have started failing. On my machines (two separate machines running Ubuntu 12.04), I am seeing a lot of the text characters rendered slightly off the expected position. This isn't a problem, but it does fail in the new algorithm because the text is in the wrong position.

The vast majority of failing cases are in the mathtext module. These have RMS errors from 10 up to 60. There are three other cases outside of mathtext that now fail:

  • test_axes.test_single_date: RMS 12.861
  • test_simplification.test_clipper: RMS 20.035 -- this is genuinely broken!
  • test_spines.test_spines_axes_positions: RMS 11.431

I think it would be fine to bump up the tolerance for the first and third one; the test_clipper looks to me like it's actually producing incorrect output. I'd like to know if other users have similar (perhaps exactly the same) RMS errors for these cases, with my patch applied. If so, it is probably the case that these expected output images are out of date, and we can regenerate them. If not, then perhaps we can bump up the tolerance a bit to allow for differences in the rendering algorithm. (Though I wouldn't want to go as high as 60.)

I have not created a pull request, since the branch is not yet ready to merge (due to the failing cases). However, hopefully we can work towards fixing up those test cases.

Thanks for your time.

@pelson
Copy link
Member

pelson commented Sep 20, 2012

@mgiuca-google : Firstly, thank you! Your work on this looks thorough and highly valuable.

The image comparison algorithm is broken.

Based on reading your message, I wouldn't go so far as to say it was broken; simply that an alternative algorithm which has better characteristics exists (ok, so I might say that the name of the existing function is broken).

Firstly, imagine you have two images that are extremely similar, but all of the pixels differ by exactly 1 value. For example, all pixels are 127 in the first image, all pixels are 128 in the second. This algorithm will consider these two images to be maximally different.

Sometimes this is a desirable characteristic (i.e. a measure of how many pixels have changed). I agree that it is not always desirable though. And the fact that the measure scales up with the size of the image doesn't sound great.

Secondly, imagine you take an image, and then take all of the pixels and randomly shuffle them around. The algorithm will consider these two images to be exactly identical, because even though the pixels are completely messed up, they generate the exact same histograms.

I agree. This is bad.

In order to see the whole changeset clearly, my advice would be to create a PR and in it state that it isn't quite ready for merge yet. From there, I would be happy to muck in and see if we can get the remaining tests to pass.

Thanks again,

@dmcdougall
Copy link
Member

The other problem with the current algorithm is that the units are meaningless. The unit of measurement is technically >"absolute number of pixels", which means that the magnitude is determined by the image size. For a 10x10 image, >the worst RMS possible is 100; for a 640x480 image, it is 307200. This makes it impossible to set a reasonable >tolerance value. In the corrected solution, the unit of measurement is "colour value difference", where the worst RMS >possible is 255. To cope with this change, I have dropped the division by 10000 and multiplied all of the tolerances by >10000 -- I found that 10 (the default) makes a reasonable tolerance under the new scheme.

This only matters when comparing RMSE values between tests. Not an unreasonable thing to do but, as you correctly point out, it is meaningless. One way to solve this is to normalise by the number of pixels in the image. Then, the max RMSE for any test will be the same. The problem now, of course, is that an error by one pixel in test A excites a different decimal place in the RMSE than a one pixel error in test B.

The ultimate epic solution of destiny would be to use a standard image size across all tests. The downside of this approach is that every baseline image would need to be regenerated.

@WeatherGod
Copy link
Member

On Thu, Sep 20, 2012 at 8:44 AM, Damon McDougall
notifications@github.comwrote:

The ultimate epic solution of destiny would be to use a standard image
size across all tests. The downside of this approach is that every baseline
image would need to be regenerated.

Another downside of having all of the test images use the same size would
be that we wouldn't be able to test the control of different aspect ratios,
different sizes, and 'bbox_tight'.

Image comparison is a tough nut to crack.

Ben Root

@mdboom
Copy link
Member

mdboom commented Sep 20, 2012

I just want to add that the fact that the algorithm "is unable to detect movement of pixels between two images" is by design. 1-pixel offsets are common place given changes in the freetype renderer and we don't want to consider that to be a test failure. Perhaps the approach you describe here compensated by first calculating the "center of mass" of each image and adjusting for the difference would be the best of both worlds.

@dmcdougall
Copy link
Member

Another downside of having all of the test images use the same size would
be that we wouldn't be able to test the control of different aspect ratios,
different sizes, and 'bbox_tight'.

Good point.

Another thing I wanted to add. It doesn't matter what (useful) norm you use for calculating the error, it will always depend on the domain.

Perhaps the best option is to normalise it to lie in [0,1]. Then inspection can tell you, pretty quickly, how 'bad' the error is.

@dmcdougall
Copy link
Member

Also, thanks @mgiuca-google for an awesome bug report.

@mgiuca-google
Copy link
Contributor Author

Thanks for the awesome and timely feedback. First, let me acknowledge that yes, image comparison is hard, and I don't mean to impose my views on the project. If there are legitimate reasons for it being the way it is now, then so be it. However, at the very least, I would like for at least the following two things to come out of this bug report:

  • Updating the comments in compare_images to be clear that this is the intended behaviour, and explaining the rationale and drawbacks.
  • Including my test_compare_images test module (with the specific cases modified) to show that the comparison function is working as intended.

@pelson:

In order to see the whole changeset clearly, my advice would be to create a PR and in it state that it isn't quite ready for merge yet.

Done.

@dmcdougall:

The ultimate epic solution of destiny would be to use a standard image size across all tests. The downside of this approach is that every baseline image would need to be regenerated.

Well, an equivalent solution (I believe), and a much easier one, would be to divide the RMS by the total number of pixels in the image (instead of the arbitrary 10000). That solves @WeatherGod's concern. This scaling approach doesn't solve all of my other criticisms, but it does solve the "units" problem.

@mdboom:

I just want to add that the fact that the algorithm "is unable to detect movement of pixels between two images" is by design. 1-pixel offsets are common place given changes in the freetype renderer and we don't want to consider that to be a test failure. Perhaps the approach you describe here compensated by first calculating the "center of mass" of each image and adjusting for the difference would be the best of both worlds.

Okay. I had wondered whether this was by design. After all, the fact that my tests started failing is a result of exactly these minor freetype rendering differences which the current algorithm is designed to ignore. The "center of mass" approach will not work because what I am seeing is some characters being rendered at a slight offset, and others not. So it is more like a kerning issue than an absolute positioning.

I feel like the best approach is not to design an algorithm that ignores movement of pixels, but rather, to have a tolerance that tolerates minor movements. The current algorithm seems like overkill, since it ignores much more drastic changes than 1-pixel offsets in text rendering. Notice that for monochrome images (such as the text rendering tests), the current algorithm is literally: "calculate the difference in the number of black pixels in the two images." So any two images with a similar amount of black will match. Here are some other errors that the current algorithm will not notice:

  • A line graph that is completely wrong, but has similar line length.
  • Text that renders with completely wrong characters, but with a similar amount of black pixels.
  • Switching the X and Y axes labels on a graph.
  • A bar graph with the bars having their colours randomly reassigned.

On the flip side, my concern is that in order to set the tolerance high enough to tolerate a 1-pixel offset of some characters, it will also tolerate completely different characters (since, surely, the number of pixels that change when you shift a character by 1 pixel is roughly the same as the number of pixels that change when you replace it with a completely different character). That's not ideal, but I still think this approach is overall going to catch more regressions.

@dmcdougall
Copy link
Member

Closing, since #1291 was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants