New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix image comparison #1291
Fix image comparison #1291
Conversation
actualImage = actualImage.astype(np.int32) | ||
|
||
# calculate the per-pixel errors, then compute the root mean square error | ||
num_values = reduce(operator.mul, expectedImage.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.prod(expectedImage.shape)
would do the trick here (obviously your version works, but feels less numpy-y).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah much better.
@mgiuca-google : This is really good stuff, thank you! As you can see, I have raised a couple of questions, but in principle, I think this will be a beneficial change. As I hinted at in my comment on the original issue, I probably wouldn't call the original image comparison test "broke", just that it has some characteristics which may not be ideal for our image testing requirements. On that basis, I wonder if it is worth us maintaining the two functions side by side, primarily so that other users who may want to do image comparison could decide which algorithm to use. This may be a contentious issue, as inevitably it will increase the amount of code that mpl has to maintain... One nitpick observation: you have built on code which is obviously not PEP8 compliant, resulting in you own code not being strictly PEP8 compliant (although you have followed the guiding principle: "A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important."). I would certainly find it an improvement if you were to rename the variables you have added/touched to be more PEP8-y (i.e. from On the whole, pretty awesome! |
Just something I have come across today in my work that might be relevant is the MapReady toolkit: http://www.asf.alaska.edu/downloads/software_tools In it, there is a program called "diffimage" (which, because this is a geoprocessing tool, does a bit more than we are looking for), but has the following description:
So, what is interesting is the use of the fourier-transform as part of the image differentiating technique. Don't know if that might be an interesting avenue to pursue or not. Cheers! |
On Fri, Sep 21, 2012 at 6:44 PM, Benjamin Root notifications@github.comwrote:
Interesting! Good find. Damon McDougall |
Thanks for your comments, @pelson. I have taken care of them.
I have renamed @WeatherGod good find. I will have a look at that tool later on. The main improvement I'd be interested in over the RMSE algorithm I implemented would be whether it can detect minor pixel shifts and assign a small penalty (whereas RMS assigns a large penalty because it just thinks that all of the pixels have changed). It shoulds like step 2 (lining up the two images) is designed to solve this, but again, we need to be able to deal with sub-image shifts, not just whole-image shifts. The new test cases Phil suggested that I add are helpful in judging this requirement. They currently output 22 and 13 respectively. I'd expect them to output some positive value, but much smaller, perhaps about 4 and 2, respectively. |
@WeatherGod wrote:
I'm not sure if you're advocating using this tool or just borrowing the idea. If you meant the former, I had a brief look at the license agreement and it is incompatible with Matplotlib. It seems to be basically the BSD license, but with the additional BSD-incompatible clause:
|
@mgiuca-google - if you wouldn't mind rebasing this, I'd like to see if we can get this merged in the next couple of weeks. Previous commenter from #1287 were @mdboom, @dmcdougall, @WeatherGod so ideally we would get either a 👍, 👎 or and explicit abstinence from those before we actually press the merge button (other commenter more than welcome too!). Cheers, |
I'm definitely in favor of this in principle. Once this is rebased and we have something to test again, I'd like to kick the tires one more time (since accidentally breaking the test suite would be a major problem). Assuming all that goes well, I'd say this is good to go. |
succeed if compare_images succeeds. Otherwise, the test will succeed if | ||
compare_images fails and returns an RMS error almost equal to this value. | ||
""" | ||
from nose.tools import assert_almost_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have those imports at the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for spotting.
I agree with this. The negotiation @mgiuca-google mentions in this PR message I think should be carefully considered particularly given the issues we have comparing images of rasterised text across different versions of freetype. |
Hey guys, Thanks for bumping this, Phil. I've done a merge up to the current head. (You said "rebase" and I'm not sure if you actually prefer a rebase instead of a merge -- I'm personally not a fan but if you really want a rebase, let me know and I'll do that.) You guys should be able to pull this branch and run the tests. The other change I made was in commit 5e22c11, I deleted the section which Michael added in 1283feb, which retries a failed comparison, removing any pixel differences of 1. This change was presumably made to work around the fact that if you have a lot of pixels with a slightly different colour, you will get a big error, such as in my all-127 versus all-128 case. My branch fixes that issue, so I don't think we need this extra case. Let me know if there is another good reason for it. Now this still fails a lot of tests due to RMS failures. As I said in the original PR, we will have to go through and update either the expected output, or the tolerance, for each test. I can do this but it would be good to come to a policy decision first. |
That is perfectly reasonable - we don't want you doing a lot of tedious work if it only goes stale again. I'm confident that if we can blitz through a review of what's here this week, @mgiuca-google can then go through rms values next (or if there are other volunteers to help with that process, it can be shared appropriately).
I did mean rebase, which is generally our preferred way of bringing branches up to date, but the reason why this is preferred over merge eludes me (for a linear history on master???). I'm sure others can fill in the details on that and whether or not to undo the merge and rebase instead. Cheers, |
Yes -- we definitely want a rebase, not a merge. The merge creates clutter in the history, and it makes it look like the old master is not the "trunk". Why are there more RMS failures with this change? The images should either be identical to the input (in which case they pass) or any differences should be handled by this new algorithm. If not, then updating the baselines will only cause the comparisons to work for you but fail for me (who produced most of the existing baseline images). Or am I missing something? I hope to find some time shortly to check this out and poke at it a bit. |
Ok -- I see what's happening. It seems like most of these tests are failing due to a subtle text positioning improvement, which shows up mostly in the vector formats. I don't see any failures that look problematic -- in fact one failure is due to a baseline image still showing an old bug. I think the thing to do here is "reset" all of the baselines by updating all of them. I'll file a PR against this one shortly to do just that. |
@mgiuca-google : github won't let me file a PR against your repo (???). Perhaps you could just manually merge |
Well, for what it's worth, if you merge the branch into the master with Thanks for going to the effort of resetting all of those images. I wasn't sure you'd want to do that, but I think it's the best outcome. I was able to manually merge it, but it doesn't seem directly relevant to my branch. Wouldn't it be better to cherry-pick fb68c58 into master (since it should not break with the existing comparison function)? Then this branch is just about fixing the comparison function, and not the images themselves. |
Okay, I have done the rebase. Now all of my commits are applied to the current HEAD. I am not sure whether I've done a "rebase" as you intended though. Did you just want my commits applied to HEAD, or did you actually want me to go back through the history and fix up the commits so that they are all in logical order and pristine? For example, removing the "Use int16 instead of int32 arrays," and just using int16 from the start. Also, should it be the case that the tests pass on all of the commits (so, don't commit a failing test case before fixing the code)? I'm just trying to get an idea of what style of branch you want to accept. If you intend to do a merge --no-ff, then it shouldn't matter if the history is a bit buggy, as long as the final product is fine. If you intend to do a fast-forward merge, then all of the commits need to be sensible. |
@@ -316,37 +300,25 @@ def compare_images( expected, actual, tol, in_decorator=False ): | |||
# open the image files and remove the alpha channel (if it exists) | |||
expectedImage = _png.read_png_int( expected ) | |||
actualImage = _png.read_png_int( actual ) | |||
expectedImage = expectedImage[:,:,:3] | |||
actualImage = actualImage[:,:,:3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Is there a reason for not taking the alpha channel into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the vast majority of our tests, it doesn't matter. But it's conceivable it might for one testing that the background of the figure is transparent for example. It probably saves some time, which is important when considering how long the tests currently take to run.
Note also that for PDF comparisons, Ghostscript never gives an alpha channel, so it's completely redundant there.
Maybe it should be a kwarg compare_alpha
(defaulting to False)? I wouldn't want that to hold up this PR because of it.
Updating the PR (note: I didn't rebase again but I will after further discussion). Here are the remaining issues. Let me know if I've missed some:
|
I agree with that approach. I'm prepared to accept that some developers' (depending on machine/os) test suite will fail after merging this PR - its easy for us to iteratively increase tolerances as needed.
Perhaps, in an ideal world, we would do well to be able to specify regions of different tolerances in the same image. But not here. 😄 |
Excellent saving! Thanks for doing this. |
@@ -299,8 +282,9 @@ def compare_images( expected, actual, tol, in_decorator=False ): | |||
= INPUT VARIABLES | |||
- expected The filename of the expected image. | |||
- actual The filename of the actual image. | |||
- tol The tolerance (a unitless float). This is used to | |||
determine the 'fuzziness' to use when comparing images. | |||
- tol The tolerance (a colour value difference, where 255 is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my kind of spelling - but is probably inconsistent with the rest of the docs/codebase. Would you mind dropping the "u" from colour? (It feels very alien asking you to do this...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pelson in fact:
nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep color | wc
15543 81737 1965493
nvaroqua@u900-bdd-1-156t-6917:~/Projects/matplotlib$ git grep colour | wc
59 396 4876
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably my fault...
Changed. (Don't worry, I'm used to writing "color" to be consistent with the code around me. I'm usually less careful in comments, but fixed for consistency.) |
This tests the image comparison function itself. Currently, all three cases fail, due to a buggy comparison algorithm. In particular, test_image_compare_scrambled shows the algorithm massively under-computing the error, and test_image_compare_shade_difference shows the algorithm massively over-computing the error.
(This regressed when the PIL code was migrated to numpy.)
The previous implementation did not compute the RMS error. It computed the RMS in the difference of the number of colour components of each value. While this computes 0 for equal images, it is incorrect in general. In particular, it does not detect differences in images with the same pixels in different places. It also cannot distinguish small changes in the colour of a pixel from large ones.
This was arbitrary and made no sense. Increased all tolerances by a factor of 10000. Note that some are ridiculously large (e.g., 200 out of 255).
…and sub-image 1-pixel offset.
…ly 1." This was introduced in 1283feb, presumably to hack around the fact that 1-pixel differences can make a very large error value. This is not necessary any more, since the root cause has been fixed."
…is not available.
…line images derived from basn3p02 in pngsuite tests. These are much smaller images than the cosine tests.
…evant under the new algorithm. Added a few new tolerance values, for output I am seeing that is valid but slightly different to the baseline image.
This is possible due to anti-aliasing.
I have rebased from master and there was just one image that had changed in the meantime and needed to be regenerated ( |
The Travis failure on 2.6 appears to be due to a network failure, not really any fault of ours. Ideally, we'd do something to push Travis to try again, but I'm reasonably confident that we are ok here, given that we have 2.7 and 3.2 working. |
I'm just going to bite the bullet and merge this. I'm reasonably confident that the 2.6 test will pass once this is merged. Thanks for all of this work -- I know this was a long lived PR for being so pervasive and fundamental to our workflow, but I think it represents a real improvement. |
Sweet! Thanks for dealing with this Michael. It's a relief to have it done. |
Indeed. Very nice work @mgiuca-google - thanks for this. |
Fixes the compare_image RMS calculation algorithm, so that it computes the RMS of the difference between corresponding pixels, as opposed to the RMS of the histograms between the two images.
See discussion on Issue 1287.
Note: This is not yet ready to merge, since it breaks a lot of tests. Some negotiation is required to figure out whether to update the expected output for each test, or bump up the tolerance.