Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison Optimization #270

Open
rpspringuel opened this issue Sep 30, 2016 · 2 comments
Open

Comparison Optimization #270

rpspringuel opened this issue Sep 30, 2016 · 2 comments

Comments

@rpspringuel
Copy link
Contributor

This is a brainstorm that I've had but don't have time to work on right now so I'm recording it here so I don't forget. No one should feel obligated to make it work unless they really want to. I'll get back to it eventually.

It occurs to me that the expensive part of the test process might be the image comparisons (I really should explore just where the processor time is actually spent). As such, if we could find a way to compare images faster this would help speed things up. Obviously reducing the image size helps with this, but, as we know from experience, reducing the image resolution means that small but important differences may get over looked (this is why we introduced the density option). What is needed is a fast comparison which is also strict.

Looking at ImageMagick's page on image comparison I found a section on finding duplicate images which makes two suggestions for quick methods to identify images which are identical. The first, using md5 checksums, won't work for us because we know we are comparing images created at different times and the difference in the file metadata will cause the hashes to be different. However, the second, which uses identify to compute a hash signature based purely on the image data (and not the file metadata) might work. My idea is that it might be faster to run identify on the two images and compare the signatures than to run compare. I have not yet tested this idea to see if it pans out.

If it does pan out, my concern is that this might be too strict. The signature comparison will only show if the images are exactly the same or not while our current compare check has a threshold below which the differences are not considered significant. As a result using the signature comparison as a replacement metric would cause false positive failures. To get around this, I'm thinking about using the signature comparison as a gatekeeper check. If the signature comparison shows that the images are different, then we run the existing compare check to determine if said difference is significant. This, however, involves a trade-off. Since we'd be running two comparisons on any image which is not identical, images which change (even insignificantly) would have their comparison slowed down. As a result, we'd need the signature comparison to be significantly faster and to expect that most changes will change only a small number of tests for this to actually save time on average. Indeed, assuming that the signature comparison is faster, it would be useful to know just how many tests need to change before this process starts costing additional time. In that manner we could make a more educated decision about whether it's a worth while change.

@henryso
Copy link
Contributor

henryso commented Sep 30, 2016

Before you spend too much time tweaking things, you might want to try the -n flag first. You can run the tests without verifying anything by using the -n flag.

On my computer, if running everything (i.e., no cached images) takes 100% of the time, and 8% of the time is saved when all images are cached, then my system takes 80% of the time when using the -n flag (which skips the image conversion and comparison completely).

@rpspringuel
Copy link
Contributor Author

So, if I understand what you're saying correctly, things break down like this:

Time spent converting test expectations = 8% (the time saved by caching the images)
Time spent generating test results = 80% (the time spent when using -n)
Time spent converting test results = 8% (i.e. probably the same as the saved by caching images since we're talking about converting the same number and sorts of images)

That would leave 4% as the time spent actually doing the comparisons, and thus even if my idea could save time, it isn't going to be able to save much.

I'll check the -n option on my machine to make sure the percentages are roughly the same, but if you're right, then clearly this sort of optimization is more effort that it could possibly be worth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants