[Proposal] The problem with visual comparison #24312

KuznetsovRoman · 2023-07-19T23:15:00Z

Problem

Pixelmatch (which is being used to compare snapshots) has an issue (as i described in mapbox/pixelmatch#127)

And that is not the only one. Pixelmatch also has some sort of antialiasing detection, but its not configurable and not perfect either:
I launched playwright test 500 times and 6 of those failed due to browser rendering instability, thats 1.2%, which is a lot.

Test code src: https://pastebin.com/eWzMpKBW

Solution

I can solve both of these problems by integrating our package looks-same instead of pixelmatch into Playwright.

Tools comparison

Color comparator: looks-same has configurable CIEDE2000 tolerance (2.3 by default) instead of pixelmatch threshold, which takes into account human color perception.

These colors from example1 (linked in issue above) would be considered as different if tolerance < 33.5, so diff image (marked as pink) would be:

At the same time, looks-same wouldn't mark the difference from example2 (linked in issue above) with tolerance > 1.86

Antialiasing: looks-same has configurable antialiasingTolerance, which would help with those 6/500 fails.

It can detect the diff from above, but would ignore it with propper antialiasingTolerance.

Benchmarking

Case:

Compare two images (896x5069) and save diff image to file system.

Environment:

Apple M1 Pro, 32GB

Currently, looks-same is not that fast, but i managed to improve its performance, and thats what i got on my branch:

equal buffers
- looks-same: 0.1ms
- pixelmatch: 260ms
indistinguishable difference (23166 / 4541824 pixels have #fefefe color instead of #ffffff)
- looks-same: 110ms
- pixelmatch: 260ms
distinguishable difference (5 red pixels on white background)
- looks-same: 130ms
- pixelmatch: 500ms
big difference (500x500 red square in the center)
- looks-same: 260ms
- pixelmatch: 540ms
huge difference (two 500x500 red squares in the center)
- looks-same: 380ms
- pixelmatch: 540ms
gigantic difference (four 500x500 red squares in the center)
- looks-same: 640ms
- pixelmatch: 580ms

Conclusion

looks-same

provides better comparison quality than pixelmatch
configurable: ignoreCaret, tolerance, antialiasingTolerance
faster than pixelmatch on average, unless the diff if really huge: (130ms constant time(896x5069) + 130 ms per each 250k different pixels)

I can complete the job and make a Pull Request to use looks-same in Playwright image comparison, but firstly i would like to clarify: are you interested in this?

P.S. If you need benchmark code examples, i could provide them in PR after polishing the looks-same code

The text was updated successfully, but these errors were encountered:

StanislavGrishaev · 2023-07-21T13:38:35Z

@KuznetsovRoman hi ! is there any chance to try your comparator now with PW ? for example in fork or some other way ? sounds very interesting...

This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. One of the new tests found a bug in our implementation. References microsoft#24312

KuznetsovRoman · 2023-07-21T19:11:18Z

@KuznetsovRoman hi ! is there any chance to try your comparator now with PW ? for example in fork or some other way ? sounds very interesting...

You could checkout my branch (https://github.com/gemini-testing/looks-same/tree/HERMIONE-1049) and use it like:

const looksSame = require('looks-same');

// createDiffImage: true is mandatory
const result = await looksSame(image1, image2, {createDiffImage: true});

image1 and image2 - either fs paths like images/image1.png or png buffers

I dont have any fork, but you could monkey-patch your playwright.

…or (#24348) This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. One of the new tests found a bug in our implementation. References #24312

aslushnikov · 2023-07-21T20:33:04Z

Hey @KuznetsovRoman,

Thank you for bringing looks-same to our attention! I didn't see this project when we were researching image comparison awhile ago.

I looked into the project and played a bit with it. Here are my high-level thoughts/comments:

looks-same claims support for multiple formats, including lossy formats like JPEG. As of today, we're yet to see any demand for anything but PNG. Do you actually use non-png images for VRT? If yes, why?
I'm surprised to see CIEDE2000 usage instead of CIE94. I did a micro benchmark to measure CIEDE2000 vs CIE94, and my results are:
- CIE94 is almost 2x faster than CIEDE2000 on my machine; implementation is also way simpler.
- There's not much difference between CIE94 vs CIEDE2000 when colors are similar. The most difference comes when colors are extremely different; in this case CIE94 might yield 50, whereas CIEDE2000 yields 130. However, this is not important for image comparator application and IMO doesn't worth a 2x penalty + extra complexity. Am I missing something?
In terms of anti-aliasing cancelling, I see that looks-same employs the same approach as pixelmatch, but exposes an extra option for the algorithm to tweak behavior.

Now, we have an experimental ssim-cie94 comparator for image comparison which intends to be a "zero-config, just works" comparator for VRT. You can enable this comparator like this:

await expect(page).toHaveScreenshot({
  _comparator: 'ssim-cie94',
});

This comparator uses cie94 with JND=1.0 for color difference, and uses local structural similarity to fight anti-aliasing artifacts. We suspect that using SSIM for anti-aliasing detection might yield great results. We also have an extensive dataset to test the comparator, a small part of which is available publicly at https://github.com/microsoft/playwright/tree/d307c8e63afe29fcc8fc0c8d4e561b07a4ebb391/tests/image_tools/fixtures

Now, I'd like to compare ssim-cie94 to looks-same.

First, I pulled anti-aliasing tests from looks-same into our repo. One of looks-same anti-aliasing tests uncovered a corner-case bug in the ssim-cie94 implementation; this is fixed now on tip-of-tree. Other than that, it passes all the tests.

Secondly, I hooked looks-same and ran it against our public dataset. It failed a few tests, with a weirdly looking diffs:

Running looks-same against our full dataset failed 7% of tests. Did I use it wrong? If yes, could you please help me hooking it up properly?

Finally, could you please try using ssim-cie94 and see if it works / doesn't work for you? If it doesn't would you mind sharing actual/expected images?

KuznetsovRoman · 2023-07-21T22:24:17Z

Running looks-same against our full dataset failed 7% of tests. Did I use it wrong?

I need specific examples. Like in should-match/webkit-rendering-artifacts/webkit-corner-2x-expected.png vs should-match/webkit-rendering-artifacts/webkit-corner-2x-actual.png, there clearly is a visible color difference

I doubt it is antialiasing problem and expected could render as expected

aslushnikov · 2023-07-22T04:18:44Z

I need specific examples

@KuznetsovRoman sure

Check out https://github.com/aslushnikov/playwright/tree/try-using-looks-same
Build playwright: npm i && npm run build
Run image tests with HTML reporter: npm run ttest image_tools -- --reporter html

This fails 5 tests for me.

I doubt it is antialiasing problem and expected could render as expected

This is an actual rendering artifact we were getting in one of our webkit screenshot tests. Whether this kind of artifact should or should not be ignored is questionable, and ultimately is a client call. My concern is the diff that is clearly wrong.

KuznetsovRoman · 2023-07-22T17:00:09Z

My concern is the diff that is clearly wrong.

The problem with your usage is the following:

    const diff = await looksSame.createDiff({
      reference: expected,
      current: actual,
      extension: 'png',
      highlightColor: '#ff0000', // color to highlight the differences
      ...looksSameOptions,
    });

  // diff is now PNG buffer (with png signature and other chunks, not a raw image buffer)
  // so you don't need to call "diffPNG.data = diff", "PNG.sync.write(diffPNG)"
  // instead, just use "body: diff" in "testInfo.attach" method

This would work:

    const diff = await looksSame.createDiff({
      reference: expected,
      current: actual,
      extension: 'png',
      highlightColor: '#ff0000', // color to highlight the differences
      ...looksSameOptions,
    });


    await testInfo.attach(fixtureName + '-diff.png', {
      body: diff,
      contentType: 'image/png',
    });

Also, on my branch https://github.com/gemini-testing/looks-same/tree/HERMIONE-1049 you should use looksSame function with {createDiffImage: true} option to get api, suitable for playwright. It is also much faster

const result = await looksSame(image1, image2, {createDiffImage: true}); 
// {
//   equal: false,
//   diffImage: OriginalImage,
//   differentPixels: 2,
//   totalPixels: 10404,
//   diffBounds: { left: 1, top: 0, right: 1, bottom: 1 },
//   diffClusters: [ { left: 1, top: 0, right: 1, bottom: 1 } ]
// }

diffImage has save method:

const {diffImage} = result;
await diffImage.save('my-diff.png');

KuznetsovRoman · 2023-07-22T17:07:17Z

This is an actual rendering artifact we were getting in one of our webkit screenshot tests

Could you give an html example, so i could render it ~500 times with webkit browser and see if it really could be rendered with this kind of difference? Looks like some svg

KuznetsovRoman · 2023-07-22T17:27:37Z

I'm surprised to see CIEDE2000 usage instead of CIE94

CIE94 still has a perceptual uniformity issue, so CIEDE2000 is better.

Example:
color1 = [129, 174, 237], color2 = [129, 174, 255], cie94 = 4.5, ciede2000 = 1.85
color3 = [111, 120, 138], color4 = [118, 120, 138], cie94 = 2.27, ciede2000 = 3.5

CIE94 is almost 2x faster than CIEDE2000 on my machine

On my branch i did some modifications to improve image comparison speed: https://github.com/gemini-testing/looks-same/blob/HERMIONE-1049/index.js#L23. With those improvements, comparison speed is similar

So, in most cases we don't really need to calc CIEDE2000 to make sure white and red are different colors.

The trick is to calc fast CIE76 firstly, and in most cases that is enough (based on the results of my calculations,[0.695 * CIEDE2000] <= CIE76 <= [6.2 * CIEDE2000] for any pair of rgb colors). We only need to calc slow CIEDE2000 if, based on CIE76, we can't be sure CIEDE2000 is lower/higher than tolerance value (keep in mind, most of the time colors would be exactly equal and we would not need any calculations, so on average, that would not be a problem)

KuznetsovRoman · 2023-07-27T19:12:41Z

@aslushnikov are there any updates?

aslushnikov · 2023-07-28T06:11:34Z

@aslushnikov are there any updates?

@KuznetsovRoman I haven't had a chance to continue experiments yet; will come back to this in a bit.

Finally, could you please try using ssim-cie94 and see if it works / doesn't work for you? If it doesn't would you mind sharing actual/expected images?

Meanwhile, any chance you can give this a try?

KuznetsovRoman · 2023-07-28T07:04:50Z

Finally, could you please try using ssim-cie94 and see if it works / doesn't work for you? If it doesn't would you mind sharing actual/expected images?

Meanwhile, any chance you can give this a try?

#24312 (comment)

I want to be sure "expected" could render as "actual"

aslushnikov · 2023-07-28T11:24:58Z

@KuznetsovRoman unfortunately, I don't recall exactly where it comes from. IIRC this is one of the Playwright tests from back in the days that was rendering differently on x86 vs arm64 Ubuntu. However, behavior of the comparator is not that important in this particular test case - algorithms can be tuned to yield either outcome.

I'm interested in getting your feedback on using our experimental ssim-cie94 for your workloads.

KuznetsovRoman · 2023-07-28T13:52:44Z

I'm interested in getting your feedback on using our experimental ssim-cie94

rendering instability tolerance: it still fails sometimes on the example above

colors comparison:
- ssim-cie94 is better, than pixelmatch, but worse than CIEDE2000 ([Proposal] The problem with visual comparison #24312 (comment))

Previously, I did not include the time to save the image, but that is wrong, because looks-same resolves png-buffer, while pixelmatch and ssim-cie94 should transform raw img buffer to png-buffer, and looks-same does it significantly faster

performance (900x5000 png images, average on 10 function calls):
- equal buffers (excluding saving to the file system)
  - looks-same: 0.12 ms
  - pixelmatch: 260 ms
  - ssim-cie94: 390 ms
- indistinguishable difference (960 / 4541824 pixels have #fefefe color instead of #ffffff, excluding saving to the file system)
  - looks-same: 111 ms
  - pixelmatch: 240 ms
  - ssim-cie94: 908 ms
- distinguishable difference (5 red pixels on white background, including saving to the file system)
  - looks-same: 140 ms
  - pixelmatch: 500 ms
  - ssim-cie94: 1170 ms
- big difference (500x500 red square in the center, including saving to the file system)
  - looks-same: 250 ms
  - pixelmatch: 530 ms
  - ssim-cie94: 1340 ms
- huge difference (two 500x500 red squares in the center, including saving to the file system)
  - looks-same: 374 ms
  - pixelmatch: 544 ms
  - ssim-cie94: 1500 ms
- gigantic difference (four 500x500 red squares in the center, including saving to the file system)
  - looks-same: 630 ms
  - pixelmatch: 600 ms
  - ssim-cie94: 1882 ms

aslushnikov · 2023-07-28T17:11:16Z

@KuznetsovRoman awesome, thank you for the numbers! Could you please share the code so that I can reproduce these findings locally?

KuznetsovRoman · 2023-07-28T18:40:41Z

@KuznetsovRoman awesome, thank you for the numbers! Could you please share the code so that I can reproduce these findings locally?

Sure: https://github.com/KuznetsovRoman/img-comparison-bench

nvm use
npm i
node index.js

…or (microsoft#24348) This patch brings in antialiasing tests from `looks-same` project for our experimental `ssim-cie94` comparator. One of the new tests found a bug in our implementation. References microsoft#24312

stuymedova · 2024-01-31T09:23:35Z

@alushnikov It seems you're quite motivated in improving ssim-cie94. Have you had a chance to continue experiments with looks-same? I'm not advocating for it, but given the expertise its maintainers have I believe it's worth giving it a chance. Ultimately, what users like myself desire is for Playwright to use the best tool for the job.

@KuznetsovRoman It'd be nice to hear about the look-same algorithm's corner cases and drawbacks too. It's hard to believe that the algorithm is that good without any drawbacks given the complexity on the matter. Also would be nice to see the research that would confirm its superiority, otherwise it's easy to be sceptical.

On the topic, I've found this talk where the speaker tells how they've omitted using complex color-matching algorithms altogether and got an improvement of about 5 times the speed. Perhaps naive, but if their solution is stable I believe the tradeoff would be justified. The speaker, I believe, works at Microsoft now. The talk is quite old, I wonder if it is still relevant https://youtu.be/ULwdj_Vr_WA?feature=shared&t=1942

cc @mourner

aslushnikov added the triaging label Jul 20, 2023

aslushnikov mentioned this issue Jul 21, 2023

fix(ssim-cie94): make sure a single off pixel yields diff in comparator #24348

Merged

aslushnikov self-assigned this Jul 24, 2023

aslushnikov added feature-visual-regression-testing and removed triaging labels Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] The problem with visual comparison #24312

[Proposal] The problem with visual comparison #24312

KuznetsovRoman commented Jul 19, 2023

StanislavGrishaev commented Jul 21, 2023

KuznetsovRoman commented Jul 21, 2023 •

edited

aslushnikov commented Jul 21, 2023 •

edited

KuznetsovRoman commented Jul 21, 2023

aslushnikov commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 22, 2023

KuznetsovRoman commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 27, 2023

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023 •

edited

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023

stuymedova commented Jan 31, 2024 •

edited

[Proposal] The problem with visual comparison #24312

[Proposal] The problem with visual comparison #24312

Comments

KuznetsovRoman commented Jul 19, 2023

Problem

Solution

Tools comparison

Benchmarking

Case:

Environment:

Conclusion

StanislavGrishaev commented Jul 21, 2023

KuznetsovRoman commented Jul 21, 2023 • edited

aslushnikov commented Jul 21, 2023 • edited

KuznetsovRoman commented Jul 21, 2023

aslushnikov commented Jul 22, 2023 • edited

KuznetsovRoman commented Jul 22, 2023 • edited

KuznetsovRoman commented Jul 22, 2023

KuznetsovRoman commented Jul 22, 2023 • edited

KuznetsovRoman commented Jul 27, 2023

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023 • edited

aslushnikov commented Jul 28, 2023

KuznetsovRoman commented Jul 28, 2023

stuymedova commented Jan 31, 2024 • edited

KuznetsovRoman commented Jul 21, 2023 •

edited

aslushnikov commented Jul 21, 2023 •

edited

aslushnikov commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 22, 2023 •

edited

KuznetsovRoman commented Jul 28, 2023 •

edited

stuymedova commented Jan 31, 2024 •

edited