-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Visual testing in docker on different CPU architecture #13873
Comments
@nickofthyme No good news, unfortunately. Overall, this sounds like a docker issue. It is expected that arm docker image vs intel docker image produce different screenshots - after all, they have different libraries/executables inside. Ideally, you would force intel image everywhere as you tried, but I guess that does not work as you've linked above. As for mitigation, we have @aslushnikov Any more ideas? |
@nickofthyme any chance you can supply me with a short repro that illustrates this image differences? I'd love to play with it! |
Yeah totally! I put up a branch at
Start web server on yarn install
cd ./e2e && yarn install && cd ../
# generate web server files
yarn test:e2e:generate
# starts local web server
yarn test:e2e:server Run playwright against local web server on
The key part is Let me know if that gives you enough to play with! 👍🏼 |
@nickofthyme I got stuck here; first, Puppeteer failed to download binary:
I ran with
With a few dozens of error lines. |
You shouldn't need any |
@nickofthyme I'm on M1 Mac 12.3, running with Node.js 16.13 |
Interesting that's pretty similar to mine. Let me know if it still gives you trouble and I can try running it on mine from scratch. |
Yep, still the same issue with |
Ok I'm afk but I'll try it tomorrow and let you know. |
Ok I removed all references to Since you are on an M1 mac I also pushed a commit (elastic/elastic-charts@dbfd2ed) updating all screenshots on my intel Mac so you can see the diff from your M1. I was able to install and run the above commands on my M1 Mac I show running
Let me know if you still have troubles with these changes. |
Hey @nickofthyme 🙂 I'm wondering what you ended up doing with this? Did you manage to solve it or find some workaround? |
No not really 😞 We tried everything in the book to make it work to where we could update screenshots locally on either The only way we could make it consistent with 0% threshold, was to just update the screenshot on CI machines via a GitHub PR comment (see elastic/elastic-charts#1819 (comment)) and handle updating the screenshots by committing directly to the PR from CI. Not the most ideal solution but it does the trick for us. I don't see this changing anytime soon since the two docker images use completely difference binaries and the
source: https://docs.docker.com/desktop/mac/apple-silicon/#known-issues |
I see.. It is what it is 🙁 |
Hey @nickofthyme, thank you again for the repro. I came back to this issue and can reproduce it reliably.
Yeah, QEMU emulation is very slow and not advanced enough to run browser binaries, so things are likely to crash / be very slow there. Running docker on M1 was the main motivation for us to provide aarch64 linux browser binaries. Now, back to your repro. I researched this topic again, and it turns out that chromium rendering is not necessarily deterministic even within a single platform (see https://crbug.com/919955). The differences might happen in the anti-aliasing, so while human eye won't be able to see any difference between images, the images are not pixel-to-pixel perfect. So with this, I draw the following conclusions:
Realistically, though, the anti-aliasing differences are not important for the visual regression testing use cases. So So far we have some tools to deal with anti-aliasing noise:
I see you chose to run |
@aslushnikov Thank you for the incredibly detailed review of this!
Agreed, this would be amazing! I looked at differences running That being said it would be great if playwright allowed passing all the playwright/packages/playwright-core/src/utils/comparators.ts Lines 61 to 63 in 5f03bd9
We have some screenshots with animations that are flaky and produce sporadical results on identical test runs, even when disabling or waiting for animations to be complete. I'm guessing this is due to anti-aliasing because this diff is imperceptible to the eye... So we used A side note to this, now that you mention it... It was strange to me that setting the // playwright.config.ts
const config: PlaywrightTestConfig = {
expect: {
toMatchSnapshot: {
threshold: 0,
maxDiffPixelRatio: 0,
maxDiffPixels: 0,
},
},
}
// some.test.ts
function getSnapshotOptions(options?: ScreenshotDOMElementOptions) {
if (options?.maxDiffPixels !== undefined) {
// need to clear default options for maxDiffPixels to be respected, else could still fail on threshold or maxDiffPixelRatio
return {
threshold: 1,
maxDiffPixelRatio: 1,
maxDiffPixels: 1000000, // some large value
...options,
};
}
return options;
}
expect(await page.screenshot()).toMatchSnapshot('landing-page.png', getSnapshotOptions({ maxDiffPixels: 5 })); The current approach would be sensible to me in most other config override scenarios but in this case all three options are tightly interdependent. Ideally, if I set any of the three directly in the options of This may just be me, I could see an explanation for both ways, but thought I'd put it out there. 🤷🏼♂️
I'd say we could probably get away with a looser All that being said we do sometimes have large diffs of PRs that require going though 100's of virtually identical screenshots, which is a pain when reviewing files in GitHub! |
reviving this a bit, as noticing similar issues, few weeks ago tests were passing on different machines with pretty tight tolerances. But now realizing things are failing. The goal is to create baseline images on local and run them on CI but reading from the above that seems like an issue. Are rendering issues chrome specific? or any browser based on chromium including firefox? Or maybe we can switch to Safari Webkit? or do they all have rendering issues that don't guarantee visual accuracy across different browsers??? Or is the only solution run docker on 1 cpu architecture locally and it should be ok on CI? Only way we're able to pass semi accurately on two different OS X machines both M1, is setting maxpixelratio .05 5% which seems fairly high |
I'm pretty sure that there is no possibility to guarantee visual accuracy across different browsers because the rendering engine differs across browsers (with some exceptions) and it could render with slight differences. One over the other is the font rendering I never got the chance to render the same exact text path on two different browsers. The visual accuracy should be instead guaranteed by rendering on the same CPU architecture and the same browsers and even if you are running the tests on different machines (aka CI and local). We used to run an x86 Docker image on both a macOS (x86 arch) and a different x86 CI with no visual differences. Instead, here we are highlighting that the difference happens when rendering on the same browser but under two different CPU architecture (e.g. testing a screenshot generated using an x86 chrome-based image with an arm64 chrome-based image). @gselsidi you mentioned:
This means that you also get different screenshots when using the same browser and the same CPU arch? |
What @markov00 said... Also 5% |
Don't know anymore screenshots behave weirdly, just created baseline images on mac m1, had a coworker run them on his intel based mac they all passed with super strict .0001 and 0 threshold. Yesterday co worker with same m1 would get constant failures |
I don't recall ever seeing inconsistencies in failures (flakiness) for the same screenshots across differing CPU architectures. But I have seen flakiness related to the nature of the screenshot itself specifically when using animations. |
yeah how can i give them to you without uploading them here publicly? or should i just upload and delete them after you've saved them? |
@gselsidi feel free to send them over to anlushni[at]microsoft[dot]com |
@@gselsidi Please file a separate feature request for this! |
did you receive my email with all the files? |
@gselsidi just checked - yes I did! The latest version of the comparator passes all your samples though. Could you please check with |
I was on dec 09, this new version dec 13 passes. Also what else can we achieve with this? Would headed vs headless pass? Non docker images vs docker images pass? Haven't gotten around to testing it that in depth yet, or would those scenarios still fail? |
@gselsidi thank you for checking!
The new comparator is designed to handle browser rendering non-determinism, so as long as the page layout is exactly the same, the screenshots should pass now. Do note though that for the layout to be the same, font metrics must be the same as well. Otherwise, line wrapping might happen in different places, boxes will have different sizes, and the images will in fact be different even for human eye.
According to some exploration I did back in the days, headless vs headed rendering differences consist of the following nuances:
We didn't aim to fix this though and we didn't experiment yet.
This won't work since browsers inside and outside docker will use different font stacks, resulting in different font metrics, resulting in different layout, and finally, yielding visually different screenshots. |
I ran playwright on all CSS tests from the Web Platform Test Suite, with a The run results are here — With SSIM I did notice a few cases where the actual and expected looked exactly the same, but there was a large diff — Not sure what's happening here, but looks like the images being displayed aren't the ones actually used for comparison (some race condition?) The number of failures with a threshold of Caveat: Web platform tests test breadth, and render cases aren't always complex. |
On a related note, It'd be convenient if the official docker image set fonts to something that is available on most Linux machines. Currently, it seems to use an obscure font ( |
Reposting what I saw here
Has ssim-cie94 been removed or not yet released? |
@github-daniel-mcdonald the comparator is still under experiment / development. |
@aslushnikov Thanks for this tip. I've expect: {
toMatchSnapshot: { comparator: 'ssim-cie94' } as any,
toHaveScreenshot: { comparator: 'ssim-cie94' } as any,
}, But as the property is not recognized, it does bring up the question, is the new comparator being used at all or is this just a property that's not read by the Playwright runner? Is there any way by which I can confirm the new comparator being used, anything in the logs or in the HTML report to look for that mentions |
@JPtenBerge it is currently available as |
Ah, great, that's doing something! To other readers: expect: {
toMatchSnapshot: { _comparator: 'loremipsum' } as any,
toHaveScreenshot: { _comparator: 'loremipsum' } as any,
}, It clearly gives an error: @aslushnikov: Thanks! |
@JPtenBerge I'd appreciate if you could share feedback on how it goes for you; things that worked and things that didn't work! |
@aslushnikov Will do. Have some other work to focus on first, but I hope to experiment a bit further somewhere later on this week. |
Hi! I am also using |
@michalstrzelecki I'm sorry for the inconvenience. This is an experimental feature that has never been released or documented. I'd also encourage everybody to share their experience with the |
Finally had a chance to test this and found a pretty large reduction in failures ( Most failures with @aslushnikov make sure to download the html reports before they expire. |
@nickofthyme thank you so much for trying out the comparator! I'm going over the HTML report, and so far majority of the failures actually look legit to me.
Two tests caught my eye: |
Hei @aslushnikov @nickofthyme few things here I've noticed: Actually that "red dot" is a dash of the red dashed line, it is really strange that in one architecture it renders that and on the other not. Looks like a different algorithm to me that in x86 renders only full dashes, were in ARM always renders them and cut to the edges if needed. The layout differences are possible only due to a difference in how text metrics are computed in SVG. I believe that this is something that can't be fixed with a comparator and is only specific to the browser underlying implementation. I already tested multiple time how different browsers handle differently the measurements of the same text with the same font and I strongly believe this is the reason for these failures. I also noticed different text ligature of arab language across charts, need more investigation here. The different color icon: strange fact here, I will investigate more, but testing with two real machines (x84 and arm) same browser, same OS, they render correctly with the same svg fill color. |
@aslushnikov It would be nice if this option was added behind await expect(page).toHaveScreenshot({
- // https://github.com/microsoft/playwright/issues/20097#issuecomment-1382672908
- // @ts-expect-error experimental feature
- _comparator: 'ssim-cie94',
+ experimental: { comparator: 'ssim-cie94', }
clip: box,
fullPage: true,
}); |
Hey team 👋🏼
I am working to migrate my puppeteer visual regression testing to playwright.
My team has people working on Macs using with either
arm64
(M1 SoC) oramd64
(Intel) CPU architecture.I'd like a way to run and update playwright tests/screenshots locally from either architecture and have the local screenshots match the screenshots running from the CI (
linux/amd64
).Currently we use the
mcr.microsoft.com/playwright:vx.x.x-focal
docker image to run tests both locally and in the CI. However running on these different architectures produce screenshots that are ever so slightly different when run on a different architecture, virtually imperceptible differences.Screenshot from M1 mac -
arm64
Screenshot from Intel mac -
amd64
Diff screenshot
Diff gif
So my questions is, does anyone have a good strategy to avoid the above errors on these two architectures without reducing the
threshold
?I've tried running docker with
--platform=linux/amd64
on my M1 mac, but I run into #13724 (comment) when running the tests, even on the latest docker version (v20.10.8
) with Rosetta 2 installed. Sounds like this could just be a known issue with docker.The text was updated successfully, but these errors were encountered: