-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(test runner): improve sharding algorithm to better spread similar tests among shards #30962
base: main
Are you sure you want to change the base?
Conversation
Maybe it's better to make this an option to allow restoring the old behaviour. ¯_(ツ)_/¯
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Do you think you can achieve the same better behavior with your sharding seed? Or are you looking for additional bias against subsequent tests being put into the same group? |
Not sure yet. But I will test this new sharding logic in our test setup to gather some results.
The seeded shuffle is basically just a quick and easy way to influence the test group to shard assignment… it's random and so it's results may vary. However, this change is aimed to improve the sharding logic to generally yield better results, which yet needs to be proved. 😅 Currently this sharding algorithm uses the number of tests per test group as a cost metric. It would be great if we could use the test duration of a previous run (when available) to even better distribute the tests among the shards. But the algorithm would be quite similar. |
I think your seed change allows users to experiment with the seeds and arrive at some better state than they are today. Any other changes w/o the timing feedback are going to yield similar results, not need to experiment with biases.
This requires a feedback loop with the test time stats which we don't have today. We recently started storing the last run stats in |
Yes, I would like to work on that. I was not yet aware of the .last-run.json. Is that something that is also written by the merge reports command? Because we need the stats combined from all shard runs. I was thinkings about adding a separate reporter for that purpose, but if those last run stats are already there…, then there might not be the need to create a separate reporter. |
Shaping this code as reporter sounds good, but Playwright core would need to consume the output of that reporter, so it needs to be baked in. Merging those should not be a hard problem, reporter or not. Unfortunately merging mangles test ids today, so we'd need to figure that out. Maybe not using the ids altogether and falling back to the file names and test titles. Also has some tricky edge cases as tests that are fast on Chromium and are slow on Firefox... |
This comment has been minimized.
This comment has been minimized.
I've added a Surprisingly when merging the reports the test ids just had a 1 character suffix that I was able to strip off… but it doesn't feel like the right way to do this. What's the reason to modify test ids when merging blobs? Couldn't this be done in a way that only modifies a test id when there is a collision? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
const testDurations = testRun.rootSuite?.allTests().reduce((map, t) => { | ||
if (t.results.length) | ||
map[t.id] = t.results.reduce((a, b) => a + b.duration, 0); | ||
return map; | ||
}, {} as { [testId: string]: number }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure it is the right way to sum all the durations… maybe it makes more sense to calc the average? Or only include durations from successful test runs… 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok for now, we'll iterate based on feedback.
@pavelfeldman |
…icrosoft#30817)" This reverts commit 825e0e4. API review notes: sounds like this change did not solve the problem for the contributor, there is a new approach under development in microsoft#30962
This is awesome - we were planning on implementing something similar outside playwright before we migrate cypress tests that currently shard across 50 build servers. Is there anything todo here I could help with to get this merged quicker ? |
Hey, I did not noticed this PR got conflicts due to the revert of my other change… I’ll resolve those conflicts in the next days. There was still on open question on whether we should introduce a command line flag to specify a last run state file or not… but I think that are improvements that can also happen after this has been merged Though, this PR could have some more documentation changes… anyone want to help with that? |
This comment has been minimized.
This comment has been minimized.
904a8dd
to
2536da7
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was still on open question on whether we should introduce a command line flag to specify a last run state file or not
We discussed this and decided that it should be fine to add --last-run-file=
command line flag that will affect both the reading and writing location of the last-run.json
.
Though, this PR could have some more documentation changes… anyone want to help with that?
Of course! We can help with adding more documentation after landing this PR.
@@ -152,6 +152,7 @@ export class Runner { | |||
export type LastRunInfo = { | |||
status: FullResult['status']; | |||
failedTests: string[]; | |||
testDurations?: { [testId: string]: number }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are a bit worried that this will force testId
to never change to preserve backwards compatibility. Therefore, we would like to use testCase.titlePath()
instead of id for now. Something like this:
export type LastRunInfo = {
status: FullResult['status'],
projects: {
name: string,
files: {
path: string,
tests: {
title: string[],
duration: number,
status: TestStatus,
}[],
}[],
}[],
};
Here file.path
is toPosixPath(path.relative(config.rootDir, filePath))
, similar to this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, more verbose is also better for future use-cases that will consume the last-run info. Having the titles and paths in there is more user friendly… 👍
However, removing the already existing failedTests: string[]
from LastRunInfo would be a backward incompatible change… maybe we should keep it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, removing the already existing failedTests: string[] from LastRunInfo would be a backward incompatible change… maybe we should keep it for now?
There are no backwards compatibility concerns as of today, since we do not expect last-run.json
to be used between different versions. It will be the case once we support "commit last-run.json to the repo for shard optimization" scenario.
That said, you can leave it as is, and we'll iterate on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to write the LastRunInfo in the proposed format, but its not as straight forward as I thought it would be.
That said, you can leave it as is, and we'll iterate on it.
You mean, I don't need to proceed here to get this merged? 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[…] we do not expect last-run.json to be used between different versions
In our CI setup, I'm using pipeline steps to restore the last-run.json from the last successful main build. So, in our case last-run info from previous playwright versions will be used, and it would be great if there is some level of backward compatibility... at least it should not fail when the format is incorrect.
const testDurations = testRun.rootSuite?.allTests().reduce((map, t) => { | ||
if (t.results.length) | ||
map[t.id] = t.results.reduce((a, b) => a + b.duration, 0); | ||
return map; | ||
}, {} as { [testId: string]: number }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok for now, we'll iterate based on feedback.
This comment has been minimized.
This comment has been minimized.
@dgozman I've just added |
Test results for "tests 1"6 passed Merge workflow run. |
Test results for "tests 1"5 flaky19538 passed, 329 skipped Merge workflow run. |
Test results for "tests 1"1 failed 6 flaky25063 passed, 489 skipped Merge workflow run. |
Test results for "tests 1"7 flaky28509 passed, 655 skipped Merge workflow run. |
Test results for "tests 1"1 failed 5 flaky24434 passed, 473 skipped Merge workflow run. |
Test results for "tests 1"7 flaky28509 passed, 655 skipped Merge workflow run. |
Just added some more documentation, mostly copying what’s in this PR description to the test-sharding page. And cleaned up imports which were previously automatically “organized” due to my vs code settings, which resulted in unnecessary changes with high conflict potential… It would be great if we could get this over the finish line for the next playwright release. People already asked me how they could use playwright of my PR in their CI… 😅 @dgozman @pavelfeldman can I get a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muhqu Sorry for the slow review, I was unavailable for a week.
I think we can land this PR if you remove the reporter for now, and remove the round-robin
strategy due to stability concerns. This will give us the necessary functionality, and we can iterate further.
I can follow up with updating the duration-round-robin
to fallback to partition
.
Defines the algorithm to be used for sharding. Defaults to `'partition'`. | ||
* `'partition'` - divide the set of test groups by number of shards. e.g. first | ||
half goes to shard 1/2 and seconds half to shard 2/2. | ||
* `'round-robin'` - spread test groups to shards in a round-robin way. e.g. loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is not a really useful strategy, because it is not stable. Whenever I add a test or move it between files, all my shards will be completely new. This introduces a lot of variability on CI, which from our experience is something to avoid. I wonder whether we can make this stable somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current project round-robin
yields much better distribution than partition
. Here is why, our tests are organised something like this (simplified)…
tests/framework/toolbar.test.ts
tests/framework/stage.test.ts
tests/logged-in/login-free.test.ts
tests/logged-in/login-premium.test.ts
tests/logged-in/premium-upsell.test.ts
tests/logged-out/view-only.test.ts
tests/logged-out/seo.test.ts
Our framework tests are pretty exhaustive and take the most time. The logged-in tests require more time than logged-out tests due to additional bootstrap and teardown. By using the partition
algorithm the first shards require significantly more time as they have all the long-tests from either framework or logged-in suites. With round-robin
we're already achieving a much better balanced overall runtime, even though the algorithm doesn't know anything about per test duration.
I don't feel like the round-robin algorithm is "not stable". It's deterministic, given the same set of tests it always yields the same result. Also I think in CI there should be no assumption that one test goes after another, unless they are explicitly configured to execute sequentially within the same test group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's fair. If you would like to opt-in into round-robin
and it works for you, that's great. I still think it is less stable than I would prefer, but it doesn't hurt to have it as an option anyway.
test groups to shards in a round-robin way. e.g. loop over test groups and | ||
always assign to the shard that has the lowest duration of tests. new tests | ||
which were not present in the last run will use an average duration time. When | ||
no `.last-run.json` could be found the behavior is identical to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the comment above about the round-robin
strategy, I think we should use partition
when no .last-run.json
is present, and also use partition
for any tests that were not found in the .last-run.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree.
Also please note that duration-round-robin
and 'round-robin' are using the same implementation. The only difference is that in case of 'duration-round-robin' the last run info is passed to the implementation. However, in case the last run info is passed but it is empty, the behaviour is identical to 'round-robin' where no last run info is passed.
I don't think it would be a good to choose a different algorithm based on whether last run info is available or not.
@@ -271,7 +277,7 @@ export function toReporters(reporters: BuiltInReporter | ReporterDescription[] | | |||
return reporters; | |||
} | |||
|
|||
export const builtInReporters = ['list', 'line', 'dot', 'json', 'junit', 'null', 'github', 'html', 'blob', 'markdown'] as const; | |||
export const builtInReporters = ['list', 'line', 'dot', 'json', 'junit', 'null', 'github', 'html', 'blob', 'markdown', 'lastrun'] as const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need lastrun
here, because it should not be user-visible.
@@ -347,6 +353,7 @@ const testOptions: [string, string][] = [ | |||
['--headed', `Run tests in headed browsers (default: headless)`], | |||
['--ignore-snapshots', `Ignore screenshot and snapshot expectations`], | |||
['--last-failed', `Only re-run the failures`], | |||
['--last-run-file', `Path to a json file where the last run information is read from and written to (default: .last-run.json)`], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
['--last-run-file', `Path to a json file where the last run information is read from and written to (default: .last-run.json)`], | |
['--last-run-file', `Path to a json file where the last run information is read from and written to (default: test-results/.last-run.json)`], |
@@ -281,6 +285,8 @@ function overridesFromOptions(options: { [key: string]: any }): ConfigCLIOverrid | |||
retries: options.retries ? parseInt(options.retries, 10) : undefined, | |||
reporter: resolveReporterOption(options.reporter), | |||
shard: shardPair ? { current: shardPair[0], total: shardPair[1] } : undefined, | |||
shardingMode: options.shardingMode ? options.shardingMode : undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate that shardingMode
has one of the supported values.
const failedTests = testRun.rootSuite?.allTests().filter(t => !t.ok()).map(t => t.id); | ||
const lastRunReport = JSON.stringify({ status, failedTests }, undefined, 2); | ||
const testDurations = testRun.rootSuite?.allTests().reduce((map, t) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const testDurations = testRun.rootSuite?.allTests().reduce((map, t) => { | |
const testDurations = (testRun.rootSuite?.allTests() || []).reduce((map, t) => { |
map[t.id] = t.results.reduce((a, b) => a + b.duration, 0); | ||
return map; | ||
}, {} as { [testId: string]: number }); | ||
const lastRunReport = JSON.stringify({ status, failedTests, testDurations }, undefined, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are still writing the file here, why do we need a separate lastrun
reporter? Let's remove the reporter for now, to make the scope of worker more narrow. We can introduce the reporter later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lastrun
reporter is required to generate last run info when merging blob reports from shards.
Maybe it would be better to implicitly write the lastrun info when merging blob reports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. From the user's perspective, lastrun just working out of the box would be easier. Whether it is a reporter internally or not is an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about always using the last run reporter? Then we would only need to have a single implementation….
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That definitely works. We'll have to add it here when isTestServer === false
.
return group.tests.reduce((sum, test) => sum + Math.max(1, lastRunInfo.testDurations?.[test.id] || averageDuration), 0); | ||
}; | ||
|
||
// We sort the test groups by the number of tests in descending order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We sort the test groups by the number of tests in descending order. | |
// We sort the test groups by group duration in descending order. |
]); | ||
}); | ||
|
||
test('should not produce skipped tests for zero-sized shards', async ({ runInlineTest }) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can safely remove tests starting from this one, because it is already covered in shard.spec.ts
, and sharding strategy does not affect any of them.
Adds alternative algorithms to assign test groups to shards to better distribute tests.
Problem
Currently the way sharding works is something like this…
Tests are ordered in the way they are discovered, which is mostly alphabetically. This has the effect that test cases are sorted nearby similar tests… for example your have first 6 tests which are testing logged-in state and then 6 tests which test logged-out state. The first 6 tests require more setup time as they are testing logged-in behaviour… With the current sharding algorithm shard 1 & 2 get those slow logged-in tests and shard 3 & 4 get the more quicker tests…
Solution
This PR adds a new
shardingMode
configuration which allows to specify the sharding algorithm to be used…shardingMode: 'partition'
That's the current behaviour, which is the default. Let me know if you have a better name to describe the current algorithm...
shardingMode: 'round-robin'
Distribute the test groups more evenly. It…
Here is a simple example where every test group represents a single test (e.g.
--fully-parallel
) ...…or a more complex scenario where test groups have different number of tests…
shardingMode: 'duration-round-robin'
It's very similar to
round-robin
, but it uses the duration of a tests previous run as cost factor. The duration will be read from.last-run.json
when available. When a test can not be found in.last-run.json
it will use the average duration of available tests. When no last run info is available, the behaviour would be identical toround-robin
.Other changes
testDurations?: { [testId: string]: number }
to.last-run.json
lastrun
reporter, which allowsmerge-reports
to generate a.last-run.json
to be generatedAppendix
Below are some runtime stats from a project I've been working on, which shows the potential benefit of this change.
The tests runs had to complete 161 tests. Single test duration ranges from a few seconds to over 2 minutes.
The partition run gives the baseline performance and illustrates the problem quite good. We have a single shard that takes almost 16 min while another one completes in under 5 min.
The round-robin algorithm gives a bit better performance, but it still has a shard that requires twice the time of another shard.
The duration-round-robin run was using the duration info from a previous run and achieves the best result by far. All shards complete in 10-11 minutes. 🏆 🎉