Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of skipped tests and dynamic parallelism #261

Open
devinburnette opened this issue Jun 14, 2024 · 12 comments
Open

Improve handling of skipped tests and dynamic parallelism #261

devinburnette opened this issue Jun 14, 2024 · 12 comments

Comments

@devinburnette
Copy link

Trying to figure out if there's a way that datadog-ci and knapsack can be better integrated, maybe this is just a queue mode problem, but here's the scenario I'm trying to solve for:

datadog's ITR examines the test coverage and determines that most of the tests can actually be skipped, but we still pass the full list of tests to knapsack in our CI config. knapsack determines subsets and breaks the list up but because most things are being skipped, some parallel nodes end up with no tests to run and it throws the entire split timings out of wack and the job takes too long to complete (in some cases longer than without skipping any tests).

I'm wondering if there's a way we can either have knapsack be aware of skipped tests when determining subsets and make the distribution better based on that, and/or somehow introduce the idea of dynamic parallelism and just determine the proper number of nodes to be ran based on the number of tests that will actually run effectively fixing the split timings issue.

/cc @ArturT @anmarchenko

@anmarchenko
Copy link

From Datadog side, I would be happy to provide a list of skipped tests to knapsack library at some point in the lifecycle; unfortunately right now we instrument the knapsack:queue:rspec_go command that is running after the parallel runners are created. The runners are created outside of knapsack (in static config of Circle CI), so I am curious what could be done in this case.

@ArturT
Copy link
Member

ArturT commented Jun 18, 2024

@devinburnette Please ensure you use the latest version of the knapsack_pro gem and datadog-ci.

knapsack determines subsets and breaks the list up but because most things are being skipped, some parallel nodes end up with no tests to run and it throws the entire split timings out of wack and the job takes too long to complete (in some cases longer than without skipping any tests).

Could you share a link to the problematic CI build in the knapsack_pro dashboard or commit hash?

@anmarchenko said:

The runners are created outside of knapsack (in static config of Circle CI), so I am curious what could be done in this case.

@devinburnette Does it mean you use CircleCI and the Rerun only failed tests feature?
https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

Is the problem related to retried CI nodes that run only a subset of the test suite? In such case CircleCI reruns all the nodes but only some would run tests when there is less test files than the number of parallel nodes. Number of parallel nodes would have to be configured in CircleCI yml config file even if Knapsack API could provide the suggested number of parallel nodes to use.

datadog's ITR examines the test coverage and determines that most of the tests can actually be skipped, but we still pass the full list of tests to knapsack in our CI config.

@devinburnette How do you pass list of tests to Knapsack?

@anmarchenko
Copy link

This is not a retry they are doing but a normal test run: Datadog is able to skip tests based on commit data to run only relevant tests. So for example instead of 9500 tests it runs 100 of them and skips 9400: dozens of parallel workers waste time in this case.

@ArturT
Copy link
Member

ArturT commented Jun 18, 2024

@anmarchenko Are tests skipped by DataDog during the runtime of the RSpec tests?
Is it visible as skipped (pending tests) in the RSpec summary?

Or does DataDog return a list of tests to skip up front and then Knapsack is fed with the list using KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE?

@devinburnette
Copy link
Author

Yes. They are skipped during the runtime of rspec marked as pending. The source file passed to knapsack contains the entire set of tests including the skipped ones.

@ArturT
Copy link
Member

ArturT commented Jun 18, 2024

Do you skip individual test cases within a test file or do you always skip all test cases within a test file?

@anmarchenko
Copy link

anmarchenko commented Jun 18, 2024

Individual tests within a test file. The tests are visible as pending in RSpec summary.

@ArturT
Copy link
Member

ArturT commented Jun 18, 2024

DataDog could provide the total execution time of tests that are going to run (without skipped tests) because it tracks individual test cases.

Note

Knapsack Pro does not track RSpec individual test cases unless they belong to bottleneck test files that must be split by examples with this feature.

The customer could use the total execution time of tests to estimate the number of parallel nodes needed. Perhaps it could be a Ruby or bash script that pulls execution time from DataDog, estimates the parallel nodes number, and configures CI.

For example, DataDog wants to run 100 test cases. Their total execution time is 30 minutes.
The customer desires to run tests under 5 minutes per parallel CI node.

30 minutes / 5 minutes = 6 nodes are needed.

The customer dynamically configures their CI to run 6 nodes instead.
Knapsack does the rest without any changes.

Would that make sense? Do you see any issue with this approach?

@anmarchenko
Copy link

Yes, Datadog evaluates the idea to provide some script for knapsack's edge case to return percentage of skipped tests (the best we can do on library's side), we will see if this is going to be enough for @devinburnette's case

@ArturT
Copy link
Member

ArturT commented Jun 19, 2024

The percentage of skipped tests might be good enough. If you run only 20% of the test suite, then you could also run 20% of the parallel nodes, or a bit more if these 20% of tests are very slow.

@ArturT
Copy link
Member

ArturT commented Jun 25, 2024

@anmarchenko Please let me know if I can assist with anything, or post an update here if you have made any progress. Thank you.

@anmarchenko
Copy link

We are testing this idea with @devinburnette, so far not clear if it works:
DataDog/datadog-ci-rb#194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants