Improve handling of skipped tests and dynamic parallelism #261

devinburnette · 2024-06-14T17:40:06Z

Trying to figure out if there's a way that datadog-ci and knapsack can be better integrated, maybe this is just a queue mode problem, but here's the scenario I'm trying to solve for:

datadog's ITR examines the test coverage and determines that most of the tests can actually be skipped, but we still pass the full list of tests to knapsack in our CI config. knapsack determines subsets and breaks the list up but because most things are being skipped, some parallel nodes end up with no tests to run and it throws the entire split timings out of wack and the job takes too long to complete (in some cases longer than without skipping any tests).

I'm wondering if there's a way we can either have knapsack be aware of skipped tests when determining subsets and make the distribution better based on that, and/or somehow introduce the idea of dynamic parallelism and just determine the proper number of nodes to be ran based on the number of tests that will actually run effectively fixing the split timings issue.

/cc @ArturT @anmarchenko

anmarchenko · 2024-06-17T08:28:19Z

From Datadog side, I would be happy to provide a list of skipped tests to knapsack library at some point in the lifecycle; unfortunately right now we instrument the knapsack:queue:rspec_go command that is running after the parallel runners are created. The runners are created outside of knapsack (in static config of Circle CI), so I am curious what could be done in this case.

ArturT · 2024-06-18T09:36:48Z

@devinburnette Please ensure you use the latest version of the knapsack_pro gem and datadog-ci.

knapsack determines subsets and breaks the list up but because most things are being skipped, some parallel nodes end up with no tests to run and it throws the entire split timings out of wack and the job takes too long to complete (in some cases longer than without skipping any tests).

Could you share a link to the problematic CI build in the knapsack_pro dashboard or commit hash?

@anmarchenko said:

The runners are created outside of knapsack (in static config of Circle CI), so I am curious what could be done in this case.

@devinburnette Does it mean you use CircleCI and the Rerun only failed tests feature?
https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

Is the problem related to retried CI nodes that run only a subset of the test suite? In such case CircleCI reruns all the nodes but only some would run tests when there is less test files than the number of parallel nodes. Number of parallel nodes would have to be configured in CircleCI yml config file even if Knapsack API could provide the suggested number of parallel nodes to use.

datadog's ITR examines the test coverage and determines that most of the tests can actually be skipped, but we still pass the full list of tests to knapsack in our CI config.

@devinburnette How do you pass list of tests to Knapsack?

anmarchenko · 2024-06-18T10:29:54Z

This is not a retry they are doing but a normal test run: Datadog is able to skip tests based on commit data to run only relevant tests. So for example instead of 9500 tests it runs 100 of them and skips 9400: dozens of parallel workers waste time in this case.

ArturT · 2024-06-18T14:01:23Z

@anmarchenko Are tests skipped by DataDog during the runtime of the RSpec tests?
Is it visible as skipped (pending tests) in the RSpec summary?

Or does DataDog return a list of tests to skip up front and then Knapsack is fed with the list using KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE?

devinburnette · 2024-06-18T14:10:19Z

Yes. They are skipped during the runtime of rspec marked as pending. The source file passed to knapsack contains the entire set of tests including the skipped ones.

ArturT · 2024-06-18T14:16:31Z

Do you skip individual test cases within a test file or do you always skip all test cases within a test file?

anmarchenko · 2024-06-18T14:18:22Z

Individual tests within a test file. The tests are visible as pending in RSpec summary.

ArturT · 2024-06-18T17:02:12Z

DataDog could provide the total execution time of tests that are going to run (without skipped tests) because it tracks individual test cases.

Note

Knapsack Pro does not track RSpec individual test cases unless they belong to bottleneck test files that must be split by examples with this feature.

The customer could use the total execution time of tests to estimate the number of parallel nodes needed. Perhaps it could be a Ruby or bash script that pulls execution time from DataDog, estimates the parallel nodes number, and configures CI.

For example, DataDog wants to run 100 test cases. Their total execution time is 30 minutes.
The customer desires to run tests under 5 minutes per parallel CI node.

30 minutes / 5 minutes = 6 nodes are needed.

The customer dynamically configures their CI to run 6 nodes instead.
Knapsack does the rest without any changes.

Would that make sense? Do you see any issue with this approach?

anmarchenko · 2024-06-19T08:33:52Z

Yes, Datadog evaluates the idea to provide some script for knapsack's edge case to return percentage of skipped tests (the best we can do on library's side), we will see if this is going to be enough for @devinburnette's case

ArturT · 2024-06-19T09:54:04Z

The percentage of skipped tests might be good enough. If you run only 20% of the test suite, then you could also run 20% of the parallel nodes, or a bit more if these 20% of tests are very slow.

ArturT · 2024-06-25T12:35:15Z

@anmarchenko Please let me know if I can assist with anything, or post an update here if you have made any progress. Thank you.

anmarchenko · 2024-06-25T14:45:14Z

We are testing this idea with @devinburnette, so far not clear if it works:
DataDog/datadog-ci-rb#194

ArturT added question RSpec labels Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of skipped tests and dynamic parallelism #261

Improve handling of skipped tests and dynamic parallelism #261

devinburnette commented Jun 14, 2024

anmarchenko commented Jun 17, 2024

ArturT commented Jun 18, 2024

anmarchenko commented Jun 18, 2024

ArturT commented Jun 18, 2024 •

edited

Loading

devinburnette commented Jun 18, 2024

ArturT commented Jun 18, 2024 •

edited

Loading

anmarchenko commented Jun 18, 2024 •

edited

Loading

ArturT commented Jun 18, 2024

anmarchenko commented Jun 19, 2024

ArturT commented Jun 19, 2024

ArturT commented Jun 25, 2024

anmarchenko commented Jun 25, 2024

Improve handling of skipped tests and dynamic parallelism #261

Improve handling of skipped tests and dynamic parallelism #261

Comments

devinburnette commented Jun 14, 2024

anmarchenko commented Jun 17, 2024

ArturT commented Jun 18, 2024

anmarchenko commented Jun 18, 2024

ArturT commented Jun 18, 2024 • edited Loading

devinburnette commented Jun 18, 2024

ArturT commented Jun 18, 2024 • edited Loading

anmarchenko commented Jun 18, 2024 • edited Loading

ArturT commented Jun 18, 2024

anmarchenko commented Jun 19, 2024

ArturT commented Jun 19, 2024

ArturT commented Jun 25, 2024

anmarchenko commented Jun 25, 2024

ArturT commented Jun 18, 2024 •

edited

Loading

ArturT commented Jun 18, 2024 •

edited

Loading

anmarchenko commented Jun 18, 2024 •

edited

Loading