Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All examples skipped after exception in before: suite hook in queue mode #196

Closed
amanfredi opened this issue May 19, 2023 · 2 comments
Closed

Comments

@amanfredi
Copy link

amanfredi commented May 19, 2023

We run 40 parallel CI runners using knapsack pro queue mode, which normally complete in about 10 minutes each.
I recently observed that after retrying certain failed test nodes, the test runner would consistently time out. I realized this was because the runner had fetched more tests from the knapsack queue than it could possibly execute, and that distribution was persisted with the fixed queue split & retry configuration. The underlying cause of this is that the first failure did not fail the entire run, but instead caused knapsack to continue fetching examples from the queue and then not actually execute them.

See the following log for examples:

I, [2023-05-19T16:35:06.300147 #129]  INFO -- : [knapsack_pro] To retry the last batch of tests fetched from the API Queue, please run the following command on your machine:
I, [2023-05-19T16:35:06.300239 #129]  INFO -- : [knapsack_pro] bundle exec rspec --profile=5 --format documentation --color --format RspecJunitFormatter --out tmp/rspec.xml --default-path spec <<<spec-files-batch-1>>>"

An error occurred in a `before(:suite)` hook.
Failure/Error: DatabaseCleaner.clean

ActiveRecord::ConnectionTimeoutError:
  could not obtain a connection from the pool within 5.000 seconds (waited 5.005 seconds); all pooled connections were in use
<<<stacktrace>>>

I, [2023-05-19T16:35:11.656783 #129]  INFO -- : [knapsack_pro] To retry the last batch of tests fetched from the API Queue, please run the following command on your machine:
I, [2023-05-19T16:35:11.656859 #129]  INFO -- : [knapsack_pro] bundle exec rspec --profile=5 --format documentation --color --format RspecJunitFormatter --out tmp/rspec.xml --default-path spec <<<spec-files-batch-2>
No examples found.

I, [2023-05-19T16:35:11.949911 #129]  INFO -- : [knapsack_pro] To retry the last batch of tests fetched from the API Queue, please run the following command on your machine:
I, [2023-05-19T16:35:11.949998 #129]  INFO -- : [knapsack_pro] bundle exec rspec --profile=5 --format documentation --color --format RspecJunitFormatter --out tmp/rspec.xml --default-path spec <<<spec-files-batch-3>>>
No examples found.

...
@ArturT
Copy link
Member

ArturT commented May 19, 2023

Hi @amanfredi

I'm wondering how to reproduce it. I did

# spec_helper.rb
KnapsackPro::Hooks::Queue.before_queue do |queue_id|
  raise 'a'
end

Indeed knapsack pro consumes tests from Queue API and does it fast. It does not execute tests because RSpec could not start tests due to a failure. Tests failed.

CI build should complete work faster when tests are not executed.

I'm not sure why you experienced a timeout from CI? Do you mean you fixed the bug in the code and you started a new build (attempted some kind of retry) and it was split using cached distribution (fixed queue split) so that a single CI node got too many tests (due to them being cached on the API side). If you fixed a bug in the code then I would expect a new git commit hash so that the fixed queue split (cached distribution on API side) won't be used.

Could you give more details on how to reproduce the problem. What you did?
Did you start 1 CI build or maybe 2 because you did a retry? Did you create a new commit?
What would you expect to happen?

@ArturT ArturT added the bug label May 19, 2023
@ArturT
Copy link
Member

ArturT commented May 23, 2023

For the future reference.

In order to reproduce the problem, raise an exception outside of a test example.

describe 'abc' do
  raise 'error'
  it do
    expect(true).to be true
  end
end

When tests fail, then new tests are fetched from Queue API. All tests fail so that means a lot of tests will be fetched quickly from Queue API.
Because of that too many test files are assigned to a given CI node index.

If you retry a CI node (and you use KNAPSACK_PRO_FIXED_QUEUE_SPLIT=true) and connection to DB works fine, then the same set of tests that was assigned to the node in the first place would run. Tests will take a long time because there is a lot of them.

Problem:
knapsack_pro does not know how to catch exceptions from inside of the RSpec process because RSpec already caught the exception, so we don't know if the test failure is a regular failure or not (exception outside of a test example). So Knapsack Pro keeps asking Queue API for more tests instead of failing fast and stopping pulling tests.

Possible solution. Ensure the DB is up and running:

Example for Github Actions:

    services:
      postgres:
        image: postgres:10.8
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: ""
          POSTGRES_DB: postgres
        ports:
          - 5432:5432
        # needed because the postgres container does not provide a healthcheck
        # tmpfs makes DB faster by using RAM
        options: >-
          --mount type=tmpfs,destination=/var/lib/postgresql/data
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

https://docs.github.com/en/actions/using-containerized-services/creating-postgresql-service-containers#running-jobs-in-containers

Maybe this will be helpful:
https://github.com/orgs/community/discussions/25757

If the problem happens again, you can create a new commit and this will start a new queue so that the tests are distributed in new order.

@ArturT ArturT added the RSpec label May 23, 2023
@ArturT ArturT closed this as not planned Won't fix, can't repro, duplicate, stale Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants