Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp integration tests, run in parallel #1105

Merged
merged 13 commits into from Dec 2, 2023

Conversation

briantist
Copy link
Contributor

@briantist briantist commented Nov 11, 2023

I started looking at improving the performance of tests, especially integration tests. Since they run serially and have to spin up Vault and/or Consul, they tend to take a while to run.

On my local machine, they take a little over 1 minute to run, in CI more like 1m40s or so. Not bad, but when you're iterating locally it can be a slog, and in CI we'd like to add more platforms to test against, and possibly more python versions for integration, so speedier test times will add up.
(unit tests were already very fast, ~3s on my machine, ~10s or less in CI)

pytest-xdist seemed like a good choice for parallelizing the tests and taking advantage of multiple cores.

The challenges came with the Vault and Consul servers that needed to be started up, and assumptions made in the tests themselves: the way everything was written hardcoded TCP port numbers and IP addresses, and assumed that if a server existed on those addresses already that it was an error state.

So this PR makes a whole lot of changes to that process, so that we can start multiple Vault and Consul servers simultaneously without them stepping on each other.

For this, the configuration files for each of them are now patched and generated at test run time, with the help of a new class that works as a context manager to find free port numbers.

The context manager usage helps in the case of launching a Vault cluster where we need to launch two instances of Vault and one instance of Consul, all with ports that don't conflict with each other, and we need to know those ports before they start so we can write configs before those servers actually use the ports.

This also means that tests that assume vault is always available on 127.0.0.1:8200 needed to be updated to dynamically use the Vault server launched for that test, and this required changes in the server manager class and the test case class to ensure that all the right things got to the right places.

The result: integration tests take ~16s on my 6-core machine with 12 test workers.
In CI ~30s

It's worth noting that GitHub's Ubuntu runners are supposed to have 2 vCPUs. Since xdist is configured by default to auto-detect the number of CPUs or hyperthreads, and set the worker count to that, I noticed that some test runs were choosing 2, and some were choosing 4.

I have not been able to find official word, but it seems like GitHub is slowly increasing the number of vCPUs on its runners (perhaps also the performance per vCPU, but that is harder for me to tell). Some independent research led me to find evidence that Ubuntu and Windows runners are going from 2 -> 4 vCPUs, and MacOS runners are going from 3 -> 4 vCPUs.

This GitHub action which shows the number of cores available runs itself in its own CI once a week, and you can look at the history of runs to see some evidence: https://github.com/SimenB/github-actions-cpu-cores/actions

Anyway all the above means is that this is a great time parallelize the tests, because we get even more gain from that change!

Unit tests are also set to use xdist now, but the times on both my local machine and in CI are roughly the same, because they were already so fast and any parallelized time savings are eaten up by the fixed overhead of setting up the workers and distributing test load, but if we use unit tests more, we'll see benefit in the future.


Other changes:

  • 2a6a90e -- we've started to see test failures for one of the PKI tests because apparently a field returned by Vault that's a comma-separated string of values does not have guaranteed order, and the order returned by Vault suddenly changed.
  • Although all of our integration tests do run in CI, every individual run skips some tests or another, and we didn't have good visibility into that so I've changed the CI invocation to show exactly which tests are skipped.
  • Generally add more retries and reporting when launching processes. With tests running in parallel, there is still a small chance of port conflicts since test workers don't talk to each other and don't synchronize, so the retries should handle those rare cases.

Other notes

  • I originally started out seeing if we could replace use of local binaries with containers instead, using docker compose. It was interesting and we might pursue it later but due to how much was already built-in to use binaries and to assume open communication between the launched processes, I decided to put that aside and build on what's here.
  • Doctests (the thing that runs code examples in the documentation) are still slow, and single-threaded.
    • First of all, they don't run in pytest, they run in the sphinx doctest plugin, but it does still use the server manager component for launching instances that we use in the integration tests.
    • I spent a lot of time trying to get the doctests to even use the dynamic port stuff I put in place (without parallelization) and it kept running into weird conflicts, despite the fact that it was dynamically choosing ports and launching processes. I could not figure out why.
    • I also spent time trying to get our doctests running in pytest. It already supports running doctest tests, including in RST files, but the sphinx doctest plugin has a number of differences that make our tests incompatible.
    • I did take some steps to try to convert them and make that work but it would have meant rewriting almost all of our examples (even to work without dynamic ports, and without parallelization) so I also put that aside.
    • We also use some kind of vendored/hacked version of doctest which could further complicate it.
    • Instead, that work of moving doctests to pytest could possibly go hand-in-hand with other docs modernization things:
    • Update docs theme and packages #1076
  • As part of this I also explored running our integration tests in CI in MacOS and Windows in addition to Ubuntu.
    • The first challenge was that our CI step to install Vault and Consul is Ubuntu-specific: it uses apt.
    • I did manage to come up with a PowerShell script for this that works across all 3 platforms. I am thinking about making that its own GitHub action.
    • Once that was done, the MacOS tests ran successfully, however GitHub Actions free usage limits us to 5 concurrent MacOS jobs at a time, and we run more than 5, so there is some contention there. As we reduce the number of Vault versions we test against that should be alleviated.
    • Windows had some other problems. First, there's a bug in Poetry that has nothing to do with Windows but seemed to come up more frequently, and it caused failures in installing dependencies. It seems to be fixed in Poetry 1.7.0.
    • After working through that and getting the tests to actually run we hit the problem I expected: the tests freeze up indefinitely, as described here:
    • Integration tests can't be run on Windows #1007
    • Mainly, I wanted to see what else we might run into and if there were any other blockers to running against Windows, and I think I have a good handle on what those are, so I've removed all the other platform stuff from this PR, and we'll get to that in another one.

@briantist briantist added misc Used as a release-drafter "category" consul tests related to tests (not necessarily CI/CD) maintenance General technical debt developer experience Developer setup and experience labels Nov 11, 2023
@briantist briantist self-assigned this Nov 11, 2023
Copy link

codecov bot commented Nov 11, 2023

Codecov Report

Merging #1105 (5211bda) into main (ede8622) will decrease coverage by 0.04%.
The diff coverage is n/a.

❗ Current head 5211bda differs from pull request most recent head 8568f67. Consider uploading reports for the commit 8568f67 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1105      +/-   ##
==========================================
- Coverage   87.16%   87.12%   -0.04%     
==========================================
  Files          64       64              
  Lines        3162     3162              
==========================================
- Hits         2756     2755       -1     
- Misses        406      407       +1     

see 1 file with indirect coverage changes

@briantist briantist changed the title WIP Revamp integration tests, run in parallel Nov 23, 2023
@briantist briantist marked this pull request as ready for review November 23, 2023 00:42
@briantist briantist requested a review from a team as a code owner November 23, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consul developer experience Developer setup and experience maintenance General technical debt misc Used as a release-drafter "category" tests related to tests (not necessarily CI/CD)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant