Revamp integration tests, run in parallel #1105

briantist · 2023-11-11T04:09:10Z

I started looking at improving the performance of tests, especially integration tests. Since they run serially and have to spin up Vault and/or Consul, they tend to take a while to run.

On my local machine, they take a little over 1 minute to run, in CI more like 1m40s or so. Not bad, but when you're iterating locally it can be a slog, and in CI we'd like to add more platforms to test against, and possibly more python versions for integration, so speedier test times will add up.
(unit tests were already very fast, ~3s on my machine, ~10s or less in CI)

pytest-xdist seemed like a good choice for parallelizing the tests and taking advantage of multiple cores.

The challenges came with the Vault and Consul servers that needed to be started up, and assumptions made in the tests themselves: the way everything was written hardcoded TCP port numbers and IP addresses, and assumed that if a server existed on those addresses already that it was an error state.

So this PR makes a whole lot of changes to that process, so that we can start multiple Vault and Consul servers simultaneously without them stepping on each other.

For this, the configuration files for each of them are now patched and generated at test run time, with the help of a new class that works as a context manager to find free port numbers.

The context manager usage helps in the case of launching a Vault cluster where we need to launch two instances of Vault and one instance of Consul, all with ports that don't conflict with each other, and we need to know those ports before they start so we can write configs before those servers actually use the ports.

This also means that tests that assume vault is always available on 127.0.0.1:8200 needed to be updated to dynamically use the Vault server launched for that test, and this required changes in the server manager class and the test case class to ensure that all the right things got to the right places.

The result: integration tests take ~16s on my 6-core machine with 12 test workers.
In CI ~30s

It's worth noting that GitHub's Ubuntu runners are supposed to have 2 vCPUs. Since xdist is configured by default to auto-detect the number of CPUs or hyperthreads, and set the worker count to that, I noticed that some test runs were choosing 2, and some were choosing 4.

I have not been able to find official word, but it seems like GitHub is slowly increasing the number of vCPUs on its runners (perhaps also the performance per vCPU, but that is harder for me to tell). Some independent research led me to find evidence that Ubuntu and Windows runners are going from 2 -> 4 vCPUs, and MacOS runners are going from 3 -> 4 vCPUs.

This GitHub action which shows the number of cores available runs itself in its own CI once a week, and you can look at the history of runs to see some evidence: https://github.com/SimenB/github-actions-cpu-cores/actions

Anyway all the above means is that this is a great time parallelize the tests, because we get even more gain from that change!

Unit tests are also set to use xdist now, but the times on both my local machine and in CI are roughly the same, because they were already so fast and any parallelized time savings are eaten up by the fixed overhead of setting up the workers and distributing test load, but if we use unit tests more, we'll see benefit in the future.

Other changes:

2a6a90e -- we've started to see test failures for one of the PKI tests because apparently a field returned by Vault that's a comma-separated string of values does not have guaranteed order, and the order returned by Vault suddenly changed.
Although all of our integration tests do run in CI, every individual run skips some tests or another, and we didn't have good visibility into that so I've changed the CI invocation to show exactly which tests are skipped.
Generally add more retries and reporting when launching processes. With tests running in parallel, there is still a small chance of port conflicts since test workers don't talk to each other and don't synchronize, so the retries should handle those rare cases.

Other notes

I originally started out seeing if we could replace use of local binaries with containers instead, using docker compose. It was interesting and we might pursue it later but due to how much was already built-in to use binaries and to assume open communication between the launched processes, I decided to put that aside and build on what's here.
Doctests (the thing that runs code examples in the documentation) are still slow, and single-threaded.
- First of all, they don't run in pytest, they run in the sphinx doctest plugin, but it does still use the server manager component for launching instances that we use in the integration tests.
- I spent a lot of time trying to get the doctests to even use the dynamic port stuff I put in place (without parallelization) and it kept running into weird conflicts, despite the fact that it was dynamically choosing ports and launching processes. I could not figure out why.
- I also spent time trying to get our doctests running in pytest. It already supports running doctest tests, including in RST files, but the sphinx doctest plugin has a number of differences that make our tests incompatible.
- I did take some steps to try to convert them and make that work but it would have meant rewriting almost all of our examples (even to work without dynamic ports, and without parallelization) so I also put that aside.
- We also use some kind of vendored/hacked version of doctest which could further complicate it.
- Instead, that work of moving doctests to pytest could possibly go hand-in-hand with other docs modernization things:
- Update docs theme and packages #1076
As part of this I also explored running our integration tests in CI in MacOS and Windows in addition to Ubuntu.
- The first challenge was that our CI step to install Vault and Consul is Ubuntu-specific: it uses apt.
- I did manage to come up with a PowerShell script for this that works across all 3 platforms. I am thinking about making that its own GitHub action.
- Once that was done, the MacOS tests ran successfully, however GitHub Actions free usage limits us to 5 concurrent MacOS jobs at a time, and we run more than 5, so there is some contention there. As we reduce the number of Vault versions we test against that should be alleviated.
- Windows had some other problems. First, there's a bug in Poetry that has nothing to do with Windows but seemed to come up more frequently, and it caused failures in installing dependencies. It seems to be fixed in Poetry 1.7.0.
- After working through that and getting the tests to actually run we hit the problem I expected: the tests freeze up indefinitely, as described here:
- Integration tests can't be run on Windows #1007
- Mainly, I wanted to see what else we might run into and if there were any other blockers to running against Windows, and I think I have a good handle on what those are, so I've removed all the other platform stuff from this PR, and we'll get to that in another one.

codecov · 2023-11-11T04:12:12Z

Codecov Report

Merging #1105 (5211bda) into main (ede8622) will decrease coverage by 0.04%.
The diff coverage is n/a.

❗ Current head 5211bda differs from pull request most recent head 8568f67. Consider uploading reports for the commit 8568f67 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1105      +/-   ##
==========================================
- Coverage   87.16%   87.12%   -0.04%     
==========================================
  Files          64       64              
  Lines        3162     3162              
==========================================
- Hits         2756     2755       -1     
- Misses        406      407       +1

see 1 file with indirect coverage changes

WIP

4ec665d

briantist added misc Used as a release-drafter "category" consul tests related to tests (not necessarily CI/CD) maintenance General technical debt developer experience Developer setup and experience labels Nov 11, 2023

briantist self-assigned this Nov 11, 2023

briantist added 6 commits November 11, 2023 09:00

xdist

589187f

retry server stuff

a1871c7

lint

08c7b97

conditional config rewrites

afe397e

set doctest vault address

e46f4ac

fix kvv2 docs test

f51c6f0

briantist force-pushed the tests/paralellize branch from 6aa0995 to f51c6f0 Compare November 22, 2023 22:38

briantist added 4 commits November 22, 2023 17:50

restore vault HA config

ce7db59

fix pki test assertion

2a6a90e

show skipped tests

8a31e82

cleanup

87bdfeb

briantist changed the title ~~WIP~~ Revamp integration tests, run in parallel Nov 23, 2023

briantist marked this pull request as ready for review November 23, 2023 00:42

briantist requested a review from a team as a code owner November 23, 2023 00:42

briantist added 2 commits November 22, 2023 19:52

fix protocol

5211bda

fix protocol for real

8568f67

briantist mentioned this pull request Nov 23, 2023

add quota support to system backend #1092

Merged

briantist merged commit 7db7de1 into hvac:main Dec 2, 2023
34 checks passed

briantist mentioned this pull request Dec 2, 2023

Add associated_data to encrypt/decrypt transit endpoints #1108

Merged

briantist mentioned this pull request Dec 30, 2023

fix readthedocs (RTD) public docsite, modernize some of the docs build/testing #1119

Merged

This was referenced Mar 16, 2024

LDAP secret engine support (#1032) #1033

Merged

Some tests in test_health.py are flaky #1141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp integration tests, run in parallel #1105

Revamp integration tests, run in parallel #1105

briantist commented Nov 11, 2023 •

edited

codecov bot commented Nov 11, 2023 •

edited

Revamp integration tests, run in parallel #1105

Revamp integration tests, run in parallel #1105

Conversation

briantist commented Nov 11, 2023 • edited

Other changes:

Other notes

codecov bot commented Nov 11, 2023 • edited

Codecov Report

briantist commented Nov 11, 2023 •

edited

codecov bot commented Nov 11, 2023 •

edited