CI Spring Cleaning #2137

sehrope · 2021-04-23T16:06:51Z

This PR addresses a couple of the topics in #2136

Limits code-style to only run on hosted runners
Disables the self-hosted runners when running GitHub Actions on forks (no more build errors on unpublished branches!)
Renames some of the GitHub Actions so the start of the name is more descriptive, e.g. "Ubuntu, code style" becomes "Code style, Ubuntu"
Cleans up the docker directory to not reference travis scripts
Removes Travis

This does not yet include the third-party action version pinning as I'm hoping for a way to store the value just once and reference it throughout the template. I tried top level environment values but you can't reference them in uses: ... clauses.

The "Only run self-hosted on pgjdbc/pgjdbc..." is done via a new build step that dynamically creates the build matrix to use for downstream steps. It outputs a JSON string that gets parsed by the other jobs as their matrix key. More info on how this works here: https://github.blog/changelog/2020-04-15-github-actions-new-workflow-features/

sehrope · 2021-04-23T16:11:45Z

Here's what the new workflow with the dynamic matrix looks like:

vlsi · 2021-04-23T16:19:09Z

I'm afraid the change kills testing for the old DB versions.

There are lots of combinations (Java in [8, 11, 16]; os in [Linux, windows, mac]; GSS in [Y, N]; PG version in [9...14]; timezone in [...]; and so on)

What if set-matrix pulls a random subset of 5-8 configurations and schedules tests appropriately?
Then we do not need to figure out which versions to run, and we have decent coverage even for unexpected combinations.

sehrope · 2021-04-23T16:38:06Z

Yes we definitely need a wider coverage than what's in the current actions matrix but rather than have it be random, I'm thinking we handle that with a separate full workflow that's run on a schedule (e.g. daily).

I don't like randomizing the PR action test matrix itself. Maybe if the randomness was seeded with something that was fixed per-PR (hash of the branch name?) so that it's consistent. But that seems like it'd be more hassle than it's worth. Especially if we have a separate task providing wider coverage.

vlsi · 2021-04-23T22:47:26Z

What is your plan?
If you remove Travis CI, then you remove a lot of important test coverage

Do you intend to replicate the comparable coverage via Travis?

sehrope · 2021-04-24T10:31:55Z

Yes that's the plan. A full matrix covering everything currently in Travis and hopefully even more.

I'm putting it together as an improved docker configuration. It'll be leveraged by the GitHub Actions and also provide a simple way for developers to test their changes against the exact same server configurations.

There's no rush to remove anything until the new coverage is in place. We can hold off on this entire PR or merge the rest and leave out the Travis removal.

sehrope · 2021-04-25T11:55:16Z

It's still a work in progress but see here for the "full matrix" branch: https://github.com/sehrope/pgjdbc/tree/omni-ci

I haven't pushed it to this PR branch yet as I didn't want to spam the repo. Took quite few a few pushes to get the actions actually running. Once it's ready for review I'm going to layer it atop of here.

That branch adds an "Omni CI" action to test a gigantic matrix of JDK x PG x SSL x XA. Also has a couple extra for replication and TZ testing. Rather than waiting for it for every PR, it runs every day at 06:00UTC so we'll have a daily report (Here's an example: https://github.com/sehrope/pgjdbc/actions/runs/782109472). We can also start it manually from the GitHub UI which is handy for a release branch (you can pick the branch it builds).

While working on this I found a couple of race conditions in the tests but overall they succeed across everything that's tested so far. The flakiest ones are the replication tests that hang on that large read/write so I've marked those as "experimental" for now. I can reproduce that locally when running the full suite but interestingly running just the failing tests does not cause the error. It must require a build-up of WAL generated by the rest of the suite or some other slowdown to trigger the exact situation.

I've also cleaned up the docker environment a bit and fiddled with some of it's default server settings to speed things up (e.g. we don't need fsync for ephemeral tests). On my local machine I'm seeing a 30-40% improvement in run time for the full suite running against the stock test docker container. As the main.yml uses the same container, it should speed up the regular PR testing flow as well.

Still need to add a couple of the more advanced test configurations, additional JDK options (right now it's just OpenJDK 8-15) , and possibly limiting the overall matrix a bit as the N x M x O x P... is a bit overkill. I'm hoping to circle back to this next week. If it looks like the additional complex configs are taking too long to set up, I'll rip out Travis removal and get this ready for inclusion with the current improvements.

bokken · 2021-04-25T13:33:11Z

Do we really need to test every version of jdk 8-<latest>? Would it make more sense to test LTS releases and latest and maybe an EA of next? So right now that would be 8, 11, 16, and 17ea. It seems that could cut down the test execution time.

…

On Sun, Apr 25, 2021 at 6:55 AM Sehrope Sarkuni ***@***.***> wrote: It's still a work in progress but see here for the "full matrix" branch: https://github.com/sehrope/pgjdbc/tree/omni-ci I haven't pushed it to this PR branch yet as I didn't want to spam the repo. Took quite few a few pushes to get the actions actually running. Once it's ready for review I'm going to layer it atop of here. That branch adds an "Omni CI" action to test a gigantic matrix of JDK x PG x SSL x XA. Also has a couple extra for replication and TZ testing. Rather than waiting for it for every PR, it runs every day at 06:00UTC so we'll have a daily report (Here's an example: https://github.com/sehrope/pgjdbc/actions/runs/782109472). We can also start it manually from the GitHub UI which is handy for a release branch (you can pick the branch it builds). While working on this I found a couple of race conditions in the tests but overall they succeed across everything that's tested so far. The flakiest ones are the replication tests that hang on that large read/write so I've marked those as "experimental" for now. I can reproduce that locally when running the full suite but interestingly running just the failing tests does not cause the error. It must require a build-up of WAL generated by the rest of the suite or some other slowdown to trigger the exact situation. I've also cleaned up the docker environment a bit and fiddled with some of it's default server settings to speed things up (e.g. we don't need fsync for ephemeral tests). On my local machine I'm seeing a 30-40% improvement in run time for the full suite running against the stock test docker container. As the main.yml uses the same container, it should speed up the regular PR testing flow as well. Still need to add a couple of the more advanced test configurations, additional JDK options (right now it's just OpenJDK 8-15) , and possibly limiting the overall matrix a bit as the N x M x O x P... is a bit overkill. I'm hoping to circle back to this next week. If it looks like the additional complex configs are taking too long to set up, I'll rip out Travis removal and get this ready for inclusion with the current improvements. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2137 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAW3U3IIAXCU6IFC53MQASDTKP7LFANCNFSM43O5LESQ> .

vlsi · 2021-04-25T14:02:07Z

That branch adds an "Omni CI" action to test a gigantic matrix of JDK x PG x SSL x XA

I am afraid that does not qualify as fair use of what GitHub kindly provides.
That is why I suggest we should set a hard limit (e.g. 10) on jobs, and that is exactly why I suggest using a random selection so we still test unusual combinations from time to time.

sehrope · 2021-04-26T12:20:21Z

Yes I'm definitely going to limit things before bringing it into this PR. Part of the big matrix approach was just to see if it would even work (both the tests on all the platforms and the overkill matrix). Will probably do something like:

Latest stable JDK x each PG server
Latest PG server x other stable JDKs
Latest PG built from source on latest stable JDK (marked as experimental so it doesn't actually fail CI but would provide a canary environment for upcoming changes that might break the driver)
Latest stable JDK / Latest stable PG x interseting combinations (e.g. different TimeZone, different langs, etc)
Latest stable PG x Manual list of additional non-openjdk

So it'd be on the order of: (Number of PG versions) + (Number of stable JDKs) + 1 + (Number of "interesting" combos) + (Numer of "other" JDKs)

I think that should be an acceptable amount and it won't grow that fast either (linearly with each new PG or JDK release).

From what I've read in the GitHub Actions docs, I think we can parameterize the actions as well. That means we could have either the randomization factor be controlled (defaulting to N random tests but override to run a different limit) or even add options to turn on slower / extensive tests that would be disabled by default.

I'll circle back once I've had some time to clean this up some more.

sehrope · 2021-04-29T19:18:09Z

I pushed an update that cleans up some more tests and the omni ci now runs with the following builds:

Latest JDK x each stable PG version
Latest PG version x each remaining JDK
One (1) No SSL / No SCRAM (only on latest PG / JDK)
Two (2) custom server timezones (only on latest PG / JDK)
Three (3) custom query modes (only on latest PG / JDK)
One (1) latest PG / JDK with all slow tests and replication tests enabled (marked experimental for now due to replication causing that hanging issue)
Other (non Zulu OpenJDK) JDKs (only on latest PG)

Here's a sample run: https://github.com/sehrope/pgjdbc/actions/runs/797070873

For the "Other JDKs" it should be able to handle anything the GitHub action/install-jdk can handle directly or anything that can supply a tarball. For now I've included:

Adopt 8 / 11
OpenJ9 8 / 11
Amazon Corretto 8 / 11

Thanks to a combination of defaulting to disabling some of the slower tests and fancy server settings (e.g. disable fsync and full page writes), the entire run takes only 17m. I think that should be fine for something that runs daily. We can also manually fire it for a branch from the GitHub UI.

This should cover just about everything that Travis was previously testing. Only thing left out is running against server HEAD revision. We can set that up as a separate exercise as the container required for it is totally different.

Another positive side effect of all this is that all the scripts for bootstrapping the CI environment work locally. There's nothing GitHub specific so for somebody new to the repo can easily reproduce CI errors locally via:

# Start server in the foreground in a terminal:
$ docker/bin/postgres-server

# Build and test entire driver:
$ ./gradlew clean build test

sehrope · 2021-05-04T14:26:02Z

Any other thoughts on this prior to merging?

Still some open TODO items to get a server source CI target going, but that might as well be it's own separate action given how completely different it will be.

vlsi · 2021-05-04T15:30:26Z

Any other thoughts on this prior to merging?

a) The key issue is that it is really hard to maintain the set of tested (and untested) combinations manually. Travis did include hand-crafted list of jobs, and it was hard to figure out which one can be removed.

b) We need to collect coverage so we know if the test code does touch the lines affected by PR. I guess we don't need to collect coverage from all the jobs since most of them might be similar and coverage collection makes tests slower, hovewer, XA and Java might touch different code paths, so we aggregate coverage data from several jobs.

c) the current labels are very long, and it makes Actions UI almost useless as you don't see what is tested.

d) 27 jobs for each PR might be too much.

sehrope · 2021-05-04T16:20:56Z

a) The key issue is that it is really hard to maintain the set of tested (and untested) combinations manually. Travis did include hand-crafted list of jobs, and it was hard to figure out which one can be removed.

We're not really losing anything with Travis right now anyway as the jobs run successfully there anymore. We could always leave it on, but having every PR complain about not passing CI because Travis itself doesn't work is annoying.

b) We need to collect coverage so we know if the test code does touch the lines affected by PR. I guess we don't need to collect coverage from all the jobs since most of them might be similar and coverage collection makes tests slower, hovewer, XA and Java might touch different code paths, so we aggregate coverage data from several jobs.

We could do the reverse and see what's being included in the "default" test environment to figure out what's missing.

c) the current labels are very long, and it makes Actions UI almost useless as you don't see what is tested.

I tried out a couple options for giving descriptive names and settled on the JSON representation as it's the only one that has the full details. I figured that we're primarily going to look at them when one of them failed and the JSON has everything needed to recreate it locally.

The matrix generation in the CI template is commented to describe the environment and what impact that would have on testing. Easier to check there to understand the combos.

d) 27 jobs for each PR might be too much.

This isn't meant to be run per-PR. It's a single CI pipeline that runs daily at 6am UTC or manually (e.g. before a release or a one-off to test a particularly complex branch).

sehrope · 2021-05-12T14:53:05Z

I removed the commit that deletes Travis and added a couple that might fix things over there. The two issues are the underlying OS for Travis is ancient (Ubuntu Trusty 14.04) and it's pointing at apt.postgresql.org which does not have the out of support server versions.

I'm trying some more modern OS options and switched out PGDG repo to use apt-archive.postgresql.org. Let's see if this finally gets that environment working again.

…own copy

…edQueue

…oncurrency friendly

Adds a helper script to wait in a loop for a postgres server to startup.

Refactors docker environment to support versions going back to 8.4. Speeds up tests by defaulting to disabling fsync, syncronous_commit, and full_page_writes. All three can be enabled back via environment variables. Adds omni.yml GitHub Actions CI configuration to automate testing of a wide dynamic matrix of server versions, JDKs, and other testing options on an daily basis. Also adds a helper script, docker/bin/postgres-server, that can be used from on a developer's local machine to create a server environment exactly matching the CI platform.

… JDKs

sehrope · 2021-05-12T17:42:42Z

After going through the Travis setup I'm convinced we should scrap it entirely. Between the yaml file, the scripts, and the matrix environment flags, it's impossible to figure out where anything gets configured. If anybody wants to follow up on trying to get it working, I've got my latest attempt here: https://github.com/sehrope/pgjdbc/tree/ci-spring-cleaning-misc-travis-non-working

I've removed the "Removing Travis" commit from this branch. Once it passes CI I'm going to let it sit till tomorrow and then merge. All these changes either fix things or are purely additive to the test setup (e.g. the new omni action). We can have a separate PR to either remove Travis or rewrite it from scratch.

sehrope force-pushed the ci-spring-cleaning branch 3 times, most recently from 496f6c5 to 68fcded Compare May 12, 2021 14:51

sehrope force-pushed the ci-spring-cleaning branch 6 times, most recently from 322c3dd to 748fb7f Compare May 12, 2021 16:56

sehrope added 7 commits May 12, 2021 13:27

test: Limit CI codestyle to only run on hosted GitHub action runner

a2d3caa

test: Disable self-hosted GitHub action runner execution on forks

1ec220e

test: Rename GitHub Action workflow steps

e41142f

test: Replace docker-compose reference to travis ssl script with its …

1cbeb3d

…own copy

fix: Correct race condition in TestLogHandler by using ConcurrentLink…

c04105d

…edQueue

test: Change replication tests to use privileged connection

425802f

test: Mark more tests that sleep or wait as SlowTests

f158978

sehrope added 7 commits May 12, 2021 13:27

test: Change some internal test DDL to use temp tables to make them c…

7be84ce

…oncurrency friendly

test: Add missing ServerVersionRule to ReplicationConnectionTest

fae1496

test: Add wait_for_pg_isready script

d8546a4

Adds a helper script to wait in a loop for a postgres server to startup.

test: Upgrade omni ci to use setup-java@v2 and options for additional…

8eddef2

… JDKs

test: Add additional JDK options to omni ci

ee133b0

test: Remove pointless test

50bc398

sehrope force-pushed the ci-spring-cleaning branch from d08fae4 to 50bc398 Compare May 12, 2021 17:27

sehrope merged commit a627cd6 into pgjdbc:master May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Spring Cleaning #2137

CI Spring Cleaning #2137

sehrope commented Apr 23, 2021

sehrope commented Apr 23, 2021

vlsi commented Apr 23, 2021

sehrope commented Apr 23, 2021

vlsi commented Apr 23, 2021

sehrope commented Apr 24, 2021

sehrope commented Apr 25, 2021

bokken commented Apr 25, 2021 via email

vlsi commented Apr 25, 2021

sehrope commented Apr 26, 2021

sehrope commented Apr 29, 2021

sehrope commented May 4, 2021

vlsi commented May 4, 2021 •

edited

sehrope commented May 4, 2021

sehrope commented May 12, 2021

sehrope commented May 12, 2021

CI Spring Cleaning #2137

CI Spring Cleaning #2137

Conversation

sehrope commented Apr 23, 2021

sehrope commented Apr 23, 2021

vlsi commented Apr 23, 2021

sehrope commented Apr 23, 2021

vlsi commented Apr 23, 2021

sehrope commented Apr 24, 2021

sehrope commented Apr 25, 2021

bokken commented Apr 25, 2021 via email

vlsi commented Apr 25, 2021

sehrope commented Apr 26, 2021

sehrope commented Apr 29, 2021

sehrope commented May 4, 2021

vlsi commented May 4, 2021 • edited

sehrope commented May 4, 2021

sehrope commented May 12, 2021

sehrope commented May 12, 2021

vlsi commented May 4, 2021 •

edited