Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Spring Cleaning #2137

Merged
merged 14 commits into from
May 13, 2021
Merged

CI Spring Cleaning #2137

merged 14 commits into from
May 13, 2021

Conversation

sehrope
Copy link
Member

@sehrope sehrope commented Apr 23, 2021

This PR addresses a couple of the topics in #2136

  • Limits code-style to only run on hosted runners
  • Disables the self-hosted runners when running GitHub Actions on forks (no more build errors on unpublished branches!)
  • Renames some of the GitHub Actions so the start of the name is more descriptive, e.g. "Ubuntu, code style" becomes "Code style, Ubuntu"
  • Cleans up the docker directory to not reference travis scripts
  • Removes Travis

This does not yet include the third-party action version pinning as I'm hoping for a way to store the value just once and reference it throughout the template. I tried top level environment values but you can't reference them in uses: ... clauses.

The "Only run self-hosted on pgjdbc/pgjdbc..." is done via a new build step that dynamically creates the build matrix to use for downstream steps. It outputs a JSON string that gets parsed by the other jobs as their matrix key. More info on how this works here: https://github.blog/changelog/2020-04-15-github-actions-new-workflow-features/

@sehrope
Copy link
Member Author

sehrope commented Apr 23, 2021

Here's what the new workflow with the dynamic matrix looks like:

github-action-dynamic-ci-matrix

@vlsi
Copy link
Member

vlsi commented Apr 23, 2021

I'm afraid the change kills testing for the old DB versions.

There are lots of combinations (Java in [8, 11, 16]; os in [Linux, windows, mac]; GSS in [Y, N]; PG version in [9...14]; timezone in [...]; and so on)

What if set-matrix pulls a random subset of 5-8 configurations and schedules tests appropriately?
Then we do not need to figure out which versions to run, and we have decent coverage even for unexpected combinations.

@sehrope
Copy link
Member Author

sehrope commented Apr 23, 2021

Yes we definitely need a wider coverage than what's in the current actions matrix but rather than have it be random, I'm thinking we handle that with a separate full workflow that's run on a schedule (e.g. daily).

I don't like randomizing the PR action test matrix itself. Maybe if the randomness was seeded with something that was fixed per-PR (hash of the branch name?) so that it's consistent. But that seems like it'd be more hassle than it's worth. Especially if we have a separate task providing wider coverage.

@vlsi
Copy link
Member

vlsi commented Apr 23, 2021

What is your plan?
If you remove Travis CI, then you remove a lot of important test coverage

Do you intend to replicate the comparable coverage via Travis?

@sehrope
Copy link
Member Author

sehrope commented Apr 24, 2021

Yes that's the plan. A full matrix covering everything currently in Travis and hopefully even more.

I'm putting it together as an improved docker configuration. It'll be leveraged by the GitHub Actions and also provide a simple way for developers to test their changes against the exact same server configurations.

There's no rush to remove anything until the new coverage is in place. We can hold off on this entire PR or merge the rest and leave out the Travis removal.

@sehrope
Copy link
Member Author

sehrope commented Apr 25, 2021

It's still a work in progress but see here for the "full matrix" branch: https://github.com/sehrope/pgjdbc/tree/omni-ci

I haven't pushed it to this PR branch yet as I didn't want to spam the repo. Took quite few a few pushes to get the actions actually running. Once it's ready for review I'm going to layer it atop of here.

That branch adds an "Omni CI" action to test a gigantic matrix of JDK x PG x SSL x XA. Also has a couple extra for replication and TZ testing. Rather than waiting for it for every PR, it runs every day at 06:00UTC so we'll have a daily report (Here's an example: https://github.com/sehrope/pgjdbc/actions/runs/782109472). We can also start it manually from the GitHub UI which is handy for a release branch (you can pick the branch it builds).

While working on this I found a couple of race conditions in the tests but overall they succeed across everything that's tested so far. The flakiest ones are the replication tests that hang on that large read/write so I've marked those as "experimental" for now. I can reproduce that locally when running the full suite but interestingly running just the failing tests does not cause the error. It must require a build-up of WAL generated by the rest of the suite or some other slowdown to trigger the exact situation.

I've also cleaned up the docker environment a bit and fiddled with some of it's default server settings to speed things up (e.g. we don't need fsync for ephemeral tests). On my local machine I'm seeing a 30-40% improvement in run time for the full suite running against the stock test docker container. As the main.yml uses the same container, it should speed up the regular PR testing flow as well.

Still need to add a couple of the more advanced test configurations, additional JDK options (right now it's just OpenJDK 8-15) , and possibly limiting the overall matrix a bit as the N x M x O x P... is a bit overkill. I'm hoping to circle back to this next week. If it looks like the additional complex configs are taking too long to set up, I'll rip out Travis removal and get this ready for inclusion with the current improvements.

@bokken
Copy link
Member

bokken commented Apr 25, 2021 via email

@vlsi
Copy link
Member

vlsi commented Apr 25, 2021

That branch adds an "Omni CI" action to test a gigantic matrix of JDK x PG x SSL x XA

I am afraid that does not qualify as fair use of what GitHub kindly provides.
That is why I suggest we should set a hard limit (e.g. 10) on jobs, and that is exactly why I suggest using a random selection so we still test unusual combinations from time to time.

@sehrope
Copy link
Member Author

sehrope commented Apr 26, 2021

Yes I'm definitely going to limit things before bringing it into this PR. Part of the big matrix approach was just to see if it would even work (both the tests on all the platforms and the overkill matrix). Will probably do something like:

  • Latest stable JDK x each PG server
  • Latest PG server x other stable JDKs
  • Latest PG built from source on latest stable JDK (marked as experimental so it doesn't actually fail CI but would provide a canary environment for upcoming changes that might break the driver)
  • Latest stable JDK / Latest stable PG x interseting combinations (e.g. different TimeZone, different langs, etc)
  • Latest stable PG x Manual list of additional non-openjdk

So it'd be on the order of: (Number of PG versions) + (Number of stable JDKs) + 1 + (Number of "interesting" combos) + (Numer of "other" JDKs)

I think that should be an acceptable amount and it won't grow that fast either (linearly with each new PG or JDK release).

From what I've read in the GitHub Actions docs, I think we can parameterize the actions as well. That means we could have either the randomization factor be controlled (defaulting to N random tests but override to run a different limit) or even add options to turn on slower / extensive tests that would be disabled by default.

I'll circle back once I've had some time to clean this up some more.

@sehrope
Copy link
Member Author

sehrope commented Apr 29, 2021

I pushed an update that cleans up some more tests and the omni ci now runs with the following builds:

  • Latest JDK x each stable PG version
  • Latest PG version x each remaining JDK
  • One (1) No SSL / No SCRAM (only on latest PG / JDK)
  • Two (2) custom server timezones (only on latest PG / JDK)
  • Three (3) custom query modes (only on latest PG / JDK)
  • One (1) latest PG / JDK with all slow tests and replication tests enabled (marked experimental for now due to replication causing that hanging issue)
  • Other (non Zulu OpenJDK) JDKs (only on latest PG)

Here's a sample run: https://github.com/sehrope/pgjdbc/actions/runs/797070873

For the "Other JDKs" it should be able to handle anything the GitHub action/install-jdk can handle directly or anything that can supply a tarball. For now I've included:

  • Adopt 8 / 11
  • OpenJ9 8 / 11
  • Amazon Corretto 8 / 11

Thanks to a combination of defaulting to disabling some of the slower tests and fancy server settings (e.g. disable fsync and full page writes), the entire run takes only 17m. I think that should be fine for something that runs daily. We can also manually fire it for a branch from the GitHub UI.

This should cover just about everything that Travis was previously testing. Only thing left out is running against server HEAD revision. We can set that up as a separate exercise as the container required for it is totally different.

Another positive side effect of all this is that all the scripts for bootstrapping the CI environment work locally. There's nothing GitHub specific so for somebody new to the repo can easily reproduce CI errors locally via:

# Start server in the foreground in a terminal:
$ docker/bin/postgres-server

# Build and test entire driver:
$ ./gradlew clean build test

@sehrope
Copy link
Member Author

sehrope commented May 4, 2021

Any other thoughts on this prior to merging?

Still some open TODO items to get a server source CI target going, but that might as well be it's own separate action given how completely different it will be.

@vlsi
Copy link
Member

vlsi commented May 4, 2021

Any other thoughts on this prior to merging?

a) The key issue is that it is really hard to maintain the set of tested (and untested) combinations manually. Travis did include hand-crafted list of jobs, and it was hard to figure out which one can be removed.

b) We need to collect coverage so we know if the test code does touch the lines affected by PR. I guess we don't need to collect coverage from all the jobs since most of them might be similar and coverage collection makes tests slower, hovewer, XA and Java might touch different code paths, so we aggregate coverage data from several jobs.

c) the current labels are very long, and it makes Actions UI almost useless as you don't see what is tested.

d) 27 jobs for each PR might be too much.

@sehrope
Copy link
Member Author

sehrope commented May 4, 2021

a) The key issue is that it is really hard to maintain the set of tested (and untested) combinations manually. Travis did include hand-crafted list of jobs, and it was hard to figure out which one can be removed.

We're not really losing anything with Travis right now anyway as the jobs run successfully there anymore. We could always leave it on, but having every PR complain about not passing CI because Travis itself doesn't work is annoying.

b) We need to collect coverage so we know if the test code does touch the lines affected by PR. I guess we don't need to collect coverage from all the jobs since most of them might be similar and coverage collection makes tests slower, hovewer, XA and Java might touch different code paths, so we aggregate coverage data from several jobs.

We could do the reverse and see what's being included in the "default" test environment to figure out what's missing.

c) the current labels are very long, and it makes Actions UI almost useless as you don't see what is tested.

I tried out a couple options for giving descriptive names and settled on the JSON representation as it's the only one that has the full details. I figured that we're primarily going to look at them when one of them failed and the JSON has everything needed to recreate it locally.

The matrix generation in the CI template is commented to describe the environment and what impact that would have on testing. Easier to check there to understand the combos.

d) 27 jobs for each PR might be too much.

This isn't meant to be run per-PR. It's a single CI pipeline that runs daily at 6am UTC or manually (e.g. before a release or a one-off to test a particularly complex branch).

@sehrope sehrope force-pushed the ci-spring-cleaning branch 3 times, most recently from 496f6c5 to 68fcded Compare May 12, 2021 14:51
@sehrope
Copy link
Member Author

sehrope commented May 12, 2021

I removed the commit that deletes Travis and added a couple that might fix things over there. The two issues are the underlying OS for Travis is ancient (Ubuntu Trusty 14.04) and it's pointing at apt.postgresql.org which does not have the out of support server versions.

I'm trying some more modern OS options and switched out PGDG repo to use apt-archive.postgresql.org. Let's see if this finally gets that environment working again.

@sehrope sehrope force-pushed the ci-spring-cleaning branch 6 times, most recently from 322c3dd to 748fb7f Compare May 12, 2021 16:56
Adds a helper script to wait in a loop for a postgres server to startup.
Refactors docker environment to support versions going back to 8.4.

Speeds up tests by defaulting to disabling fsync, syncronous_commit, and
full_page_writes. All three can be enabled back via environment variables.

Adds omni.yml GitHub Actions CI configuration to automate testing of a wide
dynamic matrix of server versions, JDKs, and other testing options on an daily
basis.

Also adds a helper script, docker/bin/postgres-server, that can be used from
on a developer's local machine to create a server environment exactly matching
the CI platform.
@sehrope
Copy link
Member Author

sehrope commented May 12, 2021

After going through the Travis setup I'm convinced we should scrap it entirely. Between the yaml file, the scripts, and the matrix environment flags, it's impossible to figure out where anything gets configured. If anybody wants to follow up on trying to get it working, I've got my latest attempt here: https://github.com/sehrope/pgjdbc/tree/ci-spring-cleaning-misc-travis-non-working

I've removed the "Removing Travis" commit from this branch. Once it passes CI I'm going to let it sit till tomorrow and then merge. All these changes either fix things or are purely additive to the test setup (e.g. the new omni action). We can have a separate PR to either remove Travis or rewrite it from scratch.

@sehrope sehrope merged commit a627cd6 into pgjdbc:master May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants