Skip to content

Conversation

@geekosaur
Copy link
Collaborator

@geekosaur geekosaur commented Nov 17, 2025

For unknown reasons, GitHub's ubuntu-22.04 randomly breaks validate for older supported ghcs with shared object versioning errors for GLIBC (alex) and GLIBCXX (text). Force 3.16 up to ubuntu-latest, which doesn't show the problem (it hasn't happened on master or PRs targeting it, only on 3.16 branch).


Template B: This PR does not modify behaviour or interface

E.g. the PR only touches documentation or tests, does refactorings, etc.

Include the following checklist in your PR:

  • Patches conform to the coding conventions.
  • Is this a PR that fixes CI? If so, it will need to be backported to older cabal release branches (ask maintainers for directions).

@Bodigrim
Copy link
Collaborator

At the very beginning of the log GitHub logs versions of runner / images it uses:

Current runner version: '2.329.0'
Runner Image Provisioner
  Hosted Compute Agent
  Version: 20251016.436
  Commit: 8ab8ac8bfd662a3739dab9fe09456aba92132568
  Build Date: 2025-10-15T20:44:12Z
Operating System
  Ubuntu
  22.04.5
  LTS
Runner Image
  Image: ubuntu-22.04
  Version: 20251112.150.1
  Included Software: https://github.com/actions/runner-images/blob/ubuntu22/20251112.150/images/ubuntu/Ubuntu2204-Readme.md
  Image Release: https://github.com/actions/runner-images/releases/tag/ubuntu22%2F20251112.150

Is there any difference in configuration between successful and failed runs?

@geekosaur
Copy link
Collaborator Author

I don't see one. https://github.com/haskell/cabal/actions/runs/19380948375/job/55459273250 is an example; I matched it with https://github.com/haskell/cabal/actions/runs/19443824531/job/55633432198 (both ghc 9.2.8).

3.16 still runs on Ubuntu 22.04; master is up to 24.04, so PRs on master won't match. I hope we don't have to force 3.16 up to 24.04, since I suspect it's on 22.04 still for a reason.

For some reason 22.04 is now randomly failing with old ghcs (that
should have been built on it??) with GLIBC and GLIBCXX versioning
errors. This doesn't show up on `master`, and this PR was failing
until I bumped it up to ubuntu-latest as well.

As with `master`, `validate-old-ghcs` still uses ubuntu-22.04
because they need `libtinfo6` which isn't available on 24.04.
@Bodigrim
Copy link
Collaborator

(Moritz suggested to check for cache issues).

Looking at https://github.com/haskell/cabal/actions/runs/19446903122/job/55643460914?pr=11296#step:9:6, the cache key is just Linux-9.2.8-0fdff09e7d2103b8b688b4ba27d1dd8c4963baa4, meaning that cache from Ubuntu 22.04 and 24.04 will be used interchangably.

You probably want to extend

key: ${{ runner.os }}-${{ matrix.ghc }}-${{ github.sha }}

with GLIBC version as a part of the key.

@geekosaur
Copy link
Collaborator Author

I'd already flushed the caches.

I don't think that sharing is the problem, because it's happening on this PR which isn't a backport from master, it's a direct PR on 3.16. So the SHA can't be matching a cache built with 24.04.

(I do wonder if we can improve cache usage by replacing the SHA with the image name, allowing caches to be shared between PRs; I don't think that can break anything, but would need significant testing. Also I'd have to make sure the cache didn't include dist-newstyle or variants thereof, but I suspect it shouldn't be anyway because a PR in development might see considerable changes that we don't necessarily want to keep between runs. Maybe we can stick that in a separate per-SHA cache if it does turn out to be needed, though. In any case, this will have to wait for post-release; right now, the important part is getting 3.16.1.0 out the door.)

@Bodigrim
Copy link
Collaborator

Bodigrim commented Nov 17, 2025

So the SHA can't be matching a cache built with 24.04.

Could it be the case that some jobs in this PR use 22.04 and some 24.04? If I'm not mistaken, "Bootstrap" ones are on 24.04 and "Validate" are on 22.04.

@geekosaur
Copy link
Collaborator Author

Oh, hm, we have a fallback cache set. That might inappropriately be bootstrapping the cache from the wrong image; I'll have to look at how (or if!) we make those fallbacks and see if I can build the image name into it.

@geekosaur
Copy link
Collaborator Author

geekosaur commented Nov 17, 2025

Bootstrap caches are separate (the cache keys start with "Bootstrap-"). That's one reason we're blowing out our cache limit so often now.

@geekosaur
Copy link
Collaborator Author

Hm, actually, sharing caches between PRs will cause thrashing (probably not corruption, I hope) if multiple PRs are building at the same time. I think I'll just replace runner.os with matrix.os for now. (In a separate PR, since that should be done on master and backported.)

@Bodigrim
Copy link
Collaborator

I suspect that matrix.os is literally ubuntu-latest, which probably does not meaningfully improve things.

@geekosaur
Copy link
Collaborator Author

I know, but I couldn't find a context which supplies anything better and I can't use a shell-out in a cache key. And otherwise threading stuff like that between jobs that need to share a cache is rather painful (I hit soooo many pain points while working on #10503…).

Copy link
Member

@Mikolaj Mikolaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it!

@geekosaur
Copy link
Collaborator Author

Okay, ugly hack but it looks like I can expose a version via a step output. Problem then becomes how to do it for 3 platforms.

(I'm also going to add arch, which we don't currently care about but will matter when I bring back Intel Mac validates or we at some point extend testing to e.g. AArch64 Linux, both of which will lead to the same mayhem we've been seeing here.)

@Bodigrim
Copy link
Collaborator

I think arch is covered by GHC ABI version and should lead to different package hashes, so they won't clash even if mixed in the same cache.

@geekosaur
Copy link
Collaborator Author

Right, but it'll make the cache contention/thrashing worse (cf. previous comment about sharing them across PRs). If GitHub's gonna prune caches aggressively, we need to be more aggressive about not wasting our cache space.

@geekosaur
Copy link
Collaborator Author

And yes, that means my comment about adding them was wrong. Sorry, too many things in flight at the moment.

@geekosaur
Copy link
Collaborator Author

Ugh, Mergify's started applying #reviews >= 2 to release branch PRs even though branch protection correctly says only 1 is needed. Hacking around…

@mergify mergify bot added the queued label Nov 18, 2025
@geekosaur
Copy link
Collaborator Author

Mmm, maybe I'm misremembering. But branch protection says only one is needed, so Mergify should accept that.

@mergify mergify bot merged commit a63f250 into 3.16 Nov 18, 2025
58 checks passed
@mergify mergify bot deleted the GLIBCXX-trace-again branch November 18, 2025 07:42
@mergify mergify bot removed the queued label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants