Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus CI free usage going away - job runtime issues & credits #24280

Open
rgommers opened this issue Jul 28, 2023 · 38 comments
Open

Cirrus CI free usage going away - job runtime issues & credits #24280

rgommers opened this issue Jul 28, 2023 · 38 comments

Comments

@rgommers
Copy link
Member

This is getting kinda annoying: [skip cirrus] is broken and the wheel build and other CI jobs are running way too often. It was probably broken by the addition of other logic like always triggering on PRs with build-related label.

@charris
Copy link
Member

charris commented Jul 28, 2023

What do you want to happen with the build label? I use it when I want to test wheel builds. It is a bit annoying, as simply adding a label will retrigger the builds.

@rgommers
Copy link
Member Author

Labels are for indicating what a PR or issue is about, so I add it to build-related PRs. And running wheel builds by default is fine, but the [skip cirrus] should supercede that (just like it does for [skip actions] on GHA).

@charris
Copy link
Member

charris commented Jul 28, 2023

but the [skip cirrus] should supercede

Fair enough, but some labels do trigger wheel builds, so they aren't only about what the PR or issue is about. Not sure how we could change that unless we want to add a [build wheels] option.

@rgommers
Copy link
Member Author

Why [skip cirrus] doesn't work is explained by

numpy/.cirrus.star

Lines 23 to 25 in 7fc7277

# Obtain commit message for the event. Unfortunately CIRRUS_CHANGE_MESSAGE
# only contains the actual commit message on a non-PR trigger event.
# For a PR event it contains the PR title and description.

It looks like it's not the build label though, we run wheel builds on Cirrus always. E.g. look at this doc-only PR with no labels: gh-24277. This wastes a huge amount of compute time.

@charris
Copy link
Member

charris commented Jul 28, 2023

Cirrus is a bit different. One problem is that it cannot be manually run, I think it needs something in the *.yml file for that. I also think (only) wheels are built and tested, I'm not sure how a build can be made without that. It is on my problem list, but not annoying enough to spend time on it. Maybe @andyfaff has some thoughts.

@andyfaff
Copy link
Member

andyfaff commented Jul 29, 2023

.cirrus.star is triggered to run with a lot of GH events on the numpy/numpy repo, e.g. if there are commits to PRs, commits to branches, PRs are opened, labels are attached to PRs, tags are pushed, merges, etc.

Here is the current logic for numpy's cirrus CI, as contained in .cirrus.star:

  • only run jobs only on the numpy/numpy repo.
  • (do a wheels build if a 'nightly' cron job is requested)
  • for all other triggers (and there can be a lot of them) the configuration script looks at the commit message of the SHA, provided by CIRRUS_CHANGE_IN_REPO, for the event. If there is [skip cirrus] or [skip ci] in the message then don't run any CI.
  • otherwise do the wheel build and the macosx_arm64 runs.

This logic will dictate that the wheels always get built if there is no [skip cirrus] in the commit message.

[skip cirrus] is broken and the wheel build and other CI jobs are running way too often.

I'm pretty sure [skip cirrus] works for commits to PRs. But for other events if the commit message for the SHA doesn't contain those words then the wheel build and macosx_arm64 will run. For example adding labels to a PR will trigger these runs if the last commit to the PR doesn't contain the magic words.

It's not clear to me from previous comments what additional logic is being requested. The following environment variables may be useful in reducing the number of runs made:

  • CIRRUS_PR | PR number if current build was triggered by a PR - we may be able to query Github as to what occurred in the PR to trigger cirrus-ci. We already query GH in cirrus.star.
  • CIRRUS_PR_LABELS | comma separated list of PR's labels if current build was triggered by a PR - examine the labels to see if 36 - build is present. e.g. if you look in the debugging info for https://cirrus-ci.com/build/4758553810960384 you'll see that label present.
  • CIRRUS_TAG | Tag name if current build was triggered by a new tag.

Possible extra logic that could be done:

  • don't automatically do wheel build unless ...
  • it's a nightly job
  • scan SHA message for [wheel build], do wheel build if present
  • scan CIRRUS_PR_LABELS for 36 - Build, do wheel build if present
  • build wheels for a tag event

Relevant links:

CIRRUS_CHANGE_IN_REPO
numpy's .cirrus.star config
numpy's wheel build config
numpy's macosx_arm64 config
cirrus environment variables

EDIT:
[skip cirrus] wasn't working, fixed in #24285.

@andyfaff
Copy link
Member

Having just said that I see an issue at https://github.com/numpy/numpy/blob/main/.cirrus.star#L27. I'll open a PR.

@andyfaff
Copy link
Member

You can examine what we requested from the GH API in 24282. The github request is:
https://api.github.com/repos/numpy/numpy/git/commits/cad8595a8c86c173285d82b61f6797ff24324364.

This is what is returned:

{
  "sha": "cad8595a8c86c173285d82b61f6797ff24324364",
  "node_id": "C_kwDOAA3dP9oAKGNhZDg1OTVhOGM4NmMxNzMyODVkODJiNjFmNjc5N2ZmMjQzMjQzNjQ",
  "url": "https://api.github.com/repos/numpy/numpy/git/commits/cad8595a8c86c173285d82b61f6797ff24324364",
  "html_url": "https://github.com/numpy/numpy/commit/cad8595a8c86c173285d82b61f6797ff24324364",
  "author": {
    "name": "Andrew Nelson",
    "email": "andyfaff@gmail.com",
    "date": "2023-07-29T00:15:13Z"
  },
  "committer": {
    "name": "Andrew Nelson",
    "email": "andyfaff@gmail.com",
    "date": "2023-07-29T00:15:13Z"
  },
  "tree": {
    "sha": "1c9f90fbbcff3162542b6663e0fe75a86e819bb4",
    "url": "https://api.github.com/repos/numpy/numpy/git/trees/1c9f90fbbcff3162542b6663e0fe75a86e819bb4"
  },
  "message": "CI: correct URL in cirrus.star [skip cirrus]",
  "parents": [
    {
      "sha": "422854fa8dc501e5fcbd713093fdee04e7e9ebb8",
      "url": "https://api.github.com/repos/numpy/numpy/git/commits/422854fa8dc501e5fcbd713093fdee04e7e9ebb8",
      "html_url": "https://github.com/numpy/numpy/commit/422854fa8dc501e5fcbd713093fdee04e7e9ebb8"
    }
  ],
  "verification": {
    "verified": false,
    "reason": "unsigned",
    "signature": null,
    "payload": null
  }
}

The cirrus CI didn't run because [skip cirrus] is in dct['message'].

@andyfaff
Copy link
Member

andyfaff commented Jul 29, 2023

I think we should be able to limit the wheel build to whatever is in the GHA wheels build, if that's desired.

EDIT: apart from manual trigger, not sure how to do that.

@rgommers
Copy link
Member Author

rgommers commented Jul 30, 2023

Thanks for the fix @andyfaff!

Given the major reduction in free resources available per 1 Sep (see https://cirrus-ci.org/blog/2023/07/17/limiting-free-usage-of-cirrus-ci/), I think we have a lot more work to do here unfortunately (and may consider buying some credits).

Regarding the label-based trigger, I think there are two things wrong with it:

  • the Build label is wrong for this, it should be a dedicated and clearly named label like trigger-cirrus
  • it's a problem that this label, once added to a PR, tends to stay on it and then the wheel builds run for every subsequent push. typically what's intended is a one-off "check wheel builds". It doesn't really make sense (resource-usage wise) to run the full battery of wheel builds on every push to a PR.

Given the above and that our resource usage at the current rate (see screenshot below) is completely unsustainable and would run at ~$2,500/month (or ~$1,400 after the upcoming price reductions also announced in the blog post linked above) if we'd have to pay for it from 1 Sep, I'd much prefer to get rid of label-based triggering completely. Manual wheel build triggers should be rare and reserved to maintainers who know what they are doing and are able to push an empty commit with the correct commands in the commit message.

image

CPU usage is also bad on cibuildwheel jobs (and note that credits go per CPU-minute, i.e. per core rather than per job); we need to ensure to use 2 cores for pytest:

image

I'll note that on jobs with 2 CPUs, using -n auto also isn't great, since pytest-xdist translates that to -n4 rather than -n2 and its scaling of parallelism is terrible and performance improvement tends to get negative from 4 jobs already. -n2 on 2-core jobs is already far from linear, 3 is getting questionable but still gains typically, >=4 quickly decreases performance.

Example log from a recent macos_arm64_test run showing we get 4 pytest-xdist workers:

$ /Users/admin/numpy-dev/bin/python3.10 -m pytest --rootdir=/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cirrus-ci-build/build-install/usr/lib/python3.10/site-packages -n auto -m 'not slow' numpy
============================= test session starts ==============================
platform darwin -- Python 3.10.6, pytest-7.4.0, pluggy-1.2.0
rootdir: /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cirrus-ci-build/build-install/usr/lib/python3.10/site-packages
configfile: ../../../../../pytest.ini
plugins: hypothesis-6.82.0, xdist-3.3.1
created: 4/4 workers
4 workers [34198 items]

Here is the full list of jobs and runtimes for a single run:

image

That's a total of ~222 CPU minutes for wheel builds per run, divided in

  • 77.5 * 2 = 144 min on Linux aarch64 = $0.145 per run
  • 33.5 * 4 = 134 min on macOS = $0.67 per run

So each wheel build costs about $0.80 each time it's triggered - this is a lot. We also have issues with some tests in the full test suite that need fixing (e.g., the slow typing tests shouldn't be run by default, they're the same on all platforms and take well over a minute). But most importantly, we should not be triggering wheel builds so much, they're only very rarely useful.

@rgommers
Copy link
Member Author

As you can see in gh-24289, that PR - which only tweaked a code comment in a meson.build file and I added the 04 - Documentation label to - still triggers a full set of wheel builds on Cirrus (I cancelled them manually after they started running).

@rgommers
Copy link
Member Author

And then after a merge to main, it's running yet again: https://cirrus-ci.com/build/5871472732798976.

@rgommers
Copy link
Member Author

rgommers commented Jul 30, 2023

This is the relevant code in .cirrus.star:

    # Obtain commit message for the event. Unfortunately CIRRUS_CHANGE_MESSAGE
    # only contains the actual commit message on a non-PR trigger event.
    # For a PR event it contains the PR title and description.
    SHA = env.get("CIRRUS_CHANGE_IN_REPO")
    url = "https://api.github.com/repos/numpy/numpy/git/commits/" + SHA
    dct = http.get(url).json()
    # if "[wheel build]" in dct["message"]:
    #     return fs.read("ci/cirrus_wheels.yml")

    if "[skip cirrus]" in dct["message"] or "[skip ci]" in dct["message"]:
        return []

    # add extra jobs to the cirrus run by += adding to config
    config = fs.read("tools/ci/cirrus_wheels.yml")
    config += fs.read("tools/ci/cirrus_macosx_arm64.yml")

    return config

I don't see any label-based triggers, also not in tools/ci/cirrus_*. I think what we need here is to (a) uncomment the lines with [wheel build], and (b) delete the line config += fs.read("tools/ci/cirrus_wheels.yml").

@andyfaff
Copy link
Member

@rgommers, see #24286

@rgommers
Copy link
Member Author

rgommers commented Jul 30, 2023

I'll try to fix some of the test suite invocation and runtime issues. EDIT: see gh-24291

@rgommers rgommers changed the title [skip cirrus] logic broken. [skip cirrus] logic broken and job runtime issues on Cirrus CI Jul 30, 2023
@andyfaff
Copy link
Member

andyfaff commented Aug 1, 2023

I'm currently experimenting with ccache for scipy builds (which use meson). Would the numpy macosx_arm64 benefit from this?

@rgommers
Copy link
Member Author

rgommers commented Aug 1, 2023

I think so - not by much though, given that the whole build is less than a minute and ~10 seconds of that is the configure stage. So if ccache helps by a factor of ~2x, it may save 20 sec or so.

@mattip
Copy link
Member

mattip commented Aug 1, 2023

Is there a way to tie the cirrus CI builds into the successful run of the smoke test from github actions?

@andyfaff
Copy link
Member

andyfaff commented Aug 1, 2023

@mattip , I'm not sure. It might be possible to have manual triggering if desired, https://cirrus-ci.org/guide/writing-tasks/#manual-tasks

@andyfaff
Copy link
Member

I think this can probably be closed now

@rgommers
Copy link
Member Author

The skip/run logic is fixed (thanks!), but gh-24291 still needs finishing and then we need to deal with Cirrus CI credits. So let me re-title this issue rather than close it.

@rgommers rgommers changed the title [skip cirrus] logic broken and job runtime issues on Cirrus CI Cirrus CI free usage going away - job runtime issues & credits Aug 12, 2023
@rgommers
Copy link
Member Author

Current state after 12 days in August - this is looking pretty good, ~3x over the free limit:

image

We haven't done many wheel builds though in August, and we do need those soon for the 1.26.x releases. Finishing up gh-24291 should be useful there. And then we'll probably end up with a O($150/month) bill that we can figure out if we're happy with and if so, the logistics of paying it.

@rgommers
Copy link
Member Author

rgommers commented Sep 6, 2023

We're 19 credits away from an outage, so at this rate another 5 days or so. I'll have a look at buying some credits or wiring up a credit card tomorrow.

rgommers added a commit to rgommers/numpy that referenced this issue Sep 13, 2023
See docs at https://cirrus-ci.org/pricing/#compute-credits
Starting with collaborators only, because those are the only ones who
should trigger wheel builds, and also author the vast majority of
PRs where architecture-specific CI is actually useful.
We can always set it to an unconditional "true" later on.

xref numpygh-24280

[skip actions] [skip azp] [skip circle]
@rgommers
Copy link
Member Author

Cirrus upped the free credits from 40 to 50, and we're at 41 now - so no problems so far. I've bought a bunch of credits and opened gh-24695 to enable using them.

@rgommers
Copy link
Member Author

After 1.5 days of usage we used 1.02 credits. It actually quite nice that you get to see how much each run costs:

image

macOS arm64 is a little expensive 🤔. This is what the docs say right now:

  • 1000 minutes of 1 virtual CPU for Linux platform for 3 compute credits
  • 1000 minutes of 1 virtual CPU for FreeBSD platform for 3 compute credits
  • 1000 minutes of 1 virtual CPU for Windows platform for 4 compute credits
  • 1000 minutes of 1 Apple Silicon CPU for 15 compute credits

I had interpreted the Apple Silicon CPU as being the whole CPU, not a CPU core. Since you can only get an instance with 4 cores, I thought pricing would end up similar to that for Windows - but it's 4x more.

These jobs were all from maintainers; the external contributor PRs continue to run jobs but don't consume credits. I guess we'll have to see what happens when Cirrus starts to enforce the free credit limit.

So right now the $1/day consumption isn't too worrying, but if consumption goes we may have to think about redoing how we trigger the macOS job.

@rgommers
Copy link
Member Author

For the record, the NumPy Steering Council signed off on my proposal to spend credits - up to max $200/month.

My goal would be to stay below $100/month, and that seems to be feasible. And the invoicing and consumption reporting seems reasonably smooth, so all good so far.

@andyfaff
Copy link
Member

There are a couple more things to try, manual triggering of the Mac run, or a Cron job e.g. every couple of days.

@rgommers
Copy link
Member Author

We used $45 in the first 28 days. I just added $99, so we're good for quite a while now. The consumption is close to what I estimated before.

@rgommers
Copy link
Member Author

We're at $0 now. There's a couple of issues:

  • Credits consumption went too fast. We did not use 99 credits in the last 35 days, more like 40.
  • SciPy has the opposite issue, zero credits were subtracted in the last 6 weeks. Since I paid both with the same credit card, it looks like the SciPy credit usage got subtracted from the NumPy funds 🤔.
  • I cannot add more credits right now, somehow Cirrus is not liking my credit card anymore.

I have to follow up with them, but in the meantime CI jobs may stop running.

@charris
Copy link
Member

charris commented Nov 16, 2023

I worst comes to worst, I also have a credit card 😉

@rgommers
Copy link
Member Author

I've added $98 today (20 Dec '23), they're accepting my credit card again. Don't know what happened, assuming some validation issue. I still need to follow up with Cirrus CI about issues with mixing NumPy/SciPy credits and getting better invoices.

@rgommers
Copy link
Member Author

Hmm, with all the recent changes to wheel builds we're now up to about $2.75 per run, that's a lot more than higher up.

image

Reasons:

  • We added musllinux_aarch64, doubling the duration of the Linux jobs
  • We added double the number of macOS jobs, for Accelerate wheels
  • The test suite has gotten slower by a significant amount

Only the last reason is fixable.

@rgommers
Copy link
Member Author

rgommers commented Jan 7, 2024

In anticipation of lots of wheel builds in the run-up to 2.0.0rc1 I added $121 more credits - we're at 164 credits as of today (7 Jan 2024). Should be enough until sometime in Feb. Next up: testing the reimbursement process.

@rgommers
Copy link
Member Author

rgommers commented Mar 8, 2024

Bought another $97 in credits (note: slightly different amount each time is to ensure the invoices are easier to tell apart). With a number of the macOS arm64 jobs now migrated to GHA, this should hopefully last us a little while.

@rgommers
Copy link
Member Author

rgommers commented Mar 8, 2024

Just a note on support quality: I noticed that a single job ran for ~17 minutes longer yesterday, and hence consumed more credits. I emailed Cirrus support; they were already on it, and fixed the issue with a comprehensive explanation within an hour or so, with credits returned and all previous jobs in the last month also audited. Impressive.

@charris
Copy link
Member

charris commented Mar 8, 2024

@rgommers This is an example that supports the argument some economists make for leaving spending decisions to the individual :)

@rgommers
Copy link
Member Author

rgommers commented Jun 5, 2024

We hit zero; I added $150 with the shiny new NumPy credit card - seems to have gone fine.

Looks like we're burning a bit more with recent release and dev activity - but still well within budget:

image image

@rgommers
Copy link
Member Author

Added another $151 today. Burn is roughly $50/month at the moment.

With the free usage months gone, the graph is more informative now:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants