frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

fingolfin · 2023-01-30T10:54:45Z

Looking at e.g. < https://github.com/oscar-system/Oscar.jl/actions/runs/4026539430> it exceeds our self-imposed 2 hour timeout limit.

In contrast, Julia nightly on Linux runs the test suite in ~1 hour; while Julia 1.6 on macOS require roughly 1 hour 25 minutes (which is about the same as Julia 1.6 on Linux in that CI run).

this was the last "good" CI run on master where it took ~1 hour on macOS (again similar to Ubuntu)
first bad CI run on master

So the failure started with our commit c6ece85 but due to the nature of the failure, I don't think that commit is at fault, it seems more likely that a change on the Julia side is responsible?

The last "good" run used

Julia Version 1.10.0-DEV.418
Commit 7630606f16a (2023-01-22 05:37 UTC)

The first "bad" run used

Julia Version 1.10.0-DEV.430
Commit 9b1ffbbe1bc (2023-01-25 12:52 UTC)

Changes on the Julia side: JuliaLang/julia@7630606...9b1ffbb

Of course there could also be changes in one of our dependencies (one could try to extract this from the CI logs above). It might also be interesting to find out which tests got slower... A quick check shows that e.g. the AbelianClousre (sic!) went from 52s to 81.3s and MPolyAnyMap/MPolyRing went from 4m04.7s to 6m47.2s. But why, and why only on macOS + Julia nightly?

The text was updated successfully, but these errors were encountered:

benlorenz · 2023-01-30T11:06:05Z

I just thought I could re-run the last good job on github actions to see if the newer julia version would cause this to timeout as well but this has installed an even older julia version now:

Julia Version 1.10.0-DEV.387
Commit e1fc4824b92 (2023-01-18 03:54 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.5.0)
  CPU: 3 × Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, ivybridge)
  Threads: 1 on 3 virtual cores

benlorenz · 2023-01-30T16:31:51Z

I did another retry of that build, this time it used:

Julia Version 1.10.0-DEV.456
Commit 4f92e79ffd6 (2023-01-30 11:52 UTC)

and the tests succeeded in 1.5 hours. Better but still quite a bit worse than before.

I did notice one particularly bad testcase in the failed log:

K3 surface automorphism groups |   28     28  19m15.4s

In the new one that I linked here and in the last good one linked by Max this took only about 2-3 minutes.

benlorenz · 2023-02-01T23:23:51Z

I think at least to some extent the CI runners are to blame, I have been running a few tests manually here. I added some system stats in the beginning and enabled GC logging.

For example in this run the load avg before starting the tests was about 8 (the runners have 3 cpus available), with CPU usage: 43.68% user, 56.6% sys, 0.24% idle and mds + mds_stores working hard to index the files. Unfortunately that run failed early due to the msolve issue.
Or here we were at 500MB free of 14GB total memory. In the log we then see quite a lot of GC pressure especially at the end here. This ran up to line 5000 before being aborted, a normal test-run (without GC logging) needs about 1300 lines, the ubuntu tests needed 2260 lines, so about 1000 printouts from the GC on ubuntu and on macos we had 4 times as many GC pauses.
The latest run does look somewhat promising, it had about 5.5GB free, the loadavg was at 4.6 at the beginning but at least these mds processes were not as busy and they seem to be rather resource-intensive. But now rather close to the end we have a lot of long pauses with very little collected memory.

Apart from these issues with the runners, it does look somewhat similar to the effects seen here JuliaLang/julia#48473 but the runners are not using M1, and the free memory values do agree between versioninfo and the output from top. The other PR JuliaLang/julia#47292 mentioned there is included in the list of commits where we started seeing this.

Edit: This did happen earlier as well, e.g. about 3 weeks ago here with Julia Version 1.10.0-DEV.320 Commit e81218ea2ff (2023-01-10 15:56 UTC).

fingolfin · 2023-02-06T14:45:31Z

I wonder if these GC pauses could be caused by use caching things too eagerly: we create lots and lots of rings and other stuff that might get cached. That would fit with what Benjamin observed, namely heap running full, which causes more and more GCs to happen.

Perhaps we can flush at least some of those caches between certain parts of the test suite?

Regarding the K3 surface automorphism groups sometimes taking so long: perhaps an artifact of randomization? Perhaps @simonbrandhorst has some thoughts on this?

thofma · 2023-02-06T15:02:42Z

There is something like Hecke.print_cache() and Hecke.clear_cache(). It looks for global variables that are either AbstractDict and AbstractVector and prints the length (we could extend it to also make it work for Oscar).

In my experience, even if all caching is disabled, GC time is strictly increasing for long running jobs. I had something running for three days (all caching disabled), and now in this session every GC invocation takes > 0.5s and it is triggered quite often:

julia> GC.gc()
GC: pause 68.48ms. collected 0.024240MB. full recollect
GC: pause 549.10ms. collected 0.000304MB. incr

(although it is doing effectively nothing).

simonbrandhorst · 2023-02-06T20:33:40Z

Regarding the K3 surface automorphism groups sometimes taking so long: perhaps an artifact of randomization? Perhaps @simonbrandhorst has some thoughts on this?

Yes that is possible. There is a little bit of randomization in the code. Though I am surprised that it may take 20minutes.
Locally it took just 71-82s to run the test each of the 10 times I tried it.)

simonbrandhorst · 2023-02-08T14:07:18Z

I did another retry of that build, this time it used:
Julia Version 1.10.0-DEV.456
Commit 4f92e79ffd6 (2023-01-30 11:52 UTC)
and the tests succeeded in 1.5 hours. Better but still quite a bit worse than before.

I did notice one particularly bad testcase in the failed log:
K3 surface automorphism groups |   28     28  19m15.4s
In the new one that I linked here and in the last good one linked by Max this took only about 2-3 minutes.

Is there a way to reproduce this on Linux? I wonder where this behavior comes from. So far this never happened for me. There is indeed some randomization going on but I am surprised that it has such a big effect.

benlorenz · 2023-02-08T14:46:54Z

Is there a way to reproduce this on Linux? I wonder where this behavior comes from. So far this never happened for me. There is indeed some randomization going on but I am surprised that it has such a big effect.

I haven't been able to reproduce this either after letting it run in a loop for a while, the timings I got after about 1000 iterations are all between 117 seconds and 139 seconds. (Note that I did restart julia for each iteration)

I believe the reason for that long duration was extreme memory starvation of that macOS runner, I found one further log with such a timing in my experiments on github actions: https://github.com/oscar-system/Oscar.jl/actions/runs/4091477929/jobs/7064161311#step:9:3447
This one does include the julia GC logging and between that output and the previous testcase there are about 700 GC pauses, each with about 1.6 seconds. Summing all those numbers gives about 19.88 minutes (just for the GC).

The reason why that testcase is particularly affected by this is that it does require somewhat more memory than other testgroups:

For the most recent run on my test-branch I added @time for each include statement and we can see that this group (K3Auto.jl) needed 32.958GiB memory allocations (this is one of the largest such values in our testsuite). There are still quite a few GC pauses above in the log but way fewer and they are also a lot shorter. Still the printout below of Free mem: 45.578 MiB is quite low.

I think there is not really a specific commit for julia nightly that does trigger all this but:

the macos runners are rather short on memory and/or busy with other stuff
julia nightly might need somewhat more memory than 1.9 or older (I haven't really checked this but that might be something to investigate)

I have started a new CI job to compare nightly with 1.8 and 1.9 now.

simonbrandhorst · 2023-02-08T14:53:25Z

I see. Thank you @benlorenz for your explanations. Indeed the algorithms/implementation is allocation heavy.

benlorenz · 2023-02-09T07:56:29Z

After torturing the github actions runners some more it looks like julia 1.9 (on macos) is similarly affected by this, for example here https://github.com/oscar-system/Oscar.jl/actions/runs/4125228965/jobs/7131303651#step:8:6741. There are a few more attempts to look at, sometimes nightly fails, sometimes 1.9, sometimes they do succeed but even then there does appear to be a lot of memory pressure for still unknown reasons.

benlorenz · 2023-02-13T09:15:11Z

This should be fixed for nightly now (or at least be significantly better), after the merge of JuliaLang/julia#48614. I re-ran a bunch of github jobs and all of them succeeded for nightly. Also the last two merges to master ran fine on macos nightly.

It is still an issue for 1.9 but the backport label is already there so this should be fixed as well soon.

thofma · 2023-02-13T09:30:25Z

Thanks for keeping track of this

benlorenz changed the title ~~CI failure with Julia nightly on macOS due to timeout~~ frequent CI failures with Julia nightly and 1.9 on macOS due to timeout Feb 9, 2023

HereAround mentioned this issue Feb 10, 2023

ToricVarieties: Several bugfixes #1909

Merged

benlorenz closed this as completed Feb 13, 2023

benlorenz mentioned this issue Feb 13, 2023

Reduce runtimes for tests #1929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

fingolfin commented Jan 30, 2023

benlorenz commented Jan 30, 2023

benlorenz commented Jan 30, 2023 •

edited

Loading

benlorenz commented Feb 1, 2023 •

edited

Loading

fingolfin commented Feb 6, 2023

thofma commented Feb 6, 2023 •

edited

Loading

simonbrandhorst commented Feb 6, 2023

simonbrandhorst commented Feb 8, 2023

benlorenz commented Feb 8, 2023 •

edited

Loading

simonbrandhorst commented Feb 8, 2023

benlorenz commented Feb 9, 2023

benlorenz commented Feb 13, 2023

thofma commented Feb 13, 2023

frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

Comments

fingolfin commented Jan 30, 2023

benlorenz commented Jan 30, 2023

benlorenz commented Jan 30, 2023 • edited Loading

benlorenz commented Feb 1, 2023 • edited Loading

fingolfin commented Feb 6, 2023

thofma commented Feb 6, 2023 • edited Loading

simonbrandhorst commented Feb 6, 2023

simonbrandhorst commented Feb 8, 2023

benlorenz commented Feb 8, 2023 • edited Loading

simonbrandhorst commented Feb 8, 2023

benlorenz commented Feb 9, 2023

benlorenz commented Feb 13, 2023

thofma commented Feb 13, 2023

benlorenz commented Jan 30, 2023 •

edited

Loading

benlorenz commented Feb 1, 2023 •

edited

Loading

thofma commented Feb 6, 2023 •

edited

Loading

benlorenz commented Feb 8, 2023 •

edited

Loading