Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent CI failures with Julia nightly and 1.9 on macOS due to timeout #1888

Closed
fingolfin opened this issue Jan 30, 2023 · 12 comments
Closed

Comments

@fingolfin
Copy link
Member

Looking at e.g. < https://github.com/oscar-system/Oscar.jl/actions/runs/4026539430> it exceeds our self-imposed 2 hour timeout limit.

In contrast, Julia nightly on Linux runs the test suite in ~1 hour; while Julia 1.6 on macOS require roughly 1 hour 25 minutes (which is about the same as Julia 1.6 on Linux in that CI run).

So the failure started with our commit c6ece85 but due to the nature of the failure, I don't think that commit is at fault, it seems more likely that a change on the Julia side is responsible?

The last "good" run used

Julia Version 1.10.0-DEV.418
Commit 7630606f16a (2023-01-22 05:37 UTC)

The first "bad" run used

Julia Version 1.10.0-DEV.430
Commit 9b1ffbbe1bc (2023-01-25 12:52 UTC)

Changes on the Julia side: JuliaLang/julia@7630606...9b1ffbb

Of course there could also be changes in one of our dependencies (one could try to extract this from the CI logs above). It might also be interesting to find out which tests got slower... A quick check shows that e.g. the AbelianClousre (sic!) went from 52s to 81.3s and MPolyAnyMap/MPolyRing went from 4m04.7s to 6m47.2s. But why, and why only on macOS + Julia nightly?

@benlorenz
Copy link
Member

I just thought I could re-run the last good job on github actions to see if the newer julia version would cause this to timeout as well but this has installed an even older julia version now:

Julia Version 1.10.0-DEV.387
Commit e1fc4824b92 (2023-01-18 03:54 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.5.0)
  CPU: 3 × Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, ivybridge)
  Threads: 1 on 3 virtual cores

@benlorenz
Copy link
Member

benlorenz commented Jan 30, 2023

I did another retry of that build, this time it used:

Julia Version 1.10.0-DEV.456
Commit 4f92e79ffd6 (2023-01-30 11:52 UTC)

and the tests succeeded in 1.5 hours. Better but still quite a bit worse than before.

I did notice one particularly bad testcase in the failed log:

K3 surface automorphism groups |   28     28  19m15.4s

In the new one that I linked here and in the last good one linked by Max this took only about 2-3 minutes.

@benlorenz
Copy link
Member

benlorenz commented Feb 1, 2023

I think at least to some extent the CI runners are to blame, I have been running a few tests manually here. I added some system stats in the beginning and enabled GC logging.

  • For example in this run the load avg before starting the tests was about 8 (the runners have 3 cpus available), with CPU usage: 43.68% user, 56.6% sys, 0.24% idle and mds + mds_stores working hard to index the files. Unfortunately that run failed early due to the msolve issue.
  • Or here we were at 500MB free of 14GB total memory. In the log we then see quite a lot of GC pressure especially at the end here. This ran up to line 5000 before being aborted, a normal test-run (without GC logging) needs about 1300 lines, the ubuntu tests needed 2260 lines, so about 1000 printouts from the GC on ubuntu and on macos we had 4 times as many GC pauses.
  • The latest run does look somewhat promising, it had about 5.5GB free, the loadavg was at 4.6 at the beginning but at least these mds processes were not as busy and they seem to be rather resource-intensive. But now rather close to the end we have a lot of long pauses with very little collected memory.

Apart from these issues with the runners, it does look somewhat similar to the effects seen here JuliaLang/julia#48473 but the runners are not using M1, and the free memory values do agree between versioninfo and the output from top. The other PR JuliaLang/julia#47292 mentioned there is included in the list of commits where we started seeing this.

Edit: This did happen earlier as well, e.g. about 3 weeks ago here with Julia Version 1.10.0-DEV.320 Commit e81218ea2ff (2023-01-10 15:56 UTC).

@fingolfin
Copy link
Member Author

I wonder if these GC pauses could be caused by use caching things too eagerly: we create lots and lots of rings and other stuff that might get cached. That would fit with what Benjamin observed, namely heap running full, which causes more and more GCs to happen.

Perhaps we can flush at least some of those caches between certain parts of the test suite?

Regarding the K3 surface automorphism groups sometimes taking so long: perhaps an artifact of randomization? Perhaps @simonbrandhorst has some thoughts on this?

@thofma
Copy link
Collaborator

thofma commented Feb 6, 2023

There is something like Hecke.print_cache() and Hecke.clear_cache(). It looks for global variables that are either AbstractDict and AbstractVector and prints the length (we could extend it to also make it work for Oscar).

In my experience, even if all caching is disabled, GC time is strictly increasing for long running jobs. I had something running for three days (all caching disabled), and now in this session every GC invocation takes > 0.5s and it is triggered quite often:

julia> GC.gc()
GC: pause 68.48ms. collected 0.024240MB. full recollect
GC: pause 549.10ms. collected 0.000304MB. incr

(although it is doing effectively nothing).

@simonbrandhorst
Copy link
Collaborator

Regarding the K3 surface automorphism groups sometimes taking so long: perhaps an artifact of randomization? Perhaps @simonbrandhorst has some thoughts on this?

Yes that is possible. There is a little bit of randomization in the code. Though I am surprised that it may take 20minutes.
Locally it took just 71-82s to run the test each of the 10 times I tried it.)

@simonbrandhorst
Copy link
Collaborator

I did another retry of that build, this time it used:

Julia Version 1.10.0-DEV.456
Commit 4f92e79ffd6 (2023-01-30 11:52 UTC)

and the tests succeeded in 1.5 hours. Better but still quite a bit worse than before.

I did notice one particularly bad testcase in the failed log:

K3 surface automorphism groups |   28     28  19m15.4s

In the new one that I linked here and in the last good one linked by Max this took only about 2-3 minutes.

Is there a way to reproduce this on Linux? I wonder where this behavior comes from. So far this never happened for me. There is indeed some randomization going on but I am surprised that it has such a big effect.

@benlorenz
Copy link
Member

benlorenz commented Feb 8, 2023

Is there a way to reproduce this on Linux? I wonder where this behavior comes from. So far this never happened for me. There is indeed some randomization going on but I am surprised that it has such a big effect.

I haven't been able to reproduce this either after letting it run in a loop for a while, the timings I got after about 1000 iterations are all between 117 seconds and 139 seconds. (Note that I did restart julia for each iteration)

I believe the reason for that long duration was extreme memory starvation of that macOS runner, I found one further log with such a timing in my experiments on github actions: https://github.com/oscar-system/Oscar.jl/actions/runs/4091477929/jobs/7064161311#step:9:3447
This one does include the julia GC logging and between that output and the previous testcase there are about 700 GC pauses, each with about 1.6 seconds. Summing all those numbers gives about 19.88 minutes (just for the GC).

The reason why that testcase is particularly affected by this is that it does require somewhat more memory than other testgroups:

For the most recent run on my test-branch I added @time for each include statement and we can see that this group (K3Auto.jl) needed 32.958GiB memory allocations (this is one of the largest such values in our testsuite). There are still quite a few GC pauses above in the log but way fewer and they are also a lot shorter. Still the printout below of Free mem: 45.578 MiB is quite low.

I think there is not really a specific commit for julia nightly that does trigger all this but:

  1. the macos runners are rather short on memory and/or busy with other stuff
  2. julia nightly might need somewhat more memory than 1.9 or older (I haven't really checked this but that might be something to investigate)

I have started a new CI job to compare nightly with 1.8 and 1.9 now.

@simonbrandhorst
Copy link
Collaborator

I see. Thank you @benlorenz for your explanations. Indeed the algorithms/implementation is allocation heavy.

@benlorenz
Copy link
Member

After torturing the github actions runners some more it looks like julia 1.9 (on macos) is similarly affected by this, for example here https://github.com/oscar-system/Oscar.jl/actions/runs/4125228965/jobs/7131303651#step:8:6741. There are a few more attempts to look at, sometimes nightly fails, sometimes 1.9, sometimes they do succeed but even then there does appear to be a lot of memory pressure for still unknown reasons.

@benlorenz benlorenz changed the title CI failure with Julia nightly on macOS due to timeout frequent CI failures with Julia nightly and 1.9 on macOS due to timeout Feb 9, 2023
@benlorenz
Copy link
Member

This should be fixed for nightly now (or at least be significantly better), after the merge of JuliaLang/julia#48614. I re-ran a bunch of github jobs and all of them succeeded for nightly. Also the last two merges to master ran fine on macos nightly.

It is still an issue for 1.9 but the backport label is already there so this should be fixed as well soon.

@thofma
Copy link
Collaborator

thofma commented Feb 13, 2023

Thanks for keeping track of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants