Problems with Linux (Arch) #25

cvalencia09 · 2024-03-05T17:02:14Z

It seems to be that the Julia package have a problem when running in a Linux machine. I have the stackoverflow error after running the newuoa algorithm, this problem doesn't occur in my Windows partition. In Windows I have the Intel compiler for Fortran, in Linux just the gcc compiler. Which libraries do you require to run the package on Linux?

Regards,

amontoison · 2024-03-06T05:26:13Z

When you use PRIMA.jl, a precompiled version of PRIMA is provided with the artifact PRIMA_jll.jl so you don't need any compiler to use this Julia interface.
Can you run the Julia tests with the following commands and provide the error(s) that you encounter?

julia> ]
pkg> test PRIMA

soldasim · 2024-04-03T21:09:51Z

Hello, I have also encountered the StackOverflowError on vairous Linux devices.

I have tested the issue on multiple devices. This is what information I could gather;

It appears the error only occurs on Linux. (At least I have not encountered it on Windows yet.)
On some Linux devices, the error only occurs when runnning PRIMA in parallel. On some devices it occurs even when running PRIMA in serial.
On the devices where the error occurs only when running in parallel, it does not matter if I parallelize using tasks (Threads.@spawn), parallel for loop (Threads.@threads for) or even if I run PRIMA in a distributed manner (using @distributed). All of these options behave the same.

Unfortunately, the error message does not provide any information (not even a stacktrace), so this may be difficult to debug. I will try to provide as much information as I can.

MWE

Consider the following MWE;

using PRIMA

function prima_serial(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    results = [newuoa(obj, start)[1] for _ in 1:tasks]
end

function prima_parallel(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    tasks = [Threads.@spawn newuoa(obj, start)[1] for _ in 1:tasks]
    results = fetch.(tasks)
end

Note that I am running only a single task when parallelizing. So there actually are not multiple PRIMA instances running in parallel. But somehow it causes errors on some devices anyway.

Test Results

The following table summarizes the results of running the two functions prima_serial and prima_parallel from the MWE on various devices that I have access to:

Device	`prima_serial`	`prima_parallel`
PC-1 (Win)	✅	✅
PC-2 (Win)	✅	✅
PC-2 (Linux)	✅	`StackOverflowError`
PC-2 (Linux) [VSCode REPL]	`StackOverflowError`	`StackOverflowError`
Cluster-1 (Linux)	✅	`StackOverflowError`
Cluster-2 (Linux)	✅	`StackOverflowError`
PC-3 (MacOS)	✅	✅

(Note that PC-2 (Win) and PC-2 (Linux) is the same exact computer with dualboot.)

Correct output:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia>

Error message for prima_serial:

julia> prima_serial()
ERROR: StackOverflowError:

julia>

Error message for prima_parallel:

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ~/julia-sandbox/prima_parallel/test.jl:15
  [9] prima_parallel()
    @ Main ~/julia-sandbox/prima_parallel/test.jl:10
 [10] top-level scope
    @ REPL[3]:1

    nested task error: StackOverflowError:

julia>

Device Specifications

PC-1 and PC-2 are my personal computers. PC-2 has both Windows and Linux on dualboot. Cluster-1 and Cluster-2 are academic clusters that I have access to. The information below contains specs of both the "login" and "work" nodes from the clusters. I've tested the MWE on both the login and work nodes and the behavior does not differ between them.

PC-1

OS: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 4 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Windows)

OS: Microsoft Windows 10 Home
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 2001 Mhz, 4 Core(s), 8 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Linux)

OS: Ubuntu 20.04.6 LTS
Kernel: Linux 5.4.0-171-generic
Architecture: x86-64
Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

Julia version: 1.10.2

Cluster-1

Login Node:
OS: CentOS Linux 7 (Core)
Kernel: Linux 3.10.0-1127.13.1.el7.x86_64
Architecture: x86-64
Processor: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

Work Node:
Kernel: Linux 4.18.0-425.13.1.el8_7.x86_64
Architecture: x86-64
Processor: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz

Julia version: 1.10.0

Cluster-2

Login Node:
OS: Ubuntu 20.04.6 LTS
Kernel: Linux 5.15.0-94-generic
Architecture: x86-64
Processor: Common KVM processor

Work Node:
OS: Ubuntu 22.04.3 LTS
Kernel: Linux 5.15.0-91-generic
Architecture: x86-64
Processor: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

Julia version: 1.10.2

PC-3 (MacOS)

OS: macOS (arm64-apple-darwin22.4.0)
CPU: 12 × Apple M3 Pro

Julia version: 1.11.1

Let me know if I can help with any additional information or testing. :)

EDIT-1: Added PC-2 (Linux) VSCode and "non-VSCode" versions to the test result table.

EDIT-2: Added PC-3 (MacOS)

soldasim · 2024-04-03T21:33:25Z

I have run ] test PRIMA on all of the devices mentioned above and all tests succeed on all devices.

Test Summary: | Pass  Total   Time
PRIMA.jl      |   81     81  12.9s
     Testing PRIMA tests passed

emmt · 2024-04-04T06:03:18Z

Thank you for all these details. I have tested your examples on my Linux laptop (Ubuntu 23.10 with 6.0.0 kernel) with the following results:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ./REPL[3]:6
  [9] prima_parallel()
    @ Main ./REPL[3]:1
 [10] top-level scope
    @ REPL[7]:1

    nested task error: StackOverflowError:

So the serial version worked, not the parallel one. Note that the serial version also worked for tasks=2 or more, not the parallel one (with always the same stack overflow error).

Are you sure that the serial version failed on your PC-2 (Linux)?

For the parallel version, I can see some questions that need to be answered:

Are the functions in libprima (the Fortran90 and the C versions) thread safe or not?
The StackOverflowError seems to indicate a problem on the side of the Julia interface. So the same question arises for the Julia version. In principle, this interface allows for having an objective function that itself calls one of the PRIMA optimizers (hierarchical optimization). But it may not have been fully tested.

amontoison · 2024-04-04T06:19:36Z

In Julia, the use of @ccall is thread-safe if it can help to isolate the issue.

emmt · 2024-04-04T06:32:40Z

Yes @ccall is used but the problem I can see, is that the Julia interface uses a per-thread stack of contexts (stored in the global variable _objfun_stack) to be at the same time thread-safe and to allow for hierarchical optimization and this has not been thoroughly tested. With the new C API of libprima (see #28) this management would no longer be necessary and the problem may be solved (provided the C and Fortran code are thread-safe).

amontoison · 2024-04-04T06:54:49Z

Ok, I see 👍
It increases the priority to do an unoffical build of PRIMA_jll.jl v0.8.0 asap.

soldasim · 2024-04-04T08:11:37Z

Are you sure that the serial version failed on your PC-2 (Linux)?

I have tested it again to be sure. The serial version really fails on my Linux PC but only in Julia started by VSCode.

When I run the two functions from Julia REPL started by the VSCode's Julia extension (Ctrl+Shift+P -> Julia: Start REPL), both the serial and the parallel version throw StackOverflowError.

When I start Julia REPL from bash myself, only the parallel version fails and the serial works fine as on other linux devices.

I don't know what to make of this, but at least it is consistent when tried multiple times.

Version info

The only difference in versioninfo() is that the REPL started by VSCode has an additional line JULIA_EDITOR = code in the Environment.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/opt/gurobi10.0.0_linux64/gurobi1000/linux64/lib
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

julia>

emmt · 2024-04-04T08:35:12Z

Ok that's really puzzling...

We have started to figure out a way to deal with thread-safety (and hierarchical optimization) differntly than currently done in PRIMA.jl. I hope this will solve the issue. Your MWE should definitely be part of the test suite of PRIMA.jl.

renatomatz · 2024-07-16T15:53:11Z

Hi, I have the same issue in my Linux machine, with prima_serial working just fine but not prima_parallel. Other than that, all tests from test PRIMA seem to pass.

Has there been any progress on this bug?

OS: Ubuntu 22.04.4 LTS
Kernel: Linux 6.5.0-41-generic
Architecture: x86-64
Processor: Intel i7-7700HQ (8) @ 3.800GHz
Julia version: 1.10.4

soldasim · 2024-11-05T12:38:17Z

Hi, got my hands on a Mac, so I've tested the parallelization on it as well.

The prima_parallel runs fine on MacOS (at least in my case). I've added the results to my original comment.

So the issue seems to be purely Linux-related.

soldasim · 2024-11-05T15:33:51Z

Interestingly, some tests do not pass on my Mac.

I believe this to be unrelated to this parallelization issue, so I've created a new issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Linux (Arch) #25

Problems with Linux (Arch) #25

cvalencia09 commented Mar 5, 2024

amontoison commented Mar 6, 2024

soldasim commented Apr 3, 2024 •

edited

Loading

soldasim commented Apr 3, 2024 •

edited

Loading

emmt commented Apr 4, 2024

amontoison commented Apr 4, 2024

emmt commented Apr 4, 2024

amontoison commented Apr 4, 2024

soldasim commented Apr 4, 2024 •

edited

Loading

emmt commented Apr 4, 2024

renatomatz commented Jul 16, 2024 •

edited

Loading

soldasim commented Nov 5, 2024

soldasim commented Nov 5, 2024

Problems with Linux (Arch) #25

Problems with Linux (Arch) #25

Comments

cvalencia09 commented Mar 5, 2024

amontoison commented Mar 6, 2024

soldasim commented Apr 3, 2024 • edited Loading

MWE

Test Results

Device Specifications

PC-1

PC-2 (Windows)

PC-2 (Linux)

Cluster-1

Cluster-2

PC-3 (MacOS)

soldasim commented Apr 3, 2024 • edited Loading

emmt commented Apr 4, 2024

amontoison commented Apr 4, 2024

emmt commented Apr 4, 2024

amontoison commented Apr 4, 2024

soldasim commented Apr 4, 2024 • edited Loading

Version info

emmt commented Apr 4, 2024

renatomatz commented Jul 16, 2024 • edited Loading

soldasim commented Nov 5, 2024

soldasim commented Nov 5, 2024

soldasim commented Apr 3, 2024 •

edited

Loading

soldasim commented Apr 3, 2024 •

edited

Loading

soldasim commented Apr 4, 2024 •

edited

Loading

renatomatz commented Jul 16, 2024 •

edited

Loading