Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Linux (Arch) #25

Open
cvalencia09 opened this issue Mar 5, 2024 · 12 comments
Open

Problems with Linux (Arch) #25

cvalencia09 opened this issue Mar 5, 2024 · 12 comments

Comments

@cvalencia09
Copy link

It seems to be that the Julia package have a problem when running in a Linux machine. I have the stackoverflow error after running the newuoa algorithm, this problem doesn't occur in my Windows partition. In Windows I have the Intel compiler for Fortran, in Linux just the gcc compiler. Which libraries do you require to run the package on Linux?

Regards,

@amontoison
Copy link
Member

When you use PRIMA.jl, a precompiled version of PRIMA is provided with the artifact PRIMA_jll.jl so you don't need any compiler to use this Julia interface.
Can you run the Julia tests with the following commands and provide the error(s) that you encounter?

julia> ]
pkg> test PRIMA

@soldasim
Copy link

soldasim commented Apr 3, 2024

Hello, I have also encountered the StackOverflowError on vairous Linux devices.

I have tested the issue on multiple devices. This is what information I could gather;

  • It appears the error only occurs on Linux. (At least I have not encountered it on Windows yet.)
  • On some Linux devices, the error only occurs when runnning PRIMA in parallel. On some devices it occurs even when running PRIMA in serial.
  • On the devices where the error occurs only when running in parallel, it does not matter if I parallelize using tasks (Threads.@spawn), parallel for loop (Threads.@threads for) or even if I run PRIMA in a distributed manner (using @distributed). All of these options behave the same.

Unfortunately, the error message does not provide any information (not even a stacktrace), so this may be difficult to debug. I will try to provide as much information as I can.

MWE

Consider the following MWE;

using PRIMA

function prima_serial(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    results = [newuoa(obj, start)[1] for _ in 1:tasks]
end

function prima_parallel(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    tasks = [Threads.@spawn newuoa(obj, start)[1] for _ in 1:tasks]
    results = fetch.(tasks)
end

Note that I am running only a single task when parallelizing. So there actually are not multiple PRIMA instances running in parallel. But somehow it causes errors on some devices anyway.

Test Results

The following table summarizes the results of running the two functions prima_serial and prima_parallel from the MWE on various devices that I have access to:

Device prima_serial prima_parallel
PC-1 (Win)
PC-2 (Win)
PC-2 (Linux) StackOverflowError
PC-2 (Linux) [VSCode REPL] StackOverflowError StackOverflowError
Cluster-1 (Linux) StackOverflowError
Cluster-2 (Linux) StackOverflowError
PC-3 (MacOS)

(Note that PC-2 (Win) and PC-2 (Linux) is the same exact computer with dualboot.)

Correct output:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia> 

Error message for prima_serial:

julia> prima_serial()
ERROR: StackOverflowError:

julia> 

Error message for prima_parallel:

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ~/julia-sandbox/prima_parallel/test.jl:15
  [9] prima_parallel()
    @ Main ~/julia-sandbox/prima_parallel/test.jl:10
 [10] top-level scope
    @ REPL[3]:1

    nested task error: StackOverflowError:

julia> 

Device Specifications

PC-1 and PC-2 are my personal computers. PC-2 has both Windows and Linux on dualboot. Cluster-1 and Cluster-2 are academic clusters that I have access to. The information below contains specs of both the "login" and "work" nodes from the clusters. I've tested the MWE on both the login and work nodes and the behavior does not differ between them.

PC-1

OS: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 4 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Windows)

OS: Microsoft Windows 10 Home
Version: 10.0.19045 Build 19045
System Type: x64-based PC
Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 2001 Mhz, 4 Core(s), 8 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Linux)

OS: Ubuntu 20.04.6 LTS
Kernel: Linux 5.4.0-171-generic
Architecture: x86-64
Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

Julia version: 1.10.2

Cluster-1

Login Node:
OS: CentOS Linux 7 (Core)
Kernel: Linux 3.10.0-1127.13.1.el7.x86_64
Architecture: x86-64
Processor: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

Work Node:
Kernel: Linux 4.18.0-425.13.1.el8_7.x86_64
Architecture: x86-64
Processor: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz

Julia version: 1.10.0

Cluster-2

Login Node:
OS: Ubuntu 20.04.6 LTS
Kernel: Linux 5.15.0-94-generic
Architecture: x86-64
Processor: Common KVM processor

Work Node:
OS: Ubuntu 22.04.3 LTS
Kernel: Linux 5.15.0-91-generic
Architecture: x86-64
Processor: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

Julia version: 1.10.2

PC-3 (MacOS)

OS: macOS (arm64-apple-darwin22.4.0)
CPU: 12 × Apple M3 Pro

Julia version: 1.11.1


Let me know if I can help with any additional information or testing. :)

EDIT-1: Added PC-2 (Linux) VSCode and "non-VSCode" versions to the test result table.

EDIT-2: Added PC-3 (MacOS)

@soldasim
Copy link

soldasim commented Apr 3, 2024

I have run ] test PRIMA on all of the devices mentioned above and all tests succeed on all devices.

Test Summary: | Pass  Total   Time
PRIMA.jl      |   81     81  12.9s
     Testing PRIMA tests passed 

@emmt
Copy link
Collaborator

emmt commented Apr 4, 2024

Thank you for all these details. I have tested your examples on my Linux laptop (Ubuntu 23.10 with 6.0.0 kernel) with the following results:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ./REPL[3]:6
  [9] prima_parallel()
    @ Main ./REPL[3]:1
 [10] top-level scope
    @ REPL[7]:1

    nested task error: StackOverflowError:

So the serial version worked, not the parallel one. Note that the serial version also worked for tasks=2 or more, not the parallel one (with always the same stack overflow error).

Are you sure that the serial version failed on your PC-2 (Linux)?

For the parallel version, I can see some questions that need to be answered:

  1. Are the functions in libprima (the Fortran90 and the C versions) thread safe or not?
  2. The StackOverflowError seems to indicate a problem on the side of the Julia interface. So the same question arises for the Julia version. In principle, this interface allows for having an objective function that itself calls one of the PRIMA optimizers (hierarchical optimization). But it may not have been fully tested.

@amontoison
Copy link
Member

In Julia, the use of @ccall is thread-safe if it can help to isolate the issue.

@emmt
Copy link
Collaborator

emmt commented Apr 4, 2024

Yes @ccall is used but the problem I can see, is that the Julia interface uses a per-thread stack of contexts (stored in the global variable _objfun_stack) to be at the same time thread-safe and to allow for hierarchical optimization and this has not been thoroughly tested. With the new C API of libprima (see #28) this management would no longer be necessary and the problem may be solved (provided the C and Fortran code are thread-safe).

@amontoison
Copy link
Member

Ok, I see 👍
It increases the priority to do an unoffical build of PRIMA_jll.jl v0.8.0 asap.

@soldasim
Copy link

soldasim commented Apr 4, 2024

Are you sure that the serial version failed on your PC-2 (Linux)?

I have tested it again to be sure. The serial version really fails on my Linux PC but only in Julia started by VSCode.

When I run the two functions from Julia REPL started by the VSCode's Julia extension (Ctrl+Shift+P -> Julia: Start REPL), both the serial and the parallel version throw StackOverflowError.

When I start Julia REPL from bash myself, only the parallel version fails and the serial works fine as on other linux devices.

I don't know what to make of this, but at least it is consistent when tried multiple times.

Version info

The only difference in versioninfo() is that the REPL started by VSCode has an additional line JULIA_EDITOR = code in the Environment.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/opt/gurobi10.0.0_linux64/gurobi1000/linux64/lib
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

julia> 

@emmt
Copy link
Collaborator

emmt commented Apr 4, 2024

Ok that's really puzzling...

We have started to figure out a way to deal with thread-safety (and hierarchical optimization) differntly than currently done in PRIMA.jl. I hope this will solve the issue. Your MWE should definitely be part of the test suite of PRIMA.jl.

@renatomatz
Copy link

renatomatz commented Jul 16, 2024

Hi, I have the same issue in my Linux machine, with prima_serial working just fine but not prima_parallel. Other than that, all tests from test PRIMA seem to pass.

Has there been any progress on this bug?

OS: Ubuntu 22.04.4 LTS
Kernel: Linux 6.5.0-41-generic
Architecture: x86-64
Processor: Intel i7-7700HQ (8) @ 3.800GHz
Julia version: 1.10.4

@soldasim
Copy link

soldasim commented Nov 5, 2024

Hi, got my hands on a Mac, so I've tested the parallelization on it as well.

The prima_parallel runs fine on MacOS (at least in my case). I've added the results to my original comment.

So the issue seems to be purely Linux-related.

@soldasim
Copy link

soldasim commented Nov 5, 2024

Interestingly, some tests do not pass on my Mac.

I believe this to be unrelated to this parallelization issue, so I've created a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants