Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gethostbyname error on macos 10.15 (github actions vm) #4710

Closed
jakebolewski opened this issue Jul 17, 2020 · 12 comments · Fixed by JuliaParallel/MPI.jl#477
Closed

gethostbyname error on macos 10.15 (github actions vm) #4710

jakebolewski opened this issue Jul 17, 2020 · 12 comments · Fixed by JuliaParallel/MPI.jl#477

Comments

@jakebolewski
Copy link

jakebolewski commented Jul 17, 2020

We are seeing a github action macos 10.15 vm issue (with mpich 3.3.2) where calling MPI.Init() fails:

Fatal error in MPI_Init: Invalid group, error stack:
MPIR_Init_thread(586)..............:
MPID_Init(224).....................: channel initialization failed
MPIDI_CH3_Init(105)................:
MPID_nem_init(324).................:
MPID_nem_tcp_init(175).............:
MPID_nem_tcp_get_business_card(401):
MPID_nem_tcp_init(373).............: gethostbyname failed, Mac-1594849612293 (errno 0)
(unknown)(): Invalid group
SingleStackUtils: Error During Test at /Users/runner/work/ClimateMachine.jl/ClimateMachine.jl/test/testhelpers.jl:16
Test threw exception
Expression: mpiexec() do cmd
run($cmd $oversubscribe -n $ntasks $(Base.julia_cmd()) --startup-file=no --project=$(Base.active_project()) $file)
true
end
failed process: Process(/Users/runner/.julia/artifacts/848ee2ddce903941ae946cd49f63eac561bd636d/bin/mpiexec -n 1 /Users/runner/hostedtoolcache/julia/1.4.2/x64/bin/julia -Cnative -J/Users/runner/work/ClimateMachine.jl/ClimateMachine.jl/ClimateMachine.so --check-bounds=yes -g1 --startup-file=no --project=/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/jl_pVc3V7/Project.toml /Users/runner/work/ClimateMachine.jl/ClimateMachine.jl/test/Utilities/SingleStackUtils/ssu_tests.jl, ProcessExited(8)) [8]

https://github.com/CliMA/ClimateMachine.jl/runs/875221461#step:8:169

Reading the source code, it seems like defining MPICH_INTERFACE_HOSTNAME as localhost should override gethostbyname behavior, but unfortunately this doesn't seem to fix the issue.

The replacement with getaddrinfo might fix the issue it seems in the upcoming 3.4 release.
#2889

Ref: JuliaParallel/MPI.jl#407

@hzhou
Copy link
Contributor

hzhou commented Jul 17, 2020

Could you try the latest version: https://www.mpich.org/downloads/ , either v3.3.2 or v3.4a3?

@jakebolewski
Copy link
Author

Sorry I had a typo in the version number, this is using v3.3.2 which should be the latest stable release.

I'm working on testing the pre-release version, but it's not packaged so it's taking longer.

@hzhou
Copy link
Contributor

hzhou commented Jul 17, 2020

I'm working on testing the pre-release version, but it's not packaged so it's taking longer.

You could try v3.4a3, which is very close to the edge.

@simonbyrne
Copy link

What's the timeline for the 3.4 release?

@hzhou
Copy link
Contributor

hzhou commented Jul 17, 2020

What's the timeline for the 3.4 release?

The plan is to release 3.4b in a couple months, then 3.4 another month after.

@jakebolewski
Copy link
Author

jakebolewski commented Jul 21, 2020

running mpiexec -host localhost seems to fix the issue with env MPICH_INTERFACE_HOSTNAME=localhost also set fixes / workaround for the issue

@hzhou
Copy link
Contributor

hzhou commented Jul 21, 2020

MPID_nem_tcp_init(373).............: gethostbyname failed, Mac-1594849612293 (errno 0)

The root issue seems that mac would set arbitrary local hostname that can't be resolved by getaddrinfo. Curiously, were you running inside a docker container?

@jakebolewski
Copy link
Author

jakebolewski commented Jul 21, 2020

We were running inside a github action osx image (for CI) which is a VM I believe although it's not clear how they instantiate the running environment.

They do create an arbitrary hostname, and sadly this can also change during the runtime I believe:

There are similar reported issues for Azure Pipelines (which I believe is shared backing infrastructure with GitHub Actions):

OSX hostname change at runtime for github actions:

Although I know this is not a typical deployed usecase for mpi / mpich, being able to run mpi tests using one of the widely used CI environments is very useful so I'm willing to help out anyway possible to make it easier for future users (maybe by adding some notes in the documentation with examples if there is a workaround).

@raffenet
Copy link
Contributor

@jakebolewski I believe this commit now in main may have resolved this issue (51ab64e). Can you confirm?

@raffenet
Copy link
Contributor

@jakebolewski reminder to confirm if this issue has been resolved, if you have time. Thanks.

@jakebolewski
Copy link
Author

I'm working on building the artifacts to test it now that a new beta release has been tagged.

@hzhou
Copy link
Contributor

hzhou commented Jun 4, 2021

This issue is stale. If it is still relevant, please re-open it.

@hzhou hzhou closed this as completed Jun 4, 2021
simonbyrne added a commit to JuliaParallel/MPI.jl that referenced this issue Jun 19, 2021
MPICH 3.4 should have fixed pmodels/mpich#4710
simonbyrne added a commit to JuliaParallel/MPI.jl that referenced this issue Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants