Skip to content

orte_init failed for some reason #369

@momonara

Description

@momonara

Hi.

I ran the mlpstorage training, and got some errors

/root/storage/mlpstorage training datasize --model retinanet --client-host-memory-in-gb 250 --num-client-hosts 1 --max-accelerators 4 --accelerator-type b200 --file --allow-run-as-root
/root/storage/mlpstorage training datagen --hosts 127.0.0.1 --num-processes 4 --model retinanet --data-dir /home/nvme/retinanet_data --results-dir /home/mlp_results/retinanet_results  --param dataset.num_files_train=4155900 --file --allow-run-as-root
/root/storage/mlpstorage training run --hosts 127.0.0.1 --num-client-hosts 1 --client-host-memory-in-gb 250 --num-accelerators 4 --accelerator-type b200 --model retinanet  --data-dir /home/nvme/retinanet_data --results-dir /home/mlp_results/retinanet_results --param dataset.num_files_train=4155900 reader.read_threads=4 --file --allow-run-as-root
[OUTPUT] 2026-05-12T13:15:08.757087 Running DLIO [Training] with 4 process(es)
[OUTPUT] 2026-05-12T13:15:16.778433 Max steps per epoch: 43290 = 1 * 4155900 / 24 / 4 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-05-12T13:15:39.206214 Starting epoch 1: 43290 steps expected
[OUTPUT] 2026-05-12T13:15:39.206664 Starting block 1
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[localhost.localdomain:4185562] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[localhost.localdomain:4185561] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[localhost.localdomain:4185563] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[localhost.localdomain:4185564] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
[localhost.localdomain:4188348] 3 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions