You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run pytorch lightning on the SLURM cluster.
It runs into MPI initialization errors first and after specifying SLURM job name to 'bash', as suggested in this issue #16730 , I can successfully run my scripts using sbatch. However, I still can't run the script in interactive sessions (both 'bash' or 'interactive' job_names failed).
# Please note that I've alreday loaded openmpi/4.0.4.
(pl_dbg) jianan.zhao@cn-g009:~/scratch/INC$ python src/scripts/pl_test.py
Starts trainer initialization
/home/mila/j/jianan.zhao/scratch/miniconda3/envs/pl_dbg/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun`command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python src/scripts/pl_test.py ...
[cn-g009.server.mila.quebec:910946] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannotexecute. There are several options for building PMI support underSLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location.Please configure as appropriate and try again.--------------------------------------------------------------------------*** An error occurred in MPI_Init_thread*** on a NULL communicator*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,*** and potentially your MPI job)[cn-g009.server.mila.quebec:910946] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!(pl_dbg) jianan.zhao@cn-g009:~/scratch/INC$ echo $SLURM_JOB_NAMEbash
Bug description
I'm trying to run pytorch lightning on the SLURM cluster.
It runs into MPI initialization errors first and after specifying SLURM job name to 'bash', as suggested in this issue #16730 , I can successfully run my scripts using
sbatch
. However, I still can't run the script in interactive sessions (both 'bash' or 'interactive' job_names failed).What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Bash commands and errors
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: