Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run in SLURM Interactive Session #19762

Open
AndyJZhao opened this issue Apr 11, 2024 · 0 comments
Open

Cannot run in SLURM Interactive Session #19762

AndyJZhao opened this issue Apr 11, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers

Comments

@AndyJZhao
Copy link

Bug description

I'm trying to run pytorch lightning on the SLURM cluster.
It runs into MPI initialization errors first and after specifying SLURM job name to 'bash', as suggested in this issue #16730 , I can successfully run my scripts using sbatch. However, I still can't run the script in interactive sessions (both 'bash' or 'interactive' job_names failed).

What version are you seeing the problem on?

v2.2

How to reproduce the bug

import pytorch_lightning as pl
import numpy as np
import torch
from torch.nn import MSELoss
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn


class SimpleDataset(Dataset):
    def __init__(self):
        X = np.arange(10000)
        y = X * 2
        X = [[_] for _ in X]
        y = [[_] for _ in y]
        self.X = torch.Tensor(X)
        self.y = torch.Tensor(y)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return {"X": self.X[idx], "y": self.y[idx]}


class MyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(1, 1)
        self.criterion = MSELoss()

    def forward(self, inputs_id, labels=None):
        outputs = self.fc(inputs_id)
        loss = 0
        if labels is not None:
            loss = self.criterion(outputs, labels)
        return loss, outputs

    def train_dataloader(self):
        dataset = SimpleDataset()
        return DataLoader(dataset, batch_size=1000)

    def training_step(self, batch, batch_idx):
        input_ids = batch["X"]
        labels = batch["y"]
        loss, outputs = self(input_ids, labels)
        return {"loss": loss}

    def configure_optimizers(self):
        optimizer = Adam(self.parameters())
        return optimizer


if __name__ == '__main__':
    model = MyModel()
    print('Starts trainer initialization')
    trainer = pl.Trainer(max_epochs=2000, accelerator='gpu')
    trainer.fit(model)

    X = torch.Tensor([[1.0], [51.0], [89.0]])
    _, y = model(X)
    print(y)
# python src/scripts/pl_test.py

Error messages and logs

Bash commands and errors

# Please note that I've alreday loaded openmpi/4.0.4.
(pl_dbg) jianan.zhao@cn-g009:~/scratch/INC$ python src/scripts/pl_test.py
Starts trainer initialization
/home/mila/j/jianan.zhao/scratch/miniconda3/envs/pl_dbg/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python src/scripts/pl_test.py ...
[cn-g009.server.mila.quebec:910946] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cn-g009.server.mila.quebec:910946] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
(pl_dbg) jianan.zhao@cn-g009:~/scratch/INC$ echo $SLURM_JOB_NAME
bash

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA A100-SXM4-80GB MIG 2g.20gb
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • pytorch-lightning: 2.2.1
    • torch: 2.2.1
    • torchaudio: 2.2.1
    • torchdata: 0.7.1
    • torchmetrics: 1.3.2
    • torchvision: 0.17.1
  • Packages:
    • aiohttp: 3.9.3
    • aiosignal: 1.3.1
    • annotated-types: 0.6.0
    • antlr4-python3-runtime: 4.9.3
    • anyio: 4.3.0
    • appdirs: 1.4.4
    • argon2-cffi: 23.1.0
    • argon2-cffi-bindings: 21.2.0
    • arrow: 1.3.0
    • asttokens: 2.4.1
    • async-lru: 2.0.4
    • async-timeout: 4.0.3
    • attrs: 23.2.0
    • babel: 2.14.0
    • beautifulsoup4: 4.12.3
    • bleach: 6.1.0
    • brotli: 1.1.0
    • cached-property: 1.5.2
    • certifi: 2024.2.2
    • cffi: 1.16.0
    • charset-normalizer: 3.3.2
    • click: 8.1.7
    • colorama: 0.4.6
    • colorlog: 6.8.2
    • comm: 0.2.2
    • datasets: 2.18.0
    • debugpy: 1.8.1
    • decorator: 5.1.1
    • defusedxml: 0.7.1
    • dgl: 2.1.0+cu118
    • dill: 0.3.8
    • docker-pycreds: 0.4.0
    • easydict: 1.13
    • einops: 0.7.0
    • entrypoints: 0.4
    • exceptiongroup: 1.2.0
    • executing: 2.0.1
    • fastjsonschema: 2.19.1
    • filelock: 3.13.3
    • fqdn: 1.5.1
    • frozenlist: 1.4.1
    • fsspec: 2024.2.0
    • gitdb: 4.0.11
    • gitpython: 3.1.43
    • gmpy2: 2.1.2
    • h11: 0.14.0
    • h2: 4.1.0
    • hpack: 4.0.0
    • httpcore: 1.0.5
    • httpx: 0.27.0
    • huggingface-hub: 0.22.2
    • hydra-colorlog: 1.2.0
    • hydra-core: 1.3.2
    • hyperframe: 6.0.1
    • idna: 3.6
    • importlib-metadata: 7.1.0
    • importlib-resources: 6.4.0
    • ipykernel: 6.29.3
    • ipython: 8.22.2
    • ipywidgets: 8.1.2
    • isoduration: 20.11.0
    • jedi: 0.19.1
    • jinja2: 3.1.3
    • joblib: 1.3.2
    • json5: 0.9.24
    • jsonpointer: 2.4
    • jsonschema: 4.21.1
    • jsonschema-specifications: 2023.12.1
    • jupyter: 1.0.0
    • jupyter-client: 8.6.1
    • jupyter-console: 6.6.3
    • jupyter-core: 5.7.2
    • jupyter-events: 0.10.0
    • jupyter-lsp: 2.2.4
    • jupyter-server: 2.13.0
    • jupyter-server-terminals: 0.5.3
    • jupyterlab: 4.1.5
    • jupyterlab-pygments: 0.3.0
    • jupyterlab-server: 2.25.4
    • jupyterlab-widgets: 3.0.10
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • littleutils: 0.2.2
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.5
    • matplotlib-inline: 0.1.6
    • mdurl: 0.1.2
    • mistune: 3.0.2
    • mpi4py: 3.1.5
    • mpmath: 1.3.0
    • multidict: 6.0.5
    • multiprocess: 0.70.16
    • nbclient: 0.10.0
    • nbconvert: 7.16.3
    • nbformat: 5.10.4
    • nest-asyncio: 1.6.0
    • networkx: 3.3
    • notebook: 7.1.2
    • notebook-shim: 0.2.4
    • numpy: 1.26.4
    • nvidia-htop: 1.2.0
    • ogb: 1.3.6
    • omegaconf: 2.3.0
    • outdated: 0.2.2
    • overrides: 7.7.0
    • packaging: 24.0
    • pandas: 2.2.1
    • pandocfilters: 1.5.0
    • parso: 0.8.4
    • pathtools: 0.1.2
    • pexpect: 4.9.0
    • pickleshare: 0.7.5
    • pillow: 9.4.0
    • pip: 24.0
    • pkgutil-resolve-name: 1.3.10
    • platformdirs: 4.2.0
    • prometheus-client: 0.20.0
    • prompt-toolkit: 3.0.42
    • protobuf: 4.25.3
    • psutil: 5.9.8
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.2
    • pyarrow: 15.0.2
    • pyarrow-hotfix: 0.6
    • pycparser: 2.22
    • pydantic: 2.6.4
    • pydantic-core: 2.16.3
    • pygments: 2.17.2
    • pysocks: 1.7.1
    • python-dateutil: 2.9.0
    • python-dotenv: 1.0.1
    • python-json-logger: 2.0.7
    • pytorch-lightning: 2.2.1
    • pytz: 2024.1
    • pyyaml: 6.0.1
    • pyzmq: 25.1.2
    • qtconsole: 5.5.1
    • qtpy: 2.4.1
    • referencing: 0.34.0
    • regex: 2023.12.25
    • requests: 2.31.0
    • rfc3339-validator: 0.1.4
    • rfc3986-validator: 0.1.1
    • rich: 13.7.1
    • rootutils: 1.0.7
    • rpds-py: 0.18.0
    • safetensors: 0.4.2
    • scikit-learn: 1.4.1.post1
    • scipy: 1.13.0
    • send2trash: 1.8.2
    • sentry-sdk: 1.44.1
    • setproctitle: 1.3.3
    • setuptools: 69.2.0
    • six: 1.16.0
    • smmap: 5.0.0
    • sniffio: 1.3.1
    • soupsieve: 2.5
    • stack-data: 0.6.2
    • sympy: 1.12
    • termcolor: 2.4.0
    • terminado: 0.18.1
    • threadpoolctl: 3.4.0
    • tinycss2: 1.2.1
    • tokenizers: 0.15.2
    • tomli: 2.0.1
    • torch: 2.2.1
    • torchaudio: 2.2.1
    • torchdata: 0.7.1
    • torchmetrics: 1.3.2
    • torchvision: 0.17.1
    • tornado: 6.4
    • tqdm: 4.66.2
    • traitlets: 5.14.2
    • transformers: 4.39.3
    • triton: 2.2.0
    • types-python-dateutil: 2.9.0.20240316
    • typing-extensions: 4.11.0
    • typing-utils: 0.1.0
    • tzdata: 2024.1
    • uri-template: 1.3.0
    • urllib3: 2.2.1
    • wandb: 0.16.5
    • wcwidth: 0.2.13
    • webcolors: 1.13
    • webencodings: 0.5.1
    • websocket-client: 1.7.0
    • wheel: 0.43.0
    • widgetsnbextension: 4.0.10
    • xxhash: 3.4.1
    • yarl: 1.9.4
    • zipp: 3.17.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.10.14
    • release: 5.15.0-101-generic
    • version: Choice of Logging Backend #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024

More info

No response

@AndyJZhao AndyJZhao added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers
Projects
None yet
Development

No branches or pull requests

1 participant