Skip to content

Preconfigured memory allocation for SLURM jobs is too high #590

@mreso

Description

@mreso

🐛 Describe the bug

Hi,

Seems like we're requesting a constant value of memory for each SLURM job :

role.resource.memMB = 2062607

The value was presumably the right size for the Coreweave cluster but fail now on a different cluster with:

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/mreso/torchforge/apps/openenv/main.py", line 947, in <module>
    _main()
  File "/home/mreso/torchforge/src/forge/util/config.py", line 340, in wrapper
    sys.exit(recipe_main(conf))
             ^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/apps/openenv/main.py", line 945, in _main
    asyncio.run(main(cfg))
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/apps/openenv/main.py", line 673, in main
    env_actor = await GenericOpenEnvActor.options(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/actor.py", line 235, in as_actor
    actor = await cls.launch(*args, **actor_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/actor.py", line 217, in launch
    proc_mesh = await get_proc_mesh(process_config=cfg)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 536, in get_proc_mesh
    return await provisioner.get_proc_mesh(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 328, in get_proc_mesh
    host_mesh, server_name = await self.create_host_mesh(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 255, in create_host_mesh
    alloc, alloc_constraints, server_name = await self.launcher.get_allocator(
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/torchforge/src/forge/controller/launcher.py", line 152, in get_allocator
    server_info = await commands.get_or_create(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/monarch/tools/commands.py", line 330, in get_or_create
    new_server_handle = str(create(config, name))
                            ^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/monarch/tools/commands.py", line 161, in create
    server_handle = runner.schedule(info)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/torchx/runner/api.py", line 330, in schedule
    app_id = sched.schedule(dryrun_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 472, in schedule
    p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--parsable', '/tmp/tmpmqoqoxz9/torchx-sbatch.sh']' returned non-zero exit status 1.
wandb:

We should automatically retrieve the information with sinfo instead.

Versions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions