-
Notifications
You must be signed in to change notification settings - Fork 69
Closed as duplicate of#611
Description
🐛 Describe the bug
Hi,
Seems like we're requesting a constant value of memory for each SLURM job :
torchforge/src/forge/controller/launcher.py
Line 139 in 6177da8
| role.resource.memMB = 2062607 |
The value was presumably the right size for the Coreweave cluster but fail now on a different cluster with:
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/mreso/torchforge/apps/openenv/main.py", line 947, in <module>
_main()
File "/home/mreso/torchforge/src/forge/util/config.py", line 340, in wrapper
sys.exit(recipe_main(conf))
^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/apps/openenv/main.py", line 945, in _main
asyncio.run(main(cfg))
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/apps/openenv/main.py", line 673, in main
env_actor = await GenericOpenEnvActor.options(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/actor.py", line 235, in as_actor
actor = await cls.launch(*args, **actor_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/actor.py", line 217, in launch
proc_mesh = await get_proc_mesh(process_config=cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 536, in get_proc_mesh
return await provisioner.get_proc_mesh(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 328, in get_proc_mesh
host_mesh, server_name = await self.create_host_mesh(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/provisioner.py", line 255, in create_host_mesh
alloc, alloc_constraints, server_name = await self.launcher.get_allocator(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/torchforge/src/forge/controller/launcher.py", line 152, in get_allocator
server_info = await commands.get_or_create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/monarch/tools/commands.py", line 330, in get_or_create
new_server_handle = str(create(config, name))
^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/monarch/tools/commands.py", line 161, in create
server_handle = runner.schedule(info)
^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/torchx/runner/api.py", line 330, in schedule
app_id = sched.schedule(dryrun_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/site-packages/torchx/schedulers/slurm_scheduler.py", line 472, in schedule
p = subprocess.run(req.cmd, stdout=subprocess.PIPE, check=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mreso/miniforge3/envs/monarch/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sbatch', '--parsable', '/tmp/tmpmqoqoxz9/torchx-sbatch.sh']' returned non-zero exit status 1.
wandb:
We should automatically retrieve the information with sinfo instead.
Versions
No response
daniellepintz
Metadata
Metadata
Assignees
Labels
No labels