slurm_scheduler.py throws exception with slurm 24.11.5

## 🐛 Bug

I am seeing an issue with the slurm_scheduler where it raises an exception at this line:

https://github.com/pytorch/torchx/blob/50da5afbeaa645400e938911af17fa7817e5c22d/torchx/schedulers/slurm_scheduler.py#L672

That is because `job_resources` is actually None at that point.

The full job is this:


`{'account': 'root', 'accrue_time': {'set': True, 'infinite': False, 'number': 0}, 'admin_comment': '', 'allocating_node': 'slurm-login-0', 'array_job_id': {'set': True, 'infinite': False, 'number': 0}, 'array_task_id': {'set': False, 'infinite': False, 'number': 0}, 'array_max_tasks': {'set': True, 'infinite': False, 'number': 0}, 'array_task_string': '', 'association_id': 6, 'batch_features': '', 'batch_flag': True, 'batch_host': '', 'flags': ['EXACT_CPU_COUNT_REQUESTED', 'EXACT_MEMORY_REQUESTED', 'USING_DEFAULT_ACCOUNT', 'USING_DEFAULT_PARTITION', 'USING_DEFAULT_QOS', 'USING_DEFAULT_WCKEY', 'PARTITION_ASSIGNED'], 'burst_buffer': '', 'burst_buffer_state': '', 'cluster': 'slurm', 'cluster_features': '', 'command': '/tmp/tmpvbbeyrkr/torchx-sbatch.sh', 'comment': '', 'container': '', 'container_id': '', 'contiguous': False, 'core_spec': 0, 'thread_spec': 32766, 'cores_per_socket': {'set': False, 'infinite': False, 'number': 0}, 'billable_tres': {'set': False, 'infinite': False, 'number': 0.0}, 'cpus_per_task': {'set': True, 'infinite': False, 'number': 48}, 'cpu_frequency_minimum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_maximum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_governor': {'set': False, 'infinite': False, 'number': 0}, 'cpus_per_tres': '', 'cron': '', 'deadline': {'set': True, 'infinite': False, 'number': 0}, 'delay_boot': {'set': True, 'infinite': False, 'number': 0}, 'dependency': '', 'derived_exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'eligible_time': {'set': True, 'infinite': False, 'number': 1756128930}, 'end_time': {'set': True, 'infinite': False, 'number': 0}, 'excluded_nodes': '', 'exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'extra': '', 'failed_node': '', 'features': '', 'federation_origin': '', 'federation_siblings_active': '', 'federation_siblings_viable': '', 'gres_detail': [], 'group_id': 2033, 'group_name': 'ahmads', 'het_job_id': {'set': True, 'infinite': False, 'number': 19}, 'het_job_id_set': '19-20', 'het_job_offset': {'set': True, 'infinite': False, 'number': 0}, 'job_id': 19, 'job_resources': None, 'job_size_str': [], 'job_state': ['PENDING'], 'last_sched_evaluation': {'set': True, 'infinite': False, 'number': 1756128930}, 'licenses': '', 'mail_type': [], 'mail_user': 'ahmads', 'max_cpus': {'set': True, 'infinite': False, 'number': 0}, 'max_nodes': {'set': True, 'infinite': False, 'number': 0}, 'mcs_label': '', 'memory_per_tres': '', 'name': 'mesh0-0', 'network': '', 'nodes': '', 'nice': 0, 'tasks_per_core': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_tres': {'set': True, 'infinite': False, 'number': 0}, 'tasks_per_node': {'set': True, 'infinite': False, 'number': 1}, 'tasks_per_socket': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_board': {'set': True, 'infinite': False, 'number': 0}, 'cpus': {'set': True, 'infinite': False, 'number': 48}, 'node_count': {'set': True, 'infinite': False, 'number': 1}, 'tasks': {'set': True, 'infinite': False, 'number': 1}, 'partition': 'all', 'prefer': '', 'memory_per_cpu': {'set': False, 'infinite': False, 'number': 0}, 'memory_per_node': {'set': True, 'infinite': False, 'number': 186777}, 'minimum_cpus_per_node': {'set': True, 'infinite': False, 'number': 48}, 'minimum_tmp_disk_per_node': {'set': True, 'infinite': False, 'number': 0}, 'power': {'flags': []}, 'preempt_time': {'set': True, 'infinite': False, 'number': 0}, 'preemptable_time': {'set': True, 'infinite': False, 'number': 0}, 'pre_sus_time': {'set': True, 'infinite': False, 'number': 0}, 'hold': False, 'priority': {'set': True, 'infinite': False, 'number': 1}, 'priority_by_partition': [], 'profile': ['NOT_SET'], 'qos': 'normal', 'reboot': False, 'required_nodes': '', 'required_switches': 0, 'requeue': True, 'resize_time': {'set': True, 'infinite': False, 'number': 0}, 'restart_cnt': 0, 'resv_name': '', 'scheduled_nodes': '', 'selinux_context': '', 'shared': [], 'sockets_per_board': 0, 'sockets_per_node': {'set': False, 'infinite': False, 'number': 0}, 'start_time': {'set': True, 'infinite': False, 'number': 0}, 'state_description': '', 'state_reason': 'None', 'standard_error': '/mnt/home/ahmads/monarch/examples/slurm-19.out', 'standard_input': '/dev/null', 'standard_output': '/mnt/home/ahmads/monarch/examples/slurm-19.out', 'submit_time': {'set': True, 'infinite': False, 'number': 1756128930}, 'suspend_time': {'set': True, 'infinite': False, 'number': 0}, 'system_comment': '', 'time_limit': {'set': False, 'infinite': True, 'number': 0}, 'time_minimum': {'set': True, 'infinite': False, 'number': 0}, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}, 'tres_bind': '', 'tres_freq': '', 'tres_per_job': '', 'tres_per_node': 'gres/gpu:4', 'tres_per_socket': '', 'tres_per_task': 'cpu=48', 'tres_req_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'tres_alloc_str': '', 'user_id': 2033, 'user_name': 'ahmads', 'maximum_switch_wait_time': 0, 'wckey': '', 'current_working_directory': '/mnt/home/ahmads/monarch/examples'}`

Notice that `job_resources: None` is there.

When I run squeue it shows:

```
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              21+1       all  mesh0-1   ahmads PD       0:00      1 (None)
              21+0       all  mesh0-0   ahmads PD       0:00      1 (None)
```

This state lasts a few seconds and after that the state changes to:

```
squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              21+0       all  mesh0-0   ahmads  R       0:46      1 slurm-h100-206-061
              21+1       all  mesh0-1   ahmads  R       0:46      1 slurm-h100-206-089
```

At that point `job_resources` is not None and the full job is:


`{'account': 'root', 'accrue_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'admin_comment': '', 'allocating_node': 'slurm-login-0', 'array_job_id': {'set': True, 'infinite': False, 'number': 0}, 'array_task_id': {'set': False, 'infinite': False, 'number': 0}, 'array_max_tasks': {'set': True, 'infinite': False, 'number': 0}, 'array_task_string': '', 'association_id': 6, 'batch_features': '', 'batch_flag': True, 'batch_host': 'slurm-h100-206-089', 'flags': ['EXACT_CPU_COUNT_REQUESTED', 'ACCRUE_COUNT_CLEARED', 'JOB_WAS_RUNNING', 'EXACT_MEMORY_REQUESTED', 'USING_DEFAULT_ACCOUNT', 'USING_DEFAULT_PARTITION', 'USING_DEFAULT_QOS', 'USING_DEFAULT_WCKEY', 'PARTITION_ASSIGNED', 'BACKFILL_ATTEMPTED'], 'burst_buffer': '', 'burst_buffer_state': '', 'cluster': 'slurm', 'cluster_features': '', 'command': '/tmp/tmpfxjjtdp9/torchx-sbatch.sh', 'comment': '', 'container': '', 'container_id': '', 'contiguous': False, 'core_spec': 0, 'thread_spec': 32766, 'cores_per_socket': {'set': False, 'infinite': False, 'number': 0}, 'billable_tres': {'set': True, 'infinite': False, 'number': 48.0}, 'cpus_per_task': {'set': True, 'infinite': False, 'number': 48}, 'cpu_frequency_minimum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_maximum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_governor': {'set': False, 'infinite': False, 'number': 0}, 'cpus_per_tres': '', 'cron': '', 'deadline': {'set': True, 'infinite': False, 'number': 0}, 'delay_boot': {'set': True, 'infinite': False, 'number': 0}, 'dependency': '', 'derived_exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'eligible_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'end_time': {'set': True, 'infinite': False, 'number': 1787664038}, 'excluded_nodes': '', 'exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'extra': '', 'failed_node': '', 'features': '', 'federation_origin': '', 'federation_siblings_active': '', 'federation_siblings_viable': '', 'gres_detail': ['gpu:h100:4(IDX:0-3)'], 'group_id': 2033, 'group_name': 'ahmads', 'het_job_id': {'set': True, 'infinite': False, 'number': 17}, 'het_job_id_set': '17-18', 'het_job_offset': {'set': True, 'infinite': False, 'number': 1}, 'job_id': 18, 'job_resources': {'select_type': ['CORE'], 'nodes': {'count': 1, 'select_type': ['ONE_ROW'], 'list': 'slurm-h100-206-089', 'whole': False, 'allocation': [{'index': 0, 'name': 'slurm-h100-206-089', 'cpus': {'count': 48, 'used': 48}, 'memory': {'used': 186777, 'allocated': 186777}, 'sockets': [{'index': 0, 'cores': [{'index': 0, 'status': ['ALLOCATED']}, {'index': 1, 'status': ['ALLOCATED']}, {'index': 2, 'status': ['ALLOCATED']}, {'index': 3, 'status': ['ALLOCATED']}, {'index': 4, 'status': ['ALLOCATED']}, {'index': 5, 'status': ['ALLOCATED']}, {'index': 6, 'status': ['ALLOCATED']}, {'index': 7, 'status': ['ALLOCATED']}, {'index': 8, 'status': ['ALLOCATED']}, {'index': 9, 'status': ['ALLOCATED']}, {'index': 10, 'status': ['ALLOCATED']}, {'index': 11, 'status': ['ALLOCATED']}, {'index': 12, 'status': ['ALLOCATED']}, {'index': 13, 'status': ['ALLOCATED']}, {'index': 14, 'status': ['ALLOCATED']}, {'index': 15, 'status': ['ALLOCATED']}, {'index': 16, 'status': ['ALLOCATED']}, {'index': 17, 'status': ['ALLOCATED']}, {'index': 18, 'status': ['ALLOCATED']}, {'index': 19, 'status': ['ALLOCATED']}, {'index': 20, 'status': ['ALLOCATED']}, {'index': 21, 'status': ['ALLOCATED']}, {'index': 22, 'status': ['ALLOCATED']}, {'index': 23, 'status': ['ALLOCATED']}, {'index': 24, 'status': ['UNALLOCATED']}, {'index': 25, 'status': ['UNALLOCATED']}, {'index': 26, 'status': ['UNALLOCATED']}, {'index': 27, 'status': ['UNALLOCATED']}, {'index': 28, 'status': ['UNALLOCATED']}, {'index': 29, 'status': ['UNALLOCATED']}, {'index': 30, 'status': ['UNALLOCATED']}, {'index': 31, 'status': ['UNALLOCATED']}]}, {'index': 1, 'cores': [{'index': 0, 'status': ['UNALLOCATED']}, {'index': 1, 'status': ['UNALLOCATED']}, {'index': 2, 'status': ['UNALLOCATED']}, {'index': 3, 'status': ['UNALLOCATED']}, {'index': 4, 'status': ['UNALLOCATED']}, {'index': 5, 'status': ['UNALLOCATED']}, {'index': 6, 'status': ['UNALLOCATED']}, {'index': 7, 'status': ['UNALLOCATED']}, {'index': 8, 'status': ['UNALLOCATED']}, {'index': 9, 'status': ['UNALLOCATED']}, {'index': 10, 'status': ['UNALLOCATED']}, {'index': 11, 'status': ['UNALLOCATED']}, {'index': 12, 'status': ['UNALLOCATED']}, {'index': 13, 'status': ['UNALLOCATED']}, {'index': 14, 'status': ['UNALLOCATED']}, {'index': 15, 'status': ['UNALLOCATED']}, {'index': 16, 'status': ['UNALLOCATED']}, {'index': 17, 'status': ['UNALLOCATED']}, {'index': 18, 'status': ['UNALLOCATED']}, {'index': 19, 'status': ['UNALLOCATED']}, {'index': 20, 'status': ['UNALLOCATED']}, {'index': 21, 'status': ['UNALLOCATED']}, {'index': 22, 'status': ['UNALLOCATED']}, {'index': 23, 'status': ['UNALLOCATED']}, {'index': 24, 'status': ['UNALLOCATED']}, {'index': 25, 'status': ['UNALLOCATED']}, {'index': 26, 'status': ['UNALLOCATED']}, {'index': 27, 'status': ['UNALLOCATED']}, {'index': 28, 'status': ['UNALLOCATED']}, {'index': 29, 'status': ['UNALLOCATED']}, {'index': 30, 'status': ['UNALLOCATED']}, {'index': 31, 'status': ['UNALLOCATED']}]}]}]}, 'cpus': 48, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}}, 'job_size_str': [], 'job_state': ['RUNNING'], 'last_sched_evaluation': {'set': True, 'infinite': False, 'number': 1756128038}, 'licenses': '', 'mail_type': [], 'mail_user': 'ahmads', 'max_cpus': {'set': True, 'infinite': False, 'number': 0}, 'max_nodes': {'set': True, 'infinite': False, 'number': 0}, 'mcs_label': '', 'memory_per_tres': '', 'name': 'mesh0-1', 'network': '', 'nodes': 'slurm-h100-206-089', 'nice': 0, 'tasks_per_core': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_tres': {'set': True, 'infinite': False, 'number': 0}, 'tasks_per_node': {'set': True, 'infinite': False, 'number': 1}, 'tasks_per_socket': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_board': {'set': True, 'infinite': False, 'number': 0}, 'cpus': {'set': True, 'infinite': False, 'number': 48}, 'node_count': {'set': True, 'infinite': False, 'number': 1}, 'tasks': {'set': True, 'infinite': False, 'number': 1}, 'partition': 'all', 'prefer': '', 'memory_per_cpu': {'set': False, 'infinite': False, 'number': 0}, 'memory_per_node': {'set': True, 'infinite': False, 'number': 186777}, 'minimum_cpus_per_node': {'set': True, 'infinite': False, 'number': 48}, 'minimum_tmp_disk_per_node': {'set': True, 'infinite': False, 'number': 0}, 'power': {'flags': []}, 'preempt_time': {'set': True, 'infinite': False, 'number': 0}, 'preemptable_time': {'set': True, 'infinite': False, 'number': 1756128038}, 'pre_sus_time': {'set': True, 'infinite': False, 'number': 0}, 'hold': False, 'priority': {'set': True, 'infinite': False, 'number': 1}, 'priority_by_partition': [], 'profile': ['NOT_SET'], 'qos': 'normal', 'reboot': False, 'required_nodes': '', 'required_switches': 0, 'requeue': True, 'resize_time': {'set': True, 'infinite': False, 'number': 0}, 'restart_cnt': 0, 'resv_name': '', 'scheduled_nodes': '', 'selinux_context': '', 'shared': [], 'sockets_per_board': 0, 'sockets_per_node': {'set': False, 'infinite': False, 'number': 0}, 'start_time': {'set': True, 'infinite': False, 'number': 1756128038}, 'state_description': '', 'state_reason': 'None', 'standard_error': '/mnt/home/ahmads/monarch/examples/slurm-18.out', 'standard_input': '/dev/null', 'standard_output': '/mnt/home/ahmads/monarch/examples/slurm-18.out', 'submit_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'suspend_time': {'set': True, 'infinite': False, 'number': 0}, 'system_comment': '', 'time_limit': {'set': False, 'infinite': True, 'number': 0}, 'time_minimum': {'set': True, 'infinite': False, 'number': 0}, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}, 'tres_bind': '', 'tres_freq': '', 'tres_per_job': '', 'tres_per_node': 'gres/gpu:4', 'tres_per_socket': '', 'tres_per_task': 'cpu=48', 'tres_req_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'tres_alloc_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'user_id': 2033, 'user_name': 'ahmads', 'maximum_switch_wait_time': 0, 'wckey': '', 'current_working_directory': '/mnt/home/ahmads/monarch/examples'}`
```

So we probably need to check if `job_resources is not None`. However that alone is not sufficient because `scheduled_nodes` is also not there.

The slurm version is:

```
sinfo --version
slurm 24.11.5
```



Module (check all that applies):
 * [ ] `torchx.spec`
 * [ ] `torchx.component`
 * [ ] `torchx.apps`
 * [ ] `torchx.runtime`
 * [ ] `torchx.cli`
 * [x ] `torchx.schedulers`
 * [ ] `torchx.pipelines`
 * [ ] `torchx.aws`
 * [ ] `torchx.examples`
 * [ ] `other`


## To Reproduce

Steps to reproduce the behavior:

1.
1.
1.



## Expected behavior



## Environment

```
(m2) ahmads@slurm-login-0:~$ python torchx/scripts/collect_env.py 
Collecting environment information...
PyTorch version: 2.8.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 14.3.0-4) 14.3.0
Clang version: 20.1.8 (https://github.com/conda-forge/clangdev-feedstock 55af0edabf37262c7a2a1caaa38f9a9b843ba87f)
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9454 48-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 48
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3810.7910
CPU min MHz:                        1500.0000
BogoMIPS:                           5491.85
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          3 MiB (96 instances)
L1i cache:                          3 MiB (96 instances)
L2 cache:                           96 MiB (96 instances)
L3 cache:                           512 MiB (16 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-47,96-143
NUMA node1 CPU(s):                  48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] torch==2.8.0
[pip3] torchx-nightly==2025.8.24
[pip3] triton==3.4.0
[conda] nccl                        2.27.7.1         hfee04f2_2          conda-forge
[conda] numpy                       2.2.6            pypi_0              pypi
[conda] nvidia-cublas-cu12          12.8.4.1         pypi_0              pypi
[conda] nvidia-cuda-cupti-cu12      12.8.90          pypi_0              pypi
[conda] nvidia-cuda-nvrtc-cu12      12.8.93          pypi_0              pypi
[conda] nvidia-cuda-runtime-cu12    12.8.90          pypi_0              pypi
[conda] nvidia-cudnn-cu12           9.10.2.21        pypi_0              pypi
[conda] nvidia-cufft-cu12           11.3.3.83        pypi_0              pypi
[conda] nvidia-curand-cu12          10.3.9.90        pypi_0              pypi
[conda] nvidia-cusolver-cu12        11.7.3.90        pypi_0              pypi
[conda] nvidia-cusparse-cu12        12.5.8.93        pypi_0              pypi
[conda] nvidia-cusparselt-cu12      0.7.1            pypi_0              pypi
[conda] nvidia-nccl-cu12            2.27.3           pypi_0              pypi
[conda] nvidia-nvjitlink-cu12       12.8.93          pypi_0              pypi
[conda] nvidia-nvtx-cu12            12.8.90          pypi_0              pypi
[conda] torch                       2.8.0            pypi_0              pypi
[conda] torchx-nightly              2025.8.24        pypi_0              pypi
[conda] triton                      3.4.0            pypi_0              pypi

Versions of CLIs:
AWS CLI: aws-cli/2.28.10 Python/3.13.4 Linux/6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc exe/x86_64.ubuntu.22
gCloud CLI: None
AZ CLI: None
Slurm: slurm 24.11.5
Docker: None
kubectl: v1.29.9

torchx dev package versions:
backports.asyncio.runner:1.2.0
docker:7.1.0
fsspec:2025.7.0
Pygments:2.19.2
pyre-extensions:0.0.32
pytest:8.4.1
pytest-asyncio:1.1.0
pytest-timeout:2.4.0
pytest-xdist:3.8.0
requests:2.32.5
torch:2.8.0
torchx-nightly:2025.8.24
wheel:0.45.1

torchx config:
N/A
```


 - torchx version (e.g. 0.1.0rc1): torchx-nightly           2025.8.24
 - Python version: 3.10.18
 - OS (e.g., Linux): linux
 - How you installed torchx (`conda`, `pip`, source, `docker`): pip
 - Docker image and tag (if using docker):
 - Git commit (if installed from source):
 - Execution environment (on-prem, AWS, GCP, Azure etc):
 - Any other relevant information:

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

slurm_scheduler.py throws exception with slurm 24.11.5 #1101

🐛 Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

slurm_scheduler.py throws exception with slurm 24.11.5 #1101

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions