-
Notifications
You must be signed in to change notification settings - Fork 147
Description
🐛 Bug
I am seeing an issue with the slurm_scheduler where it raises an exception at this line:
That is because job_resources is actually None at that point.
The full job is this:
{'account': 'root', 'accrue_time': {'set': True, 'infinite': False, 'number': 0}, 'admin_comment': '', 'allocating_node': 'slurm-login-0', 'array_job_id': {'set': True, 'infinite': False, 'number': 0}, 'array_task_id': {'set': False, 'infinite': False, 'number': 0}, 'array_max_tasks': {'set': True, 'infinite': False, 'number': 0}, 'array_task_string': '', 'association_id': 6, 'batch_features': '', 'batch_flag': True, 'batch_host': '', 'flags': ['EXACT_CPU_COUNT_REQUESTED', 'EXACT_MEMORY_REQUESTED', 'USING_DEFAULT_ACCOUNT', 'USING_DEFAULT_PARTITION', 'USING_DEFAULT_QOS', 'USING_DEFAULT_WCKEY', 'PARTITION_ASSIGNED'], 'burst_buffer': '', 'burst_buffer_state': '', 'cluster': 'slurm', 'cluster_features': '', 'command': '/tmp/tmpvbbeyrkr/torchx-sbatch.sh', 'comment': '', 'container': '', 'container_id': '', 'contiguous': False, 'core_spec': 0, 'thread_spec': 32766, 'cores_per_socket': {'set': False, 'infinite': False, 'number': 0}, 'billable_tres': {'set': False, 'infinite': False, 'number': 0.0}, 'cpus_per_task': {'set': True, 'infinite': False, 'number': 48}, 'cpu_frequency_minimum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_maximum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_governor': {'set': False, 'infinite': False, 'number': 0}, 'cpus_per_tres': '', 'cron': '', 'deadline': {'set': True, 'infinite': False, 'number': 0}, 'delay_boot': {'set': True, 'infinite': False, 'number': 0}, 'dependency': '', 'derived_exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'eligible_time': {'set': True, 'infinite': False, 'number': 1756128930}, 'end_time': {'set': True, 'infinite': False, 'number': 0}, 'excluded_nodes': '', 'exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'extra': '', 'failed_node': '', 'features': '', 'federation_origin': '', 'federation_siblings_active': '', 'federation_siblings_viable': '', 'gres_detail': [], 'group_id': 2033, 'group_name': 'ahmads', 'het_job_id': {'set': True, 'infinite': False, 'number': 19}, 'het_job_id_set': '19-20', 'het_job_offset': {'set': True, 'infinite': False, 'number': 0}, 'job_id': 19, 'job_resources': None, 'job_size_str': [], 'job_state': ['PENDING'], 'last_sched_evaluation': {'set': True, 'infinite': False, 'number': 1756128930}, 'licenses': '', 'mail_type': [], 'mail_user': 'ahmads', 'max_cpus': {'set': True, 'infinite': False, 'number': 0}, 'max_nodes': {'set': True, 'infinite': False, 'number': 0}, 'mcs_label': '', 'memory_per_tres': '', 'name': 'mesh0-0', 'network': '', 'nodes': '', 'nice': 0, 'tasks_per_core': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_tres': {'set': True, 'infinite': False, 'number': 0}, 'tasks_per_node': {'set': True, 'infinite': False, 'number': 1}, 'tasks_per_socket': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_board': {'set': True, 'infinite': False, 'number': 0}, 'cpus': {'set': True, 'infinite': False, 'number': 48}, 'node_count': {'set': True, 'infinite': False, 'number': 1}, 'tasks': {'set': True, 'infinite': False, 'number': 1}, 'partition': 'all', 'prefer': '', 'memory_per_cpu': {'set': False, 'infinite': False, 'number': 0}, 'memory_per_node': {'set': True, 'infinite': False, 'number': 186777}, 'minimum_cpus_per_node': {'set': True, 'infinite': False, 'number': 48}, 'minimum_tmp_disk_per_node': {'set': True, 'infinite': False, 'number': 0}, 'power': {'flags': []}, 'preempt_time': {'set': True, 'infinite': False, 'number': 0}, 'preemptable_time': {'set': True, 'infinite': False, 'number': 0}, 'pre_sus_time': {'set': True, 'infinite': False, 'number': 0}, 'hold': False, 'priority': {'set': True, 'infinite': False, 'number': 1}, 'priority_by_partition': [], 'profile': ['NOT_SET'], 'qos': 'normal', 'reboot': False, 'required_nodes': '', 'required_switches': 0, 'requeue': True, 'resize_time': {'set': True, 'infinite': False, 'number': 0}, 'restart_cnt': 0, 'resv_name': '', 'scheduled_nodes': '', 'selinux_context': '', 'shared': [], 'sockets_per_board': 0, 'sockets_per_node': {'set': False, 'infinite': False, 'number': 0}, 'start_time': {'set': True, 'infinite': False, 'number': 0}, 'state_description': '', 'state_reason': 'None', 'standard_error': '/mnt/home/ahmads/monarch/examples/slurm-19.out', 'standard_input': '/dev/null', 'standard_output': '/mnt/home/ahmads/monarch/examples/slurm-19.out', 'submit_time': {'set': True, 'infinite': False, 'number': 1756128930}, 'suspend_time': {'set': True, 'infinite': False, 'number': 0}, 'system_comment': '', 'time_limit': {'set': False, 'infinite': True, 'number': 0}, 'time_minimum': {'set': True, 'infinite': False, 'number': 0}, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}, 'tres_bind': '', 'tres_freq': '', 'tres_per_job': '', 'tres_per_node': 'gres/gpu:4', 'tres_per_socket': '', 'tres_per_task': 'cpu=48', 'tres_req_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'tres_alloc_str': '', 'user_id': 2033, 'user_name': 'ahmads', 'maximum_switch_wait_time': 0, 'wckey': '', 'current_working_directory': '/mnt/home/ahmads/monarch/examples'}
Notice that job_resources: None is there.
When I run squeue it shows:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
21+1 all mesh0-1 ahmads PD 0:00 1 (None)
21+0 all mesh0-0 ahmads PD 0:00 1 (None)
This state lasts a few seconds and after that the state changes to:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
21+0 all mesh0-0 ahmads R 0:46 1 slurm-h100-206-061
21+1 all mesh0-1 ahmads R 0:46 1 slurm-h100-206-089
At that point job_resources is not None and the full job is:
{'account': 'root', 'accrue_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'admin_comment': '', 'allocating_node': 'slurm-login-0', 'array_job_id': {'set': True, 'infinite': False, 'number': 0}, 'array_task_id': {'set': False, 'infinite': False, 'number': 0}, 'array_max_tasks': {'set': True, 'infinite': False, 'number': 0}, 'array_task_string': '', 'association_id': 6, 'batch_features': '', 'batch_flag': True, 'batch_host': 'slurm-h100-206-089', 'flags': ['EXACT_CPU_COUNT_REQUESTED', 'ACCRUE_COUNT_CLEARED', 'JOB_WAS_RUNNING', 'EXACT_MEMORY_REQUESTED', 'USING_DEFAULT_ACCOUNT', 'USING_DEFAULT_PARTITION', 'USING_DEFAULT_QOS', 'USING_DEFAULT_WCKEY', 'PARTITION_ASSIGNED', 'BACKFILL_ATTEMPTED'], 'burst_buffer': '', 'burst_buffer_state': '', 'cluster': 'slurm', 'cluster_features': '', 'command': '/tmp/tmpfxjjtdp9/torchx-sbatch.sh', 'comment': '', 'container': '', 'container_id': '', 'contiguous': False, 'core_spec': 0, 'thread_spec': 32766, 'cores_per_socket': {'set': False, 'infinite': False, 'number': 0}, 'billable_tres': {'set': True, 'infinite': False, 'number': 48.0}, 'cpus_per_task': {'set': True, 'infinite': False, 'number': 48}, 'cpu_frequency_minimum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_maximum': {'set': False, 'infinite': False, 'number': 0}, 'cpu_frequency_governor': {'set': False, 'infinite': False, 'number': 0}, 'cpus_per_tres': '', 'cron': '', 'deadline': {'set': True, 'infinite': False, 'number': 0}, 'delay_boot': {'set': True, 'infinite': False, 'number': 0}, 'dependency': '', 'derived_exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'eligible_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'end_time': {'set': True, 'infinite': False, 'number': 1787664038}, 'excluded_nodes': '', 'exit_code': {'status': ['SUCCESS'], 'return_code': {'set': True, 'infinite': False, 'number': 0}, 'signal': {'id': {'set': False, 'infinite': False, 'number': 0}, 'name': ''}}, 'extra': '', 'failed_node': '', 'features': '', 'federation_origin': '', 'federation_siblings_active': '', 'federation_siblings_viable': '', 'gres_detail': ['gpu:h100:4(IDX:0-3)'], 'group_id': 2033, 'group_name': 'ahmads', 'het_job_id': {'set': True, 'infinite': False, 'number': 17}, 'het_job_id_set': '17-18', 'het_job_offset': {'set': True, 'infinite': False, 'number': 1}, 'job_id': 18, 'job_resources': {'select_type': ['CORE'], 'nodes': {'count': 1, 'select_type': ['ONE_ROW'], 'list': 'slurm-h100-206-089', 'whole': False, 'allocation': [{'index': 0, 'name': 'slurm-h100-206-089', 'cpus': {'count': 48, 'used': 48}, 'memory': {'used': 186777, 'allocated': 186777}, 'sockets': [{'index': 0, 'cores': [{'index': 0, 'status': ['ALLOCATED']}, {'index': 1, 'status': ['ALLOCATED']}, {'index': 2, 'status': ['ALLOCATED']}, {'index': 3, 'status': ['ALLOCATED']}, {'index': 4, 'status': ['ALLOCATED']}, {'index': 5, 'status': ['ALLOCATED']}, {'index': 6, 'status': ['ALLOCATED']}, {'index': 7, 'status': ['ALLOCATED']}, {'index': 8, 'status': ['ALLOCATED']}, {'index': 9, 'status': ['ALLOCATED']}, {'index': 10, 'status': ['ALLOCATED']}, {'index': 11, 'status': ['ALLOCATED']}, {'index': 12, 'status': ['ALLOCATED']}, {'index': 13, 'status': ['ALLOCATED']}, {'index': 14, 'status': ['ALLOCATED']}, {'index': 15, 'status': ['ALLOCATED']}, {'index': 16, 'status': ['ALLOCATED']}, {'index': 17, 'status': ['ALLOCATED']}, {'index': 18, 'status': ['ALLOCATED']}, {'index': 19, 'status': ['ALLOCATED']}, {'index': 20, 'status': ['ALLOCATED']}, {'index': 21, 'status': ['ALLOCATED']}, {'index': 22, 'status': ['ALLOCATED']}, {'index': 23, 'status': ['ALLOCATED']}, {'index': 24, 'status': ['UNALLOCATED']}, {'index': 25, 'status': ['UNALLOCATED']}, {'index': 26, 'status': ['UNALLOCATED']}, {'index': 27, 'status': ['UNALLOCATED']}, {'index': 28, 'status': ['UNALLOCATED']}, {'index': 29, 'status': ['UNALLOCATED']}, {'index': 30, 'status': ['UNALLOCATED']}, {'index': 31, 'status': ['UNALLOCATED']}]}, {'index': 1, 'cores': [{'index': 0, 'status': ['UNALLOCATED']}, {'index': 1, 'status': ['UNALLOCATED']}, {'index': 2, 'status': ['UNALLOCATED']}, {'index': 3, 'status': ['UNALLOCATED']}, {'index': 4, 'status': ['UNALLOCATED']}, {'index': 5, 'status': ['UNALLOCATED']}, {'index': 6, 'status': ['UNALLOCATED']}, {'index': 7, 'status': ['UNALLOCATED']}, {'index': 8, 'status': ['UNALLOCATED']}, {'index': 9, 'status': ['UNALLOCATED']}, {'index': 10, 'status': ['UNALLOCATED']}, {'index': 11, 'status': ['UNALLOCATED']}, {'index': 12, 'status': ['UNALLOCATED']}, {'index': 13, 'status': ['UNALLOCATED']}, {'index': 14, 'status': ['UNALLOCATED']}, {'index': 15, 'status': ['UNALLOCATED']}, {'index': 16, 'status': ['UNALLOCATED']}, {'index': 17, 'status': ['UNALLOCATED']}, {'index': 18, 'status': ['UNALLOCATED']}, {'index': 19, 'status': ['UNALLOCATED']}, {'index': 20, 'status': ['UNALLOCATED']}, {'index': 21, 'status': ['UNALLOCATED']}, {'index': 22, 'status': ['UNALLOCATED']}, {'index': 23, 'status': ['UNALLOCATED']}, {'index': 24, 'status': ['UNALLOCATED']}, {'index': 25, 'status': ['UNALLOCATED']}, {'index': 26, 'status': ['UNALLOCATED']}, {'index': 27, 'status': ['UNALLOCATED']}, {'index': 28, 'status': ['UNALLOCATED']}, {'index': 29, 'status': ['UNALLOCATED']}, {'index': 30, 'status': ['UNALLOCATED']}, {'index': 31, 'status': ['UNALLOCATED']}]}]}]}, 'cpus': 48, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}}, 'job_size_str': [], 'job_state': ['RUNNING'], 'last_sched_evaluation': {'set': True, 'infinite': False, 'number': 1756128038}, 'licenses': '', 'mail_type': [], 'mail_user': 'ahmads', 'max_cpus': {'set': True, 'infinite': False, 'number': 0}, 'max_nodes': {'set': True, 'infinite': False, 'number': 0}, 'mcs_label': '', 'memory_per_tres': '', 'name': 'mesh0-1', 'network': '', 'nodes': 'slurm-h100-206-089', 'nice': 0, 'tasks_per_core': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_tres': {'set': True, 'infinite': False, 'number': 0}, 'tasks_per_node': {'set': True, 'infinite': False, 'number': 1}, 'tasks_per_socket': {'set': False, 'infinite': True, 'number': 0}, 'tasks_per_board': {'set': True, 'infinite': False, 'number': 0}, 'cpus': {'set': True, 'infinite': False, 'number': 48}, 'node_count': {'set': True, 'infinite': False, 'number': 1}, 'tasks': {'set': True, 'infinite': False, 'number': 1}, 'partition': 'all', 'prefer': '', 'memory_per_cpu': {'set': False, 'infinite': False, 'number': 0}, 'memory_per_node': {'set': True, 'infinite': False, 'number': 186777}, 'minimum_cpus_per_node': {'set': True, 'infinite': False, 'number': 48}, 'minimum_tmp_disk_per_node': {'set': True, 'infinite': False, 'number': 0}, 'power': {'flags': []}, 'preempt_time': {'set': True, 'infinite': False, 'number': 0}, 'preemptable_time': {'set': True, 'infinite': False, 'number': 1756128038}, 'pre_sus_time': {'set': True, 'infinite': False, 'number': 0}, 'hold': False, 'priority': {'set': True, 'infinite': False, 'number': 1}, 'priority_by_partition': [], 'profile': ['NOT_SET'], 'qos': 'normal', 'reboot': False, 'required_nodes': '', 'required_switches': 0, 'requeue': True, 'resize_time': {'set': True, 'infinite': False, 'number': 0}, 'restart_cnt': 0, 'resv_name': '', 'scheduled_nodes': '', 'selinux_context': '', 'shared': [], 'sockets_per_board': 0, 'sockets_per_node': {'set': False, 'infinite': False, 'number': 0}, 'start_time': {'set': True, 'infinite': False, 'number': 1756128038}, 'state_description': '', 'state_reason': 'None', 'standard_error': '/mnt/home/ahmads/monarch/examples/slurm-18.out', 'standard_input': '/dev/null', 'standard_output': '/mnt/home/ahmads/monarch/examples/slurm-18.out', 'submit_time': {'set': True, 'infinite': False, 'number': 1756128037}, 'suspend_time': {'set': True, 'infinite': False, 'number': 0}, 'system_comment': '', 'time_limit': {'set': False, 'infinite': True, 'number': 0}, 'time_minimum': {'set': True, 'infinite': False, 'number': 0}, 'threads_per_core': {'set': False, 'infinite': False, 'number': 0}, 'tres_bind': '', 'tres_freq': '', 'tres_per_job': '', 'tres_per_node': 'gres/gpu:4', 'tres_per_socket': '', 'tres_per_task': 'cpu=48', 'tres_req_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'tres_alloc_str': 'cpu=48,mem=186777M,node=1,billing=48,gres/gpu=4', 'user_id': 2033, 'user_name': 'ahmads', 'maximum_switch_wait_time': 0, 'wckey': '', 'current_working_directory': '/mnt/home/ahmads/monarch/examples'}
So we probably need to check if `job_resources is not None`. However that alone is not sufficient because `scheduled_nodes` is also not there.
The slurm version is:
sinfo --version
slurm 24.11.5
Module (check all that applies):
* [ ] `torchx.spec`
* [ ] `torchx.component`
* [ ] `torchx.apps`
* [ ] `torchx.runtime`
* [ ] `torchx.cli`
* [x ] `torchx.schedulers`
* [ ] `torchx.pipelines`
* [ ] `torchx.aws`
* [ ] `torchx.examples`
* [ ] `other`
## To Reproduce
Steps to reproduce the behavior:
1.
1.
1.
<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->
## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->
## Environment
(m2) ahmads@slurm-login-0:~$ python torchx/scripts/collect_env.py
Collecting environment information...
PyTorch version: 2.8.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 14.3.0-4) 14.3.0
Clang version: 20.1.8 (https://github.com/conda-forge/clangdev-feedstock 55af0edabf37262c7a2a1caaa38f9a9b843ba87f)
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.18 | packaged by conda-forge | (main, Jun 4 2025, 14:45:41) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9454 48-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 3810.7910
CPU min MHz: 1500.0000
BogoMIPS: 5491.85
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization: AMD-V
L1d cache: 3 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 96 MiB (96 instances)
L3 cache: 512 MiB (16 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] torch==2.8.0
[pip3] torchx-nightly==2025.8.24
[pip3] triton==3.4.0
[conda] nccl 2.27.7.1 hfee04f2_2 conda-forge
[conda] numpy 2.2.6 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.8.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.9.90 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.27.3 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.8.90 pypi_0 pypi
[conda] torch 2.8.0 pypi_0 pypi
[conda] torchx-nightly 2025.8.24 pypi_0 pypi
[conda] triton 3.4.0 pypi_0 pypi
Versions of CLIs:
AWS CLI: aws-cli/2.28.10 Python/3.13.4 Linux/6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc exe/x86_64.ubuntu.22
gCloud CLI: None
AZ CLI: None
Slurm: slurm 24.11.5
Docker: None
kubectl: v1.29.9
torchx dev package versions:
backports.asyncio.runner:1.2.0
docker:7.1.0
fsspec:2025.7.0
Pygments:2.19.2
pyre-extensions:0.0.32
pytest:8.4.1
pytest-asyncio:1.1.0
pytest-timeout:2.4.0
pytest-xdist:3.8.0
requests:2.32.5
torch:2.8.0
torchx-nightly:2025.8.24
wheel:0.45.1
torchx config:
N/A
- torchx version (e.g. 0.1.0rc1): torchx-nightly 2025.8.24
- Python version: 3.10.18
- OS (e.g., Linux): linux
- How you installed torchx (`conda`, `pip`, source, `docker`): pip
- Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information:
## Additional context
<!-- Add any other context about the problem here. -->