Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Invalid Trackable RESource (TRES) specification' error #68

Open
kimchitsigai opened this issue Apr 26, 2022 · 3 comments
Open

'Invalid Trackable RESource (TRES) specification' error #68

kimchitsigai opened this issue Apr 26, 2022 · 3 comments

Comments

@kimchitsigai
Copy link

Hi,

I'm using slurm-drmaa to submit a job and I get the error below:

d #89f27 [     0.00]  * # Native specification:  --time=1:00:00 --ntasks=1 --gres=gpu:1 --cpus-per-task=2 --nodes=1 --account=xxx@yyy --partition=mypartition

t #89f27 [     0.00] -> slurmdrmaa_parse_native

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * # time_limit = 1:00:00

t #89f27 [     0.00] -> slurmdrmaa_datetime_parse(1:00:00)

d #89f27 [     0.00]  * parsed: 0000-00-00 01:00:00 +00:00:00 [---hms-]

t #89f27 [     0.00] <- slurmdrmaa_datetime_parse(1:00:00) =60 minutes
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # ntasks = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # gres = gpu:1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # cpus_per_task = 2
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * nodes: 1 ->
d #89f27 [     0.00]  * # min_nodes = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # account = xxx@yyy
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # partition = allgpus
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * finalizing job constraints
d #89f27 [     0.00]  * set min_cpus to ntasks*cpus_per_task: 2
t #89f27 [     0.00] <- slurmdrmaa_parse_native
E #89f27 [     4.24]  * fsd_exc_new(1016,slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification,1)

t #89f27 [     4.24] -> slurmdrmaa_free_job_desc
t #89f27 [     4.24] <- slurmdrmaa_free_job_desc

t #89f27 [     4.24] <- drmaa_run_job=17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

Traceback (most recent call last):
  ..
  File "/.../python3.6/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/.../python3.6/site-packages/drmaa/helpers.py", line 302, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/.../site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

The same job without ```--gres=gpu:1`` works fine.
slurm-drmaa version is 1.1.3 and slurm version is 21.08.6. Os is RHEL 8.4.

Any hint would be greatly appreciated,
Kimchi

@kimchitsigai
Copy link
Author

kimchitsigai commented Apr 28, 2022

Hi,

I've added a logging instruction in slurm_drmaa/session.c:
fsd_log_debug(("job_desc.tres_per_node = %s", job_desc.tres_per_node));
and it produced this output:
job_desc.tres_per_node = gpu:1
And finally, I get the Invalid Trackable (TRES) specification RESource error.

By changing the input from --gres=gpu:1 to --gres=gres:gpu:1, the logging instruction outputs:
job_desc.tres_per_node = gres:gpu:1
as expected. The TRES error disappeared and the job was correctly submitted.

By looking at the code in https://github.com/SchedMD/slurm/blob/master/src/common/gres.c and https://github.com/SchedMD/slurm/blob/master/src/common/slurm_opt.c it seems that Slurm is expecting GPU resource requests formatted as gres:gpu:1 and not as gpu:1 as in the previous versions (20.x) of Slurm.

Shouldn't slurm-drmaa be updated to take into account this change in Slurm?

Best,
Kimchi

@scholtalbers
Copy link

I can confirm that changing --gres=gpu:1 to --gres=gres:gpu:1 solves the TRES error.

@natefoo
Copy link
Owner

natefoo commented Sep 13, 2023

Thanks, yes, it sounds like we should probably prepend gres: to the string in the job template. gres in the job template was generalized to tres_per_node in 18.08, I am not sure if this is when it broke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants