Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails with Slurm 18.08.8 #32

Closed
pdblood opened this issue Sep 9, 2019 · 5 comments

Comments

@pdblood
Copy link

commented Sep 9, 2019

Testing with the drmaa-run utility, I find that slurm-drmaa fails with the 18.08.8 release of Slurm, but the exact same procedure works fine with 18.08.7. With 18.08.8 it fails at the job run step:

E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

Corresponding to this part of the drmaa-run code:

        /* run */
        if (api.run_job(jobid, sizeof(jobid) - 1, jt, errbuf, sizeof(errbuf) - 1) != DRMAA_ERRNO_SUCCESS) {
                fsd_log_fatal(("Failed to submit a job: %s ", errbuf));
                exit(2); /* TODO exception */

Slurm 18.08.8 addresses a security vulnerability that exists in prior versions of Slurm.

@natefoo

This comment has been minimized.

Copy link
Owner

commented Sep 13, 2019

I've just tried reproducing this with 18.08.8 and it worked for me. Can you include the debug log leading up to the exception?

@natefoo

This comment has been minimized.

Copy link
Owner

commented Sep 13, 2019

Nevermind, I see I have a bit more in email from you:

d #2af1 [     0.00]  * # Setting defaults for tasks and processors
d #2af1 [     0.00]  * # Native specification: -A pscstaff -p RM-small
t #2af1 [     0.00] -> slurmdrmaa_parse_native
d #2af1 [     0.00]  * # account = pscstaff
d #2af1 [     0.00]  * # partition = RM-small
d #2af1 [     0.00]  * finalizing job constraints
d #2af1 [     0.00]  * set min_cpus to ntasks: 1
t #2af1 [     0.00] <- slurmdrmaa_parse_native
E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

This could be an issue with the native spec, I'll have a look at that.

@pdblood

This comment has been minimized.

Copy link
Author

commented Sep 14, 2019

It turns out this error was being caused by a configuration issue that requires a job name be specified. With jobs submitted via sbatch, the name of the script was used when no job name was specified. Once the admin changed job_script.lua to handle nil values for job name, the tests with drmaa-run started working with Slurm 18 08.8. This did not fix my related issue with submitting jobs from Galaxy using slurm-drmaa, but drmaa-run now works as expected with Slurm 18.08.8.

@pdblood

This comment has been minimized.

Copy link
Author

commented Sep 17, 2019

Closing this issue since this failure appears to have been due to a specific configuration detail in job_script.lua on the system running Slurm 18.08.8 that was different from the system I tested with Slurm 18.08.7, leading me to believe that there was an incompatibility with Slurm 18.08.8. After further testing, with drmaa-run, Slurm 18.08.8 appears to work as expected.

@pdblood pdblood closed this Sep 17, 2019
@natefoo

This comment has been minimized.

Copy link
Owner

commented Sep 18, 2019

Thanks for the update, I'd tried with Python drmaa and couldn't get it to fail, it's good to know what the issue was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.