Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with using --srun-args "--async -o ${work_dir} " on Slurm #194

Open
GhaSiKey opened this issue Jan 13, 2023 · 2 comments
Open

Problems with using --srun-args "--async -o ${work_dir} " on Slurm #194

GhaSiKey opened this issue Jan 13, 2023 · 2 comments

Comments

@GhaSiKey
Copy link

GhaSiKey commented Jan 13, 2023

I tried to use mim to submit training tasks asynchronously on Slurm, using the following command:
mim train mmcls resnet101_b16x8_cifar10.py --launcher slurm --gpus 1 --gpus-per-node 1 --partition aide_dev --work-dir tmp --srun-args "--async -o /mnt/petrelfs/gaoshiqi/"
In order to be able to commit asynchronously on slurm and redirects the log to /mnt/petrelfs/gaoshiqi/, I added the parameter --srun-args "--async -o /mnt/petrelfs/gaoshiqi/"
However, the execution of the command success but the task is not committed to the Slurm cluster, and I cant find my log /mnt/petrelfs/gaoshiqi/phoenix-slurm-5181985.out.
the log is as follows:
image
Trying to find my log, but not exited:
image

@GhaSiKey
Copy link
Author

I found the cause of the problem: after using asynchronous submission, a batchscript is automatically generated, and I found a problem with the content.
image
image
If the job-name parameter is not added, mim will automatically generate it, causing the batchscript content to be misplaced and thus causing the task submission to fail.

@ice-tong
Copy link
Collaborator

It seems a bug of srun:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants