Deadline Aware AI Training Job Scheduler for Heterogenious cluster
The scheduler must be run at SIST's AI Cluster, with partition critical.
cd slurm-cli
- init the database by
python jobs.py
and see 数据库创建成功 python slurm-cli.py -i <path of slurm script>
Monitor should be always running
python monitor.py
cd slurm-cli
python eva_base.py
cd slurm-cli
python eva_dash.py
Pay attention to the hard coded path in the above two files
All prerequies are sbatch files in test_generator
- if I see
Lock!
when execute:- delete
slurm-cli/database.lock
- It will happen if exit in accident
- delete
Certainly, you may meet problems, including not have priviliage for AI cluster, cannot run pytorch bench mark, path missing. Should you trouble with these error, contact qinfr@@shanghaitech.edu.cn.