This little tool helps in scheduling, tracking and aggregating calculations and their results. It forms the step that brings you from 'a directory with working code for a job' to 'running dozens of jobs and getting results easily'.
pip install fenpei
This is intended to be used to run multiple intensive computations on a (linux) cluster. At present, it assumes a shared file system on the cluster.
It takes a bit of work to integrate with your situation but it is very flexible and should make your life easier after setting it up. Some features:
- Jobs are created in Python files, making it short and extremely flexible.
- It uses a command line interface (some shell experience required) to easily start, stop or monitor jobs.
- Easy to use with existing code and easily reproducible, since it works by creating isolated job directories.
- Can replaces scheduling queue functionality and start jobs through ssh, or can work with existing systems (slurm and qsum included, others implementable).
- Flexibility for caching, preparation and result extraction.
- Uses multi-processing and can easily use caching for greater performance, and symlinks to save space.
Note that:
- You will have to write Python code for your specific job, as well as any analysis or visualization for the extracted data.
- Except for status monitoring mode, it derives the state on each run, it doesn't keep a database that can get outdated or corrupted.
One example to run reproducible jobs with Fenpei (there are many ways):
- Make a script that runs your code from source to completion for one set of parameters.
- Subclass the ShJobSingle job and add all the files that you need in get_nosub_files.
- Replace all the parameters in the run script and other config files by {{ some_param_name }}. Add these files to get_sub_files.
- Make a Python file (example below) for each analysis you want to run, and fill in all the some_param_name with the appropriate values.
- From a shell, use python your_jobfile.py -s to see the status, then use other flags for more functionality (see below).
- Implement is_complete and result in your job (and crash_reason if you want -t) (others can be overridden too, if you require special behaviour).
- Add analysis code to your job file if you want to visualize the results.
Example file to generate jobs:
def generate_jobs(): for alpha in [0.01, 0.10, 1.00]: for beta in range(0, 41): dict(name='a{0:.2f}_b{1:d}'.format(alpha, beta), subs=dict( alpha=alpha, beta=beta, gamma=5, delta='yes' ), use_symlink=True) def analyze(queue): results = queue.compare_results(('J', 'init_vib', 'init_rot',)) # You now have the results for all jobs, indexed by the above three parameters. # Visualization is up to you, and will be run when the user adds -x if __name__ == '__main__': jobs = create_jobs(JobCls=ShefJob, generator=generate_jobs(), default_batch=splitext(basename(__file__))[0]) queue = SlurmQueue(partition='example', jobs=jobs, summary_func=analyze) queue.run_argv()
This file registers many jobs for combinations of alpha and beta parameters. You can now use the command line:
usage: results.py [-h] [-v] [-f] [-e] [-a] [-d] [-l] [-p] [-c] [-w WEIGHT] [-q LIMIT] [-k] [-r] [-g] [-s] [-m] [-x] [-t] [-j] [--jobs JOBS] [--cmd ACTIONS] distribute jobs over available nodes optional arguments: -h, --help show this help message and exit -v, --verbose more information (can be used multiple times, -vv) -f, --force force certain mistake-sensitive steps instead of failing with a warning -e, --restart with this, start and cleanup ignore complete (/running) jobs -a, --availability list all available nodes and their load (cache reload) -d, --distribute distribute the jobs over available nodes -l, --list show a list of added jobs -p, --prepare prepare all the jobs -c, --calc start calculating one jobs, or see -z/-w/-q -w WEIGHT, --weight WEIGHT -c will start jobs with total WEIGHT running -q LIMIT, --limit LIMIT -c will add jobs until a total LIMIT running -k, --kill terminate the calculation of all the running jobs -r, --remove clean up all the job files -g, --fix fix jobs, check cache etc (e.g. after update) -s, --status show job status -m, --monitor show job status every few seconds -x, --result run analysis code to summarize results -t, --whyfail print a list of failed jobs with the reason why they failed -j, --serial job commands (start, fix, etc) may NOT be run in parallel (parallel is faster but order of jobs and output is inconsistent) --jobs JOBS specify by name the jobs to (re)start, separated by whitespace --cmd ACTIONS run a shell command in the directories of each job that has a dir ($NAME/$BATCH/$STATUS if --s) actions are executed (largely) in the order they are supplied; some actions may call others where necessary
Pull requests, extra documentation and bug reports are welcome! It's Revised BSD-licensed so you can do many things.