Schedule and manage computational jobs on linux cluster
Python Shell
Latest commit c079b7e Feb 18, 2017 mverleg queue start fixes
Permalink
Failed to load latest commit information.
fenpei
test
.gitignore
LICENSE.txt
README.rst
__init__.py
setup.cfg
setup.py
todo.txt

README.rst

Fenpei

This little tool helps in scheduling, tracking and aggregating calculations and their results. It forms the step that brings you from 'a directory with working code for a job' to 'running dozens of jobs and getting results easily'.

pip install fenpei

This is intended to be used to run multiple intensive computations on a (linux) cluster. At present, it assumes a shared file system on the cluster.

It takes a bit of work to integrate with your situation but it is very flexible and should make your life easier after setting it up. Some features:

  • Jobs are created in Python files, making it short and extremely flexible.
  • It uses a command line interface (some shell experience required) to easily start, stop or monitor jobs.
  • Easy to use with existing code and easily reproducible, since it works by creating isolated job directories.
  • Can replaces scheduling queue functionality and start jobs through ssh, or can work with existing systems (slurm and qsum included, others implementable).
  • Flexibility for caching, preparation and result extraction.
  • Uses multi-processing and can easily use caching for greater performance, and symlinks to save space.

Note that:

  • You will have to write Python code for your specific job, as well as any analysis or visualization for the extracted data.
  • Except for status monitoring mode, it derives the state on each run, it doesn't keep a database that can get outdated or corrupted.

One example to run reproducible jobs with Fenpei (there are many ways):

  • Make a script that runs your code from source to completion for one set of parameters.
  • Subclass the ShJobSingle job and add all the files that you need in get_nosub_files.
  • Replace all the parameters in the run script and other config files by {{ some_param_name }}. Add these files to get_sub_files.
  • Make a Python file (example below) for each analysis you want to run, and fill in all the some_param_name with the appropriate values.
  • From a shell, use python your_jobfile.py -s to see the status, then use other flags for more functionality (see below).
  • Implement is_complete and result in your job (and crash_reason if you want -t) (others can be overridden too, if you require special behaviour).
  • Add analysis code to your job file if you want to visualize the results.

Example file to generate jobs:

def generate_jobs():
    for alpha in [0.01, 0.10, 1.00]:
        for beta in range(0, 41):
            dict(name='a{0:.2f}_b{1:d}'.format(alpha, beta), subs=dict(
                alpha=alpha,
                beta=beta,
                gamma=5,
                delta='yes'
            ), use_symlink=True)

def analyze(queue):
    results = queue.compare_results(('J', 'init_vib', 'init_rot',))
    # You now have the results for all jobs, indexed by the above three parameters.
    # Visualization is up to you, and will be run when the user adds -x

if __name__ == '__main__':
    jobs = create_jobs(JobCls=ShefJob, generator=generate_jobs(), default_batch=splitext(basename(__file__))[0])
    queue = SlurmQueue(partition='example', jobs=jobs, summary_func=analyze)
    queue.run_argv()

This file registers many jobs for combinations of alpha and beta parameters. You can now use the command line:

usage: results.py [-h] [-v] [-f] [-e] [-a] [-d] [-l] [-p] [-c] [-w WEIGHT]
                  [-q LIMIT] [-k] [-r] [-g] [-s] [-m] [-x] [-t] [-j]
                  [--jobs JOBS] [--cmd ACTIONS]

distribute jobs over available nodes

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         more information (can be used multiple times, -vv)
  -f, --force           force certain mistake-sensitive steps instead of
                        failing with a warning
  -e, --restart         with this, start and cleanup ignore complete
                        (/running) jobs
  -a, --availability    list all available nodes and their load (cache reload)
  -d, --distribute      distribute the jobs over available nodes
  -l, --list            show a list of added jobs
  -p, --prepare         prepare all the jobs
  -c, --calc            start calculating one jobs, or see -z/-w/-q
  -w WEIGHT, --weight WEIGHT
                        -c will start jobs with total WEIGHT running
  -q LIMIT, --limit LIMIT
                        -c will add jobs until a total LIMIT running
  -k, --kill            terminate the calculation of all the running jobs
  -r, --remove          clean up all the job files
  -g, --fix             fix jobs, check cache etc (e.g. after update)
  -s, --status          show job status
  -m, --monitor         show job status every few seconds
  -x, --result          run analysis code to summarize results
  -t, --whyfail         print a list of failed jobs with the reason why they
                        failed
  -j, --serial          job commands (start, fix, etc) may NOT be run in
                        parallel (parallel is faster but order of jobs and
                        output is inconsistent)
  --jobs JOBS           specify by name the jobs to (re)start, separated by
                        whitespace
  --cmd ACTIONS         run a shell command in the directories of each job
                        that has a dir ($NAME/$BATCH/$STATUS if --s)

actions are executed (largely) in the order they are supplied; some actions
may call others where necessary

Pull requests, extra documentation and bug reports are welcome! It's Revised BSD-licensed so you can do many things.