HPC job management suite supporting compute clusters and the IBM BlueGene series. Mirror of SVN repository. (DEPRECATED)
Python Shell C Other
Latest commit 531aec1 Mar 27, 2012 richp preliminary fix for cluster_system losing process groups to cleanup.
git-svn-id: https://svn.mcs.anl.gov/repos/cobalt/trunk@2267 c55eebd0-760c-0410-adae-a6e4ca644001
Failed to load latest commit information.
man Updating manpages for group running. Oct 8, 2010
misc version bump to 0.99.0pre38 Feb 6, 2012
src preliminary fix for cluster_system losing process groups to cleanup. Mar 27, 2012
testsuite Fixing broken tests. We are using a much more realistic fake-bridge now. Mar 2, 2012
tools changes to cdbwriter setup scripts and ddl files to allow the cobalt … Feb 5, 2012
CHANGES updated CHANGES file to include changes for 0.99.0pre34-6 Nov 9, 2011
README keeping the readme up to date Sep 16, 2008
TODO Test commit Jul 31, 2009


changes in cobalt 0.98.2

* change to the simulator's XML file
* the simulator can simulate bad hardware
* bug fix so that the state of a reservation queue is honored
* reservation queues are shown along with normal queues in partadm and
  partlist output
* added a --sort flag to [c]qstat which allows the user to specify how the
  results are sorted
* robustification of state file saving so that a full disk doesn't make
  cobalt corrupt its state files
* job dependencies are supported
* cobalt's representation of job states has changed
* qalter -t takes relative time arguments
* partadm --diag can be used for running "diagnostics" on a partition and
  its children
* releaseres can properly release multiple reservations at once
* the fields shown with [c]qstat -f can be controlled through an
  environment variable or a setting in cobalt.conf
* some problems with script mode jobs are fixed
* the scheduler now uses a utility function to choose which job to execute
* the high-prio queue policy has been renamed high_prio (as this is now
  handled by a function written in python, and '-' isn't legal inside a
* job validation has been moved from [c]qsub into the bgsystem component
* cobalt uses the bridge API to autodetect certain situations that will
  prevent jobs from running successfully
* adding an additional .cobaltlog file to the output generated by jobs
* adding a -u flag to [c]qsub which allows users to specify the desired
  umask for output files created by cobalt

The XML format used to describe partition information to the simulator has
changed.  A new file in this format is included with the release, and one
can now use

    partadm --xml

to have a running cobalt instance create an XML file describing the system
being managed.

To simulate bad hardware, one can use the client script named "hammer.py".
The components that one can break are the NodeCards and Switches listed in
the simulator's XML file.

Job dependencies are created by using the --dependencies flag with [c]qsub.
The argument to this flag is a colon separated list of jobids which must
complete successfully in order for the job being submitted to be allowed to

Job states have changed substantially.  "administrative" holds (as
specified with cqadm) and "user" holds (as specified with qhold) can now be
separately applied to a job.  That is to say, a job can have both kinds of
hold applied to it, with qrls only releasing a user hold, and cqadm only
releasing an administrative hold.  Additionally, jobs may exhibit states
like "dep_hold" or "maxrun_hold".  There is also a new output field available
to [c]qstat, specified with short_state.  This will produce single letter
output to show job states like PBS.

There is a diagnostic framework that can be used to run any kind of program
which can help diagnose bad hardware (e.g. a normal science application
which is hard on the machine).  Problems are isolated by using a binary
search on the children of a suspect partition.  Use 

    partadm --diag=diag_name partition_name

to run a script/program named diag_name found in /usr/lib/cobalt/diags/ .
The exit value of the script should be 0 to indicate no problem found or
non-zero to indicate an error.

The scheduler now uses utility functions to decide on which job to execute.
Cobalt has two built in utility functions:  "high_prio" and "default".
Both of these utility functions immitate the behavior of those policies in
previous versions of cobalt.  In the [bgsched] section of cobalt.conf, one
may make an entry such as

    utility_file: /etc/cobalt.utility

which tells cobalt where to find user-defined cost functions.  Also in the
[bgsched] section, one may include an entry like

    default_reservation_policy: biggest_first

to control the default policy applied to a newly created reservation queue.
The file /etc/cobalt.utility simply contains the definitions of python
functions, the names of which can be used as queue policies, set via cqadm.

The scheduler iterates through the jobs which are available to run, and
evaluates them one by one with the utility function specified by each job's
queue.  The job having the highest utility value is selected to run.  If
this job is unable to run (perhaps because it needs a partition which is
currently blocked), cobalt can use a threshold to try to run jobs that are
"almost as good" as the one which cannot start.  This threshold is set by
the utility function itself.  If no such jobs exist, cobalt will apply a
conservative backfill which should not interfere with the "best" job.

The utility functions take no arguments, and should return a tuple of
length 2: the first entry is the score for the job, and the second entry is
the minimum allowed score for some other job that is allowed to start
instead of this one.  Information about the job currently being evaluated
by the utility function is available through several variables:

    queued_time -- the time in seconds that the job has been waiting
    wall_time -- the time in seconds requested by the job for execution
    size -- the number of nodes requested by the job
    user_name -- the user name of the person owning the job
    project -- the project under which the job was submitted
    queue_priority -- the priority of the queue in which the job lives
    machine_size -- the total number of nodes available in the machine
    jobid -- the integer job id shown in [c]qstat output

Here is an example of a utility function that tries to avoid starvation:

def wfp():
    val = (queued_time / wall_time)**2 * size
    return (val, 0.75 * val)

This utility function allows jobs that have been waiting in the queue to
get angrier and angrier that they haven't been allowed to run.  The second
entry in the return value says that if cobalt is unable to start the
"winning" job, it should only start a job having a utility value of at
least 75% of the winning job's utility value.  In this way, starved jobs
can prevent other jobs from starting until enough resources are freed for
the starved jobs to run.

Here are some more considerations about utility functions.

Queues pointing to overlapping partitions may have different utility
functions, but the values generated by these utility functions will be
compared against each other.  Queues which point to disjoint partitions do
not have the utility values of their jobs compared against each other.
In the first case, since the queues are competing for resources, one queue
can prevent jobs in the other queue from starting.  In the second case,
since there is no competition for resources, the queues cannot interfere
with each other.

Cobalt attempts to determine whether queues have overlapping partitions by
looking at the nodecards available to each queue.  Any queues which share
nodecards are assumed to be competing for resources.

Of special note: if you are trying to configure your cobalt installation to
have queues pointing to disjoint pieces of the machine, you need to either
remove the "top level" partition that encompasses the entire machine, or
change that partition to the "unschedulable" state.  Otherwise, cobalt will
detect that all of the queues are competing for resources.

Changes to the /etc/cobalt.utilty file can be made at runtime.  To tell
cobalt to reload this file, issue the command:

    schedctl --reread-policy

partadm -l and partlist may now report certain partitions as having "hardware
offline".  This indicates that the bridge API has reported that either a node
card or a switch is in a state that would result in job failure.  Cobalt will
avoid running jobs on these partitions while the "hardware offline" state

Jobs now produce a .cobaltlog file in addition to the .error and .output files.
This new file contains things like the actual mpi command executed, and the
environment variables set when the command was invoked.



A class definition changed, which breaks the statefiles used by cqm.  The
statefile used by bgystem and bgsched should load.

To recreate the information stored in the state file, use the mk_jobs.py
and mk_queues.py scripts.  These will dump a series of commands that will
recreate your queue configuration and jobs that are queued.