# Executing external tasks

A complex workflow usually have a large number of steps, most of them are light-weight and can be executed locally, but some of them are time- and resource-consuming and are suited to be executed on dedicated servers or cluster systems. There are several approaches in running such workflows, namely,

1. Execute the entire workflow on the cluster as a single multi-processing job. This is not idea because 1). you would have to allocate enough resource for the most CPU and RAM-demanging step and longest execution time for the most time-consuming step but these resources are not utilized most of the time, and 2). you are limiting yourself to a single node and cannot really execute the workflow in parallel.

2. Separate the workflow by steps and submit them as separate tasks to a queuing system. This approach allows better parallelization but it can be difficult and time consuming to execute a large number of small jobs, because these tasks might spend more time waiting than running. 

SoS takes a different approach than most workflow systems in that

1. It executes most or all steps directly. The workflow is executed in a multi-processing manner where multiple processes (by default to 4) are used to execute different branches of the DAG (Direct Acyclic Graph).

2. Part of the steps can be defined as **tasks** that are executed externally. The tasks can be executed locally as separate processes, remotely on a remote server, sent to distributed task-queues (such as [rq](http://python-rq.org/) or [Celery](http://www.celeryproject.org/), or cluster systems based on PBS, Torch, or SunGrid. SoS handles file synchronization so **tasks could be submitted to queues with their own file systems**. SoS can wait for the completion of the tasks or exits (default mode). The tasks could be executed and monitored independent of the workflows, and SoS can resume the execution of the workflow if it depends on the completion of some of the tasks.

This external execution model offers great flexibility in the execution of workflows. For example,
* You can execute a step on a remote server with more resource by specifying parameter `queue` of a single task.
* You can submit all tasks of a workflow to a cluster by specifying cluster name with the `-q` option of command `sos run`.
* You can submit part of the tasks to one machine, and part of the tasks to another task queue using the combination of task-specific option and global `-q` option.
* You can use SoS as a task-generation tool to generate a bunch of tasks, and send the tasks to different computer systems for execution. SoS automatically handles file synchronization so that you can easily move from one cluster to another cluster or another server.

## Specification of tasks

If a job is long and time consuming, it is much preferred to submit them as separate tasks to be executed, for example, on a cluster system. These jobs should be specified using the `task` keyword, which marks the beginning of a task, with optional runtime options to control its execution. For example,

```
[10]
input: group_by='single'

task: concurrent=True

run('''
samtools index {_input}
''')
```

execute a shell script in parallel (with `concurrent=True`). The step process can consists of arbitrary python statements and execute multiple step actions. For example,

```python
task:
try:
   action1()
except RuntimeError:
   action2()
```

execute `action1` and `action2` if `action1` raises an error.

```python
task:
for par in ['-4', '-6']:
   run('command with ${par}')
```

executes commands in a loop. This is similar to

```
pars = ['-4', '-6']
input: for_each=pars
task:
run('command with ${_pars}')
```

but the `for` loop version would not be able to be executed in parallel. Note that SoS actions can be used outside of `step process` but only statements specified after the `process` keyword can have runtime options and be executed in separate processes. That is to say,

```
pars = ['-4', '-6']
input: for_each=pars
run('command with ${_pars}')
```

is equivalent to

```
pars = ['-4', '-6']
input: for_each=pars
task:
run('command with ${_pars}')
```

but the latter can have additional runtime options to run commands in parallel

```
pars = ['-4', '-6']
input: for_each=pars
task: concurrent=True
run('command with ${_pars}')
```

Because step tasks are executed outside of SoS, variables assigned in step tasks are not accessible to SoS. For example,

```
[10: shared='res']
res = some_action()
```

executes `some_action()` in step process and return its result as a shared variable `res`. The following script,

```
[10: shared='res']
task:
res = some_action()
```

however, does not work because `res` is assigned in step task and is not accessible from the step.

## Common host configuration

### `alias`

### `address`

### `path_map`

### `shared`

### `send_cmd`

### `receive_cmd`

### `execute_cmd`

## Common queue configuration

### `queue_type`

### `status_check_interval`

### `max_running_jobs`

## RQ configuration

### `redis_host`

### `redis_port`

## Celery configuration

### `broker`

### `backend`

## PBS/Torch configuration

### `template_file`

`template_file` should point to the location of a template file (available locally, not on remote host) that will be used to generate a shell script that will be submitted to the PBS system. A typical template would look like

```
#!/bin/bash
$PBS -N ${task}
#PBS -l nodes=${nodes}:ppn=${ppn}
#PBS -l walltime=${walltime}
#PBS -l mem=${mem}
#PBS -o ${task}.out
#PBS -e ${task}.err
#PBS -q long
#PBS -m ae
#PBS -M your@email.address
#PBS -v ${cur_dir}

cd ${cur_dir}

source /setup.sh
sos execute ${task} -v ${verbosity} -s ${sig_mode}
```

The template file will be interpolated with the following information

* `task`: task id
* `nodes`, `ppn`, `walltime`, `mem`: resource task options
* `cur_dir`: translated current project directory
* `verbosity` and `sig_mode`: sos run mode.

Note that
1. You will need to specify all resource options (`nodes`, `ppn`, `walltime`, and `mem`) as task options if they are used in the template file.
2. If you need to specify more options (e.g. queue name), you will have to define multiple host entries with different template files. For example, you could define two queue entries as `cluster-short` and `cluster-long` for two queues on the same cluster.

### `job_template`

`job_template` should be the content of the template file if you prefer listing the content directly in the config file, and happen to know how to specify multi-line strings in YAML format.

### `submit_cmd`

A `submit_cmd` template is the command that will be executed to submit the job. It accepts the same set of variables as `job_template`, with an additional variable `job_file` pointing to the location of the job file on the remote host. The `submit_cmd` is usually as simple as

```
qsub ${job_file}
```

but you could specify some options from command line instead of the job file and define `submit_cmd` as

```
msub -l ${walltime} < ${job_file}
```

### `status_cmd` (unused)

This option is currently unused because sos provides its own job monitoring method.

### `kill_cmd` (unused)

This option is currently unused because sos provides its own job management method.

## Resource options

### Option `walltime`

### Option `nodes`

### Option `ppn`

### Option `mem`

## Execution options

### Option `queue`

### Option `workdir`

Default to current working directory.

Option `workdir` controls the working directory of the process. For example, the following step downloads a file to the `resource_dir` using command `wget`.

```python
[10]

run: workdir=resource_dir

  wget a_url -O filename

```

### Option `concurrent`

Default to `False`.

If the step process is repeated for different input files or parameters (using input options `group_by` or `for_each`), the loop process can be execute in parallel, up to the maximum number of concurrent jobs specified by command line option `-j`.

### Option `env`

The `env` option allow you to modify runtime environment, similar to the `env` parameter of the `subprocess.Popen` function. For example, you can execute your command with in a specific directory using

```
task:  env={'PATH': '/path/to/mycommand' + os.sep + os.environ['PATH']}
run:
   mycommand 
```

### Option `prepend_path`

Option `prepend_path` is a shortcut to option `env` to prepend one (a string) or more (a list of strings) paths to system path. For example, the above example can be shortened to

```
task:  prepend_path='/path/to/mycommand'
run:
   mycommand 
```

### Option `active`

## Commands and Options

### `sos run -q`

### `sos status`

`sos status -q`

`sos status -v 3`

### `sos kill` 

`sos kill -q`

### `sos execute`

`sos execute -q`

## Examples

### Remote execution

### PBS Cluster