# SoS Syntax

## Overview

A SoS **script** defines one or more **workflows**, and each workflow consists of one or more **steps**. 

![workflow](../media/workflow.png)

A SoS script contains **comments**, **statements**, and one or more SoS **steps**. A SoS **step** consists of a **header**
with one or more step names and optional options. The body of a SoS step consists of optional **comments**, 
**statements**, **input**, **output**, **depends** files, **parameter** definitions, followed by step **process**. The following figure 
shows a sample script that defines a workflow with two steps:

![sample_script](../media/sample_script.jpg)

The above workflow is defined in a plain-text `.sos` file and displayed in a browser with syntax highlighting. The same workflow could be defined in a Jupyter notebook, with cells starting with section headers as follows: 

In [None]:
[global]
local_resource = '~/Resource/'
data_dir       = '~/Data/bams/'
resource_dir   = '${local_resource}/resources/hg19/Ensembl/Genes'

# samples to be processed
parameter: samples = ['s312', 's315', 's312a', 's315a']

In [None]:
[gff_0]
# download gene models from the MISO website
output: '${resource_dir}/Home_sapiens.GRCh37.65.gff.zip'
download: dest_dir=resource_dir, decompress=True
    http://genes.mit.edu/burgelab/miso/annotations/gene-models/Homo_sapiens.GRCh37.65.gff.zip

In [None]:
[gff_1]
# Index gtf file using index_gff
output: '${resource_dir}/${hg19_gff_index}/genes.gff'
task:   working_dir=resource_dir
run:    docker_image='mdabioinfo/miso:latest'
    rm -rf ${hg19_gff_index}
    index_gff --index ${hg19_gff_file} ${hg19_gff_index}

Note that a SoS Jupyter notebook can contain markdown cells and cells with other kernels but only cells starting with section headers (ignoring magics and comments etc) are parts of the workflow definition. Workflows in Jupyter notebook can be executed within the notebook using SoS magics, or executed from command line using commands such as `sos run`.

## Terminology & Grammar

* **Script**: A SoS script that defines one or more workflows.
* **Workflow**: A sequence of processes that can be executed to complete certain task.
* **Step**: A step of a workflow that perform one piece of the workflow.
* **Target**: Objects that are input and result of a SoS step, which are usually files, but can also be objects such as an executable command (with variable locations), and a SoS variable.
* **Step options**: Options of the step that assist the definition of the workflow.
* **Step input**: Specifies the input files of the step.
* **Step output**: Specifies the output files and targets of the step.
* **Step dependencies**: Specifies the files and targets that are required by the step.
* **Step process**: The process that a step executes to complete specified work, specified as one or more Python statements. 
* **Task**: Part or all step process that will be executed and monitored outside of SoS. These are usually resource intensive jobs that will take long time to complete.
* **Action**: SoS or user-defined Python functions. They differ from regular Python functions in that they may behave differently in different running mode of SoS (e.g. ignore when executed in dryrun mode).

More formally defined, the SoS syntax obeys the following grammar, given in extended Backus-Naur form (EBNF):

```
Script         = {comment}, {statement}, {step};
comment        = "#", text, NEWLINE
assignment     = name, "=", expression, NEWLINE
```

with SoS steps defined as

```
step           = step_header,
                 {comment}, {{statement}, [input | output | depends ]},
                 [process, NEWLINE, {script} ]
step_header    = "[", section_names, [":", names | options], "]", NEWLINE
parameter      = "parameter", ":", assignment
input          = "input", ":", [expressions], [",", options], NEWLINE
output         = "output", ":", [expressions], [",", options], NEWLINE
depends        = "depends", ":", [expressions], [",", options], NEWLINE
task           = "task", ":",  [options]
action         = func_format | script_format
func_format    = name, "(", [options], ")"
script_format  = name, ":", [options], NEWLINE, script 
section_names  = section_name, ",", section_name
section_name   = name, "(", text, ")"
names          = name, {",", name}
workflow       = name, ['_', steps], {"+", name, ['_', steps}
assignment     = name, "=", expression, NEWLINW
expressions    = expression, {",", expression}
options        = option, {"," option}
option         = name, "=", expression
```

Here `name`, `expression` and `statement` are arbitrary [Python 3](http://www.python.org) names, expression and statements with added SoS features.

## Basic Syntax

 If you are unfamiliar with Python, you can learn some basics of Python, usually in less than half a day, by reading some Python tutorials (e.g. [the official python tutorial](https://docs.python.org/3/tutorial/)). This [short introduction](https://docs.python.org/3/tutorial/introduction.html) is good enough for you to get started. A small difference between SoS and regular Python 3 syntax is that SoS is more lenient on the use of mixed tab and spaces for indentation. Although it is highly recommended that you use all spaces for indentation, SoS will give an warning and treat tabs as 4 spaces during execution.

### String Interpolation

Unlike Python format string, SoS string interpolation **does not require any prefix**, and is **applied to only double quoted strings** (`" "`, `""" """`, `r" "`, and `r""" """`). Single quoted strings (`' '`, `''' '''`, `r' '`, and `r''' '''`) are not interpolated.

Although configurable, the default sigil for SoS string interpolation is `'${ }'`, which means by default any expression between `${` and `}` would be evaluated by SoS. For example, expressions `resource_path`, `sample_names[0]` and `sample_names` would be replaced by their values in variables `ref_genome`, `title`, and `all_names`, but not in `single_quoted` because the string literal is quoted by single quotes. For convenience, we use a magic `%preview` of the SoS kernel to display the values of variables after the evaluation of the cell content.

In [1]:
%preview ref_genome title all_names single_quoted

resource_path = '~/.sos/resources'
ref_genome    = "${resource_path}/hg19/refGenome.fasta"

sample_names  = ['A', 'B', 'C']
title         = "Sample ${sample_names[0]} results"
all_names     = "Samples ${sample_names}"

single_quoted = '${sample_names} is not interpolated'

'~/.sos/resources/hg19/refGenome.fasta'

'Sample A results'

'Samples A B C'

'${sample_names} is not interpolated'

SoS actions specified in **script format** is assumed to be in raw tripple quote and will be interpolated. For example, variable `num` is passed from SoS to a shell script in the following example

In [2]:
import random
num = random.randint(1, 6)
run:
    echo "Random number is ${num}"

Random number is 2


because the code is equivalent to

In [3]:
import random
num = random.randint(1, 6)
run(r"""
echo "Random number is ${num}"
""")

Random number is 4


#### String representation of Objects

SoS evaluate an expression and returns the string representation of the value.

If the value is of simple Python types such as string, boolean, and numbers, the standard Python representation of the value (`repr(obj)`) will be returned.

In [4]:
"${2**10}"

'1024'

In [5]:
user = 'Bob'
"${\"Hi, \" + user}"

'Hi, Bob'

For objects with an iterator interface (e.g. Python `list`, `tuple`, `dict`, and `set`), SoS join the string representation of each item by a space (or comma with `,` conversion flag). For example,

  * List of strings will be converted to a string by joining strings with a space or comma.
  * Dictionary of strings will be converted to a string by joining dictionary keys with no guarantee on the order of values.

In [6]:
names = ['James', 'Bob', 'Kathy']
"${names}"

'James Bob Kathy'

In [7]:
salary = {'James': 20, 'Bob': 25, 'Kathy': 18}
"Employees: ${salary}"

'Employees: Bob James Kathy'

It is worth noting that the step input and output variables (`input`, `output`, `depends`, and its looped version `_input`, `_output`, and `_depends`) are always list of targets. However, if the list contains only one filename, `"${input}"` would be the same as `"${input[0]}"`.

SoS string interpolation supports all string format and conversion specification as in the [Python string format specifier](https://docs.python.org/3/library/string.html#formatspec). That is to say, you can use `: specifier` at the end of the expression to control the format of the output. For example

In [8]:
"${1/3. :.2f}"

'0.33'

In [9]:
filename = 'test.sos'
"${filename:>20}"

'            test.sos'

SoS also extends the conversion operators of the standard Python string format string to give you more control on the string representation of objects, particularly file and directory names. The conversion operators should be specified after a `!` character.

SoS currently supports the following convertors:


| convertor | effect | input | output |
| :----------| :----- | :----- | :-------|
| `s`         | `str()`  | `${file1!s}` | `file 1.txt` |
| `r`         | `repr()`  | `${file1!r}` | `'file 1.txt'` |
| `q`         | `quoted()` | `${file1!q}` | `'file 1.txt'`|
| `e`         | `replace(' ', '\\ ')` | `${file1!e}` | `file\ 1.txt`|
| `a`         | `abspath(expanduser())` |  `${file2!a}` | `/path/to/user/SoS/test.sos` |
| `b`         | `basename())` |  `${file2!b}` | `test.sos` |
| `d`         | `dirname())` |  `${file2!d}` | `/path/to/user/SoS/` |
| `n`         | `splitext()[0]` | `${file2!n}` | `~/SoS/test` |
| `u`         | `expanduser()` | `${file2!u}` | `/path/to/user/SoS/test.sos`|
| `,`         | `','.join()` | `${files!,}` | `a.txt,b.txt`|


here we assume

```
file1='file 1.txt'
file2='~/SoS/test.sos'
files=['a.txt', 'b.txt']
```

For example, if we need to create a file called `Bon Jovi.txt` and run

In [10]:
%sandbox --expect-error
filename = 'Bon Jovi.txt'
run:
    echo "test" > ${filename}
    cat ${filename}

test Jovi.txt


cat: Jovi.txt: No such file or directory
Failed to process statement run(r"""echo "test" > ${filena...name}""")\n (RuntimeError): Failed to execute script (ret=1).
Please use command
	``/bin/bash \
	  /var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp0q6rm6bx/.sos/interactive_0_0_2b5a234c``
under "/private/var/folders/ys/gnzk0qbx5wbdgm531v82xxljv5yqy8/T/tmp0q6rm6bx" to test it.


We would get two files `Bon` and `Jovi.txt` because the command executed was actually

```
    echo "test" > Bon Jovi.txt
    cat Bon Jovi.txt
```

To avoid such problems, you can quote the filename using the `q` (quoted) convertor

In [11]:
run:
   echo "test" > ${filename!q}
   cat ${filename!q}

test


Depending on how your scripts handle filenames, it can be handy to pass filenames to scripts in expanded format. For example, it would be perfectly OK to pass `~/a.txt` to a shell script, but a `u` convertor should be added if you are passing the filename to a script that does not understand `~` in filenames. SoS makes it easy for you to pass filenames in different forms to underlying scripts. For example,

In [12]:
%preview name filename basefilename expanded parparname
file = '~/sos/examples/update_toc.sos'
name = "${file!n}"
filename = "${file!b}"
basefilename = "${file!bn}"
expanded = "${file!u}"
parparname = "${file!ddb}"

'~/sos/examples/update_toc'

'update_toc.sos'

'update_toc'

'/Users/bpeng1/sos/examples/update_toc.sos'

'sos'

The last example is pretty interesting because it applies three converters and gets the name of grand-parent directory using an equivalence of `basename(dirname(dirname(file)))`.

Finally, the `,` converter can be used to output Python sequences with items separated by comma instead of space. For example, if you are passing a Python list as R literals, you can pass them as follows:

In [13]:
salary = {'James': 20, 'Bob': 25, 'Kathy': 18}
R:
    employee = c(${salary!,r})
    print(employee)

[1] "Bob"   "James" "Kathy"


Here the `r` converter quotes the strings, and `,` converter joints the strings by `,`.

In [14]:
"${salary!,r}"

"'Bob','James','Kathy'"

Although the SoS format specifiers are convenient to use, you are not limited to these rules and can define your own ways to present objects. For example

In [15]:
def r_list(obj):
    return 'c(' + ','.join('{!r}'.format(x) for x in obj) + ')'

"${r_list(salary)}"

"c('James','Kathy','Bob')"

#### Inclusion of sigil in string

If you need to stop SoS from interpolating some expressions in a string, you can 

1. Use single quotes, or
2. precede the SoS sigil with a backslash (`\`)

For example, the following script includes a shell script with shell variable `${file}` that is not interpolated by SoS:

In [16]:
[10]
title = "Sample ${sample_names[0]} results"
run:
    echo ${title}
    for file in a.txt b.txt c.txt
    do
        echo Processing \${file} ...
    done


Sample A results
Processing a.txt ...
Processing b.txt ...
Processing c.txt ...


#### Alternative sigil

If your SoS script contains long bash, perl, or other scripts in which `${ }` are frequently used, it can be tedious and error prone to backquote all sigils in these script. In this case you can assign an alternative sigil to the steps using a `sigil` section option.

For example, the example above could be written as 

In [17]:
[10: sigil='%( )']
title = "Sample %(sample_names[0]) results"
run:
    echo %(title)
    for file in a.txt b.txt c.txt
    do
        echo Processing ${file} ...
    done

Sample A results
Processing a.txt ...
Processing b.txt ...
Processing c.txt ...


to use an alternative sigil for this particular step using a step option `sigil`.

You can define any sigil as long as it has a left sigil and a right sigil separated by a space. You can even use sigils with identical left and right sigil (e.g. `# #`), although the latter is more prone to errors

### Script style function

SoS allows you to write SoS `action` (basically a Python function) that accept a script (string) as the first parameter in a special script format. For example,

```sos
R("""
pdf('${input}')
plot(0, 0)
dev.off()
""", workdir='result')
```

can be written as

```sos
R:     workdir='result'
pdf('${_input}')
plot(0, 0)
dev.off()
```

**The script is a string without quotation marks** and the normal string interpolation will take place. You can also indent the script (add equal amount of leading white spaces to all lines) and write the action as

```sos
R:  workdir='result'
   pdf('${_input}')
   plot(0, 0)
   dev.off()
```

The latter is much preferred because it avoids trouble if your script contains strings such as `[1]` and `option:` (and be treated as SoS directives), and more importantly, allows starting a new statement from a non-indented line. For example, `print('Hello world')` would be considered part of a R script in

```sos
R:  workdir='result'
pdf('${_input}')
plot(0, 0)
dev.off()

print('Hello world')
```

but a separate statement in 

```sos
R:  workdir='result'
   pdf('${_input}')
   plot(0, 0)
   dev.off()

print('Hello world')
```

Although the script format is more concise and easier to read, it is limited to actions that accept a string as its first parameter and cannot return value or be used within `try/except` of `if/else` statements.

## File structure

A complete SoS script would have a **header**, followed by a **global section** (without section header), and one or more SoS **sections** with header. SoS **pre-processors** can be used to include other scripts or exclude parts of the scripts conditionally. None of the parts is required so an empty script is a valid SoS script.

### File header

A SoS script usually starts with lines

```python
#!/usr/bin/env sos-runner
#fileformat=SOS1.0
```

The first line allows the script to be executed by command `sos-runner` if it is executed as an executable script. The second line tells SoS the format of the script. The `#fileformat` line does not have to be the first or second line but should be in the first comment block. SOS format 1.0 is assumed if no format line is present.

### Global section and default variables

Python functions, classes, variables can be defined or imported (using Python `import` statement) before any SoS step is defined. These definitions usually contains variables such as version and date of the script, paths to various resources, and utility functions that will be used by later steps. **These definitions are visible to all steps of workflows**.

A global section can be defined without section header in a `.sos` file as statements before any sos step, or with a section `[global]`.

SoS defines the following variables before any variables are defined

* **`SOS_VERSION`**: version of SoS command.
* **`CONFIG`**: A dictionary of configurations specified by the global sos configuration file (`~/.sos/config.yml`), local configuration file (`./config.yaml`) and command line option `-c config_file`. 

### SoS Sections

A **step** refers to a step of a SoS workflow and is defined by a **section** in a SoS script. A SoS script can define multiple workflows from multiple sections. A section can define multiple steps of one or more workflows.

A section starts with header in the format of

```
[names: options]
```

The header should start with a `[` from the beginning of a line and end with a `]`. It can contain one or more names with optional section options. Please refer to [workflow specification](Workflow_Specification.html) for the specification of workflows from sections.

A section can have arbitrary Python statements and SoS-specific statements that define the input, output, and dependent targets, and external tasks of the step. These statements starts with keywords `input:`, `output:`, `depends:`, and `task:`. Please refer to [SoS step](SoS_Step.html) for more details about these statements.