# Execution of tasks on remote host

SoS allows you to execute tasks on remote hosts with or without their own file systems. For example, you can execute a complex workflow mostly locally, but execute a few jobs on a remote host if it provides more computing power, or if it has some software that cannot be installed locally. The remote host could have its own file system (separate systems), share its file system with the local machine (e.g. nodes on the cluster), or share some storage with the local machine (e.g. have the same shared storage), so file synchronization will be needed in some cases.

With help from a few runtime options (options to `task`), SoS can

* Copy specified local files to the remote host, possibly to different directories
* Start a SoS task on the remote machine and wait for the completion of the task
* Copy results back from the remote host if the execution is successful

## System setup

### Public-key access

Following any online tutorial, set up public-key access from your local machine to the remote host. If your public key does not work, check file permissions of `~/.ssh`, files under `.ssh`, and `$HOME` in some cases. After setting up the server, make sure you can login without password using command

```bash
% ssh remote-host
```

### Install software

You will need to install the latest version of sos (preferrably identical version between local and remote hosts), and the software you will need to run. Test it by logging to the remote machine with commands

```bash
% sos -h
```

Commands that are available in login shell are not necessarily available during remote execution. Basically, remote execution through `ssh` invokes a non-interactive and non-login shell with basic `$PATH`. SoS tries to address this problem by executing commands through a login shell

```bash
% ssh remote-host "bash --login -c 'sos execute task_id'"
```

but default `.bashrc` on the remote server might contain a line like

```bash
[ -z "$PS1" ] && return
```

that makes it exit when `bash` is not running interactively. This line has to be removed in order to have complete `$PATH` during remote execution.

Now, fire command

```bash
% ssh remote-host "bash --login -c 'sos -h'"
```

from your local machine and see if `sos` can be invoked. Similarly, check if the command you would like to execute remotely can be executed in this manner.

### Configure `address`

This step is optional but is highly recommended. Basically you can save necessary information for each remote host that you would like to use in a SoS configuration file so that you do not have to specify them one by one.

First, give your host an alias so that you do not have to specify the long URL each time. To do this execute command

```
% sos config --global --set hosts.monster.address dcdrlue8ee.yourdomain.com
```

where `monster` is a short alias and `dcdrlue8ee.yourdomain.com` is the complete address.

This commands writes to `~/.sos/config.yml` the following entry

```bash
$ cat ~/.sos/config.yml
hosts:
  monster:
    address: dcdrlue8ee.yourdomain.com
```

You can write to this file directly if you are familiar with YML format.

If your account name differs between the local and remote servers, the complete address should be `username@address`. In this example `john@dcdrlue8ee.yourdomain.com` if the remote server account is `john`.

### Configure `path_map`

`path_map` is a list of directory mappings between local and remote directories. For example, if you work locally on a Mac machine with home directory `/Users/myuser`, and the remote server is a Linux machine with home directory `/home/myuser`, you should define a `path_map` using command

```
% sos config --global --set hosts.monster.path_map /Users/myuser/:/home/myuser/
```

In this way, if the local data is `/User/myuser/projects/input.fastq`, the path will be translated to `/home/myuser/projects/input.fastq` during remote execution.

In more complicated cases where there are different directories, more than one mapping can be specified. For example, if you have directories under different volumes, you can map them differently using

```bash
% sos config --global --set hosts.monster.path_map \
    /Users/myuser/projects/:/home/myuser/scratch/projects/  \
    /Volumes/Resource:/home/myuser/resource \
    /Users/myuser:/home/myuser
```

This command will write the following entries to the configuration file

```
$ cat ~/.sos/config.yml
hosts:
  monster:
    address: dcdrlue8ee.yourdomain.com
    path_map:
    - /Users/myuser/projects/:/home/myuser/scratch/projects/
    - /Volumes/Resource:/home/myuser/resource
    - /Users/myuser:/home/myuser
```

and will map your local direcrories as follows

```
~/projects/data ==> /home/myuser/scratch/projects/data
/Volumes/Resources/hg19.fasta ==> /home/myuser/resource/hg19.fasta
~/myscript ==> /home/myuser/myscript.py
```

Note that

1. Both source and destination paths should be absolute (starts with `/` for Linux-like systems).
2. SoS expands local directories to absolute path before applying `path_map`.
3. SoS applies path maps at the order in which they are specified, so you should specify general mappings after more specific ones. For example, if you specify `/Users/myuser:/home/myuser` before `/Users/myuser/projects/:/home/myuser/scratch/projects/`, path `/Users/myuser/projects/input.txt` would be mapped to `/home/myuser/projects/input`, not to the intended directory `/home/myuser/scratch/projects/input`.


### Configure `shared`

Option `shared` tells SoS which file systems are shared between local and remote hosts so that it does not have to synchronize files under these directories between the hosts.

* SoS assumes independent file systems so you do not have to specify option `shared` if the local and remote hosts does not share any file system.
 
* If your local and remote host share all file systems, you should use command

  ```bash
  % sos config --global --set hosts.monster.shared /
  ```
  to indicate that the root directory is shared so no cross-network copy is needed.
  
* If your local and remote host share one or more shared volumes, you can specify them with command

  ```bash
  % sos config --global --set hosts.monster.shared /projects /data
  ```
  to indicate that files under these directories are available on the remote host.

Shared file systems do not have to be mounted at the same locations. For example, a local file system `/projects` might be available at the remote host under `/scratch/projects`. In this case, you should

* Set `/projects` as `shared` so that files under `/projects` will not be copied.
* Set `/projects:/scratch/projects` in `path_map` so that the path can be correctly translated between local and remote hosts.

It is important to remember that **SoS does not copy files under shared directories**. If the local and remote host share a file system but you really would like to copy files to a differenet directory, you can ignore the `shared` option and let SoS copy files as if they are separate file systems.

## Options `on_host`, `to_host` and `from_host`

Now that you have your machine configured, you should try to copy some files and see if they work correctly. File copy is specified with options `to_host` and `from_host`. You can test these options using simple SoS steps such as (replace filenames with files you have, of course),

```
[1]
task: 
    on_host: 'monster',
    to_host: ['~/projects/data/test1', '/Volumes/Resources/hg19.fasta'],
    from_host: '~/projects/data/test1.res'
run:
    echo "Hello, World"
```

Note that:

1. Option `on_host` specifies the host to connect, and allows SoS to retrieve `path_map` from configuration file.
2. Option `to_host` specifies local files or directories that will be copied to the remote host (using `rsync`, if files are not on a shared file system).
3. Option `from_host` specifies **local files** that need to be copied from the remote host. SoS will use `path_map` to determine paths of the corresponding remote files to be copied.

## Translate your task to be executed on remote host (`task` option `map_vars`)

### Variable translation

Each task has an *context* that contains variables that will be used to, for example, compose scripts to be executed. Even if the task will be executed remotely, you should write your task using local paths. For example, you might have a script that generates a `STAR` index from a fasta file. You can have all these files available locally and write the task as:

```
depends:      hg19_fasta
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
		--genomeDir ${hg19_star_index} \
		--genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

**SoS will by default translate all variables (of type string and list of strings) as if they are local paths**. In this case, because `hg19_star_index`, `hg19_fasta` and `hg19_genes_gtf` are all local paths, they will be translated to remote paths during remote execution. Variables that are translated can be view with option `-v3` (debug output).

### Option `preserved_vars`

Automatic variable translation is convenient but SoS can make mistakes because it does not really know which variables contains path names that need to be converted. For example, if you do not have `hg19_fasta` locally and would like to use `hg19_fasta` that is available on the remote host, you can add this variable to option `preserved_vars` so that its value will be preserved during context switch:

```sos
task:     on_host='monster', preserved_vars='hg19_fasta'
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
		--genomeDir ${hg19_star_index} \
		--genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

Note that 

1. The default value of `map_vars` is `True`, which translates all variables.
2. Variables such as `_input`, `input`, `_output` are automatically translated so you only need to specify non-system variables.
2. You can hard-code paths in remote host to bypass the variable mapping, but that will make your script host-dependent. It is recommended that you **write your script in local paths** and let SoS do the conversion so that you do not have to change the script itself if you are switching to another host with differnt `path_map`.

## Running task remotely

With all the pieces putting together, you can now execute the task on the remote host using `task` options

```sos
depends:  hg19_fasta
task:     on_host='monster',
          to_host=[hg19_fasta, hg19_genes_gtf], from_host=hg19_star_index
run:
    STAR \
		--runThreadN 8 \
		--runMode genomeGenerate \
		--genomeDir ${hg19_star_index} \
		--genomeFastaFiles ${hg19_fasta} \
		--sjdbGTFfile ${hg19_genes_gtf} \
		--sjdbOverhang 100
```

Here the default `map_vars=True` is used so all variables are translated.

## Advanced usages

### `path_map` option

You can specify `path_map` as a `task` option. This allows you to write everything in the script clearly and you do not have to use any configuration file. For example,

```sos
task:     on_host='dcdrlue8ee.yourdomain.com', path_map='/Users/myuser:/home/myuser',
          other_options...
```

`path_map` specified in script will override these in `config.yml` if available.

### `send_cmd`, `received_cmd`, `execute_cmd`

SoS uses `rsync` command to exchange files between hosts, and use `ssh` to execute command. If the default commands do not work for your configuration (e.g. if you do not have `rsync` and need to use `scp`, you can 

1. Use option `-v3` to display the exact command used to transfer files and execute commands.

2. Define options `send_cmd`, `received_cmd` and `execute_cmd` for your particular configuration. These variables should be defined with `${source}` and `${dest}` which will be replaced by source and destination filenames for each file.

