# AdaptiveMD

## Example 4 - Custom `Task` objects

### 0. Imports

In [1]:
import sys, os

In [2]:
from adaptivemd import (
    Project, Task, File
)

Let's open our `test` project by its name. If you completed the first examples this should all work out of the box.

In [3]:
project = Project('example-worker')

Open all connections to the `MongoDB` and `Session` so we can get started.

Let's see again where we are. These numbers will depend on whether you run this notebook for the first time or just continue again. Unless you delete your project it will accumulate models and files over time, as is our ultimate goal.

In [4]:
print project.files
print project.generators
print project.models

<StoredBundle for with 78 file(s) @ 0x10890f690>
<StoredBundle for with 2 file(s) @ 0x10890f650>
<StoredBundle for with 2 file(s) @ 0x10890f610>


Now restore our old ways to generate tasks by loading the previously used generators.

In [5]:
engine = project.generators['openmm']
modeller = project.generators['pyemma']
pdb_file = project.files['initial_pdb']

## A simple task

A task is in essence a bash script-like description of what should be executed by the worker. It has details about files to be linked to the working directory, bash commands to be executed and some meta information about what should happen in case we succeed or fail.

#### The execution structure

Let's first explain briefly how a task is executed and what its components are. This was originally build so that it is compatible with radical.pilot and still is. So, if you are familiar with it, all of the following information should sould very familiar.

A task is executed from within a unique directory that only exists for this particular task. These are located in `adaptivemd/workers/` and look like 

```
worker.0x5dcccd05097611e7829b000000000072L/
```

the long number is a hex representation of the UUID of the task. Just if you are curious type
```
print hex(my_task.__uuid__)
```

Then we change directory to this folder write a `running.sh` bash script and execute it. This script is created from the task definition and also depends on your resource setting (which basically only contain the path to the workers directory, etc)

The script is divided into 3 parts.

1. **Pre-Exec**: Things to happen before the main command

    1. **Pre-Stage**: files are staged (copyied or linked) to the working dir    
    2. **Pre-Bash**: some bash commands

2. **Command**: the main command is executed (this is just one command)

3. **Post-Exec**: Things to happen after the main command

    1. **Post-Bash**: some bash commands
    2. **Post-Stage**: Move all files we want to keept to a more permanent place

A note on this structure. This is to match radical.pilots structure while so far we are not using the main command. RP uses this to provide MPI support, etc. So for us we will mainly have pre-stage, some bash script and then the post-stage phase. Also, RP provides some more feature to copy to remote locations etc. This is why the pre-stage phase is expressed differently.

Okay, lots of theory, now some real code for running a task that generated a trajectory

In [6]:
task = engine.task_run_trajectory(project.new_trajectory(pdb_file, 100))

First a look at the different stages

In [7]:
task.pre_exec

[Link('staging:///alanine.pdb' > 'worker://initial.pdb),
 Link('staging:///system.xml' > 'worker://system.xml),
 Link('staging:///integrator.xml' > 'worker://integrator.xml),
 Link('staging:///openmmrun.py' > 'worker://openmmrun.py),
 'hostname',
 'python openmmrun.py -r --report-interval 1 -p CPU --store-interval 1  -t initial.pdb --length 100 output.dcd']

We are linking a lot of files to the worker directory and change the name for the .pdb in the process. Then call the actual `python` script that runs openmm. This would usually be command. But since command for us is just another line in the bash script it practically does not matter. In RP with MPI this would fail now!

In [8]:
task.command

'echo "DONE!"'

Well, this is just a dummy command and could even be `None` or empty

In [9]:
task.post_exec

[Move('worker://output.dcd.restart' > 'sandbox://{}/00000026.dcd.restart),
 Move('worker://output.dcd' > 'sandbox://{}/00000026.dcd)]

And finally move the `output.dcd` and the restart file back tp the trajectory folder.

There is a way to list lot's of things about tasks and we will use it a lot to see our modifications.

In [10]:
print task.description

Task: OpenMMEngine [created]

Required : ['staging:///system.xml', 'staging:///integrator.xml', 'staging:///alanine.pdb', 'staging:///openmmrun.py']
Output : ['sandbox://{}/00000026.dcd', 'sandbox://{}/00000026.dcd.restart']
Modified : []

<pretask>
Link('staging:///alanine.pdb' > 'worker://initial.pdb)
Link('staging:///system.xml' > 'worker://system.xml)
Link('staging:///integrator.xml' > 'worker://integrator.xml)
Link('staging:///openmmrun.py' > 'worker://openmmrun.py)
hostname
python openmmrun.py -r --report-interval 1 -p CPU --store-interval 1  -t initial.pdb --length 100 output.dcd
echo "DONE!"
Move('worker://output.dcd.restart' > 'sandbox://{}/00000026.dcd.restart)
Move('worker://output.dcd' > 'sandbox://{}/00000026.dcd)
<posttask>


### Modify a task

As long as a task is not saved and hence placed in the queue, it can be altered in any way. All of the 3 / 5 phases can be changed separately. You can add things to the staging phases or bash phases or change the command. So, let's do that now

#### Add a bash line

In [11]:
task.pre_bash('echo "This new line is pointless"')

In [12]:
print task.description

Task: OpenMMEngine [created]

Required : ['staging:///system.xml', 'staging:///integrator.xml', 'staging:///alanine.pdb', 'staging:///openmmrun.py']
Output : ['sandbox://{}/00000026.dcd', 'sandbox://{}/00000026.dcd.restart']
Modified : []

<pretask>
Link('staging:///alanine.pdb' > 'worker://initial.pdb)
Link('staging:///system.xml' > 'worker://system.xml)
Link('staging:///integrator.xml' > 'worker://integrator.xml)
Link('staging:///openmmrun.py' > 'worker://openmmrun.py)
hostname
python openmmrun.py -r --report-interval 1 -p CPU --store-interval 1  -t initial.pdb --length 100 output.dcd
echo "This new line is pointless"
echo "DONE!"
Move('worker://output.dcd.restart' > 'sandbox://{}/00000026.dcd.restart)
Move('worker://output.dcd' > 'sandbox://{}/00000026.dcd)
<posttask>


As expected this line was added to the end of the pre-bash phase. This is, after the `python` command.

#### Add staging

To set staging is more difficult. The reason is, that you normally have no idea where files are located and hence writing a copy or move is impossible. This is why the staging commands are not bash lines but objects that hold information about the actual file transaction to be done. There are some task methods that help you move files but also files itself can generate this commands for you.

Let's move one trajectory file around a little more as an example

In [13]:
traj = project.trajectories.one

In [14]:
transaction = traj.copy()
print transaction

Copy('sandbox://{}/00000000.dcd' > 'worker://00000000.dcd)


This looks like in the pre-exec phase output. The default for a copy is to move it to the worker directory under the same name, but you can give another name/location if you use that as an argument

In [15]:
transaction = traj.copy('delete.pdb')
print transaction

Copy('sandbox://{}/00000000.dcd' > 'worker://delete.pdb)


If you want to move it not to the worker directory you have to specify the location and you can do so with the prefixes (`shared://`, `sandbox://`, `staging://` as explained in the previous examples)

In [16]:
transaction = traj.copy('staging:///delete.pdb')
print transaction

Copy('sandbox://{}/00000000.dcd' > 'staging:///delete.pdb)


Or if you only give a path, the old filename is used.

In [17]:
transaction = traj.copy('staging:///')
print transaction

Copy('sandbox://{}/00000000.dcd' > 'staging:///00000000.dcd)


Besides `.copy` you can also `.move` or `.link` files.

In [18]:
transaction = pdb_file.copy('staging:///delete.pdb')
print transaction
transaction = pdb_file.move('staging:///delete.pdb')
print transaction
transaction = pdb_file.link('staging:///delete.pdb')
print transaction

Copy('file://{}/alanine.pdb' > 'staging:///delete.pdb)
Move('file://{}/alanine.pdb' > 'staging:///delete.pdb)
Link('file://{}/alanine.pdb' > 'staging:///delete.pdb)


#### Local files

Let's mention these because they require special treatment. We cannot (like RP can) copy files to the HPC, we need to store them in the DB first.

In [19]:
new_pdb = File('file://../files/ntl9/ntl9.pdb').load()

Make sure you use `file://` to indicate that you are using a local file. The above example uses a relative path which will be replaced by an absolute one, otherwise we ran into trouble once we open the project at a different directory.

In [20]:
print new_pdb.location

file:///Users/jan-hendrikprinz/Studium/git/adaptivemd/examples/files/ntl9/ntl9.pdb


Note that now there are 3 `/` two from the `://` and one from the root directory of your machine

The `load()` at the end really loads the file and when you save this `File` now it will contain the content of the file. You can access this content as seen in the previous example.

In [21]:
print new_pdb.get_file()[300:]

          H  
ATOM      4  H3  MET     1      34.640  28.530  33.770  0.00  0.00           H  
ATOM      5  CA  MET     1      32.630  27.950  33.530  0.00  0.00           C  
ATOM      6  HA  MET     1      32.790  26.910  33.800  0.00  0.00           H  
ATOM      7  CB  MET     1      31.260  28.370  34.050  0.00  0.00           C  
ATOM      8  HB1 MET     1      30.520  27.780  33.510  0.00  0.00           H  
ATOM      9  HB2 MET     1      31.160  28.130  35.110  0.00  0.00           H  
ATOM     10  CG  MET     1      30.970  29.860  33.830  0.00  0.00           C  
ATOM     11  HG1 MET     1      31.530  30.440  34.570  0.00  0.00           H  
ATOM     12  HG2 MET     1      31.290  30.170  32.840  0.00  0.00           H  
ATOM     13  SD  MET     1      29.240  30.300  34.020  0.00  0.00           S  
ATOM     14  CE  MET     1      29.180  31.800  33.040  0.00  0.00           C  
ATOM     15  HE1 MET     1      28.810  31.560  32.040  0.00  0.00           H  
ATOM     16  H

For local files you normally use `.transfer`, but `copy`, `move` or `link` work as well. Still, there is no difference since the file only exists in the DB now and copying from the DB to a place on the HPC results in a simple file creation.

Now, we want to add a command to the staging and see what happens.

In [22]:
transaction = new_pdb.transfer()
print transaction

Transfer('file://{}/ntl9.pdb' > 'worker://ntl9.pdb)


In [23]:
task.pre_stage(transaction)

In [24]:
print task.description

Task: OpenMMEngine [created]

Required : ['staging:///system.xml', 'file://{}/ntl9.pdb', 'staging:///integrator.xml', 'staging:///alanine.pdb', 'staging:///openmmrun.py']
Output : ['sandbox://{}/00000026.dcd', 'sandbox://{}/00000026.dcd.restart']
Modified : []

<pretask>
Link('staging:///alanine.pdb' > 'worker://initial.pdb)
Link('staging:///system.xml' > 'worker://system.xml)
Link('staging:///integrator.xml' > 'worker://integrator.xml)
Link('staging:///openmmrun.py' > 'worker://openmmrun.py)
Transfer('file://{}/ntl9.pdb' > 'worker://ntl9.pdb)
hostname
python openmmrun.py -r --report-interval 1 -p CPU --store-interval 1  -t initial.pdb --length 100 output.dcd
echo "This new line is pointless"
echo "DONE!"
Move('worker://output.dcd.restart' > 'sandbox://{}/00000026.dcd.restart)
Move('worker://output.dcd' > 'sandbox://{}/00000026.dcd)
<posttask>


We now have one more transfer command. But something else has changed. There is one more files listed as required. So, the task can only run, if that file exists, but since we loaded it into the DB, it exists (for us). For example the newly created trajectory `25.dcd` does not exist yet. Would that be a requirement the task would fail. But let's check that it exists.

In [25]:
new_pdb.exists

True

Okay, we have now the PDB file staged and so any real bash commands could work with a file `ntl9.pdb`. Alright, so let's output its stats.

In [26]:
task.pre_bash('stat ntl9.pdb')

Now we could run this task, as before and see, if it works. (Make sure you still have a worker running)

In [27]:
project.queue(task)

And check, that the task is running

In [38]:
task.state

u'success'

If we did not screw up the task, it should have succeeded and we can look at the STDOUT.

In [39]:
print task.stdout

12:29:50 [worker:3] stderr from running task
Stevie.fritz.box
GO...
Reading PDB
Done
Initialize Simulation
Done.
('# platform used:', 'CPU')
('# temperature:', Quantity(value=300.0, unit=kelvin))
START SIMULATION
DONE
('Written to file', 'output.dcd')
('Written to file', 'output.dcd.restart')
This new line is pointless
16777220 97228327 lrwxr-xr-x 1 jan-hendrikprinz staff 0 14 "Mar 17 12:29:46 2017" "Mar 17 12:29:46 2017" "Mar 17 12:29:46 2017" "Mar 17 12:29:46 2017" 4096 8 0 ntl9.pdb
DONE!



Well, great, we have the pointless output and the stats of the newly staged file `ntl9.pdb`

#### Change the command

As said before, this is not realy necessary unless in the future for MPI calls, but you can change the actual main bash command with `call`

In [40]:
task.call('echo')
print task.command
print task.executable  # show main executable
print task.arguments  # show arguments

echo 
echo
[]


The call command is special and works like `str.format`, while parsing the arguments for file locations and make sure these are wrapped in quotes, etc.

In [41]:
task.call('echo {} {}', 'HELLO', 'WORLD')
print task.command
print task.executable
print task.arguments

echo "HELLO" "HELLO"
echo
['HELLO', 'HELLO']


### How does a real script look like

Just for fun let's create the same scheduler that the `adaptivemdworker` uses, but from inside this notebook.

In [42]:
from adaptivemd import WorkerScheduler

In [43]:
sc = WorkerScheduler(project.resource)

If you really wanted to use the worker you need to initialize it and it will create directories and stage files for the generators, etc. For that you need to call `sc.enter(project)`, but since we only want it to parse our tasks, we only set the project without invoking initialization. You should normally not do that.

In [44]:
sc.project = project

Now we can use a function `.task_to_script` that will parse a task into a bash script. So this is really what would be run on your machine now.

In [45]:
sc.task_to_script(task)

['ln -s ../staging_area/alanine.pdb initial.pdb',
 'ln -s ../staging_area/system.xml system.xml',
 'ln -s ../staging_area/integrator.xml integrator.xml',
 'ln -s ../staging_area/openmmrun.py openmmrun.py',
 '# write file `ntl9.pdb` from DB',
 'mv -s _file_ntl9.pdb ntl9.pdb',
 'hostname',
 'python openmmrun.py -r --report-interval 1 -p CPU --store-interval 1  -t initial.pdb --length 100 output.dcd',
 'echo "This new line is pointless"',
 'stat ntl9.pdb',
 'echo "HELLO" "HELLO"',
 'mv output.dcd.restart ../../projects/example-worker/trajs/00000026.dcd.restart',
 'mv output.dcd ../../projects/example-worker/trajs/00000026.dcd']

Now you see that all file paths have been properly interpreted to work. See that there is a comment about a temporary file from the DB that is then renamed. This is a little trick to be compatible with RPs way of handling files. (TODO: We might change this to just write to the target file. Need to check if that is still consistent)

Next, we will talk about the factories for `Task` objects, called `generators`.

In [46]:
project.close()