## Queueing System

Allow MD simulations to be scheduled

There are several ways to implement this and we should keep specific aspects in mind:

1. We do not know, if (a) python will run on the cluster and (b) the cluster can connect to the internet
2. You want to run on your local machine for testing and with little changes on a cluster
3. You might want to use workers that pull a job from a list or put jobs in a queue. 
4. We want to be able to use different simulation engines: Gromacs, AceMD, OpenMM, ...

    class Strategy(StorableMixin):
        pass    

It would be great if a user could define a strategy using simple building blocks. I presume this will be too rigid and too complicated. If we provide a ways that makes it wasy to write a little python program, that does what we want, it might be better. I could also imagine that you need to subclass a `Strategy` class and implement certain functions.

There are several ways to express a strategy by a user

1. provide high-level function with lots of options

    ```py
    strat = OneOverCAdaptiveStrategy(trajs='all', prior=1)
    ```

2. use building blocks to define a acyclic directed graph (ADC)

    ```py
    strat = {
        'traj_set': (storage.trajectories.load, slice(None, None)),
        'correlation_matrix': (pyemma.get_correlation, 'traj_set', 'featurizer'),
        'tica_proj': (pyemma.magic_tica_fnc, 'correlation_matrix', ),
        'clustering': (pyemma.magic_clustering_fnc, 'tica_proj', 'traj_set'),
        'count_matrix': (pyemma.count_fnc, 'clustering', 'traj_set'),
        'msm': (pyemma.ml_rev_msm, 'count_matrix'),
    }
    ```

3. use building blocks to create a quasi-Domain Specific Language (DSL) that expresses 2. but is better readable

    ```py
    Sequence([
        Add(
            MSM(
                trajs=Storage.trajectories)),
        Add(
            Clustering())
    ])
    ```
    
4. write python code that is translated into 2. (if possible) and gets executed using the scheduler

    ```py
    
    ```
    
5. write python code that uses the schedular and has helper functions to access the existing results

    ```py
    
    ```

    storage = MongoDBStorage('alanine_adaptive')
    queue = WorkerCeleryQS(storage=storage)
    strategy = RespawnAtOneOverC()
    sampler = AdaptiveSampling(storage, queue, strategy)
    sampler.start()

The main routine that combines all aspects.

1. The queueing system
2. A place to store objects
3. The logic to build models
4. The application of the adaptive strategy

Once you have created all these parts you can create the sampler and run it. The QS and the ST should exist independently

## Sampler

## PyEmma (build MSMs)

Construct an MSM from given data in the DB

    storage = MongoDBStorage('alanine_adaptive')

    storage.trajectories.save(last_trajectory)

    print len(storage.trajectories)

### Example 1

This will remember the number of items per store at a certain point in time. If you want to store the result of a query you can simply remember the query function and the storage state to reconstruct the actual result. Very useful to keep track of new objects.

#### StorageState

#### Special Iterators

A way to encode all trajectories of a specific type in the storage. Usually you would write

```py
stateA_trajs = [traj for traj in storage.trajectories if traj[0] in stateA]
```

we might want to simplify this into

```
all_traj_iterator = storage.trajectories
stateA_traj_iterator = storage.trajectories.filter(lambda traj: traj[0] in stateA) 
```

These queries should also be storable as well as the result as implicit

#### Externally stored content

### Additional Functionalities

#### ExternalTrajectory

Instead of saving a list of snapshots we could just save a whole trajectory as one object and reference it by filename. It would work exactly as a normal trajectory but the snapshots will be loaded specifically and iterating over a trajectory is also different. 

Not sure how to get a uuid for the snapshots then but I guess you can do that by skipping $n$ indices when saving the trajectory. Then, if the traj has id `17` and is of length 5. The frames have UUID `18 - 22`. These snapshots can be accessed from a special externalsnapshot store that can handle the requests.

#### ExternalSnapshot

These are referencing a frame number and an `MDFile`

#### ExternalFile

Referencing an external filenme with a UUID. A filename should also be unique, but this way it is unified. You can also reference other URI like websites, etc...

#### MDFile

An external file readable by MDTraj

There are some caveats to make sure this is efficient but the main point is bookkeeping and that should be possible.

#### Matrix

In general a matrix with a special purpose and references on how they were created, e.g. `CorrelationMatrix`, `TransitionMatrix`, ...

### Additional StorableObjects

To keep it simple for a user we should create a simple API to instantiate a `QueueingSystem` which only a few options. This QS can be adapted to all sorts of other clusters by subclassing. The main point is that the QS can 

1. accept new tasks to run
2. add the result of the tasks to a storage
3. report on the status of task execution
4. controle the execution, i.e., abort gracefully, remove tasks, etc...

all the rest is handled by the specific QS instance. The QS should be responsibe for the distribution and execution, and if desired also the setup of necessary workers or initial preparations, etc.

##### Example 1

Assume you want to use workers on a cluster, then you need to run the server that workers can connect to and place the worker in the cue.

##### Example 2

Placing jobs in a cue might not require any additional preparations beside monitoring the cue and see if you can place new jobs in the cue or when a job is finished doing some cleaning up and registering the result in the DB.

#### Storing

After a task is finished its results need to be registered in the DB. Either you store a pointer to the files 

```py
storage.trajectories.save(ExternalMDTrajectory('file0001.xtc'))
```

or save it as a full trajectory object in the DB copying all frames with it

```py
storage.trajectories.save(Trajectory(iterator_of_frames))
```

Still, the QS is responsible to do that with whatever the worker returned. If it returned the file location then you might reference it (i.e. Gromacs), if it returns a stream of a pickled trajectory store it (i.e. OpenMM).

#### Persistance

The QS should (not necessary) run independent of the main task and be able to be stopped or continued. The idea is that the actual jobs to be done and the execution are independent. The Sampler will add tasks to the queue and if you get time on a cluster you can reduce the list. As long as you have not run these you can still clear the queue or add more jobs. For the MSM Adaptive Scheme we usually do not have dependent tasks and it does never matter in which order we run the tasks or if we skip some. If we pick good candidated we will converge faster, but we will not converge to a wrong solution (within the bounds of the projection error or course).

#### Additional features

- add automatic retries, if a simulation should crash or is aborted.

    # return a pyemma source object
    storage.trajectories.as_source  

    # actually .trajectories is an iterator over all trajectories 

    # [Do all the pyemma magic]

    storage.models.save(model)  # should contain all references to how its created

    orchestrator = ClusterQueueingSystem(
        storage=ClusterStorage(),
        config_file,
        [options])

    simulation_task = SimulateNFramesMDTask(
        engine = my_gromacs_engine,
        start_frame = my_last_frame,
        n_frames = 100
    )

    # do this 100 times
    orchestrator.append([simulation_task] * 100)

    print orchestrator.currently_running_tasks

## ResultDB

Store all data / results from the queueing system, model builder, etc.

The database is some kind of repository that does the bookkeeping. This does not necessarily involve storing all trajetory files, but all existing files should be mentioned in the database. The same should be true for all models, tasks, clusterings, etc. All objects that might be of later value and we want to access easily.

Since there are several ways to do that we should just provide an API and do several implementations for different purposes. 

I propose either to 

1. adopt the netCDF+ approach, which is basically a NoSQL DB in a single file and extend it to point to external files (a PR already exists, but needs to be updates),
2. to use a MongoDB instead which is the more general approach, but require a MongoDB server to be running.
3. use a special directory and store one file per object which a specific naming scheme

But all will share the same API which I would adopt from what we use with OPS. 

The first two are more elegant and provide the additional benefit of easier search and access as well as reusing existing objects. In this case it is also important to use strictly immutable objects, ob objects where only non-essential attributes are mutable.

If objects are immutable we can safely use pointers to existing objects without the danger that these might change in the future. This will simplify storage immensely.

The most important point is that the storage is persisting and will not disappear once a job is finished or the main simulation crashed. It should also be suited to restart a simulation either after crashing or if more simulation is needed. Lastly it can also serve as the starting point for further simulations and analysis.

A DB is best suited to remain consistant, whereas the file based approaches can suffer if the simulation crashes at the wrong moment.

For the purely filebased approach we need to write functions that parse the directory tree and return the appropriate objects.

## Strategy

The Brain

This is still the most experimental and not clear part. We might have to try different ideas to get something that feels right.