# ecFlow eLearning Jupyter Notebook

![ecflow](anim/ecflow_el.gif)

- This is a snapshot of ecflow_ui, the ecFlow GUI
- The top *node* is the **server node**, called eowyn here
- right below are the **suites** nodes which contain **families** and **tasks**
- the family node is a container for families and task
- the **task** node is the leaf node from which a **job** is created by ecflow and **submitted**, locally or remotely
- an **alias** node can be created from the GUI, to dynamically modify the script and variables from a task, for an independent submission, in case of debugging, fixing, It is attached to the task and can be removed directly. Multiple aliases can run simultaneously when the task is well designed (different work directory).
- you can see we can attach **attributes** to nodes: 
    -**label, meter, event** which may be updated from the job to follow its progress, reporting a string, an integer, or a boolean (set)
    -**date, time, trigger** (black), **complete** (blue)
    -when the task has to run, a job is submitted, which reports its **status** so that task colour changes.
    -**late** attribute is alos visible here.

![workflow](anim/ecflow.gif)

# ecFlow components

  - the **ecflow_server**, 
  - the clients (**ecflow_client**, the command line text client, **ecflow_ui** the GUI, and the **python API** for python clients). 
  - A **template language** is understood by ecFLow to transform the **task template scripts** into **job files**.

- the template language is generic, so that _you_ decide which language the job will be, bash, ksh, python, perl,...
- You start describing the tasks to run, their relations and dependencies, as a **suite** in a **definition file**
- You load the suite into the server
- You provide the script templates that the ecFlow server shall read at the moment the jobs shall be generated
- when the dependencies are fulfilled (Date, Time, Trigger, Complete, Inlimit) the job is created and submitted. At that moment the task is **submitted**. If anything is problematic, the task turns **aborted**
   -for example, the job could not be created, or it could not be submitted, maybe the remote system not responding or it is refusing the submission (queue, account, size of the job) 
- the jobs starts and writes its progress into the output file. it reports to ecflow server using the **ecflow_client** 
  -**ecflow_client --init** reports the job **active**
  -**ecflow_client --complete** reports sucessful completion
  -at anytime if something goes wront the jobs shall send the **ecflow_client --abort** command to report the task **aborted**

![mindmap](anim/ecflowmm.png)

![status flow](anim/ecflow_status.gif)

## Status transition
- Initially, when a suite is loaded for the first time, its status is **unknown**, so that no job is immediately submitted. It gives a chance for the suite designer to check that it is as expected:
  - scripts and headers can be read by the server, 
  - that there is no variable used in the script, yet not defined in the suite, 
  - the structure of the suite reflect expectation. 
  - at that time, nodes can be suspended from the GUI, deleted, their position modified.
- The **begin** command will let the ecFlow server to initialise servers variables, nodes statuses, and even start jobs when the server is in **restarted mode** and the suite node is not **suspended**
- At that moment, if some jobs cannot be created/submitted, few tasks may turn **aborted**
- most shall remain **queued**, as the default status, waiting the dependencies to be reached (Date, Time, Trigger, Complete, Inlimit)
- the suite definition may contain some nodes using the attribute **defstatus**. It defines the default status of a node when it is requeued or when the suite is begun.
- transiently, submitted jobs are visible (when submitted to a queuing system, when it can last few seconds)
- when the job is started and calls **ecflow_client --init** the task turns **active**
- when the job is completing **ecflow_client --complete** is called
- a job may face problem, and report that with **ecflow_client --abort**
- the variables **ECF_JOB_CMD** and **ECF_KILL_CMD** can be defined in the suite, and modified throughout, reflecting the command the ecFlow server shall call to submit or kill the job.

In [4]:
% load_ext autoreload
% autoreload 2
%quickref

[video](anim/ecflow_3min39.mp4)

## Start the ecFlow server

In [1]:
%%bash 
ecflow_start.sh -p 2500

ping server(eowyn:2500) succeeded in 00:00:00.002894  ~2 milliseconds
server is already started


## Start the GUI ecflow_ui

In [30]:
%%bash
ecflow_ui &

bash: line 1: ecflow_ui: command not found


## Experience the command line client

In [6]:
%%bash
ecflow_client --port 2500 --host localhost --ping

ping server(localhost:2500) succeeded in 00:00:00.001609  ~1 milliseconds


In [7]:
%%bash
ecflow_client --help


Client/server based work flow package:

Ecflow version(4.8.0) boost(1.53.0) compiler(gcc 4.8.4) protocol(TEXT_ARCHIVE) Compiled on Oct 10 2017 23:59:15

ecflow_client provides the command line interface, for interacting with the server:
Try:

   ecflow_client --help=all       # List all commands, verbosely
   ecflow_client --help=summary   # One line summary of all commands
   ecflow_client --help=child     # One line summary of child commands
   ecflow_client --help=user      # One line summary of user command
   ecflow_client --help=<cmd>     # Detailed help on each command

Commands:

   abort                alter                begin                ch_add               ch_auto_add          
   ch_drop              ch_drop_user         ch_register          ch_rem               ch_suites            
   check                checkJobGenOnly      check_pt             complete             debug                
   debug_server_off     debug_server_on      delete               edit_histor

## Python for suites design, client-server communication, or a script.py as a task template...

In [9]:
%%bash
export PYTHONPATH=$PYTHONPATH:/usr/local/apps/ecflow/lib/python2.7/site-packages/ecflow:/usr/local/lib/python2.7/site-packages/ecflow


In [3]:
import ecflow
help(ecflow)

Help on module ecflow:

NAME
    ecflow - The ecflow module provides the python bindings/api for creating definition structure and communicating with the server.

FILE
    /usr/local/lib/python2.7/site-packages/ecflow/ecflow.so

CLASSES
    Boost.Python.enum(__builtin__.int)
        AttrType
        CheckPt
        ChildCmdType
        DState
        Days
        FlagType
        SState
        State
        Style
        ZombieType
        ZombieUserActionType
    Boost.Python.instance(__builtin__.object)
        Autocancel
        Client
        Clock
        Cron
        Date
        Day
        Defs
        Ecf
        Event
        Expression
        FamilyVec
        File
        Flag
        FlagTypeVec
        InLimit
        JobCreationCtrl
        Label
        Late
        Limit
        Meter
        Node
            NodeContainer
                Family
                Suite
            Submittable
                Alias
                Task
        NodeVec
        PartExpres

## Define the simplest suite and print it as "text definition file"

In [9]:
%run src/el11_test_suite.py

Creating suite definition
# 4.8.0
suite elearning
  defstatus suspended
  edit ECF_INCLUDE '/home/map/ecflow_server/include'
  edit ECF_FILES '/home/map/ecflow_server/files'
  edit ECF_HOME '/home/map/ecflow_server'
  task t1
endsuite

Saving definition to file 'elearning.def'


In [1]:
%pycat src/el11_test_suite.py

## Create the head.h and tail.h files for all tasks in ECF_INCLUDE directory

In [5]:
%run src/el12_test_suite_include.py

head.h/tail.h files are now created in ECF_INCLUDE /home/map/ecflow_server/include


In [5]:
%pycat src/el12_test_suite_include.py

## Create the simplest task template in ECF_FILES directory

In [6]:
%run src/el13_test_suite_wrapper.py

The script template file is now created as /home/map/ecflow_server/files/t1.ecf


In [11]:
%pycat src/el13_test_suite_wrapper.py

## load or replace the node into the Server with a standalone Python ecFlow client

You may ask: why do we need to write an intermediate text file, which is later loaded with another python script which has to read this file?

We don't need: this is a tutorial which explains step by step. In practice the suite is most often directly loaded from creation as ecflow.Defs with ecflow.Client into the server.

Yet, the text (expanded) suite definition can be read, modified, used with ecflow_client command line, to create, or update a node in an existing suite.

In [8]:
%%bash
cd src; python el14_test_suite_client.py

Traceback (most recent call last):
  File "el14_test_suite_client.py", line 20, in <module>
    CLIENT.replace("/%s" % NAME, "%s.def" % NAME)
RuntimeError: ecflow:ClientInvoker::replace(/elearning,elearning.def, ...) failed: ReplaceNodeCmd::ReplaceNodeCmd: Could not parse file elearning.def : DefsStructureParser::DefsStructureParser: Unable to open file! elearning.def

Ecflow version(4.8.0) boost(1.53.0) compiler(gcc 4.8.4) protocol(TEXT_ARCHIVE) Compiled on Oct 20 2017 17:32:43



In [6]:
%pycat src/el14_test_suite_client.py

check_job_creation can run from the client side to identify if jobs have a chance to be created by ecFlow server. You can fix:
- missing task template files
- missing include files
- ecFlow variables used in a script template, yet undefined in the py-def
- missing directories (ECF_HOME, ECF_INCLUDE, ECF_FILES)
- unexpected directories right restriction, read only, or directory not accessible

## Standalone Python client to download and display live server content

In [10]:
%run src/el15_checking_the_result.py

# 4.8.0
defs_state STATE state>:queued flag:message state_change:9
  edit ECF_MICRO '%' # server
  edit ECF_HOME '/home/map/ecflow_server' # server
  edit ECF_JOB_CMD '%ECF_JOB% 1> %ECF_JOBOUT% 2>&1' # server
  edit ECF_KILL_CMD 'kill -15 %ECF_RID%' # server
  edit ECF_STATUS_CMD 'ps --sid %ECF_RID% -f' # server
  edit ECF_URL_CMD '${BROWSER:=firefox} -remote 'openURL(%ECF_URL_BASE%/%ECF_URL%)'' # server
  edit ECF_URL_BASE 'https://software.ecmwf.int' # server
  edit ECF_URL 'wiki/display/ECFLOW/Home' # server
  edit ECF_LOG '/home/map/ecflow_server/eowyn.2500.ecf.log' # server
  edit ECF_INTERVAL '60' # server
  edit ECF_LISTS '/home/map/ecflow_server/ecf.lists' # server
  edit ECF_CHECK '/home/map/ecflow_server/eowyn.2500.check' # server
  edit ECF_CHECKOLD '/home/map/ecflow_server/eowyn.2500.check.b' # server
  edit ECF_CHECKINTERVAL '120' # server
  edit ECF_CHECKMODE 'CHECK_ON_TIME' # server
  edit ECF_TRIES '2' # server
  edit ECF_VERSION '4.8.0' # server
  edit ECF_PORT '2500' 

## Delete-Load-Begin a suite... or just replace a node

In [11]:
% run src/el16_client_load.py

Server was restarted
Suite elearning is now begun


In [1]:
% pycat src/el16_client_load.py

- **restart_server** is issued from the Python client, or from the GUI, **once**, so that jobs can be submitted.
- **begin_suite** must be issued each time the suite is loaded. To prevent that, **replace** is sometimes preferred. load would not overwrite a suite already existing on the server and would report an error. Sometimes, **delete** is called on the live suite, to clear the path to the incoming **load**.

# Going Further

## Add a task

In [12]:
%run src/el21_add_another_task.py

Creating suite definition
replaced node /elearning/t2 into localhost 2500


In [20]:
%pycat src/el21_add_another_task.py

## Add a family

In [13]:
%run src/el22_add_families.py

Creating suite definition
replaced node /elearning into localhost 2500


In [None]:
%pycat src/el22_add_families.py

## Add variables

Variables are essential to a suite. They are attached to a node as an **attribute** (keyword edit in text defintion file, the native ecflow API is node.add_variable("NAME", "value") and ecf.py used in this tutorial defines them with node.add(Variables(a_dictionnary)).

When preprocessing the task script to generate the job, ecFlow server replace each occurence of a variable (name surrounded by the % ECF_MICRO character) with its value.

A task may inherit a variable from a parent node, or overwrite the inherited value, defining it again.

In [14]:
%run src/el23_add_variable.py

replaced node /elearning into localhost 2500


In [3]:
%pycat src/el23_add_variable.py

## Add Trigger

The Trigger attribute will prevent the task to run until its expression is true.

The expression may contain reference to other nodes for their status, it may reference variables, events, meters, limits.

It is possible to inhibit triggers in a suite setting "ecf.USE_TRIGGER=False", when we design a suite, and few trigger expression refer to "missing" nodes.

In [15]:
%run src/el24_add_trigger.py

replaced node /elearning/f1 into localhost 2500


In [1]:
%pycat src/el24_add_trigger.py

## Add Complete

The Complete attribute is the counterpart of trigger. It will set the task complete as soon as the expresssion is true, so that the job may not have to run.

In [18]:
%run src/el26_add_complete.py

replaced node /elearning into localhost 2500


In [2]:
%pycat src/el26_add_complete.py

## Add Event

The event is created as an attribute attached to a node (Task, Family) and updated (set) 
- by the job calling the "ecflow_client --event" command
- or a user, using the command "ecflow_client --alter change event"

In [20]:
%run src/el27_add_event.py


replaced node /elearning into localhost 2500


In [2]:
%pycat src/el27_add_event.py


## Add Meter

The Meter attribute is attached to a Task (or a Family) so that the job (or client) can update its integer value.

In [21]:
%run src/el28_add_meter.py

<open file '/home/map/ecflow_server/files/t1.ecf', mode 'w' at 0x7f542c1e95d0>
<open file '/home/map/ecflow_server/files/t2.ecf', mode 'w' at 0x7f542c1e9420>
<open file '/home/map/ecflow_server/files/t3.ecf', mode 'w' at 0x7f542c1e95d0>
<open file '/home/map/ecflow_server/files/t4.ecf', mode 'w' at 0x7f542c1e9420>
<open file '/home/map/ecflow_server/files/t5.ecf', mode 'w' at 0x7f542c1e95d0>
<open file '/home/map/ecflow_server/files/t6.ecf', mode 'w' at 0x7f542c1e9420>
<open file '/home/map/ecflow_server/files/t7.ecf', mode 'w' at 0x7f542c1e95d0>
replaced node /elearning/f2 into localhost 2500


## Add Date and Time

The Date attribute holds the job to run until the date is achieved. 

A Time attribute prevents the job to be submitted immediately. It can be one value (ex 10:00), or a range of time (with an interval, ). Multiple Time attributes can be attached to the same node. Be careful, that, until the last occurence is met, the task is immediately requeued, so that there should not be any trigger referring to the task complete...

When Date and Time are associated, Date holds first, then the Task wait for the Time to start.

When a parent node is **suspended**, it is not enough for a date or time condition to fall. Yet the task will not start until the suspended node is **resumed**. Some may say these attribute have memory. the **why** command/Panel will show when the next expected task occurence is. The **requeue** command will 'restore consumed token' when the task was **executed** (forced to run) manually with the GUI.

We can attach these attributes to a dummy task and refer to it with a trigger in many cases. That way the task, without time dependency directly attached, can be requeued, without reactivating the time condition.

In [23]:
% run src/el29_add_time_date.py

replaced node /elearning into localhost 2500


In [None]:
% pycat src/el29_add_time_date.py

## Add Label

A label is a text message attached to the task (or Family) which is updated by the task (or client).

In [24]:
%run src/el30_add_label.py

replaced node /elearning into localhost 2500


In [None]:
%pycat src/el30_add_label.py

## Add Repeat

A Repeat is like a (for) loop at suite level. It gets incremented to the next value, once all nodes below get complete. It is an **active** attribute in the sense that it causes the nodes below to be requeued (default status, event and meter reset) when the increment occurs.

In [25]:
% run src/el31_add_repeat.py

replaced node /elearning/f5 into localhost 2500


In [None]:
% pycat src/el31_add_repeat.py

## Add Limit and Inlimit

A limit may prevent jobs to be submitted immediately. It can represent a mutex (value 1) or a semaphore.

The Inlimit attribute registers to a limit.

In [26]:
% run src/el32_add_limit.py

replaced node /elearning/f5 into localhost 2500


In [None]:
% pycat src/el32_add_limit.py

## Add Limit Inlimit

In [29]:
% run src/el32_add_limit.py

replaced node /elearning/f5 into localhost 2500


In [None]:
% pycat src/el32_add_limit.py

## Add Late

In [28]:
% run src/el33_add_late.py

replaced node /elearning/f5 into localhost 2500


Late is an attribute which may cause a poping window, to catch attention, when a job remains in submit, or active status for too long, or when complete is not reached in time. In order to really catch attention, some might prefer a **watchdog**, a dedicated task, which turns aborted, beyond a given threshold, or quietly becomes complete, thanks to a Time and Complete attribute.

## Debug a task with an Alias

[alias](anim/ecflow_alias.mp4)

# Exercises

## Data Acquisition suite example

In [20]:
% run src/el41_data_acquisition.py

# 4.8.0
suite data_acquisition
  defstatus suspended
  repeat day 1
  edit ECF_INCLUDE '/home/map/ecflow_course'
  edit ECF_FILES '/home/map/ecflow_course/acq'
  edit SLEEP '2'
  edit ECF_HOME '/home/map/ecflow_course'
  family Exeter
    family archive
      family observations
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
      family fields
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
      family images
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
    endfamily
  endfamily
  family Toulouse
    family archive
      family observations
   

In [None]:
 % pycat src/el41_data_acquisition.py

## Operational suite?

In [21]:
% run src/el41_data_acquisition.py

# 4.8.0
suite data_acquisition
  defstatus suspended
  repeat day 1
  edit ECF_INCLUDE '/home/map/ecflow_course'
  edit ECF_FILES '/home/map/ecflow_course/acq'
  edit SLEEP '2'
  edit ECF_HOME '/home/map/ecflow_course'
  family Exeter
    family archive
      family observations
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
      family fields
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
      family images
        time 00:00 23:00 01:00
        task get
          label info ""
        task process
          trigger get eq complete
        task store
          trigger get eq complete
      endfamily
    endfamily
  endfamily
  family Toulouse
    family archive
      family observations
   

In [None]:
% pycat src/el41_data_acquisition.py

## Back archiving

In [22]:
% run src/el42_operational_suite_solution.py

Defs file operational_suite.def
  suite '/operational_suite has not completed
Please see files .flat and .depth for analysis
# 4.8.0
defs_state MIGRATE
  edit ECF_MICRO '%' # server
  edit ECF_HOME '.' # server
  edit ECF_JOB_CMD '%ECF_JOB% 1> %ECF_JOBOUT% 2>&1' # server
  edit ECF_KILL_CMD 'kill -15 %ECF_RID%' # server
  edit ECF_STATUS_CMD 'ps --sid %ECF_RID% -f' # server
  edit ECF_URL_CMD '${BROWSER:=firefox} -remote 'openURL(%ECF_URL_BASE%/%ECF_URL%)'' # server
  edit ECF_URL_BASE 'https://software.ecmwf.int' # server
  edit ECF_URL 'wiki/display/ECFLOW/Home' # server
  edit ECF_LOG 'eowyn.3141.ecf.log' # server
  edit ECF_INTERVAL '60' # server
  edit ECF_LISTS 'eowyn.3141.ecf.lists' # server
  edit ECF_CHECK 'eowyn.3141.ecf.check' # server
  edit ECF_CHECKOLD 'eowyn.3141.ecf.check.b' # server
  edit ECF_CHECKINTERVAL '120' # server
  edit ECF_CHECKMODE 'CHECK_ON_TIME' # server
  edit ECF_TRIES '2' # server
  edit ECF_VERSION '4.8.0' # server
  edit ECF_PORT '3141' # server
  edi

In [None]:
% pycat src/el42_operational_suite_solution.py

## All together

In [23]:
%run  src/el51_gallery_suite_example.py

Defs file elearning.def
  suite '/elearning has not completed
Please see files .flat and .depth for analysis
# 4.8.0
defs_state MIGRATE
  edit ECF_MICRO '%' # server
  edit ECF_HOME '.' # server
  edit ECF_JOB_CMD '%ECF_JOB% 1> %ECF_JOBOUT% 2>&1' # server
  edit ECF_KILL_CMD 'kill -15 %ECF_RID%' # server
  edit ECF_STATUS_CMD 'ps --sid %ECF_RID% -f' # server
  edit ECF_URL_CMD '${BROWSER:=firefox} -remote 'openURL(%ECF_URL_BASE%/%ECF_URL%)'' # server
  edit ECF_URL_BASE 'https://software.ecmwf.int' # server
  edit ECF_URL 'wiki/display/ECFLOW/Home' # server
  edit ECF_LOG 'eowyn.3141.ecf.log' # server
  edit ECF_INTERVAL '60' # server
  edit ECF_LISTS 'eowyn.3141.ecf.lists' # server
  edit ECF_CHECK 'eowyn.3141.ecf.check' # server
  edit ECF_CHECKOLD 'eowyn.3141.ecf.check.b' # server
  edit ECF_CHECKINTERVAL '120' # server
  edit ECF_CHECKMODE 'CHECK_ON_TIME' # server
  edit ECF_TRIES '2' # server
  edit ECF_VERSION '4.8.0' # server
  edit ECF_PORT '3141' # server
  edit ECF_NODE '%ECF

In [None]:
% pycat src/el51_gallery_suite_example.py

# Resources

Confluence ecFlow https://software.ecmwf.int/wiki/display/ECFLOW/ecflow+home

ecFlow, one page, https://software.ecmwf.int/wiki/display/ECFLOW/ecFlow@ECMWF
Tutorial, https://software.ecmwf.int/wiki/display/ECFLOW/Tutorial
User Manual, https://software.ecmwf.int/wiki/display/ECFLOW/User+Manual
Cookbook, https://software.ecmwf.int/wiki/display/ECFLOW/Cookbook
Source code, https://software.ecmwf.int/stash/projects/ECFLOW/repos/ecflow/browse

https://software.ecmwf.int/stash/projects/ECFLOW/repos/elearning/browse

# Glossary

https://software.ecmwf.int/wiki/display/ECFLOW/Glossary