# Basic examples
In this notebook we will show how to write a basic pipeline, in the **desipipe** framework. You need to have installed **desipipe** with:
```
python -m pip install git+https://github.com/cosmodesi/desipipe#egg=desipipe
```

## Task manager
Let's consider a simple example: the Monte-Carlo estimation of $\pi$.

In [1]:
import time

from desipipe import Queue, Environment, TaskManager, FileManager, spawn

# Let's instantiate a Queue, which records all tasks to be performed
queue = Queue('test', base_dir='_tests')
queue.clear()
# Pool of 4 workers
# Any environment variable can be passed to Environment: it will be set when running the tasks below
tm = TaskManager(queue, environ=Environment(), scheduler=dict(max_workers=4))

def draw_random_numbers(size):
    import numpy as np
    return np.random.uniform(-1, 1, size)

# We decorate the function (task) with tm.python_app
@tm.python_app
def fraction(seed=42, size=10000, draw_random_numbers=draw_random_numbers):
    # All definitions, except input parameters, must be in the function itself, or in its arguments
    # and this, recursively:
    # draw_random_numbers is defined above and all definitions, except input parameters, are in the function itself
    # This is required for the tasks to be pickelable (~ can be converted to bytes)
    import time
    import numpy as np
    time.sleep(5)  # wait 5 seconds, just to show jobs are indeed run in parallel
    x, y = draw_random_numbers(size), draw_random_numbers(size)
    return np.sum((x**2 + y**2) < 1.) * 1. / size  # fraction of points in the inner circle of radius 1

# Here we use another task manager, with only 1 worker
tm2 = tm.clone(scheduler=dict(max_workers=1))
@tm2.python_app
def average(fractions):
    import numpy as np
    return np.average(fractions) * 4.

# Let's add another task, to be run with bash
@tm2.bash_app
def echo(avg):
    return ['echo', '-n', 'bash app says pi is ~ {:.4f}'.format(avg)]

t0 = time.time()
# The following line stacks all the tasks in the queue
fractions = [fraction(seed=i) for i in range(20)]
# fractions is a list of Future instances
# We can pass them to other tasks, which creates a dependency graph
avg = average(fractions)
ech = echo(avg)
print('Elapsed time: {:.4f}'.format(time.time() - t0))

Elapsed time: 1.0306


The cell above stacks all tasks in the queue. ``fraction`` tasks will be 'PENDING' (waiting to be run),
while ``average`` tasks will be 'WAITING' for the former to complete. ``echo`` also depends on ``average``.
Running the script above will write a queue on disk, with name 'test', in the directory ``_tests``
(by default, it is ``${HOME}/.desipipe/queues/${USERNAME}/``).

Now, we can spawn a manager process that will run the above tasks, following the specifications of the task managers.

In [2]:
# Spawn a process that will distribute the tasks over workers
spawn(queue)
# Alternatively, with the command line (see below):
# desipipe spawn -q ./_tests/test --spawn

bash app says pi is ~ 3.147


In [3]:
# result() returns the result of the function, which can take some time to complete
# in this case, ~ 20 tasks which take 5 seconds distributed over 4 processes: typically 25 seconds
print(ech.out())
print('pi is ~ {:.4f}'.format(avg.result()))
print('Elapsed time: {:.1f}'.format(time.time() - t0))

bash app says pi is ~ 3.1470
pi is ~ 3.1470
Elapsed time: 37.6


## Tips
If you re-execute the two above cells, the cached result is immediately returned.
If you modify e.g. ``fraction``, a new result (including ``average``) will be computed.
If you modify ``average``, only ``average`` will be computed again.
To change this default behavior, you can pass ``skip=True`` (skip this app) or ``name=True`` (or a the original app name)

In [4]:
@tm2.bash_app(skip=True)  # no computation scheduled, just returns None
def echo2(avg):
    return 42

assert echo2(avg) is None

@tm2.bash_app(name=True)
def fraction():
    return None

for frac in fractions:
    assert fraction().result() == frac.result()  # the previous fraction result is used

@tm2.bash_app(name='echo')
def echo2(avg):
    return 42

print(echo2().out())  # the same as echo().out()

bash app says pi is ~ 3.1470


Note that one can incrementally build the script: previous tasks will not be rerun if they have not changed.
One can interact with ``queue`` from python directly, e.g.: ``queue.tasks()`` to list tasks, ``queue.pause()`` to pause the queue, ``queue.resume()`` to resume the queue, etc.
Usually though, one will use the command line: see below.

## Command line
We provide a number of command line instructions to interact with queues: list queues, tasks in a queue, pause or resume a queue.
There are many options! to get help, e.g.: ``desipipe kill --help``.

### Print queues

In [5]:
%%bash
desipipe queues -q './_tests/*'

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.00]  09-12 18:45  desipipe                  INFO     Matching queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 22
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


### Print tasks in a queue

In [6]:
%%bash
desipipe tasks -q ./_tests/test
# task state can be:
# WAITING: Waiting for requirements (other tasks) to finish
# PENDING: Eligible to be selected and run
# RUNNING: Running right now (out and err are updated live)
# SUCCEEDED: Finished with errno = 0
# FAILED: Finished with errno != 0

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.12]  09-12 18:45  desipipe                  INFO     Tasks that are SUCCEEDED:
[000000.12]  09-12 18:45  desipipe                  INFO     app: fraction
[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344187
[000000.12]  09-12 18:45  desipipe                  INFO     errno: 0
[000000.12]  09-12 18:45  desipipe                  INFO     err: 
[000000.12]  09-12 18:45  desipipe                  INFO     out: 
[000000.12]  09-12 18:45  desipipe                  INFO     app: fraction
[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344195
[000000.12]  09-12 18:45  desipipe                  INFO     errno: 0
[000000.12]  09-12 18:45  desipipe                  INFO     err: 
[000000.12]  09-12 18:45  desipipe                  INFO     out: 
[000000.12]  09-12 18:45  desipipe                  INFO     app: fraction
[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344209
[000000.12]  09-12 18:45  desipipe                  

[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344209
[000000.12]  09-12 18:45  desipipe                  INFO     errno: 0
[000000.12]  09-12 18:45  desipipe                  INFO     err: 
[000000.12]  09-12 18:45  desipipe                  INFO     out: 
[000000.12]  09-12 18:45  desipipe                  INFO     app: fraction
[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344230
[000000.12]  09-12 18:45  desipipe                  INFO     errno: 0
[000000.12]  09-12 18:45  desipipe                  INFO     err: 
[000000.12]  09-12 18:45  desipipe                  INFO     out: 
[000000.12]  09-12 18:45  desipipe                  INFO     app: average
[000000.12]  09-12 18:45  desipipe                  INFO     jobid: 2344315
[000000.12]  09-12 18:45  desipipe                  INFO     errno: 0
[000000.12]  09-12 18:45  desipipe                  INFO     err: 
[000000.12]  09-12 18:45  desipipe                  INFO     out: 
[000000.12]

### Pause a queue
When pausing a queue, all processes running tasks from this queue will stop (after they finish their current task).

In [7]:
%%bash
desipipe pause -q ./_tests/test
desipipe queues -q './_tests/*'  # state is now PAUSED

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.00]  09-12 18:45  desipipe                  INFO     Pausing queue Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
[000000.00]  09-12 18:45  desipipe                  INFO     Matching queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=PAUSED, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 22
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


### Resume a queue
When resuming a queue, tasks can be processed.

In [8]:
%%bash
desipipe resume -q ./_tests/test  # pass --spawn to spawn a manager process that will distribute the tasks among workers
desipipe queues -q './_tests/*'  # state is now ACTIVE

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.00]  09-12 18:45  desipipe                  INFO     Resuming queue Queue(size=22, state=PAUSED, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
[000000.00]  09-12 18:45  desipipe                  INFO     Matching queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 22
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


### Retry tasks
Change task state to PENDING.

In [9]:
%%bash
desipipe retry -q ./_tests/test --state SUCCEEDED
desipipe queues -q './_tests/*'  # task state is now PENDING

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.00]  09-12 18:45  desipipe                  INFO     Matching queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 22
RUNNING   : 0
SUCCEEDED : 0
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


### Spawn a manager process
Spawn a manager process that will distribute the tasks among workers, using the scheduler and provider defined above.

In [10]:
%%bash
desipipe spawn -q ./_tests/test  # pass --spawn to spawn an independent process, and exit this one
desipipe queues -q './_tests/*'  # tasks have been reprocessed: SUCCEEDED

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


bash app says pi is ~ 3.147
[000000.00]  09-12 18:45  desipipe                  INFO     Matching queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 4
SUCCEEDED : 18
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


### Kill running tasks
Kill running tasks of a given queue.

In [11]:
%%bash
desipipe kill -q ./_tests/test

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


### Delete queue(s)

In [12]:
%%bash
desipipe delete -q './_tests/*'  # pass --force to actually delete the queue

bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by bash)


[000000.00]  09-12 18:45  desipipe                  INFO     I will delete these queues:
[000000.00]  09-12 18:45  desipipe                  INFO     Queue(size=22, state=ACTIVE, filename=/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 22
FAILED    : 0
KILLED    : 0
UNKNOWN   : 0


## File manager
The file manager aimes at keeping track of files (of all kinds) produced in the processing.

In [13]:
%%file '_tests/files.yaml'

description: Some text file
id: my_input_file
filetype: text
path: ${SOMEDIR}/in_{option1}_{i:d}.txt
author: Chuck Norris
options:
  option1: ['a', 'b']
  i: range(0, 3, 1)

Overwriting _tests/files.yaml


In [14]:
fm = FileManager('_tests/files.yaml', environ=dict(SOMEDIR='_tests'))
# To select files
fm2 = fm.select(keywords='text file', option1=['a'])
# Iterate over files
for fi in fm2:
    print(fi)
    # Write text
    fi.write('hello world!')

BaseFile(
filetype: text,
id: my_input_file,
author: Chuck Norris,
options: {'option1': 'a', 'i': 0},
description: Some text file,
filepath: _tests/in_a_0.txt
)
BaseFile(
filetype: text,
id: my_input_file,
author: Chuck Norris,
options: {'option1': 'a', 'i': 1},
description: Some text file,
filepath: _tests/in_a_1.txt
)
BaseFile(
filetype: text,
id: my_input_file,
author: Chuck Norris,
options: {'option1': 'a', 'i': 2},
description: Some text file,
filepath: _tests/in_a_2.txt
)


In [15]:
# To add a new entry
fm.append(dict(description='added file', id='added_file', filetype='catalog', path='test.fits'))
# To delete an entry
del fm[-1]
# To add a cloned entry
fm.append(fm[0].clone(id='my_output_file', path='${SOMEDIR}/out_{option1}_{i:d}.txt'))
fm.write('_tests/files.yaml')
# Display new file data base
!cat '_tests/files.yaml'

/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
author: Chuck Norris
description: Some text file
filetype: text
id: my_input_file
options:
  i: range(0, 3)
  option1: [a, b]
path: ${SOMEDIR}/in_{option1}_{i:d}.txt
---
author: Chuck Norris
description: Some text file
filetype: text
id: my_output_file
options:
  i: range(0, 3)
  option1: [a, b]
path: ${SOMEDIR}/out_{option1}_{i:d}.txt


In practice, we will just edit the *.yaml* file directly.

In [16]:
# Let's add a new task!
@tm.python_app
def copy(text_in, text_out):
    import numpy as np  # just to illustrate that the package version is tracked
    text = text_in.read()
    text += ' this is my first message'
    print('saving', text_out.filepath)
    text_out.write(text)

In [17]:
# Iterate over files
for fi in fm.select(option1=['a']):
    copy(fi.get(id='my_input_file'), fi.get(id='my_output_file'))

# Let's spawn a new process, as the previous one has finished (there was no work anymore!)
from desipipe import spawn
spawn(queue)

saving _tests/out_a_0.txt
saving _tests/out_a_1.txt
saving _tests/out_a_2.txt


In [18]:
!ls -a _tests/

/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
.   .desipipe	in_a_0.txt  in_a_2.txt	 out_a_1.txt  test.sqlite
..  files.yaml	in_a_1.txt  out_a_0.txt  out_a_2.txt


In [19]:
!cat _tests/out_a_0.txt

/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
hello world! this is my first message

In [20]:
# This is where desipipe processing information is saved
!ls -a _tests/.desipipe
print('\n*.py file is:')
!cat _tests/.desipipe/copy.py
print('\n*.versions file is:')
!cat _tests/.desipipe/copy.versions

/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
.  ..  copy.py	copy.versions

*.py file is:
/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
def copy(text_in, text_out):
    import numpy as np  # just to illustrate that the package version is tracked
    text = text_in.read()
    text += ' this is my first message'
    print('saving', text_out.filepath)
    text_out.write(text)

*.versions file is:
/bin/bash: /home/adematti/anaconda3/envs/cosmodesi-main/lib/libtinfo.so.6: no version information available (required by /bin/bash)
ctypes=1.1.0
json=2.0.9
numpy=1.22.3
mpi4py=3.1.4


In [21]:
# Delete queue
queue.delete()