# Task manager
In this notebook we will show how to process tasks within the **desipipe** framework. You need to have installed **desipipe** with:
```
python -m pip install git+https://github.com/cosmodesi/desipipe#egg=desipipe
```
You can also take a look at https://desipipe.readthedocs.io/en/latest/user/getting_started.html.

## Toy example
Let's consider a simple example: the Monte-Carlo estimation of $\pi$.

In [1]:
import time

from desipipe import Queue, Environment, TaskManager, FileManager, spawn

# Let's instantiate a Queue, which records all tasks to be performed
queue = Queue('test', base_dir='_tests')
queue.clear()
# Pool of 4 workers
# Any environment variable can be passed to Environment: it will be set when running the tasks below
tm = TaskManager(queue, environ=Environment(), scheduler=dict(max_workers=4))

def draw_random_numbers(size):
    import numpy as np
    return np.random.uniform(-1, 1, size)

# We decorate the function (task) with tm.python_app
@tm.python_app
def fraction(seed=42, size=10000, draw_random_numbers=draw_random_numbers):
    # All definitions, except input parameters, must be in the function itself, or in its arguments
    # and this, recursively:
    # draw_random_numbers is defined above and all definitions, except input parameters, are in the function itself
    # This is required for the tasks to be pickelable (~ can be converted to bytes)
    import time
    import numpy as np
    time.sleep(5)  # wait 5 seconds, just to show jobs are indeed run in parallel
    x, y = draw_random_numbers(size), draw_random_numbers(size)
    return np.sum((x**2 + y**2) < 1.) * 1. / size  # fraction of points in the inner circle of radius 1

# Here we use another task manager, with only 1 worker
tm2 = tm.clone(scheduler=dict(max_workers=1))
@tm2.python_app
def average(fractions):
    import numpy as np
    return np.average(fractions) * 4.

# Let's add another task, to be run with bash
@tm2.bash_app
def echo(avg):
    return ['echo', '-n', 'bash app says pi is ~ {:.4f}'.format(avg)]

t0 = time.time()
# The following line stacks all the tasks in the queue
fractions = [fraction(seed=i) for i in range(20)]
# fractions is a list of Future instances
# We can pass them to other tasks, which creates a dependency graph
avg = average(fractions)
ech = echo(avg)
print('Elapsed time: {:.4f}'.format(time.time() - t0))

Elapsed time: 0.7672


The cell above stacks all tasks in the queue. ``fraction`` tasks will be 'PENDING' (waiting to be run),
while ``average`` tasks will be 'WAITING' for the former to complete. ``echo`` also depends on ``average``.
Running the script above will write a queue on disk, with name 'test', in the directory ``_tests``
(by default, it is ``${HOME}/.desipipe/queues/${USERNAME}/``).

Now, we can spawn a manager process that will run the above tasks, following the specifications of the task managers.

In [2]:
# Spawn a process that will distribute the tasks over workers
spawn(queue, timestep=1.)
# Alternatively, with the command line (see below):
# desipipe spawn -q ./_tests/test --spawn

In [3]:
# result() returns the result of the function, which can take some time to complete
# in this case, ~ 20 tasks which take 5 seconds distributed over 4 processes: typically 25 seconds
print(ech.out())
print('pi is ~ {:.4f}'.format(avg.result()))
print('Elapsed time: {:.1f}'.format(time.time() - t0))

bash app says pi is ~ 3.1420
pi is ~ 3.1420
Elapsed time: 30.4


## Tips
If you re-execute the two above cells, the cached result is immediately returned.
If you modify e.g. ``fraction``, a new result (including ``average``) will be computed.
If you modify ``average``, only ``average`` will be computed again.
To change this default behavior and *not recompute* average, you can pass ``skip=True`` (skip this app no matter what) or ``name=True`` (or the original app name).

In [4]:
@tm2.bash_app(skip=True)  # no computation scheduled, just returns None
def echo2(avg):
    return 42

assert echo2(avg) is None

@tm2.bash_app(name=True)
def fraction():
    return None

for frac in fractions:
    assert fraction().result() == frac.result()  # the previous fraction result is used

@tm2.bash_app(name='echo')
def echo2(avg):
    return 42

print(echo2().out())  # the same as echo().out()

bash app says pi is ~ 3.1420


Now, let's imagine some tasks have failed, and you want to rerun them (and only them), after changing the code. Let's examplify with some FAILED tasks:

In [5]:
@tm2.python_app
def test_error(i):
    if i >= 2:
        raise ValueError(str(i))
    else:
        return i

errors = [test_error(i) for i in range(4)]  # list of tasks
spawn(queue)  # run the task in spawned processes
errors = [error.err() for error in errors]  # let's get the err output
# No error for the first 2 (i < 2)
print(errors[:2])
# ValueError for the others (i > 2)
print(errors[2:])

['', '']
['Traceback (most recent call last):\n  File "/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/desipipe/task_manager.py", line 1432, in run\n    result = self._run(**kwargs)\n  File "/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/desipipe/task_manager.py", line 1353, in _run\n    return self.func(*args, **kw)\n  File "<string>", line 3, in test_error\nValueError: 2\n', 'Traceback (most recent call last):\n  File "/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/desipipe/task_manager.py", line 1432, in run\n    result = self._run(**kwargs)\n  File "/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/desipipe/task_manager.py", line 1353, in _run\n    return self.func(*args, **kw)\n  File "<string>", line 3, in test_error\nValueError: 3\n']


We notice that we made a (here, artificial) mistake in the code, so we do:

In [6]:
@tm2.python_app(name=True, state='SUCCEEDED')  # SUCCEEDED tasks, with this name ('test_error') are not rerun
def test_error(i):
    return i + 10  # let's add 10 to distinguish them from the previous run

errors = [test_error(i) for i in range(4)]  # list of tasks
spawn(queue)  # run the task in spawned processes
errors = [error.result() for error in errors]  # let's get the result
print(errors)

[0, 1, 12, 13]


The first two tasks are not rerun (they were 'SUCCEEDED'), giving 0 and 1. The other tasks (previously 'FAILED') have been rerun with the new code, giving 12 and 13.

Note that one can incrementally build the script: previous tasks will not be rerun if they have not changed.
One can interact with ``queue`` from python directly, e.g.: ``queue.tasks()`` to list tasks, ``queue.pause()`` to pause the queue, ``queue.resume()`` to resume the queue, etc.
Usually though, one will use the command line: see below.

## Command line
We provide a number of command line instructions to interact with queues: list queues, tasks in a queue, pause or resume a queue.
There are many options! To get help, e.g.: ``desipipe kill --help``.

### Print queues
Print the list of all your queues.

In [7]:
%%bash
desipipe queues -q './_tests/*'

[000000.27]  10-17 20:47  desipipe                  INFO     Matching queues:
[000000.27]  10-17 20:47  desipipe                  INFO     Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 26
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


### Print tasks in a queue

Task state can be:

  - 'WAITING': Waiting for requirements (other tasks) to finish.
  - 'PENDING': Eligible to be selected and run.
  - 'RUNNING': Running right now (out and err are updated live).
  - 'SUCCEEDED': Finished with errno = 0. All good!
  - 'FAILED': Finished with errno != 0. This means the code raised an exception.
  - 'KILLED': Killed. Typically when the task has not had time to finish, because the requested amount of time (if any) was not sufficient. May be raised by out-of-memory as well.
  - 'UNKNOWN': The task has been in 'RUNNING' state longer than the requested amount of time (if any) in the provider. This means that **desipipe** could not properly update the task state before the job was killed, typically because the job ran out-of-time. If you scheduled the requested time to be able to fit in multiple tasks, you may just want to retry running these tasks (see below).


In [8]:
%%bash
desipipe tasks -q ./_tests/test

[000000.32]  10-17 20:47  desipipe                  INFO     Tasks that are SUCCEEDED:
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134277
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134279
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134276
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134278
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134308
[000000.32]  10-17 20:47  desipipe                  INFO     app: fraction
[000000.32]  10-17 20:47  desipipe                  INFO     jobid: 134307
[000000.32]  

### Pause a queue
When pausing a queue, all processes running tasks from this queue will stop (after they finish their current task).

In [9]:
%%bash
desipipe pause -q ./_tests/test
desipipe queues -q './_tests/*'  # state is now PAUSED

[000000.28]  10-17 20:47  desipipe                  INFO     Pausing queue Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
[000000.28]  10-17 20:47  desipipe                  INFO     Matching queues:
[000000.28]  10-17 20:47  desipipe                  INFO     Queue(size=28, state=PAUSED, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 26
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


### Resume a queue
This is the opposite of ``pause``. When resuming a queue, tasks can get processed again (if a manager process is running).

In [10]:
%%bash
desipipe resume -q ./_tests/test  # pass --spawn to spawn a manager process that will distribute the tasks among workers
desipipe queues -q './_tests/*'  # state is now ACTIVE

[000000.27]  10-17 20:47  desipipe                  INFO     Resuming queue Queue(size=28, state=PAUSED, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
[000000.28]  10-17 20:47  desipipe                  INFO     Matching queues:
[000000.28]  10-17 20:47  desipipe                  INFO     Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 26
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


### Retry tasks
Tasks for which state is 'SUCCEEDED' (here for the example --- typically you will want to try again the 'KILLED' ones), and only those tasks, are changed to 'PENDING', i.e. they will be processed again.

In [11]:
%%bash
desipipe retry -q ./_tests/test --state SUCCEEDED
desipipe queues -q './_tests/*'  # task state is now PENDING

[000000.27]  10-17 20:47  desipipe                  INFO     Matching queues:
[000000.28]  10-17 20:47  desipipe                  INFO     Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 26
RUNNING   : 0
SUCCEEDED : 0
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


### Spawn a manager process
This command is the one to "get the work job done".
Specifically, it spawns a manager process that distributes the tasks among workers.

In [12]:
%%bash
desipipe spawn -q ./_tests/test  # pass --spawn to spawn an independent process, and exit this one
desipipe queues -q './_tests/*'  # tasks have been reprocessed: SUCCEEDED

[000000.27]  10-17 20:48  desipipe                  INFO     Matching queues:
[000000.27]  10-17 20:48  desipipe                  INFO     Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 26
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


### Kill running tasks
Kills running tasks of the queue.

In [13]:
%%bash
#desipipe kill -q ./_tests/test

Kills all processes related to this queue (including manager processes):

In [14]:
%%bash
#desipipe kill -q ./_tests/test --all

### Delete queue(s)

In [15]:
%%bash
desipipe delete -q './_tests/*'  # pass --force to actually delete the queue

[000000.27]  10-17 20:48  desipipe                  INFO     I will delete these queues:
[000000.27]  10-17 20:48  desipipe                  INFO     Queue(size=28, state=ACTIVE, filename=/local/home/adematti/Bureau/DESI/NERSC/cosmodesi/desipipe/nb/_tests/test.sqlite)
WAITING   : 0
PENDING   : 0
RUNNING   : 0
SUCCEEDED : 26
FAILED    : 2
KILLED    : 0
UNKNOWN   : 0


## Troubleshooting

For discussion about troubleshooting, look at https://desipipe.readthedocs.io/en/latest/user/getting_started.html