# Distributed trask graph execution
First, again on a single local machine. Then, distributing Dask scheduler and workers.

In [1]:
from pi_workload import define_pi_workload

Note, we only need to import the Dask distributed package,

In [2]:
import dask.distributed

And specify a worker,

In [3]:
client = dask.distributed.Client(n_workers=1)

In [4]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:42161  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 1  Cores: 16  Memory: 135.09 GB


The other code is basically identical,

In [5]:
import time, numpy

In [7]:
start = time.time()
pi = define_pi_workload().compute()
elapse = time.time() - start

print('pi computed:   ', pi)
print('pi from numpy: ', numpy.pi)
print('wall time: ', elapse, 'sec')

workload in giga bytes: 250.0
500.0 chunks to process
pi computed:    3.141582139648
pi from numpy:  3.141592653589793
wall time:  67.92618155479431 sec


Note, that task graph execution was still local here, as we didn't specify a distributed setup of Dask workers!

## Really going distributed...

There are only two ingredients for this to work properly: The first, each machine needs to have local access to a Dask software environment. The second, the machines need to be able to communicate with each other. (If these kind of examples don't work on your machines, a likely reason is that necessary ports are blocked. In this case, your admins might be able to help.)

This is how it looks like on our example machines,

In [8]:
!curl scalcmon.geomar.de


Thursday, 19 November 2020, 16:00

                      |               |           load_averages |         |
                 host |   used_memory |      1min   5min  15min |   users |
----------------------+---------------+-------------------------+---------+--------------
              scalc01 |         29.3% |      0.12   0.20   0.32 |       4 | #
              scalc02 |          0.5% |      0.00   0.01   0.00 |       1 | #
              scalc03 |          0.5% |      0.04   0.01   0.00 |       0 | #
              scalc04 |          0.5% |      0.02   0.02   0.00 |       0 | #
              scalc05 |          0.5% |      0.01   0.02   0.00 |       0 | #
              scalc06 |          0.5% |      0.07   0.06   0.01 |       0 | #
              scalc07 |          0.5% |      0.00   0.00   0.00 |       0 | #
              scalc08 |          0.5% |      0.00   0.01   0.00 |       0 | #
              scalc09 |          0.5% |      0.00   0.00   0.00 |       0 | #
              scalc1

### Create a Dask scheduler process on a second machine
First, we need to setup a Dask scheduler process which needs to be visible by the local machine with the Dask client process. This will only setup the scheduler, without any actual Dask workers.

In [9]:
!hostname # this is where we are!

scalc15


In a Jupyter terminal open the Dask scheduler on a remote machine with e.g. SSH,


```bash
$ ssh khoeflich@scalc03.geomar.de \
source $HOME/Course-Data-Science-with-Dask/03_distributed/conda.sh && \
dask-scheduler
```


### Connect the scheduler with the Dask client process on the local machine
Just copying/pasting from the `dask-scheduler` standard output in the terminal,

In [12]:
client = dask.distributed.Client('tcp://10.199.124.115:8786')

In [13]:
client

0,1
Client  Scheduler: tcp://10.199.124.115:8786  Dashboard: http://10.199.124.115:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


### Create a Dask worker on a third machine
To be able to execute task graphs, we need at least one of these. Note, that we have to tell the `dask-worker` where exactly the `dask-scheduler` process is located in the network,

```bash
$ ssh khoeflich@scalc08.geomar.de \
source $HOME/Course-Data-Science-with-Dask/03_distributed/conda.sh && \
dask-worker 10.199.124.115:8786
```

Now, the client already displays that a worker is connected to the scheduler and hence to the cluster,

In [14]:
client

0,1
Client  Scheduler: tcp://10.199.124.115:8786  Dashboard: http://10.199.124.115:8787/status,Cluster  Workers: 1  Cores: 16  Memory: 135.09 GB


And we can repeat the workload from above, but on a scaled out / distributed cluster.

In [18]:
start = time.time()
pi = define_pi_workload().compute()
elapse = time.time() - start

print('pi computed:   ', pi)
print('pi from numpy: ', numpy.pi)
print('wall time: ', elapse, 'sec')

workload in giga bytes: 250.0
500.0 chunks to process
pi computed:    3.141618020096
pi from numpy:  3.141592653589793
wall time:  66.5627076625824 sec


### Of course, we could add further Dask workers manually...

But doing this would be rather tedious. Luckily, Dask is relatively mature and provides a whole collection of ready-to-use and high-level "cluster managers" that spawn the network of a Dask scheduler and the Dask workers for you. Here, we could use the `Dask.distributed.SSHCluster()` interface or the `dask-ssh` command-line tool. (For an incomplete list of much more sophisticated Dask deployment options, see the slides!)

## Using a cluster manager interface

We could scale out a Dask cluster with `dask-ssh scalc{06..09}.geomar.de` in the terminal again, but let's have a look at how cluster management from within Jupyter notebook works. (The logic is very similar for many other Dask cluster deployment tools/options! And migration of a Dask task graph calculation to another type of machine only requires setting up and connecting to another than the example cluster object. The task graph execution code will always stay as it is.)

In [19]:
ssh_cluster = dask.distributed.SSHCluster(
    ["scalc05.geomar.de", # scheduler
     "scalc06.geomar.de", # workers...
     "scalc07.geomar.de",
     "scalc08.geomar.de",
     "scalc09.geomar.de",
    ],
)

distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - -----------------------------------------------
distributed.deploy.ssh - INFO - distributed.scheduler - INFO - Clear task state
distributed.deploy.ssh - INFO - distributed.scheduler - INFO -   Scheduler at: tcp://10.199.124.105:8786
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.199.124.108:38787'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.199.124.106:40993'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.199.124.109:42719'
distributed.deploy.ssh - INFO - distributed.nanny - INFO -         Sta

Now, instead of providing the Dask scheduler address manually we connect the above Dask cluster object,

In [20]:
client = dask.distributed.Client(ssh_cluster)

In [21]:
client

0,1
Client  Scheduler: tcp://10.199.124.105:8786  Dashboard: http://10.199.124.105:8787/status,Cluster  Workers: 4  Cores: 64  Memory: 540.37 GB


And compute again,

In [23]:
start = time.time()
pi = define_pi_workload().compute()
elapse = time.time() - start

print('pi computed:   ', pi)
print('pi from numpy: ', numpy.pi)
print('wall time: ', elapse, 'sec')

workload in giga bytes: 250.0
500.0 chunks to process
pi computed:    3.141599005184
pi from numpy:  3.141592653589793
wall time:  25.776468753814697 sec


Note, that we have strongly reduced the task graph execution wall time here!

Also, don't forget to close Dask clusters (to free-up machines!) after you are done,

In [24]:
ssh_cluster.close()

## Python environment

In [26]:
!conda list --explicit

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2020.11.8-ha878542_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.35.1-hed1e6ac_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgfortran5-9.3.0-he4bcb1c_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libstdcxx-ng-9.3.0-h2ae2ef3_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/pandoc-2.11.1.1-h36c2ea0_0.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgfortran-ng-9.3.0-he4bcb1c_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgomp-9.3.0-h5dbcf3e_17.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-1_gnu.tar.bz2
https://conda.anaconda.org/conda-forge/linux-64/libgcc-ng-9.3.0-h5dbcf3e_17.tar.bz2
ht