# Running Dask on the cluster with mlrun

The dask frameworks enables users to parallelize their python code and run it as a distributed process on Iguazio cluster and dramatically accelerate their performance. <br>
In this notebook you'll learn how to create a dask cluster and then an mlrun function running as a dask client. <br>
It also demonstrates how to run parallelize custom algorithm using Dask Delayed option

For more information on dask over kubernetes: https://kubernetes.dask.org/en/latest/

## Set up the environment

In [1]:
# set mlrun api path and artifact path for logging
import mlrun
project_name = "dask-demo"
project = mlrun.get_or_create_project(name=project_name)

> 2023-02-16 13:26:39,219 [info] loaded project dask-demo from MLRun DB


## Create and Start Dask Cluster
Dask functions can be local (local workers), or remote (use containers in the cluster), in the case of remote users can specify the number of replica (optional) or leave blank for auto-scale.<br>
We use `new_function()`  to define our Dask cluster and set the desired configuration of that clustered function.

if the dask workers need to access the shared file system we apply a shared volume mount (e.g. via v3io mount).

Dask function spec have several unique attributes (in addition to the standard job attributes):

* **.remote** - bool, use local or clustered dask
* **.replicas** - number of desired replicas, keep 0 for auto-scale
* **.min_replicas**, **.max_replicas** - set replicas range for auto-scale
* **.scheduler_timeout** - cluster will be killed after timeout (inactivity), default is '60 minutes'
* **.nthreads** - number of worker threads
<br>

If you want to access the dask dashboard or scheduler from remote you need to use NodePort service type (set `.service_type` to 'NodePort'), and the external IP need to be specified in mlrun configuration (mlconf.remote_host), this will be set automatically if you are running on an Iguazio cluster.

We specify the kind (dask) and the container image 

In [2]:
# create an mlrun function which will init the dask cluster
dask_cluster_name = "dask-cluster"
dask_cluster = mlrun.new_function(dask_cluster_name, kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())

<mlrun.runtimes.daskjob.DaskCluster at 0x7f803cfa1700>

In [3]:
# set range for # of replicas with replicas and max_replicas
dask_cluster.spec.min_replicas = 1
dask_cluster.spec.max_replicas = 4

# set the use of dask remote cluster (distributed) 
dask_cluster.spec.remote = True
dask_cluster.spec.service_type = "NodePort"

# set dask memory and cpu limits
dask_cluster.with_worker_requests(mem='2G', cpu='2')

### Initialize the Dask Cluster

When we request the dask cluster `client` attribute it will verify the cluster is up and running

In [4]:
# init dask client and use the scheduler address as param in the following cell
dask_cluster.client

> 2023-02-16 13:26:39,366 [info] trying dask client at: tcp://mlrun-dask-cluster-7001917e-f.default-tenant:8786
> 2023-02-16 13:26:39,437 [info] using remote dask scheduler (mlrun-dask-cluster-7001917e-f) at: tcp://mlrun-dask-cluster-7001917e-f.default-tenant:8786


Mismatched versions found

+---------+----------------+----------------+----------------+
| Package | client         | scheduler      | workers        |
+---------+----------------+----------------+----------------+
| blosc   | None           | 1.11.1         | 1.11.1         |
| lz4     | None           | 3.1.10         | 3.1.10         |
| python  | 3.9.16.final.0 | 3.9.12.final.0 | 3.9.12.final.0 |
+---------+----------------+----------------+----------------+


0,1
Connection method: Direct,
Dashboard: http://mlrun-dask-cluster-7001917e-f.default-tenant:8787/status,

0,1
Comm: tcp://10.200.196.39:8786,Workers: 1
Dashboard: http://10.200.196.39:8787/status,Total threads: 1
Started: 36 minutes ago,Total memory: 20.00 GiB

0,1
Comm: tcp://10.200.196.40:41351,Total threads: 1
Dashboard: http://10.200.196.40:46595/status,Memory: 20.00 GiB
Nanny: tcp://10.200.196.40:35230,
Local directory: /mlrun/dask-worker-space/worker-31w64g0e,Local directory: /mlrun/dask-worker-space/worker-31w64g0e
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 408.47 MiB,Spilled bytes: 0 B
Read bytes: 0.0 B,Write bytes: 838.366183960203 B


## Creating A Function Which Run Over Dask

In [5]:
# mlrun: start-code

Import mlrun and dask. nuclio is used just to convert the code into an mlrun function

In [6]:
import mlrun 

In [7]:
%nuclio config kind = "job"
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting kind to 'job'
%nuclio: setting spec.image to 'mlrun/ml-models'


In [8]:
from dask.distributed import Client
from dask import delayed
from dask import dataframe as dd

import warnings
import numpy as np
import os
import mlrun

warnings.filterwarnings("ignore")

### python function code

This simple function reads a csv file using dask dataframe and run group by and describe function on the dataset and store the results as a dataset artifact <br>

In [9]:
def test_dask(context,
              dataset: mlrun.DataItem,
              client=None,
              dask_function: str=None) -> None:
    
    # setup dask client from the MLRun dask cluster function
    if dask_function:
        client = mlrun.import_function(dask_function).client
    elif not client:
        client = Client()
    
    # load the dataitem as dask dataframe (dd)
    df = dataset.as_df(df_module=dd)
    
    # run describe (get statistics for the dataframe) with dask
    df_describe = df.describe().compute()
    
    # run groupby and count using dask 
    df_grpby = df.groupby("VendorID").count().compute()
    
    context.log_dataset("describe", 
                    df=df_grpby,
                    format='csv', index=True)
    return

In [10]:
# mlrun: end-code

## Test Our Function Over Dask

### Load sample data

In [11]:
DATA_URL="/User/examples/ytrip.csv"

In [12]:
!mkdir -p /User/examples/
!curl -L "https://s3.wasabisys.com/iguazio/data/Taxi/yellow_tripdata_2019-01_subset.csv" > {DATA_URL}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 84.9M  100 84.9M    0     0  7224k      0  0:00:12  0:00:12 --:--:-- 13.0M


### Convert the code to MLRun function

Use code_to_function to convert the code to MLRun and specify the configuration for the dask process (e.g. replicas, memory etc.) <br>
Note that the resource configurations are per worker

In [13]:
# mlrun will transform the code above (up to nuclio: end-code cell) into serverless function 
# which will run in k8s pods
fn = mlrun.code_to_function("test_dask",  kind='job', handler="test_dask").apply(mlrun.mount_v3io())

### Run the function

When running the function you would see a link as part of the result. click on this link takes you to the dask monitoring dashboard

In [14]:
# function URI is db://<project>/<name>
dask_uri = f'db://{project_name}/{dask_cluster_name}'

In [15]:
r = fn.run(handler = test_dask,
           inputs={"dataset": DATA_URL},
           params={"dask_function": dask_uri})

> 2023-02-16 13:26:58,829 [info] starting run test-dask-test_dask uid=ae1608c7992a446ba64ef4088ffe51fb DB=http://mlrun-api:8080
> 2023-02-16 13:26:58,995 [info] Job is running in the background, pod: test-dask-test-dask-c7bpd
Names with underscore '_' are about to be deprecated, use dashes '-' instead. Replacing underscores with dashes.
> 2023-02-16 13:27:06,941 [info] trying dask client at: tcp://mlrun-dask-cluster-7001917e-f.default-tenant:8786
> 2023-02-16 13:27:06,965 [info] using remote dask scheduler (mlrun-dask-cluster-7001917e-f) at: tcp://mlrun-dask-cluster-7001917e-f.default-tenant:8786
remote dashboard: default-tenant.app.vmdev94.lab.iguazeng.com:32365
> 2023-02-16 13:27:14,515 [info] To track results use the CLI: {'info_cmd': 'mlrun get run ae1608c7992a446ba64ef4088ffe51fb -p dask-demo', 'logs_cmd': 'mlrun logs ae1608c7992a446ba64ef4088ffe51fb -p dask-demo'}
> 2023-02-16 13:27:14,515 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguaze

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
dask-demo,...8ffe51fb,0,Feb 16 13:27:06,completed,test-dask-test_dask,v3io_user=danikind=jobowner=danimlrun/client_version=1.3.0-rc23mlrun/client_python_version=3.9.16host=test-dask-test-dask-c7bpd,dataset,dask_function=db://dask-demo/dask-cluster,,describe





> 2023-02-16 13:27:17,389 [info] run executed, status=completed


## Track the progress in the UI

Users can view the progress and detailed information in the mlrun UI by clicking on the uid above. <br>
Also, to track the dask progress in the dask UI click on the "dashboard link" above the "client" section