# Using MLRUN with Dask Distributed Jobs

In [1]:
# recommended, installing the exact package versions
#!pip install dask==2.9.1 distributed==2.9.1 kubernetes==10.0.0 dask_kubernetes==0.10.0 kubernetes-asyncio==10.0.0 msgpack==0.6.2

## Writing a function code

In [2]:
# function that will be distributed 
def inc(x):
    return x+2

The MLRun context in the case of Dask will have an extra param `dask_client`
which is initialized based on the function spec (below), and can be used to submit Dask commands.

In [3]:
# wrapper function, uses the dask client object
def hndlr(context, x=1,y=2):
    context.logger.info('params: x={},y={}'.format(x,y))
    print('params: x={},y={}'.format(x,y))
    x = context.dask_client.submit(inc, x)
    print(x)
    print(x.result())
    context.log_result('y', x.result())

In [4]:
# nuclio: end-code
# marks the end of a code section, do not delete

In [5]:
# nuclio: ignore
import nuclio

In [6]:
from mlrun import new_function, mlconf, code_to_function, mount_v3io, NewTask
mlconf.dbpath = 'http://mlrun-api:8080'

## Define a Dask function object
dask functions can be local (local workers), or remote (use containers in the cluster),
in the case of `remote` users can specify the number of replica (optional) or leave blank for auto-scale.

We use `code_to_function()` which packs the function code into the function object/yaml (eliminate the need to update the function image), we can use `new_function()` if the function code is part of the image or can be remotely mounted (e.g. via v3io mount).

Dask function spec have several unique attributes:
* **.remote** - bool, use local or clustered dask
* **.replicas** - number of desired replicas, keep 0 for auto-scale
* **.min_replicas, .max_replicas** - set replicas range for auto-scale
* **.scheduler_timeout** - cluster will be killed after timeout (inactivity), default is `'60 minutes'`
* **.nthreads** - number of worker threads
* **.kfp_image** - optional, container image to use by KFP Pipeline runner (default to mlrun/dask)

If you want to access the dask dashboard from remote you need to use `NodePort` service type (set **.service_type** to 'NodePort'), and the external IP need to be specified in mlrun configuration (`mlconf.remote_host`), this will be set automatically if you are running on an Iguazio cluster.
If you want to use the `NodePort` for remote access to the scheduler you need to also specify `function.use_remote=True`.

In [9]:
# create the function from the notebook code + annotations, add volumes
dsf = code_to_function('mydask', kind='dask').apply(mount_v3io())

In [10]:
dsf.spec.image = 'daskdev/dask:2.9.1'
dsf.spec.remote = True
dsf.spec.replicas = 1
dsf.spec.service_type = 'NodePort'
dsf.spec.image_pull_policy = 'Always'

## Build the function with extra packages
We can skip the build section if we dont add packages (instead need to specify the image e.g. `dsf.spec.image='daskdev/dask:2.9.1'`) 

In [11]:
# uncomment if you want to add packages to the workers and build a new image
# dsf.build_config(base_image='daskdev/dask:2.9.1', commands=['pip install pandas'])
# dsf.deploy()

## Run a task using our distributed dask function (cluster)

In [12]:
myrun = dsf.run(handler=hndlr, params={'x': 12})

[mlrun] 2020-02-24 22:23:48,435 starting run mydask-hndlr uid=d724caf66eba4e4baa86faf539c81b53  -> http://mlrun-api:8080
[mlrun] 2020-02-24 22:23:49,032 saving function: mydask, tag: latest
[mlrun] 2020-02-24 22:23:56,887 trying dask client at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786
[mlrun] 2020-02-24 22:23:56,896 using remote dask scheduler (mlrun-mydask-75a63220-9) at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786


[mlrun] 2020-02-24 22:23:56,904 params: x=12,y=2
params: x=12,y=2
<Future: pending, key: inc-d94d5c09f5ea9337e23aad080db4545d>
14

[mlrun] 2020-02-24 22:24:02,571 run ended with state 


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...c81b53,0,Feb 24 22:23:48,completed,mydask-hndlr,kind=daskowner=adminhost=jupyter-68bdf65845-httbr,,x=12,y=14,


to track results use .show() or .logs() or in CLI: 
!mlrun get run d724caf66eba4e4baa86faf539c81b53  , !mlrun logs d724caf66eba4e4baa86faf539c81b53 
[mlrun] 2020-02-24 22:24:02,609 run executed, status=completed


In [13]:
# get the function status and addresses
dsf.status.to_dict()

{'scheduler_address': 'tcp://mlrun-mydask-75a63220-9.default-tenant:8786',
 'cluster_name': 'mlrun-mydask-75a63220-9',
 'node_ports': {'dashboard': 30489, 'scheduler': 31426}}

### Accesing the dask client directly
You can get the dask client object and cluster information by reading the client object.

> Note: the cluster can timeout, when you call the client MLrun will also verify the cluster is live and active and if not it will restart the dask cluster and refresh the function object with the latest addresses and status.

In [14]:
c = dsf.client

[mlrun] 2020-02-24 22:24:10,108 trying dask client at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786
[mlrun] 2020-02-24 22:24:10,115 using remote dask scheduler (mlrun-mydask-75a63220-9) at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786


### Access a dask function using the DB
If we want to access the dask function (or its cluster), we can load the function object from the DB (assuming we already .run() or .save() it).

This can be useful if we want to load the same function in a different notebook or container, or if we restarted our notebook

In [15]:
from mlrun import import_function
# Functions url: db://<project>/<name>[:tag]
dsf_obj = import_function('db://default/mydask')
c = dsf_obj.client

[mlrun] 2020-02-24 22:24:15,282 trying dask client at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786
[mlrun] 2020-02-24 22:24:15,289 using remote dask scheduler (mlrun-mydask-75a63220-9) at: tcp://mlrun-mydask-75a63220-9.default-tenant:8786


## Building a Pipeline using dask functions

In [16]:
import kfp
from kfp import dsl
from mlrun import run_pipeline

In [17]:
@dsl.pipeline(name="dask_pipeline")
def dask_pipe(x=1,y=10):
    # use_db option will use a function (DB) pointer instead of adding the function spec to the YAML
    myrun = dsf.as_step(NewTask(handler=hndlr, name="dask_pipeline", params={'x': x, 'y': y}), use_db=True)
    # is the step (dask client) need v3io access u should add: .apply(mount_v3io())
    myrun.container.set_image_pull_policy('Always')

In [18]:
# for pipeline debug
kfp.compiler.Compiler().compile(dask_pipe, 'daskpipe.yaml', type_check=False)

[mlrun] 2020-02-24 22:24:38,635 saving function: mydask, tag: latest




In [19]:
arguments={'x':4,'y':-5}
artifact_path = '/User/test'
run_pipeline(dask_pipe, arguments, artifact_path=artifact_path,
             run="DaskExamplePipeline", experiment="dask pipe")

[mlrun] 2020-02-24 22:24:49,917 saving function: mydask, tag: latest


[mlrun] 2020-02-24 22:24:49,966 Pipeline run id=4eefbad4-abc9-48aa-9dc6-82c70f968a82, check UI or DB for progress


'4eefbad4-abc9-48aa-9dc6-82c70f968a82'