# Creating A Pipeline Using MLRUN

In [1]:
import kfp
from kfp import dsl
from mlrun import run_start, mlrun_op, get_run_db
from mlrun.platforms import mount_v3io

<b> Test/Debug the code locally, verify its working <b>

In [None]:
!python -m mlrun run -p p1=5 -s file=secrets.txt --out-path /User/mlrun/xx  training.py

## Build & Run a KubeFlow Pipeline 

This example is using iguazio shared FS (v3io), the `/User` dir is the "Home" for the user and the Jupyter notebook<br>
the code is mounted into the pipeline containers (no need to rebuild containers when the code changes and the runtime have access to the user local files)

MLRUN has a DB specified in the `db_path` argument (this example is using files to store runs and artifacts)<br>
the result artifacts are versioned and stored under the specified location, each workflow have a unique artifacts directory (`/<path>/{{workflow.uid}}/`)

Artifact and DB paths can use file paths or URLs for supported datastores (prefixed with s3://, v3io://, ..), <br>
Notes: file store artifacts cannot be viewed by KFP (use object URLs), URL based stores may requieres secrets passing 

In [20]:
this_path = '/User/mlrun'
db_path = this_path
#artifacts_path = this_path + '/data/{{workflow.uid}}/'
artifacts_path = 'v3io:///bigdata/mlrun/{{workflow.uid}}/'

## Example 1: a 2 step workflow (training, validation)
* 1st step: Execute training job with parameters <b>p1</b> and <b>p2</b>, log results and various artifacts including model (see [training.py](training.py))
* 2nd step: take the <b>modelfile</b> from the 1st stepand conduct validation (see [validation.py](validation.py))

In [3]:
# run training using params p1 and p2, generate 2 registered outputs (model, dataset) to be listed in the pipeline UI
# user can specify the target path per output e.g. 'model.txt':'<some-path>', or leave blank to use the default out_path
def mlrun_train(p1, p2):
    return mlrun_op('training', 
                    command = this_path + '/training.py', 
                    params = {'p1':p1, 'p2':p2},
                    outputs = {'model.txt':'', 'dataset.csv':''},
                    out_path = artifacts_path,
                    rundb = db_path)
                    
# use data (model) from the first step as an input
def mlrun_validate(modelfile):
    return mlrun_op('validation', 
                    command = this_path + '/validation.py', 
                    inputs = {'model.txt':modelfile},
                    out_path = artifacts_path,
                    rundb = db_path)

<b> Create a Kubeflow Pipelines DSL (execution graph/DAG)</b>

In [4]:
@dsl.pipeline(
    name='My MLRUN pipeline',
    description='Shows how to use mlrun.'
)
def mlrun_pipeline(
   p1 = 5 , p2 = '"text"'
):
    # create a train step, apply v3io mount to it (will add the /User mount to the container)
    train = mlrun_train(p1, p2).apply(mount_v3io())
    
    # feed 1st step results into the secound step
    # Note: the '.' in model.txt must be substituted with '-'
    validate = mlrun_validate(train.outputs['model-txt']).apply(mount_v3io())

<b> Create the pipeline spec </b><br>
compile the pipeline and create a YAML file from it 

In [5]:
kfp.compiler.Compiler().compile(mlrun_pipeline, 'mlrunpipe.yaml')

<b> Create a KFP client, Experiment and run the pipeline with custom parameter </b>

In [6]:
client = kfp.Client(namespace='default-tenant')
arguments = {'p1': 4}
experiment = client.create_experiment('mlrun demo')
run_result = client.run_pipeline(experiment.id, 'mlrun pipe demo', 'mlrunpipe.yaml', arguments)

<b> See the run status and results in the run database </b>

In [7]:
# connect to the run db 
db = get_run_db(db_path).connect()

In [10]:
# query the DB with filter on workflow ID (only show this workflow) 
db.list_runs('', labels=f'workflow={run_result.id}').show()

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...75fd89,0,Jul 28 22:11:35,completed,validation,workflow=9b98341e-b184-11e9-8636-0aeff3c69daaowner=roothost=my-mlrun-pipeline-66k76-520096857runtime=local,model.txt,,,validation.html
...ec50d5,0,Jul 28 22:11:28,completed,training,workflow=9b98341e-b184-11e9-8636-0aeff3c69daaowner=roothost=my-mlrun-pipeline-66k76-2436334224runtime=localframework=sklearn,infile.txt,p1=4p2=text,accuracy=8loss=12,model.txtresults.htmldataset.csvchart.html


## Example 2: Building a Pipeline with Hyperparams and Parallel Execution
We may want to run the same training job with multiple parameter options, we can lavarage MLRUN paralelism<br>
, instead or running each run in a seperate container with extra start and stop times we can use a pool of serverless functions<br>
or containers which will run the workload in parallel.

We extend our pipeline to use hyper parameters, the training Job will accept a list per parameter and will run all the parameter<br>
combinations (GridSearch), involving the fixes parameters `params` and the expended parameters (from `hyperparams`)<br>
since we now have an array of results (called `iterations`) we need an extra step between training and validation <br>
(see [best_fit.py](best_fit.py)) which will select the best result.

Parameter combinations can also be provided using the `param_file` option which reads the parameter values per iteration<br>
from a CSV file (where the first row hold the parameter names and following rows hold param values).<br>
the use of `hyperparams` and `param_file` can be extended to many tasks including data and ETL tasks<br>
e.g. create a list of text or image file paths in a CSV file and run a step which process all those files in paralell. 

> Note: `Iteration` steps always generate an output `iteration_results.csv` which holds the results table<br> 
Each iteration has its specific run DB entry, but all iterations share the same default out_path, use `context.iteration`<br>
value in your code if you want to create per iteration paths/filenames.

In [21]:
def mlrun_train(p1, p2):
    return mlrun_op('training', 
                    command = this_path + '/training.py', 
                    params = {'p2':p2},
                    hyperparams = {'p1': p1},
                    out_path = artifacts_path,
                    rundb = db_path)
                    
# select best fit
def mlrun_select(iterations):
    return mlrun_op('best_fit', 
                    command = this_path + '/best_fit.py', 
                    inputs = {'iterations.csv': iterations},
                    outputs = {'model.txt':''},
                    out_path = artifacts_path,
                    rundb = db_path)              

# use data from the first step
def mlrun_validate(modelfile):
    return mlrun_op('validation', 
                    command = this_path + '/validation.py', 
                    inputs = {'model.txt':modelfile},
                    out_path = artifacts_path,
                    rundb = db_path)

In [22]:
@dsl.pipeline(
    name='My MLRUN pipeline',
    description='Shows how to use mlrun.'
)
def mlrun_pipeline(
   p1 = [5, 6, 2] , p2 = '"text"'
):
    train = mlrun_train(p1, p2).apply(mount_v3io())
    
    # feed the result list into a "best fit" selection step
    selector = mlrun_select(train.outputs['iterations']).apply(mount_v3io())
    
    # feed the best fit model into a validation step
    validate = mlrun_validate(selector.outputs['model-txt']).apply(mount_v3io())

In [23]:
kfp.compiler.Compiler().compile(mlrun_pipeline, 'mlrunpipe_hyper.yaml')

In [24]:
client = kfp.Client(namespace='default-tenant')
arguments = {'p1': [5, 7, 3]}
experiment = client.create_experiment('mlrun demo hyper')
run_result = client.run_pipeline(experiment.id, 'mlrun hyper pipe demo', 'mlrunpipe_hyper.yaml', arguments)

<b> See the run status and results in the run database </b>

In [25]:
db.list_runs('', labels=f'workflow={run_result.id}').show()

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...3fa36c,1,Jul 28 22:43:58,completed,training,workflow=2c9e86bd-b189-11e9-8636-0aeff3c69daaowner=roothost=my-mlrun-pipeline-8p7c7-855036829runtime=localframework=sklearn,infile.txt,p2=textp1=5,accuracy=10loss=15,model.txtresults.htmldataset.csvchart.html
...3fa36c,0,Jul 28 22:43:57,running,training,workflow=2c9e86bd-b189-11e9-8636-0aeff3c69daaowner=roothost=my-mlrun-pipeline-8p7c7-855036829runtime=local,,p2=text,,
