# Demonstrate Local Or Remote Functions And Full Pipelines
  --------------------------------------------------------------------

Creating a local function, running predefined functions, creating and running a full ML pipeline with local and library functions.

#### **notebook how-to's**
* Create and test a simple function
* Examine data using containarized `describe` function
* Creating an ML Pipeline from various functions
* Running and tracking the pipeline results and artifacts

## Create and Test a Local Function (Iris Data Generator)
Import nuclio SDK and magics, <b>do not remove the cell and comment !!!</b><br>
and set the path to the MLRun API service

In [1]:
# nuclio: ignore
import nuclio

<b>Specify function dependencies and configuration<b>

In [2]:
%%nuclio cmd -c
pip install sklearn
pip install pyarrow

In [3]:
%nuclio config spec.build.baseImage = "mlrun/mlrun"

%nuclio: setting spec.build.baseImage to 'mlrun/mlrun'


#### Function code
Generate the iris dataset and log the dataframe (as csv or parquet file)

In [4]:
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score
from mlrun.artifacts import TableArtifact, PlotArtifact
import pandas as pd


def iris_generator(context, format='csv'):
    iris = load_iris()
    iris_dataset = pd.DataFrame(data=iris.data, columns=iris.feature_names)
    iris_labels = pd.DataFrame(data=iris.target, columns=['label'])
    iris_dataset = pd.concat([iris_dataset, iris_labels], axis=1)
    
    context.logger.info('saving iris dataframe to {}'.format(context.artifact_path))
    context.log_dataset('iris_dataset', df=iris_dataset, format=format, index=False)


The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:

In [5]:
# nuclio: end-code
# marks the end of a code section

<a id='test-locally'></a>
### Run the data generator function locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.

We use the ```local``` runtime, later on we will use a ```job``` runtime for running containers.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [6]:
from os import path
from mlrun import run_local, NewTask, mlconf, import_function, mount_v3io
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

# specify artifacts target location
artifact_path = mlconf.artifact_path or path.abspath('./')
project_name = 'sk2-project'

<b>Run</b>

In [7]:
# run the function locally
gen = run_local(name='iris_gen', handler=iris_generator, 
                project=project_name, artifact_path=path.join(artifact_path, 'data')) 

[mlrun] 2020-03-29 21:28:11,596 starting run iris_gen uid=d056a4f9fcbb40c1817d2c3046c219a9  -> http://10.196.88.27:80
[mlrun] 2020-03-29 21:28:11,634 saving iris dataframe to /User/ml/demos/sklearn-pipe/data
[mlrun] 2020-03-29 21:28:11,684 log artifact iris_dataset at /User/ml/demos/sklearn-pipe/data/iris_dataset.csv, size: 2776, db: Y



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...c219a9,0,Mar 29 21:28:11,completed,iris_gen,v3io_user=adminkind=handlerowner=adminhost=jupyter-74f9488695-6wrxj,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run d056a4f9fcbb40c1817d2c3046c219a9 --project sk2-project , !mlrun logs d056a4f9fcbb40c1817d2c3046c219a9 --project sk2-project
[mlrun] 2020-03-29 21:28:11,729 run executed, status=completed


## Create a project to host our functions, jobs and artifacts

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [8]:
from mlrun import new_project, code_to_function
project_dir = './project'
skproj = new_project(project_name, project_dir, init_git=True)

#### Convert our local code to a distributed function object 

In [9]:
gen_func = code_to_function(name='gen_iris', kind='job')
skproj.set_function(gen_func)

<mlrun.projects.project.MlrunProject at 0x7efda6e43ac8>

## Load and run a library function (visualize dataset features and stats)

<b>Step 1:</b> load the function object from the function hub (marketplace)


In [10]:
skproj.set_function('hub://describe', 'describe')

<mlrun.projects.project.MlrunProject at 0x7efda6e43ac8>

<b>Step 2:</b> Run the describe function as a Kubernetes job with specified parameters.

> `mount_v3io()` vonnect our function to v3io shared file system and allow us to pass the data and get back the results (plots) directly to our notebook

In [11]:
skproj.func('describe').apply(mount_v3io()).run(params={'label_column': 'label'}, 
                                                inputs={"table": gen.outputs['iris_dataset']}, 
                                                artifact_path=artifact_path)

[mlrun] 2020-03-29 21:28:31,152 starting run describe-summarize uid=95710f97ca6c4e2ba512b73990ad44e8  -> http://10.196.88.27:80
[mlrun] 2020-03-29 21:28:31,224 Job is running in the background, pod: describe-summarize-r29jx
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
[mlrun] 2020-03-29 21:28:44,717 log artifact histograms at /User/ml/demos/sklearn-pipe/plots/hist.html, size: 280685, db: Y
[mlrun] 2020-03-29 21:28:45,133 log artifact imbalance at /User/ml/demos/sklearn-pipe/plots/imbalance.html, size: 11840, db: Y
[mlrun] 2020-03-29 21:28:45,317 log artifact correlation at /User/ml/demos/sklearn-pipe/plots/corr.html, size: 33266, db: Y

[mlrun] 2020-03-29 21:28:45,377 run executed, status=completed
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...ad44e8,0,Mar 29 21:28:42,completed,describe-summarize,host=describe-summarize-r29jxkind=jobowner=adminv3io_user=admin,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation


to track results use .show() or .logs() or in CLI: 
!mlrun get run 95710f97ca6c4e2ba512b73990ad44e8 --project sk2-project , !mlrun logs 95710f97ca6c4e2ba512b73990ad44e8 --project sk2-project
[mlrun] 2020-03-29 21:28:50,464 run executed, status=completed


<mlrun.model.RunObject at 0x7efda775e0b8>

#### Add more functions to our project (from the functions hub/marketplace)

In [12]:
skproj.set_function('hub://sklearn_classifier', 'train')
skproj.set_function('hub://test_classifier', 'test')
skproj.set_function('hub://model_server', 'serving')
skproj.sync_functions()
#print(skproj.to_yaml())

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes an execution graph (DAG) and how functions are conncted to form an end to end pipline. 

* Build the iris generator (ingest) function container 
* Ingest the iris data
* Analyze the datas (describe)
* Train and test the model
* Deploy the model as a serverless function

In [13]:
%%writefile project/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}
DATASET = 'iris_dataset'
LABELS  = "label"

def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        f.spec.image_pull_policy = 'Always'

@dsl.pipeline(
    name="My XGBoost training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    # build our ingestion function (container image)
    builder = funcs['gen-iris'].deploy_step(skip_deployed=True)
    
    # run the ingestion function with the new image and params
    ingest = funcs['gen-iris'].as_step(
        name="get-data",
        handler='iris_generator',
        image=builder.outputs['image'],
        params={'format': 'pq'},
        outputs=[DATASET])

    # analyze our dataset
    describe = funcs["describe"].as_step(
        name="summary",
        params={"label_column": LABELS},
        inputs={"table": ingest.outputs[DATASET]})
    
    # train with hyper-paremeters 
    train = funcs["train"].as_step(
        name="train-skrf",
        params={"model_pkg_class" : "sklearn.ensemble.RandomForestClassifier",
                "sample"          : -1, 
                "label_column"    : LABELS,
                "test_size"       : 0.10},
        hyperparams={'CLASS_n_estimators': [100, 300, 500]},
        selector='max.accuracy',
        inputs={"dataset"         : ingest.outputs[DATASET]},
        outputs=['model', 'test_set'])

    # test and visualize our model
    test = funcs["test"].as_step(
        name="test",
        params={"label_column": LABELS},
        inputs={"models_path" : train.outputs['model'],
                "test_set"    : train.outputs['test_set']})

    # deploy our model as a serverless function
    deploy = funcs["serving"].deploy_step(models={f"{DATASET}_v1": train.outputs['model']})

Overwriting project/workflow.py


In [14]:
skproj.set_workflow('main', 'workflow.py')

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [15]:
skproj.save()

<a id='run-pipeline'></a>
## Run a pipeline workflow
You can check the [workflow.py](src/workflow.py) file to see how functions objects are initialized and used (by name) inside the workflow.
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note the pipeline can include CI steps like building container images and deploying models.



### Run
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [16]:
artifact_path = path.abspath('./{{workflow.uid}}')
run_id = skproj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path, 
    dirty=True)

[mlrun] 2020-03-29 21:28:57,864 Pipeline run id=8d4f35e6-73fa-4109-96a8-a6d7ee8c42f5, check UI or DB for progress


#### Track pipeline results

In [17]:
from mlrun import get_run_db
db = get_run_db().connect()
db.list_runs(project=skproj.name, labels=f'workflow={run_id}').show()

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...11ec5d,0,Mar 29 21:29:54,completed,test,host=test-xgmdxkind=jobowner=adminv3io_user=adminworkflow=8d4f35e6-73fa-4109-96a8-a6d7ee8c42f5,models_pathtest_set,label_column=label,accuracy=0.9333333333333333avg_precscore=0.37779997779997776f1_score=0.9333333333333333rocauc=0.43671760338427007,rocconfusion
...284f21,0,Mar 29 21:29:30,completed,summary,host=summary-n2929kind=jobowner=adminv3io_user=adminworkflow=8d4f35e6-73fa-4109-96a8-a6d7ee8c42f5,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation
...bd4f9c,0,Mar 29 21:29:29,completed,train-skrf,kind=jobowner=adminv3io_user=adminworkflow=8d4f35e6-73fa-4109-96a8-a6d7ee8c42f5,dataset,label_column=labelmodel_pkg_class=sklearn.ensemble.RandomForestClassifiersample=-1test_size=0.1,accuracy=0.9705882352941176avg_precscore=0.9942548033145147best_iteration=1f1_score=0.9705882352941176rocauc=0.9945117845117846,test_setmodelrocconfusioniteration_results
...c326a1,0,Mar 29 21:29:11,completed,get-data,host=get-data-s9g8fkind=jobowner=adminv3io_user=adminworkflow=8d4f35e6-73fa-4109-96a8-a6d7ee8c42f5,,format=pq,,iris_dataset


**[back to top](#top)**