# Demonstrate Local Or Remote Functions And Full Pipelines
  --------------------------------------------------------------------

Creating a local function, running predefined functions, creating and running a full ML pipeline with local and library functions.

#### **notebook how-to's**
* Create and test a simple function
* Examine data using containarized `describe` function
* Creating an ML Pipeline from various functions
* Running and tracking the pipeline results and artifacts

## Create and Test a Local Function (Iris Data Generator)
Import nuclio SDK and magics, <b>do not remove the cell and comment !!!</b><br>
and set the path to the MLRun API service

In [1]:
# nuclio: ignore
import nuclio

<b>Specify function dependencies and configuration<b>

In [2]:
%nuclio config spec.build.baseImage = "mlrun/ml-models:0.4.6"

%nuclio: setting spec.build.baseImage to 'mlrun/ml-models:0.4.6'


#### Function code
Generate the iris dataset and log the dataframe (as csv or parquet file)

In [3]:
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score
from mlrun.artifacts import TableArtifact, PlotArtifact
import pandas as pd


def iris_generator(context, format='csv'):
    iris = load_iris()
    iris_dataset = pd.DataFrame(data=iris.data, columns=iris.feature_names)
    iris_labels = pd.DataFrame(data=iris.target, columns=['label'])
    iris_dataset = pd.concat([iris_dataset, iris_labels], axis=1)
    
    context.logger.info('saving iris dataframe to {}'.format(context.artifact_path))
    context.log_dataset('iris_dataset', df=iris_dataset, format=format, index=False)


The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:

In [4]:
# nuclio: end-code
# marks the end of a code section

<a id='test-locally'></a>
### Run the data generator function locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.

We use the ```local``` runtime, later on we will use a ```job``` runtime for running containers.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [5]:
from os import path
from mlrun import run_local, NewTask, mlconf, import_function, mount_v3io
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

# specify artifacts target location
artifact_path = mlconf.artifact_path or path.abspath('./')
mlconf.artifact_path

'/User/artifacts'

In [7]:
project_name = 'sklearn-project-demo'

<b>Run</b>

In [8]:
# run the function locally
gen = run_local(
    name='iris_gen', 
    handler=iris_generator, 
    project=project_name, 
    artifact_path=artifact_path) 

[mlrun] 2020-04-28 09:33:24,376 starting run iris_gen uid=9923e9d0be2f436db8bf6a64e8ed2eca  -> http://mlrun-api:8080
[mlrun] 2020-04-28 09:33:24,670 saving iris dataframe to /User/artifacts
[mlrun] 2020-04-28 09:33:25,193 log artifact iris_dataset at /User/artifacts/iris_dataset.csv, size: 2776, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
sklearn-project-demo,...e8ed2eca,0,Apr 28 09:33:24,completed,iris_gen,v3io_user=adminkind=handlerowner=adminhost=jupyter-6dc6ff466f-q56kd,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run 9923e9d0be2f436db8bf6a64e8ed2eca --project sklearn-project-demo , !mlrun logs 9923e9d0be2f436db8bf6a64e8ed2eca --project sklearn-project-demo
[mlrun] 2020-04-28 09:33:25,377 run executed, status=completed


## Create a project to host our functions, jobs and artifacts

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [9]:
from mlrun import new_project, code_to_function
project_dir = './project'
skproj = new_project(project_name, project_dir, init_git=True)

#### Convert our local code to a distributed function object 

In [10]:
gen_func = code_to_function(name='gen_iris', 
                            kind='job',
                            image='mlrun/ml-models:0.4.6')
skproj.set_function(gen_func)

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f6374c200f0>

## Load and run a library function (visualize dataset features and stats)

<b>Step 1:</b> load the function object from the function hub (marketplace)


In [11]:
skproj.set_function('hub://describe', 'describe')

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f6345acb080>

<b>Step 2:</b> Run the describe function as a Kubernetes job with specified parameters.

> `mount_v3io()` vonnect our function to v3io shared file system and allow us to pass the data and get back the results (plots) directly to our notebook

In [12]:
skproj.func('describe').apply(mount_v3io()).run(params={'label_column': 'label'}, 
                                                inputs={"table": gen.outputs['iris_dataset']}, 
                                                artifact_path=artifact_path)

[mlrun] 2020-04-28 09:33:56,838 starting run describe-summarize uid=8729a6637254490f8cf739cc359ae8d9  -> http://mlrun-api:8080
[mlrun] 2020-04-28 09:33:57,133 Job is running in the background, pod: describe-summarize-pxz29
[mlrun] 2020-04-28 09:34:15,998 log artifact histograms at /User/artifacts/plots/hist.html, size: 282853, db: Y
[mlrun] 2020-04-28 09:34:17,688 log artifact imbalance at /User/artifacts/plots/imbalance.html, size: 11716, db: Y
[mlrun] 2020-04-28 09:34:18,357 log artifact correlation at /User/artifacts/plots/corr.html, size: 30642, db: Y

[mlrun] 2020-04-28 09:34:18,552 run executed, status=completed
final state: succeeded


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
sklearn-project-demo,...359ae8d9,0,Apr 28 09:34:09,completed,describe-summarize,host=describe-summarize-pxz29kind=jobowner=adminv3io_user=admin,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation


to track results use .show() or .logs() or in CLI: 
!mlrun get run 8729a6637254490f8cf739cc359ae8d9 --project sklearn-project-demo , !mlrun logs 8729a6637254490f8cf739cc359ae8d9 --project sklearn-project-demo
[mlrun] 2020-04-28 09:34:26,937 run executed, status=completed


<mlrun.model.RunObject at 0x7f6341fd7710>

#### Add more functions to our project (from the functions hub/marketplace)

In [13]:
skproj.set_function('hub://sklearn_classifier', 'train')
skproj.set_function('hub://test_classifier', 'test')
skproj.set_function('hub://model_server', 'serving')
skproj.set_function('hub://model_server_tester', 'live_tester')
#print(skproj.to_yaml())

<mlrun.runtimes.function.RemoteRuntime at 0x7f6345e1eda0>

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes an execution graph (DAG) and how functions are conncted to form an end to end pipline. 

* Build the iris generator (ingest) function container 
* Ingest the iris data
* Analyze the dataset (describe)
* Train and test the model using multiple algorithms (AutoML)
* Deploy the best model as a serverless function

In [14]:
%%writefile project/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}
DATASET = 'iris_dataset'
LABELS  = "label"

def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        f.spec.image_pull_policy = 'Always'

@dsl.pipeline(
    name="My XGBoost training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    # build our ingestion function (container image)
    builder = funcs['gen-iris'].deploy_step(skip_deployed=True)
    
    # run the ingestion function with the new image and params
    ingest = funcs['gen-iris'].as_step(
        name="get-data",
        handler='iris_generator',
        image=builder.outputs['image'],
        params={'format': 'pq'},
        outputs=[DATASET])

    # analyze our dataset
    describe = funcs["describe"].as_step(
        name="summary",
        params={"label_column": LABELS},
        inputs={"table": ingest.outputs[DATASET]})
    
    # train with multiple algo and compare max accuracy 
    train = funcs["train"].as_step(
        name="train-skrf",
        params={"sample"          : -1, 
                "label_column"    : LABELS,
                "test_size"       : 0.10},
        hyperparams={'model_pkg_class': ["sklearn.ensemble.RandomForestClassifier", 
                                         "sklearn.linear_model.LogisticRegression",
                                         "sklearn.ensemble.AdaBoostClassifier"]},
        selector='max.accuracy',
        inputs={"dataset"         : ingest.outputs[DATASET]},
        outputs=['model', 'test_set'])

    # test and visualize our model
    test = funcs["test"].as_step(
        name="test",
        params={"label_column": LABELS},
        inputs={"models_path" : train.outputs['model'],
                "test_set"    : train.outputs['test_set']})

    # deploy our model as a serverless function
    deploy = funcs["serving"].deploy_step(models={f"{DATASET}_v1": train.outputs['model']})

Writing /User/artifacts/project/workflow.py


In [15]:
skproj.set_workflow('main', 'workflow.py')

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [16]:
skproj.save()

<a id='run-pipeline'></a>
## Run a pipeline workflow
You can check the [workflow.py](src/workflow.py) file to see how functions objects are initialized and used (by name) inside the workflow.
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note the pipeline can include CI steps like building container images and deploying models.



### Run
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [17]:
artifact_path = mlconf.artifact_path
run_id = skproj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path, 
    dirty=True)

[mlrun] 2020-04-28 09:35:17,635 Pipeline run id=897c1f81-703c-47ee-87ad-d4c4cdbd6301, check UI or DB for progress


#### Track pipeline results

In [20]:
from mlrun import get_run_db
db = get_run_db().connect()
db.list_runs(project=skproj.name, labels=f'workflow={run_id}').show()

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
sklearn-project-demo,...64c88170,0,Apr 28 09:36:02,error,test,host=test-n9g4rkind=jobowner=adminv3io_user=adminworkflow=897c1f81-703c-47ee-87ad-d4c4cdbd6301,models_pathtest_set,label_column=label,,
sklearn-project-demo,...40123b84,0,Apr 28 09:35:47,completed,summary,host=summary-lns4zkind=jobowner=adminv3io_user=adminworkflow=897c1f81-703c-47ee-87ad-d4c4cdbd6301,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation
sklearn-project-demo,...6d05ae27,0,Apr 28 09:35:47,completed,train-skrf,kind=jobowner=adminv3io_user=adminworkflow=897c1f81-703c-47ee-87ad-d4c4cdbd6301,dataset,label_column=labelsample=-1test_size=0.1,accuracy=0.9705882352941176best_iteration=1f1_score=0.9705882352941176rocauc=0.9945117845117846,test_setmodelrocconfusioniteration_results
sklearn-project-demo,...3613b380,0,Apr 28 09:35:36,completed,get-data,host=get-data-8mqjrkind=jobowner=adminv3io_user=adminworkflow=897c1f81-703c-47ee-87ad-d4c4cdbd6301,,format=pq,,iris_dataset


**[back to top](#top)**