# Demonstrate Local Or Remote Functions And Full Pipelines
  --------------------------------------------------------------------

Creating a local function, running predefined functions, creating and running a full ML pipeline with local and library functions.

#### **notebook how-to's**
* Create and test a simple function
* Examine data using containarized `describe` function
* Creating an ML Pipeline from various functions
* Running and tracking the pipeline results and artifacts

## Create and Test a Local Function (Iris Data Generator)
Import nuclio SDK and magics, <b>do not remove the cell and comment !!!</b><br>
and set the path to the MLRun API service

In [1]:
# nuclio: ignore
import nuclio

from os import environ
# set the location/url of mlrun service 
#environ['MLRUN_DBPATH'] = environ.get('MLRUN_DBPATH', 'http://mlrun-api:8080')

<b>Specify function dependencies and configuration<b>

In [2]:
%%nuclio cmd -c
pip install sklearn
pip install pyarrow

In [3]:
%nuclio config spec.build.baseImage = "mlrun/mlrun"

%nuclio: setting spec.build.baseImage to 'mlrun/mlrun'


#### Function code
Generate the iris dataset and log the dataframe (as csv or parquet file)

In [4]:
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score
from mlrun.artifacts import TableArtifact, PlotArtifact
import pandas as pd


def iris_generator(context, format='csv'):
    iris = load_iris()
    iris_dataset = pd.DataFrame(data=iris.data, columns=iris.feature_names)
    iris_labels = pd.DataFrame(data=iris.target, columns=['label'])
    iris_dataset = pd.concat([iris_dataset, iris_labels], axis=1)
    
    context.logger.info('saving iris dataframe to {}'.format(context.artifact_path))
    context.log_dataset('iris_dataset', df=iris_dataset, format=format, index=False)


The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:

In [5]:
# nuclio: end-code
# marks the end of a code section

<a id='test-locally'></a>
### Run the data generator function locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.

We use the ```local``` runtime, later on we will use a ```job``` runtime for running containers.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [6]:
from os import path
from mlrun import run_local, NewTask, mlconf, import_function, mount_v3io
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'
# specify artifacts target location
artifact_path = mlconf.artifact_path or path.abspath('./')

<b>Run</b>

In [7]:
# run the function locally
gen = run_local(name='iris_gen', handler=iris_generator, artifact_path=path.join(artifact_path, 'data')) 

[mlrun] 2020-03-24 20:12:37,953 starting run iris_gen-iris_generator uid=d68bb9f7a9f54f1b93e3408b3cac4b47  -> http://mlrun-api:8080
[mlrun] 2020-03-24 20:12:37,989 saving iris dataframe to /User/new-xgb/data
[mlrun] 2020-03-24 20:12:38,055 log artifact iris_dataset at /User/new-xgb/data/iris_dataset.csv, size: 2776, db: Y



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...ac4b47,0,Mar 24 20:12:37,completed,iris_gen-iris_generator,kind=handlerowner=adminhost=jupyter-db8d675b8-rcpjx,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run d68bb9f7a9f54f1b93e3408b3cac4b47 --project default , !mlrun logs d68bb9f7a9f54f1b93e3408b3cac4b47 --project default
[mlrun] 2020-03-24 20:12:38,109 run executed, status=completed


## Load and run a library function (visualize dataset features and stats)

<b>Step 1:</b> load the function object from the function hub (marketplace)

In [8]:
describe = import_function('hub://describe').apply(mount_v3io())

<b>Step 2:</b> Run the describe function as a Kubernetes job with specified parameters.

> The v3io shared file system allow us to pass the data and get back the results (plots) directly to our notebook

In [9]:
describe.run(params={'label_column': 'label'}, inputs={"table": gen.outputs['iris_dataset']}, artifact_path=artifact_path)

[mlrun] 2020-03-24 20:12:38,327 starting run describe-summarize uid=d6956b9fb37c4069969146a1343f63b4  -> http://mlrun-api:8080
[mlrun] 2020-03-24 20:12:38,390 Job is running in the background, pod: describe-summarize-xdtxq
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
.csv
[mlrun] 2020-03-24 20:12:54,380 log artifact histograms at /User/new-xgb/plots/hist.html, size: 280685, db: Y
[mlrun] 2020-03-24 20:12:54,790 log artifact imbalance at /User/new-xgb/plots/imbalance.html, size: 11840, db: Y
[mlrun] 2020-03-24 20:12:54,964 log artifact correlation at /User/new-xgb/plots/corr.html, size: 33266, db: Y

[mlrun] 2020-03-24 20:12:55,028 run executed, status=completed
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...3f63b4,0,Mar 24 20:12:51,completed,describe-summarize,host=describe-summarize-xdtxqkind=jobowner=admin,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation


to track results use .show() or .logs() or in CLI: 
!mlrun get run d6956b9fb37c4069969146a1343f63b4  , !mlrun logs d6956b9fb37c4069969146a1343f63b4 
[mlrun] 2020-03-24 20:12:57,581 run executed, status=completed


<mlrun.model.RunObject at 0x7fce58428e48>

## Create a project with multiple functions

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [10]:
from mlrun import new_project, code_to_function
project_dir = './project'
xgbproj = new_project('xgb-project', project_dir, init_git=True)

#### Convert our local code to a distributed function object 

In [11]:
gen_func = code_to_function(name='gen_iris', kind='job')
xgbproj.set_function(gen_func)

<mlrun.projects.project.MlrunProject at 0x7fce56997e80>

#### Add more functions to our project (from the functions hub/marketplace)

In [12]:
xgbproj.set_function('hub://describe', 'describe')
xgbproj.set_function('hub://sklearn_classifier', 'train')
xgbproj.set_function('hub://test_classifier', 'test')
xgbproj.set_function('hub://serving', 'serving')
xgbproj.sync_functions()
#print(xgbproj.to_yaml())

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes an execution graph (DAG) and how functions are conncted to form an end to end pipline. 

* Build the iris generator (ingest) function container 
* Ingest the iris data
* 

In [13]:
%%writefile project/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}
DATASET = 'iris_dataset'
LABELS  = "label"
MODEL_KEY = "models"

def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        f.spec.image_pull_policy = 'Always'

@dsl.pipeline(
    name="My XGBoost training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    builder = funcs['gen-iris'].deploy_step(skip_deployed=True)
    
    ingest = funcs['gen-iris'].as_step(
        name="get-data",
        handler='iris_generator',
        image=builder.outputs['image'],
        params={'format': 'pq'},
        outputs=[DATASET])

    describe = funcs["describe"].as_step(
        name="summary",
        params={"label_column": LABELS},
        inputs={"table": ingest.outputs[DATASET]})
    
    train = funcs["train"].as_step(
        name="train",
        params={"model_pkg_class" : "sklearn.linear_model.LogisticRegression",
                "model_key"       : MODEL_KEY, 
                "sample"          : -1, 
                "label_column"    : LABELS,
                "test_size"       : 0.10,
                "class_params_updates"  : {"random_state": 1},
                "fit_params_updates"    : {}},
        inputs={"data_key"        : ingest.outputs[DATASET]},
        outputs=[MODEL_KEY, "test_set"])

    test = funcs["test"].as_step(
        name="test",
        params={"label_column": LABELS},
        inputs={"models_dir"  : train.outputs[MODEL_KEY],
                "test_set"    : train.outputs["test_set"]},
        outputs=[MODEL_KEY])

    deploy = funcs["serving"].deploy_step(models={f"{DATASET}_v1": train.outputs[MODEL_KEY]})

Overwriting project/workflow.py


In [14]:
xgbproj.set_workflow('main', 'workflow.py')

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [15]:
xgbproj.save()

<a id='run-pipeline'></a>
## Run a pipeline workflow
You can check the [workflow.py](src/workflow.py) file to see how functions objects are initialized and used (by name) inside the workflow.
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note the pipeline can include CI steps like building container images and deploying models.



### Run
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [19]:
run_id = xgbproj.run(
    'main',
    arguments={}, 
    artifact_path=artifact_path, 
    dirty=True)

[mlrun] 2020-03-24 20:21:31,126 Pipeline run id=964afbfa-c716-48b3-9534-44320a53c282, check UI or DB for progress


#### Track pipeline results

In [None]:
from mlrun import get_run_db
db = get_run_db().connect()
db.list_runs(project=xgbproj.name, labels=f'workflow={run_id}').show()

**[back to top](#top)**