# Demonstrate Git Based ML Pipeline Automation
  --------------------------------------------------------------------

Creating a local function, running predefined functions, creating and running a full ML pipeline with local and library functions.

#### **notebook how-to's**
* Create and test a simple function
* Examine data using serverless (containarized) `describe` function
* Create an automated ML pipeline from various library functions
* Running and tracking the pipeline results and artifacts

## Create and Test a Local Ingestion/Data-prep Function (e.g. Iris Data Generator)
Import nuclio SDK and magics, <b>do not remove the cell and comment !!!</b>

In [1]:
# nuclio: ignore
import nuclio

<b>Specify function dependencies and configuration<b>

In [2]:
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting spec.image to 'mlrun/ml-models'


#### Function code
Generate the iris dataset and log the dataframe (as csv or parquet file)

In [3]:
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score
from mlrun.artifacts import TableArtifact, PlotArtifact
import pandas as pd

def iris_generator(context, format='csv'):
    iris = load_iris()
    iris_dataset = pd.DataFrame(data=iris.data, columns=iris.feature_names)
    iris_labels = pd.DataFrame(data=iris.target, columns=['label'])
    iris_dataset = pd.concat([iris_dataset, iris_labels], axis=1)
    
    context.logger.info('saving iris dataframe to {}'.format(context.artifact_path))
    context.log_dataset('iris_dataset', df=iris_dataset, format=format, index=False)


The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:

In [4]:
# nuclio: end-code
# marks the end of a code section

## Create a project to host our functions, jobs and artifacts

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [5]:
from os import path
from mlrun import run_local, NewTask, mlconf, import_function, mount_v3io
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

# specify artifacts target location
artifact_path = mlconf.artifact_path or path.abspath('./')
project_name = 'gitops-project'

In [6]:
from mlrun import new_project, code_to_function
project_dir = './'
skproj = new_project(project_name, project_dir)

<a id='test-locally'></a>
### Run/test the data generator function locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.<br>
when using `run_local()` the function inputs and outputs are automatically recorded by MLRun experiment and data tracking DB.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [8]:
# run the function locally
gen = run_local(name='iris_gen', handler=iris_generator, 
                project=project_name, artifact_path=path.join(artifact_path, 'data')) 

[mlrun] 2020-05-12 15:39:21,448 starting run iris_gen uid=4e946f5fcd4f41e98aec9d62d1226a67  -> http://10.199.227.162:8080
[mlrun] 2020-05-12 15:39:21,485 saving iris dataframe to /User/gitops/data
[mlrun] 2020-05-12 15:39:21,535 log artifact iris_dataset at /User/gitops/data/iris_dataset.csv, size: 2776, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
gitops-project,...d1226a67,0,May 12 15:39:21,completed,iris_gen,v3io_user=adminkind=handlerowner=adminhost=jupyter-65887d7ffb-5jsn2,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run 4e946f5fcd4f41e98aec9d62d1226a67 --project gitops-project , !mlrun logs 4e946f5fcd4f41e98aec9d62d1226a67 --project gitops-project
[mlrun] 2020-05-12 15:39:21,576 run executed, status=completed


#### Convert our local code to a distributed serverless function object 

In [9]:
gen_func = code_to_function(name='gen_iris', kind='job')
skproj.set_function(gen_func)

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f00230156a0>

## Create a Fully Automated ML Pipeline

#### Add more functions to our project to be used in our pipeline (from the functions hub/marketplace)

AutoML training (classifier), Model validation (test_classifier), Real-time model server, and Model REST API Tester

In [10]:
skproj.set_function('hub://sklearn_classifier', 'train')
skproj.set_function('hub://test_classifier', 'test')
skproj.set_function('hub://model_server', 'serving')
skproj.set_function('hub://model_server_tester', 'live_tester')
skproj.set_function('hub://github_utils:development', 'git_utils')
#print(skproj.to_yaml())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f0023710c88>

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes a Kubeflow execution graph (DAG)<br>
and how functions and data are connected  to form an end to end pipeline. 

* Build the iris generator (ingest) function container 
* Ingest the iris data
* Analyze the dataset (describe)
* Train and test the model
* Deploy the model as a real-time serverless function
* Test the serverless function REST API with test dataset

Check the code below to see how functions objects are initialized and used (by name) inside the workflow.<br>
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note: the pipeline can include CI steps like building container images and deploying models as illustrated  in the following example.


In [23]:
%%writefile ./workflow.py
from kfp import dsl
from mlrun import mount_v3io, NewTask


funcs = {}
this_project = None
DATASET = 'iris_dataset'
LABELS  = "label"

# init functions is used to configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
     
    # uncomment this line to collect the inference results into a stream
    # and specify a path in V3IO (<datacontainer>/<subpath>)
    #functions['serving'].set_env('INFERENCE_STREAM', 'users/admin/model_stream')

    
@dsl.pipeline(
    name="Demo training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    
    exit_task = NewTask(handler='run_summary_comment')
    exit_task.with_params(workflow_id='{{workflow.uid}}', 
                          repo=this_project.params.get('git_repo'),
                          issue=this_project.params.get('git_issue'))
    exit_task.with_secrets('inline', {'GITHUB_TOKEN': this_project.get_secret('GITHUB_TOKEN')})
    with dsl.ExitHandler(funcs['git_utils'].as_step(exit_task, name='exit-handler')):

        # run the ingestion function with the new image and params
        ingest = funcs['gen-iris'].as_step(
            name="get-data",
            handler='iris_generator',
            params={'format': 'pq'},
            outputs=[DATASET])

        # train with hyper-paremeters 
        train = funcs["train"].as_step(
            name="train",
            params={"sample"          : -1, 
                    "label_column"    : LABELS,
                    "test_size"       : 0.10},
            hyperparams={'model_pkg_class': ["sklearn.ensemble.RandomForestClassifier", 
                                             "sklearn.linear_model.LogisticRegression",
                                             "sklearn.ensemble.AdaBoostClassifier"]},
            selector='max.accuracy',
            inputs={"dataset"         : ingest.outputs[DATASET]},
            labels={"commit": this_project.params.get('commit', '')},
            outputs=['model', 'test_set'])

        # test and visualize our model
        test = funcs["test"].as_step(
            name="test",
            params={"label_column": LABELS},
            inputs={"models_path" : train.outputs['model'],
                    "test_set"    : train.outputs['test_set']})

        # deploy our model as a serverless function
        deploy = funcs["serving"].deploy_step(models={f"{DATASET}_v1": train.outputs['model']}, 
                                              tag=this_project.params.get('commit', 'v1'))

        # test out new model server (via REST API calls)
        tester = funcs["live_tester"].as_step(name='model-tester',
            params={'addr': deploy.outputs['endpoint'], 'model': f"{DATASET}_v1"},
            inputs={'table': train.outputs['test_set']})


Overwriting ./workflow.py


In [24]:
# register the workflow file as "main", embed the workflow code into the project YAML
skproj.set_workflow('main', 'workflow.py')

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [25]:
skproj.artifact_path = 'v3io:///users/admin/pipe/{{workflow.uid}}'
skproj.save()

### Set parameters for test

In [28]:
skproj.params['git_repo'] = 'yaronha/tstactions'
skproj.params['git_issue'] = 4
skproj.with_secrets('inline', {'GITHUB_TOKEN': '<your git token>'})

<mlrun.projects.project.MlrunProject at 0x7f002695d588>

<a id='run-pipeline'></a>
## Run a pipeline workflow
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [29]:
run_id = skproj.run(
    'main',
    arguments={}, 
    dirty=True)

[mlrun] 2020-05-12 15:49:08,108 Pipeline run id=168edbf9-5e02-40d9-b47d-1258a6148fef, check UI or DB for progress


#### Track pipeline results

In [28]:
from mlrun import get_run_db
db = get_run_db().connect()
db.list_runs(project=skproj.name, labels=f'workflow={run_id}').show()



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
sk-project,...22adb6ee,0,May 09 22:09:43,completed,model-tester,host=model-tester-sq7mmkind=jobowner=adminv3io_user=adminworkflow=ed91b97d-505b-4337-bff1-bfe647216c9e,table,addr=http://18.221.106.241:30785model=iris_dataset_v1,avg_latency=4580errors=0match=14max_latency=6472min_latency=4155total_tests=15,latency
sk-project,...cb8e215d,0,May 09 22:09:30,completed,test,host=test-bcr7hkind=jobowner=adminv3io_user=adminworkflow=ed91b97d-505b-4337-bff1-bfe647216c9e,models_pathtest_set,label_column=label,accuracy=0.9333333333333333avg_precscore=0.3673982494785104f1_score=0.9333333333333333rocauc=0.3333333333333333,rocconfusiontest_set_preds
sk-project,...12c2c896,0,May 09 22:09:17,completed,summary,host=summary-l2ddfkind=jobowner=adminv3io_user=adminworkflow=ed91b97d-505b-4337-bff1-bfe647216c9e,table,label_column=label,scale_pos_weight=1.00,histogramsimbalancecorrelation
sk-project,...87d9b3dc,0,May 09 22:09:17,completed,train-skrf,kind=jobowner=adminv3io_user=adminworkflow=ed91b97d-505b-4337-bff1-bfe647216c9e,dataset,label_column=labelsample=-1test_size=0.1,accuracy=0.9705882352941176best_iteration=2f1_score=0.9705882352941176rocauc=0.9945117845117846,test_setrocconfusionmodeliteration_results
sk-project,...847c0a88,0,May 09 22:09:06,completed,get-data,host=get-data-jzglmkind=jobowner=adminv3io_user=adminworkflow=ed91b97d-505b-4337-bff1-bfe647216c9e,,format=pq,,iris_dataset


**[back to top](#top)**