# E2E Serverless ML pipeline  - Ingest, Train, Auto Deploy Model
  --------------------------------------------------------------------

Using the classic Iris dataset to demonstrate definition and automation of an end to end ML pipeline.

#### **notebook how-to's**
* Write and test ML pipeline in a notebook.
* Use hyper parameter tests
* Convert the code to serverless functions and run in the cluster
* Define an ML pipeline DAG (using KubeFlow Pipelines)
  * with 4 steps: data prep, training, model deployment, model report
* Check our pipeline results from the notebook

<a id='top'></a>
#### **steps**
**[define a new function and its dependencies](#define-function)**<br>
**[run the data collection and training locally](#test-locally)**<br>
**[running a task with Hyper parameters (GridSearch)](#hyper-param)**<br>
**[define cluster jobs, build images and run](#build)**<br>
**[Create a multi-stage KubeFlow Pipeline from our functions](#pipeline)**<br>

In [1]:
# nuclio: ignore
import nuclio

<a id='define-function'></a>
### **define a new function and its dependencies**

In [2]:
# %%nuclio cmd -c
# pip install sklearn
# pip install xgboost
# pip install matplotlib

In [4]:
# use this to supress XGB FutureWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
import xgboost as xgb
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score
from mlrun.artifacts import TableArtifact, PlotArtifact
import pandas as pd


def iris_generator(context):
    iris = load_iris()
    iris_dataset = pd.DataFrame(data=iris.data, columns=iris.feature_names)
    iris_labels = pd.DataFrame(data=iris.target, columns=['label'])
    iris_dataset = pd.concat([iris_dataset, iris_labels], axis=1)
    context.logger.info('saving iris dataframe to {}'.format(context.out_path))
    context.log_artifact(TableArtifact('iris_dataset', df=iris_dataset))
    

def xgb_train(context, 
              dataset='',
              model_name='model.bst',
              max_depth=6,
              num_class=10,
              eta=0.2,
              gamma=0.1,
              steps=20):

    df = pd.read_csv(dataset)
    X = df.drop(['label'], axis=1)
    y = df['label']
    
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
    dtrain = xgb.DMatrix(X_train, label=Y_train)
    dtest = xgb.DMatrix(X_test, label=Y_test)

    # Get params from event
    param = {"max_depth": max_depth,
             "eta": eta, "nthread": 4,
             "num_class": num_class,
             "gamma": gamma,
             "objective": "multi:softprob"}

    # Train model
    xgb_model = xgb.train(param, dtrain, steps)

    preds = xgb_model.predict(dtest)
    best_preds = np.asarray([np.argmax(line) for line in preds])

    # log results and artifacts
    context.log_result('accuracy', float(accuracy_score(Y_test, best_preds)))
    context.log_artifact('model', body=bytes(xgb_model.save_raw()), 
                         local_path=model_name, labels={'framework': 'xgboost'})
    
    
import matplotlib
import matplotlib.pyplot as plt
from io import BytesIO

def plot_iter(context, iterations, col='accuracy', num_bins=10):
    df = pd.read_csv(BytesIO(iterations.get()))
    x = df['output.{}'.format(col)]
    fig, ax = plt.subplots(figsize=(6,6))
    n, bins, patches = ax.hist(x, num_bins, density=1)
    ax.set_xlabel('Accuraccy')
    ax.set_ylabel('Count')
    context.log_artifact(PlotArtifact('myfig', body=fig))

The following end-code annotation tells ```nuclio``` to stop parsing the notebook from this cell. _**Please do not remove this cell**_:

In [6]:
# nuclio: end-code
# marks the end of a code section

<a id='test-locally'></a>
### run the data collection and training locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.

We use the ```local``` runtime by default, later on we will use a ```job``` runtime for running containers.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [7]:
from mlrun import new_function, code_to_function, NewTask, v3io_cred, new_model_server, mlconf, get_run_db, mount_v3io
# for local DB path use 'User/mlrun' instead 
mlconf.dbpath = 'http://mlrun-api:8080'

#### Generate the iris dataset and store in a CSV

In [8]:
out_path='/User/artifacts/xgb-demo-project'

In [9]:
gen = new_function().run(name='iris_gen', 
                         handler=iris_generator, 
                         out_path=out_path) 

[mlrun] 2020-04-29 21:20:52,583 starting run iris_gen uid=617bb603696a47e2ab82c10d584facee  -> http://mlrun-api:8080
[mlrun] 2020-04-29 21:20:52,617 .out_path will soon be deprecated, use .artifact_path
[mlrun] 2020-04-29 21:20:52,617 saving iris dataframe to /User/artifacts/xgb-demo-project
[mlrun] 2020-04-29 21:20:52,639 log artifact iris_dataset at /User/artifacts/xgb-demo-project/iris_dataset.csv, size: 2776, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...584facee,0,Apr 29 21:20:52,completed,iris_gen,v3io_user=adminkind=handlerowner=adminhost=jupyter-6b586bcb85-4f5j7,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run 617bb603696a47e2ab82c10d584facee --project default , !mlrun logs 617bb603696a47e2ab82c10d584facee --project default
[mlrun] 2020-04-29 21:20:52,693 run executed, status=completed


#### define a training task and run locally

In [10]:
task = NewTask(handler=xgb_train, out_path=out_path, 
               inputs={'dataset': gen.outputs['iris_dataset']})
task.with_params(eta=0.1, max_depth=6, gamma=0.1)

<mlrun.model.RunTemplate at 0x7f5d3d154908>

In [11]:
run = new_function().run(task)

[mlrun] 2020-04-29 21:20:52,718 starting run mlrun-cdfbf2-xgb_train uid=2eeb027d2acc4db0a3e1c56b4e35731c  -> http://mlrun-api:8080
[mlrun] 2020-04-29 21:20:52,791 log artifact model at /User/artifacts/xgb-demo-project/model.bst, size: 49772, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...4e35731c,0,Apr 29 21:20:52,completed,mlrun-cdfbf2-xgb_train,v3io_user=adminkind=handlerowner=adminhost=jupyter-6b586bcb85-4f5j7,dataset,eta=0.1max_depth=6gamma=0.1,accuracy=0.9333333333333333,model


to track results use .show() or .logs() or in CLI: 
!mlrun get run 2eeb027d2acc4db0a3e1c56b4e35731c --project default , !mlrun logs 2eeb027d2acc4db0a3e1c56b4e35731c --project default
[mlrun] 2020-04-29 21:20:52,829 run executed, status=completed


<a id="hyper-param" ></a>
### running a task with Hyper parameters (GridSearch)

In many cases we want to run our jobs with multiple parameter combination options, we can simply create a task with hyper params (list of possible values per parameter) and MLRun will run all the combinations.

MLRun will store all the results (see the `iteration_results` artifact), you can specify which result is the best (will be treated as the overall task output) using the selection criteria (`max.accuracy` i.e. the one with maximum value as the `accuracy` result).

In [12]:
# test our function locally with multiple parameters
parameters = {
     "eta":       [0.10, 0.20],
     "max_depth": [3, 6, 10],
     "gamma":     [0.1, 0.3],
     }

hyper_task = NewTask(handler=xgb_train, out_path=out_path, 
                     inputs={'dataset': gen.outputs['iris_dataset']})
hyper_task.with_hyper_params(parameters, 'max.accuracy')
run = new_function().run(hyper_task)

[mlrun] 2020-04-29 21:20:52,843 starting run mlrun-4422a5-xgb_train uid=e597c6e4e33047eea89aaab5d31a7306  -> http://mlrun-api:8080
> --------------- Iteration: (1) ---------------
[mlrun] 2020-04-29 21:20:52,916 log artifact model at /User/artifacts/xgb-demo-project/1/model.bst, size: 48980, db: Y

> --------------- Iteration: (2) ---------------
[mlrun] 2020-04-29 21:20:52,989 log artifact model at /User/artifacts/xgb-demo-project/2/model.bst, size: 49700, db: Y

> --------------- Iteration: (3) ---------------
[mlrun] 2020-04-29 21:20:53,064 log artifact model at /User/artifacts/xgb-demo-project/3/model.bst, size: 49844, db: Y

> --------------- Iteration: (4) ---------------
[mlrun] 2020-04-29 21:20:53,132 log artifact model at /User/artifacts/xgb-demo-project/4/model.bst, size: 48764, db: Y

> --------------- Iteration: (5) ---------------
[mlrun] 2020-04-29 21:20:53,246 log artifact model at /User/artifacts/xgb-demo-project/5/model.bst, size: 50060, db: Y

> --------------- Iterat

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...d31a7306,0,Apr 29 21:20:52,completed,mlrun-4422a5-xgb_train,v3io_user=adminkind=handlerowner=admin,dataset,,best_iteration=1accuracy=1.0,modeliteration_results


to track results use .show() or .logs() or in CLI: 
!mlrun get run e597c6e4e33047eea89aaab5d31a7306 --project default , !mlrun logs e597c6e4e33047eea89aaab5d31a7306 --project default
[mlrun] 2020-04-29 21:20:53,914 run executed, status=completed


<a id="build"></a>
______________________________________________
### **define cluster jobs and build images**

In order to use our function in a cluster we need to package our code and dependencies.

The ```code_to_function``` call will automatically generate a ```function``` object from the current notebook (or a specified file) with its list of dependencies and runtime configuration.

The `.deploy()` command will build the dependencies and image required for running our function.

We use `.apply(mount_v3io())` to attach a v3io (iguazio data fabric) volume to our function. By default v3io will mount the current user home into the `\User` function path.

Alternatively we can use S3 as a data source or target, for that you need to add AWS credentials to the task and specify paths starting with `s3://` e.g.:

    task.with_secrets('file', 'secrets.txt')
    out_path='s3://my-bucket/data'

In [13]:
# create the function from the notebook code + annotations
xgbfn = code_to_function('xgb', kind='job').apply(mount_v3io())

In [14]:
#xgbfn.deploy()

**run our task using the cluster job**

In [15]:
task.with_input('dataset', gen.outputs['iris_dataset'])
nrun = xgbfn.run(task, handler='xgb_train', out_path=out_path, watch=True)

[mlrun] 2020-04-29 21:20:57,595 starting run xgb-xgb_train uid=ff13a88da0f14e7985dd5f06a1ab242a  -> http://mlrun-api:8080
[mlrun] 2020-04-29 21:20:57,678 Job is running in the background, pod: xgb-xgb-train-s85p5
[mlrun] 2020-04-29 21:21:02,016 log artifact model at /User/artifacts/xgb-demo-project/model.bst, size: 48836, db: Y

[mlrun] 2020-04-29 21:21:02,031 run executed, status=completed
final state: succeeded


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
default,...a1ab242a,0,Apr 29 21:21:01,completed,xgb-xgb_train,host=xgb-xgb-train-s85p5kind=jobowner=adminv3io_user=admin,dataset,eta=0.1gamma=0.1max_depth=6,accuracy=0.9666666666666667,model


to track results use .show() or .logs() or in CLI: 
!mlrun get run ff13a88da0f14e7985dd5f06a1ab242a  , !mlrun logs ff13a88da0f14e7985dd5f06a1ab242a 
[mlrun] 2020-04-29 21:21:03,845 run executed, status=completed


<a id="pipeline"></a>
______________________________________________
## Create a multi-stage KubeFlow Pipeline from our functions
* Load Iris dataset into a CSV
* Train a model using XGBoost with Hyper-parameter
* Deploy the model using Nuclio-serving
* Generate a plot of the training results

In [16]:
import kfp
from kfp import dsl

**create a model serving function from the [model-serving notebook](nuclio-serving.ipynb)** 

This function will be used in our workflow

In [17]:
# define a nuclio-serving functions, generated from a remote notebook file
srvfn = new_model_server('iris-xgb-demo-project',
                         model_class='XGBoostModel', 
                         filename='/User/repos/functions/xgb_serving/xgb_serving.ipynb')

# attach to the fabric (to read the model file)
srvfn.apply(mount_v3io())

<mlrun.runtimes.function.RemoteRuntime at 0x7f5d18da86a0>

**define a 4 step workflow with hyper-params**

In [18]:
@dsl.pipeline(
    name='My XGBoost training pipeline',
    description='Shows how to use mlrun.'
)
def xgb_pipeline(
   eta = [0.1, 0.2, 0.3], gamma = [0.1, 0.2, 0.3]
):

    ingest = xgbfn.as_step(name='ingest_iris', handler='iris_generator',
                          outputs=['iris_dataset'])

    
    train = xgbfn.as_step(name='xgb_train', handler='xgb_train',
                          hyperparams = {'eta': eta, 'gamma': gamma},
                          selector='max.accuracy',
                          inputs = {'dataset': ingest.outputs['iris_dataset']}, 
                          outputs=['model'])

    
    plot = xgbfn.as_step(name='plot', handler='plot_iter',
                         inputs={'iterations': train.outputs['iteration_results']},
                         outputs=['iris_dataset'])

    # deploy the model serving function with inputs from the training stage
    deploy = srvfn.deploy_step(project = 'sklearn-servers', models={'iris_v1': train.outputs['model']})

#### Create a KubeFlow client and submit the pipeline with parameters

**define the artifacts output path**
the pipeline outputs will be writtento the artifacts path directory, the path can be a file path (require volume mounts) or an object path (v3io://, s3://, ..).

if we specify `{{workflow.uid}}` in the path it will be replaced with the actual workflow ID, this way every workflow run will store artifacts in a unique location for reproducability.

In [19]:
from mlrun import run_pipeline
artifact_path = 'v3io:///users/admin/artifacts/xgb_trainer/{{workflow.uid}}/'
arguments = {'eta': [0.05, 0.10, 0.40, 0.5], 'gamma': [0.1, 0.3, 0.6]}

In [20]:
id = run_pipeline(xgb_pipeline, arguments, experiment='demo-xgb-project', artifact_path=artifact_path)



[mlrun] 2020-04-29 21:21:05,718 Pipeline run id=cd5b05dc-5f54-45be-8b00-780916ea19d9, check UI or DB for progress


### check the resilts of our pipeline

In [21]:
# connect to the run db 
db = get_run_db().connect()

In [22]:
# query the DB with filter on workflow ID (only show this workflow) 
db.list_runs('', labels=f'workflow={id}').show()

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts


**[back to top](#top)**