# Part 4: Pipeline Automation  

This part of the feature store demo walks you through the steps for creating an automated pipeline for our project.
The pipeline is created using [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/pipelines-quickstart/).

The integration of MLRun with Kubeflow Pipelines enables you to take the functions in your project and build a pipeline that contains these functions.

> **Note**: The Iguazio Data Science Platform has a default (pre-deployed) shared Kubeflow Pipelines service (`pipelines`).

An ML Engineer can gather the different functions created by the Data Engineer and Data Scientist and create this automated pipeline.

## Environment Setup

In [1]:
import mlrun

project_name, _ = mlrun.set_environment(project='fraud-demo', 
                                        user_project=True, artifact_path='./artifact')

In [2]:
mlrun_project = mlrun.new_project(project_name)

## Create a Fully Automated ML Pipeline

#### Add more functions to our project to be used in our pipeline (from the functions hub/marketplace)

AutoML training (classifier), Model validation (test_classifier), Real-time model server, and Model REST API Tester

In [3]:
mlrun_project.set_function('hub://describe')
mlrun_project.set_function('hub://sklearn_classifier', 'train')
mlrun_project.set_function('hub://test_classifier', 'test')
mlrun_project.set_function('hub://v2_model_server', 'serving')
mlrun_project.set_function('hub://v2_model_tester', 'live_tester')

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f1b87a9c7d0>

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes a Kubeflow execution graph (DAG)<br>
and how functions and data are connected  to form an end to end pipeline. 


In [4]:
%%writefile workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}

model_list = {"model_name": ['transaction_fraud_rf','transaction_fraud_xgboost', 
                             'transaction_fraud_adaboost'],
              
              "model_pkg_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}


# init functions is used to configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
    
@dsl.pipeline(
    name="Demo training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline(vector, label):
       
    # analyze our dataset
    describe = funcs["describe"].as_step(
        name="summary",
        params={"label_column": label},
        inputs={"table": vector})
    
    # train with hyper-paremeters 
    train = funcs["train"].as_step(
        name="train",
        params={"label_column": label},
        hyperparams=model_list,
        selector='max.accuracy',
        inputs={"dataset"         : vector},
        outputs=['model', 'test_set'])

    # test and visualize our model
    test = funcs["test"].as_step(
        name="test",
        params={"label_column": label},
        inputs={"models_path" : train.outputs['model'],
                "test_set"    : train.outputs['test_set']})

    # deploy our model as a serverless function, we can pass a list of models to serve 
    deploy = funcs["serving"].deploy_step(models=[{"key": f"fraud:v1", "model_path": train.outputs['model']}])

Writing workflow.py


In [5]:
# register the workflow file as "main", embed the workflow code into the project YAML
mlrun_project.set_workflow('main', 'workflow.py')

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [6]:
mlrun_project.save("project.yaml")

<a id='run-pipeline'></a>
## Run a pipeline workflow
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The `dirty` flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)<br>
The `watch` flag will wait for the pipeline to complete and print results

In this cell we will run the `main` workflow via `KubeFlow Pipelines` on top of our cluster.  
Running the pipeline may take some time. Due to possible jupyter timeout, it's best to track the pipeline's progress via KFP or the MLRun UI.

In [None]:
from os import path

vector = f'store://feature-vectors/{project_name}/transactions-fraud'

artifact_path = path.abspath('./pipe/{{workflow.uid}}')
run_id = mlrun_project.run(
    'main',
    arguments={'vector': vector, 'label': 'label'}, 
    artifact_path=artifact_path, 
    dirty=True, watch=True)

> 2021-08-11 06:37:14,549 [info] using in-cluster config.


> 2021-08-11 06:37:15,078 [info] Pipeline run id=e9d00261-3939-43a7-8bb2-2a26bba14d97, check UI or DB for progress
> 2021-08-11 06:37:15,079 [info] waiting for pipeline run completion


**[back to top](#top)**