# Tutorial 4 Running a pipeline

### Overview

This tutorial demonstrates how to create an automated pipeline for our project. <br>
In order to create a pipeline we are using Kubeflow pipeline (if you are using Iguazio platform you'll find it as a built-in service). <br> 
The integration with MLRun enables us to take the functions previously created in our porject and build a pipeline that comprises of those functions. <br>


## Prerequisites

The tutorial is a continuation of [Tutorial 3](tutorial-3.ipynb). Make sure to complete the prior tutorial before running this tutorial.

### Load Project

In [1]:
from os import path, getenv
from mlrun import load_project
import mlrun

project_path = path.abspath('conf')
project = load_project(project_path)

print(f'Project path: {project_path}\nProject name: {project.name}')
print(f'Artifacts path: {mlrun.mlconf.artifact_path}')

Project path: /User/new-tutorials/conf
Project name: getting-started-tutorial-admin
Artifacts path: /v3io/projects/{{run.project}}/artifacts


### View all existing functions in my project

Run "get_run_db.list_functions" to get the list of the functions for this project.Use the latest tag <br>
In our project we should expect to have the following functions: 
* get_data - the first function that ingest the iris dataset to the platform
* describe - generate statistics on the dataset
* train-iris - training function
* test-classifier - testing the model
* mlrun-model - the serving function 

In [2]:
from mlrun import get_run_db
get_run_db().list_functions(project={project.name}, tag='latest')

[{'kind': 'job',
  'metadata': {'name': 'get-data',
   'tag': 'latest',
   'project': 'getting-started-tutorial-admin',
   'hash': 'b590278815fff64145e375b13e3ba937e1454bf0',
   'updated': '2021-01-04T14:21:55.538190+00:00'},
  'spec': {'command': '',
   'args': [],
   'image': 'mlrun/ml-models',
   'volumes': [{'flexVolume': {'driver': 'v3io/fuse',
      'options': {'accessKey': '273a4e69-e87b-4d6b-8692-c0aa9f20f939'}},
     'name': 'v3io'}],
   'volume_mounts': [{'mountPath': '/v3io', 'name': 'v3io', 'subPath': ''},
    {'mountPath': '/User', 'name': 'v3io', 'subPath': 'users/admin'}],
   'env': [{'name': 'V3IO_API',
     'value': 'v3io-webapi.default-tenant.svc:8081'},
    {'name': 'V3IO_USERNAME', 'value': 'admin'},
    {'name': 'V3IO_ACCESS_KEY',
     'value': '273a4e69-e87b-4d6b-8692-c0aa9f20f939'}],
   'default_handler': '',
   'entry_points': {'get_data': {'name': 'get_data',
     'doc': '',
     'parameters': [{'name': 'context', 'default': ''},
      {'name': 'source_url', 'd

You're now ready to create a full ML pipeline.
This is done by using [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/), which is integrated into the Iguazio Data Science Platform.
Kubeflow Pipelines is an open-source framework for building and deploying portable, scalable machine-learning workflows based on Docker containers.
MLRun leverages this framework to take your existing code and deploy it as steps in the pipeline.

<a id="gs-pipeline-workflow-define-n-save"></a>

### Define and Save a Pipeline Workflow

A pipeline is created by running an MLRun **"workflow"**.
The following code defines a workflow and writes it to a file in your project conf directory (file name is workflow.py).
The workflow describes a directed acyclic graph (DAG) for execution using Kubeflow Pipelines, and depicts the connections between the functions and the data as part of an end-to-end pipeline.
The workflow file has two parts &mdash; initialization of the function objects, and definition of a pipeline DSL (domain-specific language) for connecting the function inputs and outputs.
Examine the code to see how functions objects are initialized and used (by name) within the workflow.

The defined pipeline includes the following steps:

- Ingest the Iris flower data set (`ingest`).
- Train and the model (`train`).
- Test the model with its test dataset
- Deploy the model as a real-time serverless function (`deploy`).

> **Note**: A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

In [3]:
%%writefile {path.join(project_path, 'workflow.py')}

from kfp import dsl
from mlrun import mount_v3io
import mlrun
from mlrun.platforms import mount_v3io_extended, mount_v3io


funcs = {}
DATASET = 'source_data'
LABELS = "label"

# Configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
#        f.apply(mount_v3io_extended())
         f.apply(mount_v3io(remote='projects',mount_path='/v3io/projects'))

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(
    name="Getting-started-tutorial",
    description="This tutorial is designed to demonstrate some of the main "
                "capabilities of the Iguazio Data Science Platform.\n"
                "The tutorial uses the Iris flower data set."
)
def kfpipeline(source_url='https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'):

    # Ingest the data set
    ingest = funcs['get-data'].as_step(
        name="get-data",
        handler='get_data',
        inputs={'source_url': source_url},
        params={'format': 'csv'},
        outputs=[DATASET])
    
    # Train a model   
    train = funcs["train-iris"].as_step(
        name="train",
        params={"label_column": LABELS},
        inputs={"dataset": ingest.outputs[DATASET]},
        outputs=['model', 'test_set'])
    
    # Test and visualize the model
    test = funcs["test"].as_step(
        name="test",
        params={"label_column": LABELS},
        inputs={"models_path": train.outputs['model'],
                "test_set": train.outputs['test_set']})
    
    # Deploy the model as a serverless function
    deploy = funcs["serving"].deploy_step(
        models={f"{DATASET}_v1": train.outputs['model']})

Overwriting /User/new-tutorials/conf/workflow.py


#### Register the Workflow

Use the `set_workflow` MLRun project method to register your workflow with MLRun.
The following code sets the `name` parameter to the selected workflow name ("main") and the `code` parameter to the name of the workflow file that is found in your project directory (**workflow.py**).

In [4]:
# Register the workflow file as "main"
project.set_workflow('main', 'workflow.py')

<a id="gs-save-project"></a>

### Save Your Project Configuration

In [5]:
project.save()

Use the `run` MLRun project method to execute your workflow pipeline with Kubeflow Pipelines.
The tutorial code sets the following method parameters; (for the full parameters list, see the MLRun documentation or embedded help):

- **`name`** &mdash; the workflow name (in this case, "main" &mdash; see the previous step).
- **`arguments`** &mdash; A dictionary of Kubeflow Pipelines arguments (parameters).
  The tutorial code sets this parameter to an empty arguments list (`{}`), but you can edit the code to add arguments.
- **`artifact_path`** &mdash; a path or URL that identifies a location for storing the workflow artifacts.
  You can use `{{workflow.uid}}` in the path to signify the ID of the current workflow run iteration.
  The tutorial code sets the artifacts path to a **&lt;worker ID&gt;** directory (`{{workflow.uid}}`) in a **pipeline** directory under the projects container (**/v3io/projects/getting-started-tutorial-project name/pipeline/&lt;worker ID&gt;**).
- **`dirty`** &mdash; set to `True` to allow running the workflow also when the project's Git repository is dirty (i.e., contains uncommitted changes).
  (When the notebook that contains the execution code is in the same Git directory as the executed workflow, the directory will always be dirty during the execution.)

The `run` method returns the ID of the executed workflow, which the code stores in a `run_id` variable.
You can use this ID to track the progress or your workflow, as demonstrated in the following sections.

> **Note**: You can also run the workflow from a command-line shell by using the `mlrun` CLI.
> The following CLI command defines a similar execution logic as that of the `run` call in the tutorial:
> ```
> mlrun project /User/getting-started-tutorial/conf -r main -p "$V3IO_HOME_URL/getting-started-tutorial/pipeline/{{workflow.uid}}/"
> ```

In [6]:
import os 
from os import environ, path
from mlrun import mlconf

In [7]:
pipeline_path = mlconf.artifact_path

run_id = project.run(
    'main',
    arguments={}, 
    artifact_path=os.path.join(pipeline_path, "pipeline", '{{workflow.uid}}'),
    dirty=True,
    watch=True)

> 2021-01-04 14:31:51,710 [info] using in-cluster config.


> 2021-01-04 14:31:52,143 [info] Pipeline run id=f07c7c0d-ce73-4ce2-9298-d9ecfc4a7b68, check UI or DB for progress
> 2021-01-04 14:31:52,144 [info] waiting for pipeline run completion


uid,start,state,name,results,artifacts
...3b7a65cf,Jan 04 14:32:22,completed,test,accuracy=1.0test-error=0.0auc-micro=1.0auc-weighted=1.0f1-score=1.0precision_score=1.0recall_score=1.0,confusion-matrixprecision-recall-multiclassroc-multiclasstest_set_preds
...f90f9aa6,Jan 04 14:32:10,completed,train,accuracy=1.0test-error=0.0auc-micro=1.0auc-weighted=1.0f1-score=1.0precision_score=1.0recall_score=1.0,train_settest_setconfusion-matrixprecision-recall-multiclassroc-multiclassmodel
...074c44a5,Jan 04 14:31:58,completed,get-data,,source_data


### View pipeline in the UI

Go go to the pipeline report (on the left hand menu). After completion you should be able to view the pipeline comprises of 3 functions 
* get-data
* train
* test


<img src="./images/kubeflow-pipeline.JPG" alt="pipeline" width="600"/>

<a id='gs-pipeline-workflow-run'></a>

## Done!

Congratulation! You've completed the getting started tutorial of the Iguazio Data Science Platform.