# Tutorial-4 running the project as a pipeline

### Overview

This tutorial demonstrates how to create an automated pipeline for our project. <br>
In order to create a pipeline we are using a built-in service in Iguazio called Kubeflow pipeline <br> 
which is based on the kubeflow pipeline open source. <br>
The integration with MLRun enables us to take the functions previously created in our porject and build a pipeline <br>
that comprises of those functions. <br>


## Prerequisites

The tutorial is a continuation of [Tutorial 3](tutorial-3.ipynb). Make sure to complete the prior tutorial before running this tutorial.

### Load Project

In [28]:
from os import path, getenv
from mlrun import load_project

project_path = path.abspath('conf')
project = load_project(project_path)

print(f'Project path: {project_path}\nProject name: {project.name}')

Project path: /User/new-tutorials/conf
Project name: getting-started-tutorial-admin


### View all existing functions in my project

run "get_run_db.list_functions" to get the list of the functions. use the latest tag <br>
in our project we should expect to have the following functions: 
* get_data - the first function that ingest the iris dataset to the platform
* describe - generate statistics on the dataset
* train-iris - training function
* test-classifier - testing the model
* mlrun-model - the serving function 

In [37]:
from mlrun import get_run_db
get_run_db().list_functions(project='getting-started-tutorial-admin', tag='latest')

[{'kind': '',
  'metadata': {'name': 'get_data',
   'tag': 'latest',
   'project': 'getting-started-tutorial-admin',
   'categories': [],
   'hash': '84b0d47c7f0737782161e42b650af5d6221dcea9',
   'updated': '2020-12-24T14:26:48.167498+00:00'},
  'spec': {'command': '/tmp/tmpm7tlfcg0.py',
   'args': [],
   'image': '',
   'build': {'commands': []},
   'description': ''},
  'verbose': False,
  'status': {}},
 {'kind': 'job',
  'metadata': {'name': 'describe',
   'tag': 'latest',
   'hash': '6fc307f7e3afddb38475ba3b9cfae7a97d7b0598',
   'project': 'getting-started-tutorial-admin',
   'labels': {'author': 'yjb'},
   'categories': ['analysis'],
   'updated': '2020-12-27T09:00:47.083430+00:00'},
  'spec': {'command': '',
   'args': [],
   'image': 'mlrun/ml-models',
   'volumes': [{'flexVolume': {'driver': 'v3io/fuse',
      'options': {'accessKey': '442b1585-d0f0-4c6f-b639-6d5a8fa3f292'}},
     'name': 'v3io'}],
   'volume_mounts': [{'mountPath': '/v3io', 'name': 'v3io', 'subPath': ''},
   

You're now ready to create a full ML pipeline.
This is done by using [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/), which is integrated into the Iguazio Data Science Platform.
Kubeflow Pipelines is an open-source framework for building and deploying portable, scalable machine-learning workflows based on Docker containers.
MLRun leverages this framework to take your existing code and deploy it as steps in the pipeline.

<a id="gs-pipeline-workflow-define-n-save"></a>

### Define and Save a Pipeline Workflow

A pipeline is created by running an MLRun **"workflow"**.
The following code defines a workflow and writes it to a file in your project conf directory (file name is workflow.py).
The workflow describes a directed acyclic graph (DAG) for execution using Kubeflow Pipelines, and depicts the connections between the functions and the data as part of an end-to-end pipeline.
The workflow file has two parts &mdash; initialization of the function objects, and definition of a pipeline DSL (domain-specific language) for connecting the function inputs and outputs.
Examine the code to see how functions objects are initialized and used (by name) within the workflow.

The defined pipeline includes the following steps:

- Ingest the Iris flower data set (`ingest`).
- Analyze the data set (`describe`).
- Train and test the model with hyperparameters (`train`).
- Deploy the model as a real-time serverless function (`deploy`).
- Test the serverless serving-model function with a test data set by using REST API calls (`Tester`).

> **Note**: A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

In [42]:
%%writefile {path.join(project_path, 'workflow.py')}

from kfp import dsl
from mlrun import mount_v3io
import mlrun
from mlrun.platforms import mount_v3io_extended


funcs = {}
DATASET = 'source_data'
LABELS = "training-iris"

# Configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io_extended())

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(
    name="Getting-started-tutorial",
    description="This tutorial is designed to demonstrate some of the main "
                "capabilities of the Iguazio Data Science Platform.\n"
                "The tutorial uses the Iris flower data set."
)
def kfpipeline(source_url='https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'):

    # Ingest the data set
    ingest = funcs['get-data'].as_step(
        name="get-data",
        handler='get_data',
        inputs={'source_url': source_url},
        params={'format': 'csv'},
        outputs=[DATASET])
    
    # Train a model   
    train = funcs["train-iris"].as_step(
        name="train",
        params={"label_column": LABELS},
        inputs={"dataset": ingest.outputs[DATASET]})

Overwriting /User/new-tutorials/conf/workflow.py


#### Register the Workflow

Use the `set_workflow` MLRun project method to register your workflow with MLRun.
The following code sets the `name` parameter to the selected workflow name ("main") and the `code` parameter to the name of the workflow file that is found in your project directory (**workflow.py**).

In [43]:
# Register the workflow file as "main"
project.set_workflow('main', 'workflow.py')

<a id="gs-save-project"></a>

### Save Your Project Configuration

Use the `save` MLRun project method to save your project definitions to a project-configuration file in your project directory (**/User/getting-started-tutorial/conf**).
The default name of the project file is **project.yaml**, but you can optionally change it by setting the `filepath` parameter of the `save` method.

> **Note:** It's recommended that you commit your project-configuration file and any future changes to this file to a Git repository.

In [44]:
project.save()

Use the `run` MLRun project method to execute your workflow pipeline with Kubeflow Pipelines.
The tutorial code sets the following method parameters; (for the full parameters list, see the MLRun documentation or embedded help):

- **`name`** &mdash; the workflow name (in this case, "main" &mdash; see the previous step).
- **`arguments`** &mdash; A dictionary of Kubeflow Pipelines arguments (parameters).
  The tutorial code sets this parameter to an empty arguments list (`{}`), but you can edit the code to add arguments.
- **`artifact_path`** &mdash; a path or URL that identifies a location for storing the workflow artifacts.
  You can use `{{workflow.uid}}` in the path to signify the ID of the current workflow run iteration.
  The tutorial code sets the artifacts path to a **&lt;worker ID&gt;** directory (`{{workflow.uid}}`) in a **pipeline** directory within the current tutorial directory (**/User/getting-started-tutorial/pipeline/&lt;worker ID&gt;**).
- **`dirty`** &mdash; set to `True` to allow running the workflow also when the project's Git repository is dirty (i.e., contains uncommitted changes).
  (When the notebook that contains the execution code is in the same Git directory as the executed workflow, the directory will always be dirty during the execution.)

The `run` method returns the ID of the executed workflow, which the code stores in a `run_id` variable.
You can use this ID to track the progress or your workflow, as demonstrated in the following sections.

> **Note**: You can also run the workflow from a command-line shell by using the `mlrun` CLI.
> The following CLI command defines a similar execution logic as that of the `run` call in the tutorial:
> ```
> mlrun project /User/getting-started-tutorial/conf -r main -p "$V3IO_HOME_URL/getting-started-tutorial/pipeline/{{workflow.uid}}/"
> ```

In [45]:
run_id = project.run(
    'main',
    arguments={}, 
    artifact_path=path.abspath(path.join('pipeline','{{workflow.uid}}')), 
    dirty=True,
    watch=True)

> 2020-12-27 20:38:33,101 [info] Pipeline run id=9b08da1f-bdb4-4a43-bcd3-9b6b6ea2f619, check UI or DB for progress
> 2020-12-27 20:38:33,102 [info] waiting for pipeline run completion


RuntimeError: run status Failed not in expected statuses

<a id='gs-pipeline-workflow-run'></a>