# Projects and automated ML pipeline

This notebook demonstrate how to work with projects, source control (git), and automating the ML pipeline.

Make sure you went over the basics in MLRun {ref}`quick-start-tutor`.

MLRun Project is a container for all your work on a particular activity: all the associated code, {ref}`functions <Functions>`, 
{ref}`jobs`, {ref}`workflows <multi-stage-workflows>`, data, models and {ref}`artifacts`. Projects can be mapped to `git` repositories to enable versioning, collaboration, and CI/CD.

You can create project definitions using the SDK or a yaml file and store those in MLRun DB, file, or archive.
Once the project is loaded you can run jobs/workflows which refer to any project element by name, allowing separation between configuration and code. See [Create and load projects](../projects/create-load-import-project.html) for details.

Projects contain `workflows` that execute the registered functions in a sequence/graph (DAG), and which can reference project parameters, secrets and artifacts by name. MLRun currently supports two workflow engines, `local` (for simple tasks) and [Kubeflow Pipelines](https://www.kubeflow.org/docs/pipelines/pipelines-quickstart/) (for more complex/advanced tasks). MLRun also supports a real-time workflow engine (see {ref}`serving`). 

An ML Engineer can gather the different functions created by the Data Engineer and Data Scientist and create this automated pipeline.

Tutorial steps:
- [**Setup the project and functions**](#project)
- [**Working with GIT and archives**](#archives)
- [**Build and run automated ML pipelines and CI/CD**](#pipeline)
- [**Test the deployed model endpoint**](#test-model)

## MLRun installation and configuration

Before running this notebook make sure the `mlrun` package is installed (`pip install mlrun`) and that you have configured the access to MLRun service. 

In [None]:
# install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

<a id="project"></a>
## Setup the project and functions

**Get or create a project:**

There are three ways to create/load **`{ref}MLRun projects <Projects>`**:
* `mlrun.projects.new_project()`  &mdash; Create a new MLRun project and optionally load it from a yaml/zip/git template.
* `mlrun.projects.load_project()` &mdash; Load a project from a context directory or remote git/zip/tar archive.
* `mlrun.projects.get_or_create_project()` &mdash; Load a project from the MLRun DB if it exists, or from a specified 
  context/archive. 

Projects refer to a `context` directory that holds all the project code and configuration. The `context` dir is 
usually mapped to a `git` repository and/or to an IDE (PyCharm, VSCode, etc.) project.   

In [1]:
import mlrun
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)

> 2022-09-01 07:38:01,914 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:38:01,914 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:38:01,975 [info] loaded project tutorial from MLRun DB


<a id="gs-tutorial-4-step-setting-up-project"></a>

### Register project functions

To run workflows, you must save the definitions for the functions in the project so function objects will be initialized automatically when you load a project or when running a project version in automated CI/CD workflows. In addition, you might want to set/register other project attributes such as global parameters, secrets, and data.

Functions are registered using the `set_function()` command, where you can specify the code, requirements, image, etc. Functions can be created from a single code/notebook file or have access to the entire project context directory (by adding the `with_repo=True` flag, it will guarantee the project context is cloned into the function runtime environment).

Function registration examples:

```python
    # example: register a notebook file as a function
    project.set_function('mynb.ipynb', name='test-function', image="mlrun/mlrun", handler="run_test")

    # define a job (batch) function which uses code/libs from the project repo
    project.set_function(
        name="myjob", handler="my_module.job_handler",
        image="mlrun/mlrun", kind="job", with_repo=True,
    )
```

**Function code:**

Run the following cell to generate the data prep file (or copy it manually):

In [2]:
%%writefile data-prep.py
import pandas as pd
import mlrun

@mlrun.function(outputs=[
    "label_column",
    ("dataset", mlrun.ArtifactType.DATASET, {'format': 'csv', 'index': False}),
])
def worldcup_data_generator(context, data_path="./WorldCupMatches.csv"):
    """a function which generates and preprocess the world cup dataset"""
    data = pd.read_csv(data_path, encoding='UTF-8')
    data.dropna(inplace=True)
    data = preprocess(data)
    teams = set(list(data['Home Team Name'].unique()) + (list(data['Away Team Name'].unique())))
    stages = list(data['Stage'].unique())
    data['Home Team Name'] = pd.Categorical(data['Home Team Name'], categories=list(teams))
    data['Away Team Name'] = pd.Categorical(data['Away Team Name'], categories=list(teams))
    data['Stage'] = pd.Categorical(data['Stage'], categories=list(stages))
    home_team = pd.get_dummies(data['Away Team Name'], prefix='Away Team Name')
    away_team = pd.get_dummies(data['Home Team Name'], prefix='Home Team Name')
    stage = pd.get_dummies(data['Stage'], prefix='Stage')
    
    data.drop(columns=['Home Team Goals', 'Away Team Goals', 'Away Team Name', 'Home Team Name', 'Stage', 'Attendance'], inplace=True)
    data = pd.concat([data, stage, home_team, away_team], axis=1)
    data.dropna(inplace=True)
    context.logger.info("saving world cup matches dataframe")
    
    return 'Win', data
    

def preprocess(data):
    teams = set(list(data['Home Team Name'].unique()) + (list(data['Away Team Name'].unique())))
    stages = list(data['Stage'].unique())
    victories, finals, goals, victories_tur, victories_tur, goals_tur = create_zero_dict(teams, 6)
    year = 1930
    
    for i, row in data.iterrows():
        if row['Stage'] == 'Final' or row['Stage'] == 'finals':
            finals[row['Home Team Name']] += 1
            finals[row['Away Team Name']] += 1
        if year != row['Year']:
            victories_tur, goals_tur = create_zero_dict(teams, 2)
        data = insert_to_data(data, i, row, ['finals', 'victory current tournament', 'victory', 'goals', 'goals current tournament'], 
                              [finals, victories_tur, victories, goals, goals_tur])

        goals, goals_tur = update_goals(data, row, [goals, goals_tur])
        if row['Win'] == 1:
            victories[row['Home Team Name']] += 1
            victories_tur[row['Home Team Name']] += 1
        elif row['Win'] == 2:
            victories[row['Away Team Name']] += 1
            victories_tur[row['Away Team Name']] += 1
        
    data = data[data.Year > 1950]
    data = data[data.Win > 0]
    return data

def create_zero_dict(teams, num_of_dict):
    
    return [dict(zip(teams, [0]*len(teams))) for i in range(num_of_dict)]

def insert_to_data(data, index, row, fields, dictionaries):
    for field, dictionary in zip(fields, dictionaries):
        data.at[index, f'Home {field}'] = dictionary[row['Home Team Name']]
        data.at[index, f'Away {field}'] = dictionary[row['Away Team Name']]
    return data
    
def update_goals(data, row, goals_dictionaries):
    for dictionary in goals_dictionaries:
        dictionary[row['Home Team Name']] += row['Home Team Goals']
        dictionary[row['Away Team Name']] += row['Away Team Goals']
    return goals_dictionaries


Overwriting data-prep.py


**Register the function above in the project:**

In [3]:
project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="worldcup_data_generator")

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7fcb9b12ee10>

**Register additional project objects and metadata:**

You can define other objects (workflows, artifacts, secrets) and parameters in the project and use them in your functions, for example:

```python
    # register a simple named artifact in the project (to be used in workflows)  
    data_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'
    project.set_artifact('data', target_path=data_url)

    # add a multi-stage workflow (./workflow.py) to the project with the name 'main' and save the project 
    project.set_workflow('main', "./workflow.py")
    
    # read env vars from dict or file and set as project secrets
    project.set_secrets({"SECRET1": "value"})
    project.set_secrets(file_path="secrets.env")
    
    project.spec.params = {"x": 5}
```

**Save the project:**

In [4]:
# save the project in the db (and into the project.yaml file)
project.save()

<mlrun.projects.project.MlrunProject at 0x7fcb8f4538d0>

When you save the project it stores the project definitions in the `project.yaml`, this allows reconstructing the project in a remote cluster or a CI/CD system. 

See the generated project file: [**project.yaml**](project.yaml).

<a id="archives"></a>
## Working with GIT and archives

### Push the project code/metadata into an Archive

Use standard Git commands to push the current project tree into a git archive, make sure you `.save()` the project before pushing it

    git remote add origin <server>
    git commit -m "Commit message"
    git push origin master

Alternatively you can use MLRun SDK calls:
- `project.create_remote(git_uri, branch=branch)` - to register the remote Git path
- `project.push()` - save project state and commit/push updates to remote git repo

you can also save the project content and metadata into a local or remote `.zip` archive, examples: 

    project.export("../archive1.zip")
    project.export("s3://my-bucket/archive1.zip")
    project.export(f"v3io://projects/{project.name}/archive1.zip")
    

<a id='load'></a>
### Load a project from local/remote archive 

The project metadata and context (code and configuration) can be loaded and initialized using the {py:meth}`~mlrun.projects.load_project` method.
when `url` (of the git/zip/tar) is specified it clones a remote repo into the local `context` dir.

    # load the project and run the 'main' workflow
    project = load_project(context="./", name="myproj", url="git://github.com/mlrun/project-archive.git")
    project.run("main", arguments={'data': data_url})

Projects can also be loaded and executed using the CLI:

    mlrun project -n myproj -u "git://github.com/mlrun/project-archive.git" .
    mlrun project -r main -w -a data=<data-url> .

In [5]:
# load the project in the current context dir
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)

> 2022-09-01 07:38:06,370 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:38:06,370 [info] Username was normalized to match the required pattern for project name: {'username': 'Davesh', 'normalized_username': 'davesh'}
> 2022-09-01 07:38:06,426 [info] loaded project tutorial from MLRun DB


<a id="pipeline"></a>
## Build and run automated ML pipelines and CI/CD

A pipeline is created by running an MLRun **"workflow"**.
The following code defines a workflow and writes it to a file in your local directory, with the file name **workflow.py**.
The workflow describes a directed acyclic graph (DAG) which is executed using the `local`, `remote`, or `kubeflow` engines.

See {ref}`projects-workflows`.
The defined pipeline includes the following steps:

- Generate/prepare the data (`ingest`).
- Train and the model (`train`).
- Deploy the model as a real-time serverless function (`serving`).

> **Note**: A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

In [6]:
%%writefile './workflow.py'

from kfp import dsl
import mlrun
import pandas as pd

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(name="worldcup-demo")
def pipeline(model_name="worldcup-classifier"):
    # run the ingestion function with the new image and params
    ingest = mlrun.run_function(
        "data-prep",
        name="get-data",
        inputs={'data':'./WorldCupMatches.csv'},
        params={"format": "csv", "model_name": model_name},
        outputs=["dataset"],
        local=True
    )

    # Train a model using the auto_trainer hub function
    train = mlrun.run_function(
        "hub://auto_trainer",
        inputs={"dataset": ingest.outputs["dataset"]},
        params = {
            "model_class": "sklearn.ensemble.RandomForestClassifier",
            "train_test_split_size": 0.2,
            "label_columns": "Win",
            "model_name": model_name,
        }, 
        handler='train',
        outputs=["model"],
    )

    # Deploy the trained model as a serverless function
    serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
    serving_fn.with_code(body=" ")
    mlrun.deploy_function(
        serving_fn,
        models=[
            {
                "key": model_name,
                "model_path": train.outputs["model"],
                "class_name": 'mlrun.frameworks.sklearn.SklearnModelServer',
            }
        ],
    )

Overwriting ./workflow.py


<a id="gs-tutorial-4-step-register-workflow"></a>

**Run the workflow:**

In [None]:
# run the workflow
run_id = project.run(
    workflow_path="./workflow.py",
    arguments={"model_name": "worldcup-classifier"},
    engine='local',
    watch=True)

> 2022-09-01 07:38:11,170 [info] starting run get-data uid=f950b894657a46729639f1f86cc83b99 DB=http://mlrun-api:8080
> 2022-09-01 07:38:11,818 [info] saving world cup matches dataframe


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tutorial-davesh,...6cc83b99,0,Sep 01 07:38:11,completed,get-data,workflow=ebabe00a1d2a4db2a2de26d3091357a3v3io_user=Daveshkind=owner=Daveshhost=jupyter-davids-55f4d7f589-q7jwr,data,format=csvmodel_name=worldcup-classifier,label_column=Win,dataset





> 2022-09-01 07:38:12,116 [info] run executed, status=completed
> 2022-09-01 07:38:12,290 [info] starting run auto-trainer-train uid=34d86bb2c88940419bc6936e31d7e384 DB=http://mlrun-api:8080
> 2022-09-01 07:38:12,560 [info] Job is running in the background, pod: auto-trainer-train-hn85b
> 2022-09-01 07:38:16,893 [info] Sample set not given, using the whole training set as the sample set
> 2022-09-01 07:38:17,076 [info] training 'worldcup-classifier'
> 2022-09-01 07:38:19,788 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
tutorial-davesh,...31d7e384,0,Sep 01 07:38:16,completed,auto-trainer-train,workflow=ebabe00a1d2a4db2a2de26d3091357a3v3io_user=Daveshkind=jobowner=Daveshmlrun/client_version=0.0.0+unstablehost=auto-trainer-train-hn85b,dataset,model_class=sklearn.ensemble.RandomForestClassifiertrain_test_split_size=0.2label_columns=Winmodel_name=worldcup-classifier,accuracy=0.7647058823529411f1_score=0.8409090909090909precision_score=0.8131868131868132recall_score=0.8705882352941177,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodel





> 2022-09-01 07:38:22,030 [info] run executed, status=completed
> 2022-09-01 07:38:22,036 [info] Starting remote function deploy
2022-09-01 07:38:22  (info) Deploying function
2022-09-01 07:38:22  (info) Building
2022-09-01 07:38:23  (info) Staging files and preparing base images
2022-09-01 07:38:23  (info) Building processor image


<br>

**View the pipeline in MLRun UI:**

![workflow](../_static/images/tutorial/workflow.png)

<br>

**Run workflows using the CLI:**

With MLRun you can use a single command to load the code from local dir or remote archive (Git, zip, ..) and execute a pipeline. This can be very useful for integration with CI/CD frameworks and practices. See {ref}`ci-integration` for more details.

The following command loads the project from the current dir (`.`) and executes the workflow with an argument, for running locally (without k8s).

    mlrun project -r ./workflow.py -w -a model_name=classifier2 .!mlrun project -r ./workflow.py -w -a model_name=classifier2 .

<a id="test-model"></a>
## Test the deployed model endpoint

Now that your model is deployed using the pipeline, you can invoke it as usual:

In [None]:
serving_fn = project.get_function("serving")

In [None]:
# create a mock (simulator of the real-time function)
my_data = {"inputs"
           :[[1]*187
            ]
}
serving_fn.invoke("/v2/models/worldcup-classifier/infer", body=my_data)

## Done!

Congratulations! You've completed the getting started tutorial.

You might also want to explore the following demos:

- For an example of distributed training pipeline using TensorFlow, Keras, and PyTorch, see the [**mask detection demo**](https://github.com/mlrun/demos/tree/1.1.x/mask-detection).
- To learn more about deploying live endpoints and concept drift, see the [**network-operations (NetOps) demo**](https://github.com/mlrun/demos/tree/1.1.x/network-operations).
- To learn about using the feature store to process raw transactions and events in real-time and respond and block transactions before they occur, see the [**Fraud prevention demo**](https://github.com/mlrun/demos/tree/1.1.x/fraud-prevention-feature-store).  
- For an example of a pipeline that summarizes and extracts keywords from a news article URL, see the [**News article summarization and keyword extraction via NLP**](https://github.com/mlrun/demos/tree/1.1.x/news-article-nlp).