# Using MLRun Projects and GIT
  --------------------------------------------------------------------

Loading full project with multiple functions and workflow and working wit Git.

#### **notebook how-to's**
* Load a project with multiple functions from Git
* Run automated workflows (using KubeFlow)
* Update, maintain and debug code 

<a id='top'></a>
#### **steps**
**[Load project from Git or Archive](#load-project)**<br>
**[Run a pipeline workflow](#run-pipeline)**<br>
**[Updating the project and code](#update-project)**<br>

In [1]:
from mlrun import load_project, code_to_function

<a id='load-project'></a>
## Load project from Git or Archive

Projects can be stored in a Git repo or in a tar archive (on object store like S3, V3IO).

`load_project(context, url)` will load/clone the project to the local context dir and build the project object from the `project.yaml` file in the git/archive root directory. 

> Note: If URL is not specified it will use the context and search for Git repo inside it, or use the `init_git=True` flag to initialize a Git repo in the target context directory.

You can also clone the code into a dir using a CLI commands:

`mlrun project my-proj/ -u git://github.com/mlrun/demo-xgb-project.git`


In [2]:
# source Git Repo, YOU SHOULD fork this to your account and use the fork
url = 'git://github.com/mlrun/demo-xgb-project.git' # refs/tags/v0.4.5'

# alternatively can use tar files, e.g.
#url = 'v3io:///users/admin/tars/src-project.tar.gz'

# change if you want to clone into a different dir, can use clone=True to override the dir content
project_dir = '/User/my-proj/'  
proj = load_project(project_dir, url, clone=True)

<br><b> TL;DR You can just jump to [running the project](#run-cmd) now</b>

## Play with the project

In [3]:
# if you are not in the project dir, change dir into the project dir
%cd {project_dir}

/User/my-proj


In [4]:
proj.source

'git://github.com/mlrun/demo-xgb-project.git'

Examine the project object, note it contains lists of `functions` and `workflows` which will be used in the project. Functions can be local to the project or referenced to (via a URL to .ipynb, .py, .yaml file and/or container image). 

In [5]:
print(proj.to_yaml())

name: iris
functions:
- url: ./src/iris.yaml
  name: xgb
- url: https://raw.githubusercontent.com/mlrun/mlrun/master/examples/xgb_serving.ipynb
  name: serving
workflows:
- name: main
  path: src/workflow.py
artifacts: []



In [6]:
# You can update the function .py and .yaml from a notebook (code + spec)
# the "code_output" option will generate a .py file from our notebook which can be used for src control and local runs
xgbfn = code_to_function('xgb', filename='notebooks/train-xgboost.ipynb' ,kind='job', code_output='src/iris.py')

# tell the builder to clone this repo into the function container 
xgbfn.spec.build.source = './'
xgbfn.export('src/iris.yaml')

[mlrun] 2020-03-30 20:27:27,357 function spec saved to path: src/iris.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7feadc11f518>

In [7]:
# read specific function spec
print(proj.func('xgb').to_yaml())

kind: job
metadata:
  name: xgb
  tag: ''
  hash: 4350869590bf4fddb3b66315e35ec1b67bfe728d
  project: iris
  categories: []
spec:
  command: src/iris.py
  args: []
  image: ''
  volumes: []
  volume_mounts: []
  env: []
  default_handler: ''
  description: ''
  build:
    source: ./
    base_image: mlrun/mlrun
    commands:
    - pip install sklearn
    - pip install xgboost
    - pip install matplotlib
    code_origin: https://github.com/mlrun/demo-xgb-project.git#32ab2068eed70f995ad13a94e3f2da6733715f48



### Run a project function locally 

In [8]:
from mlrun import run_local, NewTask
run_local(NewTask(handler='iris_generator'), proj.func('xgb'), workdir='./')

[mlrun] 2020-03-30 20:27:52,636 artifact path is not defined or is local, artifacts will not be visible in the UI
[mlrun] 2020-03-30 20:27:52,643 starting run xgb-iris_generator uid=54a28b770c2446c5b3040fb2188078c8  -> http://10.196.88.27:80
[mlrun] 2020-03-30 20:27:53,601 .out_path will soon be deprecated, use .artifact_path
[mlrun] 2020-03-30 20:27:53,601 saving iris dataframe to 
[mlrun] 2020-03-30 20:27:53,618 log artifact iris_dataset at iris_dataset.csv, size: 2776, db: Y



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...8078c8,0,Mar 30 20:27:53,completed,xgb-iris_generator,v3io_user=adminkind=owner=adminhost=jupyter-74f9488695-6wrxj,,,,iris_dataset


to track results use .show() or .logs() or in CLI: 
!mlrun get run 54a28b770c2446c5b3040fb2188078c8 --project iris , !mlrun logs 54a28b770c2446c5b3040fb2188078c8 --project iris
[mlrun] 2020-03-30 20:27:53,664 run executed, status=completed


<mlrun.model.RunObject at 0x7feaae62e6d8>

<a id='update-project'></a>
## Updating the project and code

A user can update the code using the standard Git process (commit, push, ..), if you update/edit the project object you need to run `proj.save()` which will update the `project.yaml` file in your context directory, followed by pushing your updates.

You can use `proj.push(branch, commit_message, add=[])` which will do the work for you (save the yaml, commit updates, push)

> Note: you must push updates before you build functions or run workflows since the builder will pull the code from the git repo.

If you are using containerized Jupyter you may need to first set your Git parameters, e.g.:

```
!git config --global user.email "<my@email.com>"
!git config --global user.name "<name>"
!git config --global credential.helper store
```

In [7]:
proj.push('master', 'some edits')

If you want to pull changes done by others use `proj.pull()`, if you need to update the project spec (since the yaml file changed) use `proj.reload()` and for updating the local/remote function specs use `proj.sync_functions()` (or add `sync=True` to `.reload()`).

In [3]:
proj.pull()

<a id='run-pipeline'></a>
## Run a pipeline workflow
You can check the [workflow.py](src/workflow.py) file to see how functions objects are initialized and used (by name) inside the workflow.
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note the pipeline can include CI steps like building container images and deploying models.

### Initializing the functions (e.g. mount them on the v3io fabric)
```python
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        
```
<br>

### Workflow DSL:
```python
@dsl.pipeline(
    name='My XGBoost training pipeline',
    description='Shows how to use mlrun.'
)
def kfpipeline(
        eta=[0.1, 0.2, 0.3], gamma=[0.1, 0.2, 0.3]
):
    # first step build the function container
    builder = funcs['xgb'].deploy_step(with_mlrun=False)

    # use xgb.iris_generator function to generate data (container image from the builder)
    ingest = funcs['xgb'].as_step(name='ingest_iris', handler='iris_generator',
        image=builder.outputs['image'],
        outputs=['iris_dataset'])

    # use xgb.xgb_train function to train on the data (from the generator step)
    train = funcs['xgb'].as_step(name='xgb_train', handler='xgb_train',
        image=builder.outputs['image'],
        hyperparams={'eta': eta, 'gamma': gamma},
        selector='max.accuracy',
        inputs={'dataset': ingest.outputs['iris_dataset']},
        outputs=['model'])

    # deploy the trained model using a nuclio real-time function
    deploy = funcs['serving'].deploy_step(models={'iris_v1': train.outputs['model']})
```

<a id='run-cmd'></a>
### Run
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [10]:
proj.run('main', arguments={}, artifact_path='v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/', dirty=True)





[mlrun] 2020-03-30 20:28:28,277 Pipeline run id=ec4d2321-a6f2-4f32-bafc-c49ef76a2fa0, check UI or DB for progress


'ec4d2321-a6f2-4f32-bafc-c49ef76a2fa0'

## Replacing the source path to speed debug

Instead of updating Git anytime we modify code we can build the code from the shared file system on the cluster (the build container will mount to the same location with the code instead of reading from Git).

We need to change the project source to point to the shared file system URL of our context directory (e.g. v3io), and we can re-run the workflow. 

In [8]:
proj.source = 'v3io:///users/admin/my-proj'

**[back to top](#top)**