# LightGBM Using Serverless Functions

```mlrun``` is an open-source Python package that provides a framework for running machine learning tasks transparently in multiple, scalable, runtime environments.  ```mlrun``` provides tracking of code, metadata, inputs, outputs and the results of machine learning pipelines. 

In this notebook we'll take a look at using ```mlrun```, ```nuclio``` and KubeFlow to assemble a data acquisition and model training pipeline and deploy it as a nuclio serverless function with an API endpoint for testing.  The focus here is on how all the components interact, and less on the boosting model and its optimization for the Higgs dataset.

#### **notebook take-aways**
* write and test reusable and replaceable **[MLRun](https://github.com/mlrun)** components in a notebook, file or github repository
* store and load models
* run the components as a **[KubeFlow](https://www.kubeflow.org/)** pipeline

<a id='top'></a>
#### **steps**
**[Nuclio code section](#nuclio-code-section)<br>**
    - [nuclio's ignore notation](#nignore)<br>
    - [function dependencies](#function-dependencies)<br>
    - [utiltiy functions](#utilities)<br>
**[Pipeline methods](#pipeline-methods)<br>**
    - [acquire](#acquire)<br>
    - [train](#train)<br>
    - [test](#test)<br>
    - [importance](#importance)<br>
**[Testing locally](#testing)<br>**
**[Create a deployment image](#image)<br>**
**[Test remotely](#remotely)<br>**
**[Create a KubeFlow Pipeline](#pipeline)<br>**
**[Compile the pipeline](#compile-the-pipeline)<br>**

<a id="nuclio-code-section"><a>
______________________________________________

# **nuclio code section**

<a id="nignore"></a>
### nuclio's _**ignore**_ notation

You'll write all the code that gets packaged for execution between the tags ```# nuclio: ignore```, meaning ignore all the code here and above, and ```# nuclio: end-code```, meaning ignore everything after this annotation.  Methods in this code section can be called separately if designed as such (```acquire```, ```split```, ```train```, ```test```), or as you'll discover below, they are most often "chained" together to form a pipeline where the output of one stage serves as the input to the next. The **[docs](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** also suggest another approach: we can use ```# nuclio: start``` at the first relevant code cell instead of marking all the cells above with ```# nuclio: ignore```.

See the **[nuclio-jupyter](https://github.com/nuclio/nuclio-jupyter)** repo for further information on these and many other **[nuclio magic commands](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** that make it easy to transform a Jupyter notebook environment into a platform for developing production-quality, machine learning systems.

The following two lines _**should be in the same cell**_ and mark the start of your mchine learning coding section:

In [1]:
# nuclio: ignore
import nuclio 

<a id='function-dependencies'></a>
### function dependencies

The installs made in the section **[Setup](#Setup)** covered the Jupyter environment within which this notebook runs.  However, we need to ensure that all the dependencies our nuclio function relies upon (such as ```matplotlib```, ```sklearn```, ```lightgbm```), will be available when that code is wrapped up into a nuclio function _**on some presently unknown runtime**_.   Within the nuclio code section we can ensure these dependencies get built into the function with the ```%nuclio cmd``` magic command.

In [2]:
%%nuclio cmd -c
rm /conda/lib/python3.6/site-packages/seaborn* -rf
pip uninstall -y mlrun
pip install -U -q git+https://github.com/mlrun/mlrun.git@development
pip install -U -q kfp
pip install -U -q pyarrow
pip install -U -q pandas
pip install -U -q matplotlib
pip install -U -q seaborn
pip install -U -q scikit-learn
pip install -U -q lightgbm

We'll use a standard base image here, however the build step can be shortened by preparing images with pre-installed packages.

In [3]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"

%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'


In [4]:
from io import BytesIO
from os import path, makedirs
import json
from pickle import load, dump
from pathlib import Path
from urllib.request import urlretrieve
from typing import IO, AnyStr, TypeVar, Union, List

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import (roc_curve, accuracy_score, confusion_matrix)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
import seaborn as sns

import pyarrow.parquet as pq
import pyarrow as pa
from pyarrow import Table

from mlrun.artifacts import TableArtifact, PlotArtifact
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

_______________

some useful parameters to keep the notebook neat:
    

In [5]:
ARCHIVE_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
CHUNK_SIZE = 10_000
TARGET_PATH = '/User/mlrun/lightgbm/'
MODEL_NAME = 'lightgbm_classifier.pkl'

<a id='utilities'></a>
### utility functions

Logging and getting tables from the artifact store is something we do often in this demo, so we provide these utilities for logging and extracting tables from the artifact store:

In [6]:
def get_context_table(ctxtable: DataItem) -> Table:
    """deserialize table in artifact store
    """
    blob = BytesIO(ctxtable.get())
    return pd.read_parquet(blob, engine='pyarrow')

In [7]:
def log_context_table(
    context: MLClientCtx,
    target_path: str,
    key: str,
    table: pd.DataFrame
) -> None:
    """Log a table in the artifact store.
    
    The table is written as a parquet file, and its target
    path is saved in the context.
    
    :param context:      the function context
    :param target_path:  location (folder) of our DataItem
    :param key:          name of the object in the artifact store
    :param table:        the object we wish to store
    """
    filepath = path.join(target_path, key + '.parquet')
    pq.write_table(pa.Table.from_pandas(table), filepath)    
    context.log_artifact(key, target_path=filepath)

<a id=pipeline-methods></a>
### pipeline methods

These are the methods that will be chained together in a pipeline consisting of 4 steps: ```acquire```, ```split```, ```train```, ```test```.  

#### ```acquire```

In the first step, we retrieve data in the form of a gzip archive from the **[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)**.  The file is saved as a parquet file and its location made available to the next step in the pipeline. The table header was scraped from the UCI site, and no details are provided there:



In [8]:
higgs_header = ['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ',
               'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ',
               'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ',
               'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ',
               'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ',
               'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj',
               'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']

In [9]:
def acquire(
    context: MLClientCtx,
    archive_url: Union[str, Path, IO[AnyStr]],
    header: Union[None, List[str]],
    name: str = '',
    target_path: str = '',
    chunksize: int = 256
) -> None:
    """Open a file/object archive and save as a parquet file.
    
    Args:
    :param context:       the function context
    :param archive_url:   any valid string path consistent with the path variable
                          of pandas.read_csv. Includes, strings as file paths, as urls, 
                          pathlib.Path objects, etc...
    :param header:        column names
    :param name:          local filename
    :param target_path:   destination folder of file
    :param chunksize:     (0) row size retrieved per iteration
    """
    makedirs(target_path, exist_ok=True)
    context.logger.info('verified directories')
   
    if not name.endswith('.parquet'):
        name += '.parquet'
    dest_path = path.join(target_path , name)
    
    if not path.isfile(dest_path):
        context.logger.info('destination file does not exist, downloading')
        pqwriter = None
        for i, df in enumerate(pd.read_csv(archive_url, chunksize=chunksize, names=header)):
            table = pa.Table.from_pandas(df)
            if i == 0:
                pqwriter = pq.ParquetWriter(dest_path, table.schema)
            pqwriter.write_table(table)

        if pqwriter:
            pqwriter.close()

    context.logger.info(f'saved table to {dest_path}')
    
    # store header as artifact:
    if header:
        header = [x.replace(' ', '_') for x in header]
        filepath = path.join(target_path, 'header.json')
        json.dump(header, open(filepath, 'w'))
        context.log_artifact('header', target_path=filepath)

#### ```train```
We have used only 2 parameters for demonstration purposes, ```learning_rate``` and ```num_leaves```, see **[LightGBM Parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#parameters)** for
more detail on the other parameters available and their default values.


In [10]:
def _plot_validation(
    context: MLClientCtx, 
    train_metric,
    valid_metric,
    title: str = "training validation results",
    xlabel: str = "epoch",
    ylabel: str = "",
    target_path: str = "",
    key: str = "",
) -> None:
    """Plot train and validation loss curves from a metrics table in an
    artifact store.
    
    These curves represent the training round losses from the training
    and validation sets.
    :param train_metric:    train metric
    :param valid_metric:    validation metric
    :param title:           plot title
    :param xlabel:          X-axis label
    :param ylabel:          Y-axis label
    :param target_path:     save plot in this folder
    :param key:             plot's key in the artifact store
    """
    plt.plot(train_metric)
    plt.plot(valid_metric)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.legend(["train", "valid"])
    fig = plt.gcf()

    plotpath = path.join(target_path, "history.png")
    plt.savefig(plotpath)
    context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath))

    # to ensure we don't overwrite this figure when creating the next:
    plt.cla()
    plt.clf()
    plt.close()

In [11]:
def _plot_roc(
    context: MLClientCtx, 
    y_labels,
    y_probs,
    title: str = "roc curve",
    xlabel: str = "false positive rate",
    ylabel: str = "true positive rate",
    fmt: str = "png",
    target_path: str = "",
    key: str = "",
) -> None:
    """Plot an ROC curve from test data saved in an artifact store.
    :param y_labels:        test data labels
    :param y_probs:         test data 
    :param title:           plot title
    :param xlabel:          X-axis label (not tick labels)
    :param ylabel:          Y-axis label (not tick labels)
    :param fmt:             plot file image format (png, jpg, ...)
    :param target_path:     save plot in this folder
    :param key:             plot's key in the artifact store                
    """
    fpr_xg, tpr_xg, _ = roc_curve(y_labels, y_probs)
    plt.plot([0, 1], [0, 1], "k--")
    plt.plot(fpr_xg, tpr_xg, label="roc")
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend(loc="best")
    fig = plt.gcf()

    plotpath = path.join(target_path, "roc.png")
    fig.savefig(plotpath, format=fmt)
    context.log_artifact(PlotArtifact('roc', body=fig))
    
    # to ensure we don't overwrite this figure when creating the next:
    plt.cla()
    plt.clf()
    plt.close()

In [12]:
def _plot_confusion_matrix(
    context: MLClientCtx, 
    labels,
    predictions,
    title: str = "confusion matrix",
    axislabels: List = None,
    target_path: str = "",
    key: str = "",
) -> None:
    """Create a confusion matrix.
    Plot and save a confusion matrix using test_data from a
    pipeline step.  The plot is generated usung default arguments.
    The present example could be extended by including a parameters `dict`
    that is passed through to sklearn's `confusion_matrix`,
    `ConfusionMatrixDisplay`, and matplotlib `plot`.
    :param labels:          test data labels
    :param predictions:     test data predictions
    :param title:           plot title
    :param axislabels:      list of classes, for labeling axes
    :param target_path:     save plot in this folder
    :param key:             plot's key in the artifact store
    """
    cm = confusion_matrix(labels,
                          predictions,
                          sample_weight=None,
                          labels=axislabels,
                          normalize='all')
    sns.heatmap(cm, annot=True, cmap="Blues")
    plotpath = path.join(target_path, "confusion.png")
    fig = plt.gcf()
    fig.savefig(plotpath)
    context.log_artifact(PlotArtifact('confusion_matrix', body=fig))

    # to ensure we don't overwrite this figure when creating the next:
    plt.cla()
    plt.clf()
    plt.close()

In [13]:
def _plot_importance(
    context: MLClientCtx, 
    model,
    header: List = [],
    title: str = 'LightGBM Features',
    fmt: str = "png",
    target_path: str = '',
    key: str = ''
)-> None:
    """Display estimated feature importances.
    
    :param model:       fitted lightgbm model
    :param header:      list of feature names
    :param title:       ('LightGBM Features') plot title
    :param fmt:         plot file image format (png, jpg, ...)
    :param target_path: destination folder for files
    :param key:         key of artifact in artifact store
    
    """
    # create a feature importance table with desired labels
    zipped = zip(model.feature_importances_, header)
    
    feature_imp = pd.DataFrame(sorted(zipped), columns=['freq','feature']
                              ).sort_values(by="freq", ascending=False)
    
    plt.figure(figsize=(20, 10))
    sns.barplot(x="freq", y="feature", data=feature_imp)
    plt.title(title)
    plt.tight_layout()
    fig = plt.gcf()
    plotpath = path.join(target_path, "feature-importances.png")
    fig.savefig(plotpath)
    context.log_artifact(PlotArtifact('feature-importances-plot', body=fig))

    # feature importances are also saved as a table:
    tablepath = path.join(target_path, "feature-importances-table.csv")
    feature_imp.to_csv(tablepath)
    context.log_artifact(TableArtifact('feature-importances-table', target_path=tablepath))
    
    # to ensure we don't overwrite this figure when creating the next:
    plt.cla()
    plt.clf()
    plt.close()

In [14]:
def log_model(
    context: MLClientCtx,
    model,
    history,
    test_data: pd.DataFrame, 
    header: List = [],
    target_path: str = '',
    name: str = '',  # with file extension
    key: str = 'model',
    labels: dict = {}
):
    """log a classifier model to the artifact store
    
    :param context:       function context
    :param model:         estimated model
    :param history:       training-validation metrics
    :param test_data:     test labels and test predictions
    :param header:        features labels
    :param target_path:   destintion folder for file artifacts
    :param name:          name of model file (or, prefix to model files)
    :param key:           key of model in artifact store
    :param labels:        model artifact labels
    
    Save an estimated model along with metadata, it's training-validation metrics 
    history and plots, roc curve, confusion matrix and feature importances.  
    """
    loss = np.asarray(history['train']['binary_logloss'], dtype=np.float)
    val_loss = np.asarray(history['valid']['binary_logloss'], dtype=np.float)
    
    _plot_validation(context, loss, val_loss, target_path=target_path, key='training-validation-metrics')
    _plot_roc(context, test_data.y_test, test_data.y_probs, target_path=target_path, key='roc-curve')
    _plot_confusion_matrix(context, test_data.y_test, test_data.y_pred, target_path=target_path, key='confusion-matrix')
    _plot_importance(context, model, header, target_path=target_path, key='feature-importances')
   
    # save the model and log  as an artifact
    filepath = path.join(target_path, name)
    dump(model, open(filepath, 'wb'))
    context.log_artifact(key,
                         target_path=filepath)
                         #,
                         #labels=labels)    

In [15]:
def train(
    context: MLClientCtx,
    src_file: str,
    header: DataItem,
    test_size: float = 0.1,
    train_val_split: float = 0.75,
    sample: int = -1,
    target_path: str = '',
    name: str = '',
    key: str = '',
    labels = {
        'type': 'classifier',
        'framework': 'lightgbm_booster'},  # 'lightgbm_sklearn' if this were a pipeline
    verbose: bool = False,
    random_state: int = 1,
    num_leaves: int = 31,
    learning_rate: float = 0.1,

) -> None:
    """Train and save a LightGBM model.
    
    :param context:       the function context
    :param src_file:        ('raw') name of raw data file
    :param test_size:       (0.1) test set size
    :param train_val_split: (0.75) Once the test set has been removed the 
                            training set gets this proportion.
    :param sample:          (-1). Selects the first n rows, or select a sample starting
                            from the first. If negative <-1, select a random sample from 
                            the entire file
    :param target_path:   folder location of files
    :param name:          destination name for model file
    :param key:           key for model artifact
    :param labels:        metadata dict, some keys are required (type, framework). 'type'
                          is either classifier or regressor, 'framework' can be sklearn or not
                          (sklearn models have a generic interface)
    :param verbose :       (False) show metrics for training/validation steps.
    :param random_state:  (1) sklearn rng seed
        
    Also included for demonstration are a randomly selected sample
    of LightGBM parameters:
    :param num_leaves : (Default is 31).  In the LightGBM model
            controls complexity.
    :param learning_rate : Step size at each iteration, constant.
    """
    # load local data
    srcfilepath = path.join(target_path, src_file)
    # save only a sample, intended for debugging
    if (sample == -1) or (sample >= 1):
        # get all rows, or contiguous sample starting at row 1.
        raw = pq.read_table(srcfilepath).to_pandas()
        labels = raw.pop('labels')
        raw = raw.iloc[:sample, :]
        labels = labels.iloc[:sample]
    else:
        # grab a random sample
        raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1)
        labels = raw.pop('labels')

    x, xtest, y, ytest = train_test_split(raw, 
                                          labels, 
                                          train_size=1-test_size, 
                                          test_size=test_size, 
                                          random_state=random_state)
   
    # save these for later
    log_context_table(context, target_path, 'xtest', xtest)
    log_context_table(context, target_path, 'ytest', pd.DataFrame({'labels':ytest}))

    xtrain, xvalid, ytrain, yvalid = train_test_split(x, 
                                                      y, 
                                                      train_size=train_val_split, 
                                                      test_size=1-train_val_split,
                                                      random_state=random_state)        
    
    lgb_clf = lgb.LGBMClassifier(num_leaves=num_leaves,
                                 learning_rate= learning_rate,
                                 objective='binary',
                                 metric='binary_logloss',
                                 random_state=random_state,
                                 verbose=int(verbose == True))

    eval_results = dict()
    eval_result = lgb.record_evaluation(eval_results)

    lgb_clf.fit(xtrain, 
                ytrain,
                eval_set=[(xvalid, yvalid), (xtrain, ytrain)],
                eval_names=['valid', 'train'],
                callbacks=[eval_result],
                verbose=verbose)
    
    ypred_probs = lgb_clf.predict_proba(xtest)[:, 1]
    ypred = np.where(ypred_probs >= 0.5, 1, 0)
    
    acc = accuracy_score(ytest, ypred)
    context.log_result("accuracy", float(acc))
    log_model(context, 
              lgb_clf, 
              eval_results, 
              pd.DataFrame({'y_test': ytest.values, 
                            'y_pred': ypred, 
                            'y_probs': ypred_probs}),
              target_path=target_path,
              header=json.loads(header.get().decode('utf-8')),
              name=name, 
              key=key, 
              labels=labels)

#### ```importance```
```lightgbm``` models provide a ```feature_importances``` attribute and a ```plot_importance``` method.  Here we create a table from the attribute (whose default parameter is to return the frequency a feature is used in a model), log it as an artifact and create a barplot using ```seaborn```.

#### **end of nuclio function definition**

In [16]:
# nuclio: end-code

<a id="testing"></a>
## Testing locally

The function can be run locally and debugged/tested before deployment:

In [17]:
from mlrun import (code_to_function, 
                   new_function, 
                   NewTask,
                   new_model_server, 
                   mlconf, 
                   mount_v3io)
mlconf.dbpath = 'http://mlrun-api:8080'

In [18]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

_**please note:**_ the following operation will download approximately a 3gb archive. When testing or debugging your workflows you may want to prepare a small sample file stored on `s3`.

In [34]:
# no artifacts are created in this step
workflow = new_function()

acquire_run = workflow.run(
    name='acquire_remote_data',
    handler=acquire, 
    params={
        'archive_url': ARCHIVE_URL,
        'header':      higgs_header,
        'name':        'raw',
        'target_path': TARGET_PATH,
        'chunksize'  : CHUNK_SIZE})

[mlrun] 2020-01-15 11:10:27,490 verified directories
[mlrun] 2020-01-15 11:10:27,491 saved table to /User/mlrun/lightgbm/raw.parquet



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...af16d4,0,Jan 15 11:10:27,completed,acquire_remote_data,host=jupyter-1-6c8766ddb7-rp87j,,"archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gzheader=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']name=rawtarget_path=/User/mlrun/lightgbm/chunksize=10000",,header


to track results use .show() or .logs() or in CLI: 
!mlrun get run 7ede4b3e5a394aac86d1da7531af16d4  , !mlrun logs 7ede4b3e5a394aac86d1da7531af16d4 
[mlrun] 2020-01-15 11:10:27,554 run executed, status=completed


In [20]:
train_run = workflow.run(
    name = 'train_model',
    handler=train,
    inputs={'header': acquire_run.outputs['header']},
    params={
        'src_file':         'raw.parquet',
        'sample':           20_000,
        'test_size':        0.1,
        'train_val_split':  0.75,
        'target_path':      TARGET_PATH,
        'name':             MODEL_NAME,
        'key' :             'model',
        'num_leaves':       31,
        'learning_rate':    0.1,
        'verbose':          False,
        'labels':          {'type'     : 'classifier', 
                            'framework': 'lightgbm', 
                            'mode'     : 'model'}}
)  # 'mode': 'sklearn' if this were a pipeline

uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...117526,0,Jan 15 09:39:18,completed,train_model,host=jupyter-1-6c8766ddb7-rp87j,header,"src_file=raw.parquetsample=20000test_size=0.1train_val_split=0.75target_path=/User/mlrun/lightgbm/name=lightgbm_classifier.pklkey=modelnum_leaves=31learning_rate=0.1verbose=Falselabels={'type': 'classifier', 'framework': 'lightgbm', 'mode': 'model'}",accuracy=0.7105,xtestytesttraining-validation-plot.htmlroc.htmlconfusion_matrix.htmlfeature-importances-plot.htmlfeature-importances-tablemodel


to track results use .show() or .logs() or in CLI: 
!mlrun get run 1975142ca0aa4a55a7ab552cc1117526  , !mlrun logs 1975142ca0aa4a55a7ab552cc1117526 
[mlrun] 2020-01-15 09:39:30,460 run executed, status=completed


<a id="image"></a>
### Create a deployment image

Once debugged you can create a reusable image, and then deploy it for testing. In the following line we are converting the code block between the ```#nuclio: ignore``` and ```#nuclio: end-code``` to be run as a KubeJob.   _**It is important to ensure that this image has been built at least once, and that you have access to it.**_

In [21]:
lgbm_job = code_to_function(name='lgbm_job', 
                            runtime='job', 
                            with_doc=False).apply(mount_v3io())

In [22]:
# lgbm_job.spec.no_cache = True

In [23]:
# lgbm_job.deploy()
# other options:
# ignore if exists
# save and/or export

While debugging, and _**after you have run**_ ```deploy``` **at least once**, you can comment out the last cell so that the build process isn't started needlessly.  Any code changes made after the image has been built and deployed can be injected into the job using the following line (so long as the changes don't include adding or deleting packages):

In [24]:
lgbm_job.with_code()

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7ff539a52518>

____

### test your code remotely

In [25]:
task = NewTask()

In [26]:
task.with_params(
    archive_url=ARCHIVE_URL,
    header=higgs_header,
    name='raw',
    target_path=TARGET_PATH,
    chunksize=CHUNK_SIZE)

acquire_run_tsk = lgbm_job.run(task, handler='acquire',out_path=TARGET_PATH)

[mlrun] 2020-01-15 09:39:36,225 starting run acquire uid=92b94371d0ea42e39182ebfde40a01b8  -> http://mlrun-api:8080
[mlrun] 2020-01-15 09:39:36,385 Job is running in the background, pod: acquire-mxjp5
[mlrun] 2020-01-15 09:39:43,653 verified directories
[mlrun] 2020-01-15 09:39:43,654 saved table to /User/mlrun/lightgbm/raw.parquet

[mlrun] 2020-01-15 09:39:43,678 run executed, status=completed
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...0a01b8,0,Jan 15 09:39:43,completed,lgbm-job,host=acquire-mxjp5kind=jobowner=admin,,"archive_url=https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gzchunksize=10000header=['labels', 'lepton pT ', 'lepton eta ', 'lepton phi ', 'missing energy magnitude ', 'missing energy phi ', 'jet 1 pt ', 'jet 1 eta ', 'jet 1 phi ', 'jet 1 b-tag ', 'jet 2 pt ', 'jet 2 eta ', 'jet 2 phi ', 'jet 2 b-tag ', 'jet 3 pt ', 'jet 3 eta ', 'jet 3 phi ', 'jet 3 b-tag ', 'jet 4 pt ', 'jet 4 eta ', 'jet 4 phi ', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv ', 'm_jlv', 'm_bb ', 'm_wbb ', 'm_wwbb']name=rawtarget_path=/User/mlrun/lightgbm/",,header


to track results use .show() or .logs() or in CLI: 
!mlrun get run 92b94371d0ea42e39182ebfde40a01b8  , !mlrun logs 92b94371d0ea42e39182ebfde40a01b8 
[mlrun] 2020-01-15 09:39:45,571 run executed, status=completed


In [27]:
task.with_input('header', acquire_run_tsk.outputs['header'])
task.with_params(
    src_file='raw.parquet',
    sample=20_000,
    test_size=0.1,
    train_val_split=0.75,
    target_path=TARGET_PATH,
    name=MODEL_NAME,
    key='model',
    num_leaves=31,
    learning_rate=0.1,
    verbose=False,
    labels= {'type'      : 'classifier', 
             'framework' : 'lightgbm',
             'mode'      : 'model'    })

nrun2 = lgbm_job.run(task, handler='train', out_path=TARGET_PATH)

[mlrun] 2020-01-15 09:39:45,578 starting run train uid=9d4baaaebac84c00a654306b0e6c4fe1  -> http://mlrun-api:8080
[mlrun] 2020-01-15 09:39:45,722 Job is running in the background, pod: train-8p75n
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[mlrun] 2020-01-15 09:40:07,810 run executed, status=completed
final state: succeeded


uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...6c4fe1,0,Jan 15 09:39:52,completed,lgbm-job,host=train-8p75nkind=jobowner=admin,header,"key=modellabels={'framework': 'lightgbm', 'mode': 'model', 'type': 'classifier'}learning_rate=0.1name=lightgbm_classifier.pklnum_leaves=31sample=20000src_file=raw.parquettarget_path=/User/mlrun/lightgbm/test_size=0.1train_val_split=0.75verbose=False",accuracy=0.7105,xtestytesttraining-validation-plot.htmlroc.htmlconfusion_matrix.htmlfeature-importances-plot.htmlfeature-importances-tablemodel


to track results use .show() or .logs() or in CLI: 
!mlrun get run 9d4baaaebac84c00a654306b0e6c4fe1  , !mlrun logs 9d4baaaebac84c00a654306b0e6c4fe1 
[mlrun] 2020-01-15 09:40:11,086 run executed, status=completed


<a id="pipeline"></a>
### Create a KubeFlow Pipeline from our functions

Our pipeline will consist of two instead of three steps, ```load``` and ```train```.  We'll drop the ```test```
here since at the end of this deployment we can test the system with API requests.

For complete details on KubeFlow Pipelines please refer to the following docs:
1. **[KubeFlow pipelines](https://www.kubeflow.org/docs/pipelines/)**.
2. **[kfp.dsl Python package](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#module-kfp.dsl)**.

Please note, the model server file name in the ```new_model_server``` function call below should identical in every respect to the name of the model server notebook.

In [35]:
import kfp
from kfp import dsl

In [36]:
srvfn = new_model_server('lgbm', 
                         model_class='MyLGBoostModel', 
                         filename='model-server.ipynb')
srvfn.apply(mount_v3io())

<mlrun.runtimes.function.RemoteRuntime at 0x7ff539ebc2b0>

In [42]:
@dsl.pipeline(name='LGBM', description='lightgbm classifier')
def lgbm_pipeline(learning_rate = [0.1, 0.3], num_leaves = [31, 32]):
    acquire_step = lgbm_job.as_step(
            name='acquire_remote_data',
            handler='acquire',
            params={
                'archive_url': ARCHIVE_URL,
                'header': higgs_header,
                'name':        'raw',
                'target_path': TARGET_PATH},
            outputs=['header'], 
            out_path=TARGET_PATH).apply(mount_v3io())
    
    train_step = lgbm_job.as_step(
            name='train_model', 
            handler='train',
            inputs={'header' : acquire_step.outputs['header']},
            params={
                'src_file':         'raw.parquet',
                'sample':           20000,
                'test_size':        0.1,
                'train_val_split':  0.75,
                'target_path':      TARGET_PATH,
                'name':             MODEL_NAME,
                'key' :             'model',
                'num_leaves':       31,
                'learning_rate':    0.1,
                'verbose':          False,
                'labels':          {'type'      : 'classifier',
                                    'framework' : 'lightgbm',
                                    'mode'      : 'model'}},
            outputs=['model'],
            out_path= TARGET_PATH).apply(mount_v3io())

    srvfn.deploy_step(project='github-demos', 
                      models={'lgbm_pickle': train_step.outputs['model']})

<a id="compile the pipeline"></a>
### compile the pipeline

We can compile our KubeFlow pipeline and produce a yaml description of the pipeline worflow:

In [43]:
kfp.compiler.Compiler().compile(lgbm_pipeline, TARGET_PATH + '/mlrunpipe.yaml')

In [44]:
client = kfp.Client(namespace='default-tenant')

Finally, the following line will run the pipeline as a job::

In [45]:
arguments = {
    'learning_rate': [ 0.1, 0.3],
    'num_leaves':    [31, 32]}

run_result = client.create_run_from_pipeline_func(
    lgbm_pipeline, 
    arguments, 
    run_name='my lgbm run',
    experiment_name='lgbm')