# LightGBM Using Serverless Functions

```mlrun``` is an open-source Python package that provides a framework for running machine learning tasks transparently in multiple, scalable, runtime environments.  ```mlrun``` provides tracking of code, metadata, inputs, outputs and the results of machine learning pipelines. 

In this notebook we'll take a look at using ```mlrun```, ```nuclio``` and KubeFlow to assemble a data acquisition and model training pipeline and deploy it as a nuclio serverless function with an API endpoint for testing.  The focus here is on how all the components interact, and less on the boosting model and its optimization for the Higgs dataset.

#### **notebook take-aways**
* write and test reusable and replaceable **[MLRun](https://github.com/mlrun)** components in a notebook, file or github repository
* store and load models
* run the components as a **[KubeFlow](https://www.kubeflow.org/)** pipeline

<a id='top'></a>
#### **steps**
**[Nuclio code section](#nuclio-code-section)<br>**
    - [nuclio's ignore notation](#nignore)<br>
    - [function dependencies](#function-dependencies)<br>
    - [utility functions](#utilities)<br>
**[Pipeline methods](#pipeline-methods)<br>**
    - [acquire](#acquire)<br>
    - [train](#train)<br>
**[Testing locally](#testing)<br>**
**[Create a deployment image](#image)<br>**
**[Test remotely](#remote)<br>**
**[Create a KubeFlow Pipeline](#pipeline)<br>**
**[Compile the pipeline](#compile-the-pipeline)<br>**

<a id="nuclio-code-section"><a>
______________________________________________

# **nuclio code section**

<a id="nignore"></a>
### nuclio's _**ignore**_ notation

You'll write all the code that gets packaged for execution between the tags ```# nuclio: ignore```, meaning ignore all the code here and above, and ```# nuclio: end-code```, meaning ignore everything after this annotation.  Methods in this code section can be called separately if designed as such (```acquire```, ```split```, ```train```, ```test```), or as you'll discover below, they are most often "chained" together to form a pipeline where the output of one stage serves as the input to the next. The **[docs](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** also suggest another approach: we can use ```# nuclio: start``` at the first relevant code cell instead of marking all the cells above with ```# nuclio: ignore```.

See the **[nuclio-jupyter](https://github.com/nuclio/nuclio-jupyter)** repo for further information on these and many other **[nuclio magic commands](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** that make it easy to transform a Jupyter notebook environment into a platform for developing production-quality, machine learning systems.

The following two lines _**should be in the same cell**_ and mark the start of your mchine learning coding section:

In [1]:
# nuclio: ignore
import nuclio 

<a id='function-dependencies'></a>
### function dependencies

The installs made in the section **[Setup](#Setup)** covered the Jupyter environment within which this notebook runs.  However, we need to ensure that all the dependencies our nuclio function relies upon (such as ```matplotlib```, ```sklearn```, ```lightgbm```), will be available when that code is wrapped up into a nuclio function _**on some presently unknown runtime**_.   Within the nuclio code section we can ensure these dependencies get built into the function with the ```%nuclio cmd``` magic command.

In [2]:
%%nuclio cmd -c
rm /conda/lib/python3.6/site-packages/seaborn* -rf
pip uninstall -y mlrun
pip install -U -q mlrun
pip install -U -q kfp
pip install -U -q pyarrow
pip install -U -q pandas
pip install -U -q matplotlib
pip install -U -q seaborn
pip install -U -q scikit-learn
pip install -U -q lightgbm

We'll use a standard base image here, however the build step can be shortened by preparing images with pre-installed packages.

In [3]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"

%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'


In [4]:
from io import BytesIO
from os import path, makedirs
import json
from cloudpickle import load, dump
from pathlib import Path
from urllib.request import urlretrieve
from typing import IO, AnyStr, TypeVar, Union, List, Optional

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import (roc_curve, confusion_matrix)
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
from matplotlib.figure import Figure
import seaborn as sns

import pyarrow.parquet as pq
import pyarrow as pa
from pyarrow import Table

from mlrun.artifacts import TableArtifact, PlotArtifact
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

_______________

<a id='utilities'></a>
### utility functions

Logging and getting tables from the artifact store is something we do often in this demo, so we provide these utilities for logging and extracting tables from the artifact store:

In [5]:
def get_context_table(ctxtable: DataItem) -> Table:
    """deserialize table in artifact store
    
    :param ctxtable:  table in the artifact store
    """
    blob = BytesIO(ctxtable.get())
    return pd.read_parquet(blob, engine='pyarrow')

In [6]:
def log_context_table(
    context: MLClientCtx,
    target_path: str,
    key: str,
    table: pd.DataFrame
) -> None:
    """Log a table in the artifact store.
    
    The table is written as a parquet file, and its target
    path is saved in the context.
    
    :param context:      the function context
    :param target_path:  location (folder) of our DataItem
    :param key:          name of the object in the artifact store
    :param table:        the object we wish to store
    """
    filepath = path.join(target_path, key + '.parquet')
    pq.write_table(pa.Table.from_pandas(table), filepath)    
    context.log_artifact(key, target_path=filepath)

#### ```train```
We have used only 2 parameters for demonstration purposes, ```learning_rate``` and ```num_leaves```, see **[LightGBM Parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#parameters)** for
more detail on the other parameters available and their default values.


In [7]:
def log_lgbm_model(
    context: MLClientCtx,
    model,
    data,
    header: List = [],
    target_path: str = '',
    name: str = '',  # with file extension
    key: str = 'model',
    exp_labels: dict = {}
):
    """log a classifier model to the artifact store
    
    :param context:       function context
    :param model:         estimated model
    :param history:       training-validation metrics
    :param data:          train and test data
    :param header:        features labels
    :param target_path:   destintion folder for file artifacts
    :param name:          name of model file (or, prefix to model files)
    :param key:           key of model in artifact store
    :param labels:        model artifact labels
    
    Save an estimated model along with metadata, it's training-validation metrics 
    history and plots, roc curve, confusion matrix and feature importances.  
    """
    def _gcf_clear(plt):
        plt.cla()
        plt.clf()
        plt.close()        
    
    def plot_validation(train_metric, valid_metric):
        """Plot train and validation loss curves from a metrics table in an
        artifact store.

        These curves represent the training round losses from the training
        and validation sets.
        :param train_metric:    train metric
        :param valid_metric:    validation metric
        """
        plt.plot(train_metric)
        plt.plot(valid_metric)
        plt.title("training validation results")
        plt.xlabel("epoch")
        plt.ylabel("")
        plt.legend(["train", "valid"])
        fig = plt.gcf()

        plotpath = path.join(target_path, "history.png")
        plt.savefig(plotpath)
        context.log_artifact(PlotArtifact('training-validation-plot', body=fig, target_path=plotpath))

        # to ensure we don't overwrite this figure when creating the next:
        _gcf_clear(plt)

    def plot_roc(y_labels, y_probs):
        """Plot an ROC curve from test data saved in an artifact store.
        :param y_labels:        test data labels
        :param y_probs:         test data 
        """
        fpr_xg, tpr_xg, _ = roc_curve(y_labels, y_probs)
        plt.plot([0, 1], [0, 1], "k--")
        plt.plot(fpr_xg, tpr_xg, label="roc")
        plt.xlabel("false positive rate")
        plt.ylabel("true positive rate")
        plt.title("roc curve")
        plt.legend(loc="best")
        fig = plt.gcf()

        plotpath = path.join(target_path, "roc.png")
        fig.savefig(plotpath, format=fmt)
        context.log_artifact(PlotArtifact('roc', body=fig))

        # to ensure we don't overwrite this figure when creating the next:
        _gcf_clear(plt)

    def plot_confusion_matrix(labels, predictions):
        """Create a confusion matrix.
        Plot and save a confusion matrix using test data from a
        pipeline step.  The plot is generated usung default arguments.
        The present example could be extended by including a parameters `dict`
        that is passed through to sklearn's `confusion_matrix`,
        `ConfusionMatrixDisplay`, and matplotlib `plot`.
        :param labels:          test data labels
        :param predictions:     test data predictions
        """
        cm = confusion_matrix(labels,
                              predictions,
                              sample_weight=None,
                              labels=axislabels,
                              normalize='all')
        sns.heatmap(cm, annot=True, cmap="Blues")
        plotpath = path.join(target_path, "confusion.png")
        fig = plt.gcf()
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('confusion_matrix', body=fig))

        # to ensure we don't overwrite this figure when creating the next:
        _gcf_clear(plt)

    def plot_importance(model, header: List = []):
        """Display estimated feature importances.

        :param model:       fitted lightgbm model
        :param header:      list of feature names
        """
        # create a feature importance table with desired labels
        zipped = zip(model.feature_importances_, header)

        feature_imp = pd.DataFrame(sorted(zipped), columns=['freq','feature']
                                  ).sort_values(by="freq", ascending=False)

        plt.figure(figsize=(20, 10))
        sns.barplot(x="freq", y="feature", data=feature_imp)
        plt.title('LightGBM Features')
        plt.tight_layout()
        fig = plt.gcf()
        plotpath = path.join(target_path, "feature-importances.png")
        fig.savefig(plotpath)
        context.log_artifact(PlotArtifact('feature-importances-plot', body=fig))

        # feature importances are also saved as a table:
        tablepath = path.join(target_path, "feature-importances-table.csv")
        feature_imp.to_csv(tablepath)
        context.log_artifact(TableArtifact('feature-importances-table', target_path=tablepath))

        # to ensure we don't overwrite this figure when creating the next:
        _gcf_clear(plt)

    if callable(getattr(model, 'predict_proba')):
        ypred_probs = model.predict_proba(data['xtest'])[:, 1]
        ypred = np.where(ypred_probs >= 0.5, 1, 0)
    else:
        ypred = model.predict(data['xtest'])
        ypred_probs = None

    context.log_result("test_accuracy", float(clf.score(data['xtest'], data['ytest'])))

    loss = np.asarray(model.evals_result_['train']['binary_logloss'], dtype=np.float)
    val_loss = np.asarray(model.evals_result_['valid']['binary_logloss'], dtype=np.float)

    plot_validation(loss, val_loss)
    if ypred_probs:
        plot_roc(data['ytest'], ypred_probs)
    if ypred:
        plot_confusion_matrix(data['ytest'], ypred)
    if hasattr(model, 'feature_importances_'):
        plot_importance(model, header)
   
    # save the model and log  as an artifact
    filepath = path.join(target_path, name)
    dump(model, open(filepath, 'wb'))
    context.log_artifact(key,
                         target_path=filepath,
                         labels=exp_labels)    

In [8]:
def train(
    context: MLClientCtx,
    src_file: str,
    header: DataItem,
    test_size: float = 0.1,
    train_val_split: float = 0.75,
    sample: int = -1,
    target_path: str = '',
    name: str = '',
    key: str = '',
    exp_labels = {},  # 'lightgbm_sklearn' if this were a pipeline
    verbose: bool = False,
    random_state = np.random.RandomState(1),
    **sklearn_params
) -> None:
    """Train and save a LightGBM model.
    
    :param context:         the function context
    :param src_file:        ('raw') name of raw data file
    :param header:          header artifact
    :param test_size:       (0.1) test set size
    :param train_val_split: (0.75) Once the test set has been removed the 
                            training set gets this proportion.
    :param sample:          (-1). Selects the first n rows, or select a sample starting
                            from the first. If negative <-1, select a random sample from 
                            the entire file
    :param target_path:     folder location of files
    :param name:            destination name for model file
    :param key:             key for model artifact
    :param exp_labels:      metadata dict, some keys are required (type, framework). 'type'
                            is either classifier or regressor, 'framework' can be sklearn or not
                            (sklearn models have a generic interface)
    :param verbose :        (False) show metrics for training/validation steps.
    :param random_state:    (1) sklearn rng seed
    :param sklearn_params   sklearn keyword params 
    """
    # load local data
    srcfilepath = path.join(target_path, src_file)
    # save only a sample, intended for debugging
    if (sample == -1) or (sample >= 1):
        # get all rows, or contiguous sample starting at row 1.
        raw = pq.read_table(srcfilepath).to_pandas()
        labels = raw.pop('labels')
        raw = raw.iloc[:sample, :]
        labels = labels.iloc[:sample]
    else:
        # grab a random sample
        raw = pq.read_table(srcfilepath).to_pandas().sample(sample*-1)
        labels = raw.pop('labels')

    x, xtest, y, ytest = train_test_split(raw, labels, train_size=1-test_size, 
                                          random_state=random_state)
   
    xtrain, xvalid, ytrain, yvalid = train_test_split(x, y, 
                                                      train_size=train_val_split, 
                                                      random_state=random_state)        
    
    clf = lgb.LGBMClassifier(random_state=random_state,
                             verbose=int(verbose == True))

    eval_results = dict()

    clf.fit(xtrain, 
            ytrain,
            eval_set=[(xvalid, yvalid), (xtrain, ytrain)],
            eval_names=['valid', 'train'],
            callbacks=[lgb.record_evaluation(eval_results)],
            verbose=verbose)
    
    context.log_result("train_accuracy", float(clf.score(xtrain, ytrain)))
    
    log_lgbm_model(
        context, 
        clf, 
        data = {'xtest':xtest, 'ytest':ytest},
        target_path=target_path,
        header=load(open(str(header), 'rb')),
        name=name, 
        key=key,
        exp_labels=exp_labels)

#### **end of nuclio function definition**

In [9]:
# nuclio: end-code

some useful parameters to keep the notebook neat:
    

In [10]:
ARCHIVE_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
FILE_NAME = 'higgs.parquet'
CHUNK_SIZE = 10_000
TARGET_PATH = '/User/mlrun/models/'
MODEL_NAME = 'lgb-classifier.pkl'

In [11]:
HIGGS_HEADER = ['labels', 'lepton_pT', 'lepton_eta', 'lepton_phi', 'missing_energy_magnitude', 'missing_energy_phi',
 'jet_1_pt', 'jet_1_eta', 'jet_1_phi', 'jet_1_b-tag', 'jet_2_pt', 'jet_2_eta', 'jet_2_phi', 'jet_2_b-tag', 'jet_3_pt',
 'jet_3_eta', 'jet_3_phi', 'jet_3_b-tag', 'jet_4_pt', 'jet_4_eta', 'jet_4_phi', 'jet_4_b-tag', 'm_jj', 'm_jjj', 'm_lv',
 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']

In [12]:
import mlrun

#### _acquire_ - use an existing github function to acquire and store data

In [None]:
acquire_job = mlrun.import_function(
    'https://raw.githubusercontent.com/yjb-ds/functions/lgbm-serving/fileutils/arc_to_parquet/arc_to_parquet.yaml'
).apply(mlrun.mount_v3io())
acquire_job.deploy()



[mlrun] 2020-01-21 21:42:55,693 database connection is not configured
[mlrun] 2020-01-21 21:42:55,694 building image (.mlrun/func-default-arc-to-parquet-latest)
FROM python:3.6-jessie
RUN python -m pip uninstall mlrun
RUN python -m pip install -U -q mlrun
RUN python -m pip install -U -q pandas
RUN python -m pip install -U -q pyarrow
RUN python -m pip install -U -q numpy==1.17.4
RUN pip install mlrun

[mlrun] 2020-01-21 21:42:55,696 using in-cluster config.
[mlrun] 2020-01-21 21:42:55,713 Pod mlrun-build-arc-to-parquet-fzdsd created
..
[36mINFO[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie 
[36mINFO[0m[0000] Resolved base name python:3.6-jessie to python:3.6-jessie 
[36mINFO[0m[0000] Downloading base image python:3.6-jessie     
[36mINFO[0m[0000] Error while retrieving image from cache: getting file info: stat /cache/sha256:0318d80cb241983eda20b905d77fa0bfb06e29e5aabf075c7941ea687f1c125a: no such file or directory 
[36mINFO[0m[0000] Downloading base image 

#### _train_ - use the notebook train function

In [None]:
# lgbm_job = mlrun.code_to_function(
#     name='lgbm_job', 
#     runtime='job', 
#     with_doc=False)  #.apply(mlrun.mount_v3io())

# lgbm_job.export('/User/repos/functions/serving/train-lgbm.yaml')

#### _train_ - use the github function spec

In [None]:
train_job = mlrun.import_function(
    '/User/repos/functions/serving/train-lgbm.yaml'
).apply(mlrun.mount_v3io())
train_job.deploy()

<a id="pipeline"></a>
### Create a KubeFlow Pipeline from our functions

Our pipeline will consist of two instead of three steps, ```load``` and ```train```.  We'll drop the ```test```
here since at the end of this deployment we can test the system with API requests.

For complete details on KubeFlow Pipelines please refer to the following docs:
1. **[KubeFlow pipelines](https://www.kubeflow.org/docs/pipelines/)**.
2. **[kfp.dsl Python package](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#module-kfp.dsl)**.

Please note, the model server file name in the ```new_model_server``` function call below should identical in every respect to the name of the model server notebook.

In [None]:
import kfp
from kfp import dsl

In [None]:
srvfn = mlrun.new_model_server(
    'classifier', 
    model_class='ClassifierModel', 
    filename='/User/repos/functions/serving/classifier_server.ipynb')

srvfn.apply(mlrun.mount_v3io())

In [None]:
@dsl.pipeline(name='LGBM', description='lightgbm classifier')
def lgbm_pipeline(learning_rate = [0.1, 0.3], num_leaves = [31, 32]):
    acquire_step = acquire_job.as_step(
            name='acquire_remote_data',
            handler='arc_to_parquet',
            params={
                'archive_url': ARCHIVE_URL,
                'header':      HIGGS_HEADER,
                'name':        FILE_NAME,
                'target_path': TARGET_PATH},
            outputs=['header'], 
            out_path=TARGET_PATH).apply(mlrun.mount_v3io())
    
    train_step = lgbm_job.as_step(
            name='train_model', 
            handler='train',
            inputs={'header' : acquire_step.outputs['header']},
            params={
                'src_file':         FILE_NAME,
                'sample':           20000,
                'test_size':        0.1,
                'train_val_split':  0.75,
                'target_path':      TARGET_PATH,
                'name':             MODEL_NAME,
                'key' :             'model',
                'num_leaves':       31,
                'learning_rate':    0.1,
                'verbose':          False,
                'labels':          {'type'      : 'classifier',
                                    'framework' : 'lightgbm',
                                    'mode'      : 'model'}},
            outputs=['model'],
            out_path= TARGET_PATH).apply(mlrun.mount_v3io())

    srvfn.deploy_step(
        project='default', 
        models={'classifier_gen': train_step.outputs['model']})

<a id="compile the pipeline"></a>
### compile the pipeline

We can compile our KubeFlow pipeline and produce a yaml description of the pipeline worflow:

In [None]:
kfp.compiler.Compiler().compile(lgbm_pipeline, TARGET_PATH + '/mlrunpipe.yaml')

In [None]:
client = kfp.Client(namespace='default-tenant')

Finally, the following line will run the pipeline as a job::

In [None]:
arguments = {}

run_result = client.create_run_from_pipeline_func(
    lgbm_pipeline, 
    arguments, 
    run_name='my classifier run',
    experiment_name='classifier')