# **Tensorflow-Keras and Scikit-Learn With MLRun**

_______________________________________________________________________________

MLRun is an open-source Python package that provides a framework for running machine learning tasks transparently in multiple, scalable, runtime environments.  MLRun provides tracking of code, metadata, inputs, outputs and the results of machine learning pipelines. 

In this notebook we"ll compose a pipeline that deploys a classifier model, and uses it as the input in a training and validation step. We'll be working with a synthetic features matrix of dimension 10 million rows by 20 features and a binary label.  The model will be a 2-layer neural net classifier using **[tensorflow-keras](https://www.tensorflow.org/)** (v2.0.0b1), without gpu support.

The dataset we create is balanced, however there is a `weight` parameter in the data generator function specifying the fraction of observations that are labeled 0/False. The number of samples and features are also parameters.  The demonstration could be modified easily to allow for a more fine-grained control over the simulated dataset either by adding more parameters or replacing the underlying function altogether.

The training and validation step employs a scikit learn `Pipeline` to perform feature engineering. Some of the feature engineering needs to be done _**after**_ the train-valid-test set split. In some preprocessing scenarios we might estimate a data transformation on the training set before model training, and then apply the estimate to the validation and test sets before prediction. Since we need to perform the same transformation pre-inference, all pipeline model steps are stored.

Serializing models can be challenging for number of reasons:  a pipeline with multiple steps may require just as many encoding and decoding routines--applying Python's `pickle` to a Keras model that has been wrapped in a scikit-learn api fails.  Since we have the model architecture in a class definition, all we need to do is save the weights.  Some steps in a pipeline may have no internal state to store, while others can be stored and loaded using `pickle`.  Most of it all boils down to storing dicts/json with numpy objects.

One of the upsides of the present architecture is that we can mix many simulations of data with a given model estimator, or many models with a given data sample and track everything in **MLRun**.  Research, development, and deployment, all on one page, running under multiple configurations, limited only by the compute resources at our disposal.


#### **notebook take-aways**
* write and test reusable and replaceable **[MLRun](https://github.com/mlrun)** components in a notebook, file or github repository
* store and load models
* run the components as a **[KubeFlow](https://www.kubeflow.org/)** pipeline

<a id='top'></a>
#### **steps**
**[nuclio code section](#nuclio-code-section)**<br>
    - [nuclio's ignore](#ignore)<br>
    - [function dependencies](#function-dependencies)<br>

**[components](#components)**<br>
    - [supporting functions](#imports)<br>
    - [data simulation](#data_generator)<br>
    - [feature engineering](#feateng)<br>
    - [a classifier](#classifier)<br>
    - [save and load pipeline model](#save-load)<br>
    - [training and validation](#train)<br>
**[local tests](#local-testing)**<br>
**[remote tests](#remote)**<br>
**[compose pipeline](#compose)**<br>
**[run](#run)**<br>


______________________________________________

<a id="nuclio-code-section"></a>
# **nuclio code section**

<a id='ignore'></a>
### _nuclio's **ignore** notation_

You'll write all the code that gets packaged for execution between the tags ```# nuclio: ignore```, meaning ignore all the code here and above, and ```# nuclio: end-code```, meaning ignore everything after this annotation.  The **[docs](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** also suggest another approach: we can use ```# nuclio: start``` at the first relevant code cell instead of marking all the cells above with ```# nuclio: ignore```.

See the **[nuclio-jupyter](https://github.com/nuclio/nuclio-jupyter)** repo for further information on these and many other **[nuclio magic commands](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** that make it easy to transform a Jupyter notebook environment into a platform for developing production-quality, machine learning systems.

The following two lines _**should be in the same cell**_ and mark the start of your mchine learning coding section:

In [None]:
# nuclio: ignore
import nuclio 

<a id="function-dependencies"></a>
### _function dependencies_

The installs made in the section **[Setup](#Setup)** covered the Jupyter environment within which this notebook runs.  However, we need to ensure that all the dependencies our nuclio function relies upon (such as ```matplotlib```, ```sklearn```, ```lightgbm```), will be available when that code is wrapped up into a nuclio function _**on some presently unknown runtime**_.   Within the nuclio code section we can ensure these dependencies get built into the function with the ```%nuclio cmd``` magic command.

In [None]:
%%nuclio cmd
rm /conda/lib/python3.6/site-packages/seaborn* -rf
pip install -U -q seaborn
pip install -U -q matplotlib
pip install -U -q tensorflow==2.0.0b1
pip install -U -q scikit-learn
pip install -U -q pandas 
pip install -U -q numpy==1.17.4
pip install -U -q git+https://github.com/yjb-ds/functions-demo.git
    
pip uninstall -y mlrun
pip install -U -q git+https://github.com/mlrun/mlrun.git@development

We"ll use a standard base image here, however the build step can be shortened by preparing images with pre-installed packages.

In [None]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"

<a id="support"></a>
### _imports_

Some of the functionality is provided in custom components within the ```functions``` package (found at the github repo **[function-demos](https://github.com/yjb-ds/functions-demo)**):<br>

- **[datasets](functions/datasets.py)**:&emsp;generate simulation data
- **[files](functions/file_fs.py)**:&emsp;&emsp;&emsp;save and load _remote_ files
- **[models](function/model_fs.py)**:&nbsp; &emsp;save, load, and instantiate models
- **[plots](functions/plot_fs.py)**:&emsp;  &emsp; sundry plotting functions
- **[tables](functions/tables.py)**:&emsp; &nbsp; &nbsp;logging and retrieving table artifacts

In [None]:
import os
import io
from pickle import dump, load
import json
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
from functions.datasets import create_binary_classification
from functions.models import (build_fn,
                              FeaturesEngineer, 
                              KerasClassifier,
                              class_instance)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, save_model, load_model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.initializers import Constant

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
from typing import Any, Union, Optional, List

In [None]:
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

<a id='paths'></a>
### _paths and parameters_

In [None]:
TARGET_PATH = '/User/mlrun/simdata'

# data simulation and ml training parameter
BATCH_SIZE = 1_024
LEARNING_RATE = 0.1
EPOCHS= 3
N_SAMPLES = 1_000_000
M_FEATURES = 20
CLASS_BALANCE = 0.5
DROPOUT = 0.5

<a id="components" ></a>
______________________________________________

# **components**

<a id='data_generator'></a>
## **data generator**

In [None]:
def data_generator(
    context: MLClientCtx,
    samples: int,
    features: int,
    features_hdr: Optional[List[str]],
    neg_weight: float,
    target_path: str,
    key: str
) -> None:
    """Generate raw data for this pipeline
    
    This component will be the entry point of the pipeline.
    
    In this demonstration our component is a simple wrapper for scikit learn's 
    `make_classification`, a convenient utility enabling us to build
    and test a pipeline from start to finish with a clean and 
    predictable dataset. By fiddling with neg_weight, we can also take a 
    quick look at the effect of class balance on our model before exposing it
    to the kind of data we find in the real world.
    
    :param context:       function context
    :param samples:       number of samples (rows) to generate
    :param features:      number of features (cols)
    :param features_hdr:  (optional) header for the features array
    :param neg_weights:   fraction of negative samples
    :param target_path:   destination for data including file name
    :param key:           context key of data
    """
    if features_hdr:
        assert len(features_hdr)==m_features, f"features header dimension mismatch for {name}"
    data = create_binary_classification(
                context, n_samples=samples, m_features=features,
                features_hdr=features_hdr,  weight=neg_weight, 
                target_path=target_path, key=key)

<a id='feateng'></a>
## **feature engineering**

This class implements the scikit-learn transformer API, enabling it to fit into an sklearn `Pipeline` as a step.<br>

For code please see the custom sklearn transformer `FeaturesEngineer` in **[models.py](functions/models.py)**.  

<a id="classifier"></a>
## **classifier**

This method generates a small keras Sequential model with 2 layers which gets wrapped in a `KerasClassifier` class. The latter provides it with a convenient sklearn interface for use in **[sklearn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline)**. The list of metrics collected during training can also be found in the same module as `METRICS` and includes accuracy, precision, recall, auc and a confusion matrix.<br>

For code please see `KerasClassifier` and `build_fn` in **[models.py](functions/models.py)**.  

In [None]:
METRICS = [
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.BinaryAccuracy(name="accuracy"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
    keras.metrics.AUC(name="auc"),
]

<a id='train'></a>
## **training and validation**

In this notebook demonstration we follow standard practice by wrapping the training/validation and test steps into the same method.

**exercise / todos**

To complete the demonstration, instead of hard-coding the `train_test_split` method, add a splitter class into the pipeline, like a cross-validator. 

The model encoder/decoder could also be input as a parameter.

In [None]:
def train(
    context: MLClientCtx,
    dataset: DataItem,
    engineer_cls: str,
    scaler_cls: str,
    classifier_cls: str,
    target_path: str,
    model_key: str = '',
    test_data_key: str = '',
    metrics_key: str = '',
    test_size: float = 0.1,
    valid_size: float = 0.3,
    batch_size: int = 1024,
    epochs: int = 5,
    verbose: bool = True,
    random_state: int = 1,
    ) -> None:
    """Train, validate, test and save a classifier model pipeline.
    
    Here we split the data, instantiate our pipeline and its models, and proceed
    to training and validation.
    
    The target_path defines the base folder where artifacts will be stored.  Since we
    intend to save both the model (and its components), the test set and its predictions,
    and the history of metric estimates we provide three keys.
    
    :param context:             function context
    :param dataset:             cleaned input dataset
    :param engineer_cls:        feature engineering class
    :param scaler_cls:          scaler class
    :param classifier_cls:      classifier class    
    :param target_path:         destination (folder) for artifact files
    :param model_key:           model key in the artifact store
    :param test_data_key:       test set and predictions key in the artifact store
    :param metrics_key:         metrics key in the artifact store
    :param test_size:           (0.1) test set size as fraction
    :param valid_size:          (0.3) validation set size as fraction
    :param batch_size:          (1024) network feed batch size
    :param epochs:              (5) training epochs
    :param verbose:             (default True) Show metrics for 
                                training/validation steps
        
    Also included for demonstration are a randomly selected sample
    of training parameters:
    :param learning_rate: Step size at each iteration, constant.
    """
    raw = pd.read_parquet(io.BytesIO(dataset.get()), engine='pyarrow')

    train, test = train_test_split(raw, test_size=test_size)
    train, valid = train_test_split(train, test_size=valid_size)
    
    y_train = train.pop('labels')
    y_valid = valid.pop('labels')
    y_test = test.pop('labels')

    # instantiate features engineer, scaler and classifier
    Engineer = class_instance(engineer_cls)
    Scaler = class_instance(scaler_cls)
    Classifier = class_instance(classifier_cls)

    pipe = Pipeline(steps=[('engineer', Engineer()),
                           ('scaler', Scaler()),
                           ('classifier', KerasClassifier(build_fn=build_fn, input_size=20))])
    pipe.fit(train, y_train,
             classifier__epochs=epochs, 
             classifier__batch_size=batch_size,
             classifier__validation_split=0.25) # fudge

    y_pred = pipe.predict(test)                          
    
    acc = accuracy_score(y_test, y_pred)
    context.log_result("accuracy", float(acc))
    
    # keras metrics history is a table
    metrics = pd.DataFrame(pipe.named_steps.classifier.model.history.history)
    
    # run plotting routines
    _plot_validation(metrics.loss, 
                     metrics.valid_loss, 
                     target_path=target_path, 
                     key='training-validation-metrics')
    _plot_roc(y_test, y_pred, target_path=target_path, key='roc-curve')
    _plot_confusion_matrix(y_test, y_pred, target_path=target_path, key='confusion-matrix')
    
    
    modelpath = os.path.join(target_path, model_key + '.pkl')
    dump(pipe, open(modelpath, 'wb'))
    context.log_artifact(model_key, target_path=modelpath)

#### _end of nuclio function definition_

In [None]:
# nuclio: end-code

<a id="local-testing" ></a>
______________________________________________

# **testing your code locally**

The function can be run locally and debugged/tested before deployment:

In [None]:
from mlrun import (mlconf,
                   code_to_function,
                   new_function,
                   NewTask,
                   new_model_server,
                   mount_v3io)  

Set MLRun's DB path.  MLRun wil generate and store all of its tracking and metadata to the `MLRUN_DBATH` environment variable.  We have set a `TARGET_PATH` earlier in this notebook in the above section **[paths and parameters](#paths)**.

In [None]:
mlconf.dbpath = 'http://mlrun-api:8080'

In [None]:
workflow = new_function()

In [None]:
datagen_run = workflow.run(
    name='data generator',
    handler=data_generator,
    params={
        'samples':      N_SAMPLES,
        'features':     M_FEATURES,
        'neg_weight':   CLASS_BALANCE, # this is a balanced dataset
        'target_path':  TARGET_PATH,
        'key':          'simdata'})

In [None]:
train_run = workflow.run(
    name='train, validate and store model',
    handler=train,
    inputs={
        'dataset': datagen_run.outputs['simdata']},
    params={
        'scaler_cls':     'sklearn.preprocessing.data.StandardScaler',
        'engineer_cls':   'functions.models.FeaturesEngineer',
        'classifier_cls': 'functions.models.KerasClassifier',
        'target_path':     TARGET_PATH,
        'model_key':       'model',
        'test_data_key':   'test_data',
        'metrics_key':     'metrics',
        'batch_size':      BATCH_SIZE,
        'epochs':          25})

In [None]:
plot_history = workflow.run(
    name='training history',
    handler=plot_validation,
    inputs={'metrics': train_run.outputs['metrics']},
    params={
        'fmt': 'png',
        'target_path': TARGET_PATH,
        'key' : 'training',})

In [None]:
plot_history = workflow.run(
    name='confusion matrix',
    handler=plot_confusion_matrix,
    inputs={'test_data': train_run.outputs['test_data']},
    params={'labels': [0, 1],
            'target_path': TARGET_PATH, 
            'key': 'confusion_matrix'})

<a id="image"></a>
#### _Create a deployment image_

Once debugged you can create a reusable image, and then deploy it for testing. In the following line we are converting the code block between the ```#nuclio: ignore``` and ```#nuclio: end-code``` to be run as a KubeJob. _**It is important to ensure that this function has been `deploy`ed at least once, and that you have access to it.**_

In [None]:
tfkeras_job = code_to_function(name='tfkeras_job',
                               runtime="job").apply(mount_v3io())

# set this to True so that updates to our git package are reflected in the built image,
# but please note however that this may lengthen image build times:
# tfkeras_job.spec.no_cache = True

In [None]:
tfkeras_job.deploy()
# other options:
# ignore if exists
# save and/or export

In [None]:
#tfkeras_job.with_code()

<a id="remote"></a>
# **test your code remotely**

In [None]:
task = NewTask()

task.with_params(samples=N_SAMPLES,
                 features=M_FEATURES,
                 neg_weight=CLASS_BALANCE,
                 target_path=TARGET_PATH,
                 key='simdata')

nrun = tfkeras_job.run(task, 
                       handler='data_generator', 
                       out_path=TARGET_PATH)

In [None]:
task.with_input('dataset', nrun.outputs['simdata'])

task.with_params(scaler_cls='sklearn.preprocessing.data.StandardScaler',
                 engineer_cls='functions.models.FeaturesEngineer',
                 classifier_cls='functions.models.KerasClassifier',
                 target_path=TARGET_PATH,
                 model_key='model',
                 test_data_key='test_data',
                 metrics_key='metrics',
                 batch_size=BATCH_SIZE,
                 epochs=10)

nrun2 = tfkeras_job.run(task, handler='train', out_path=TARGET_PATH)

In [None]:
task.with_input('metrics', nrun2.outputs['metrics'])

task.with_params(fmt='png', target_path=TARGET_PATH, key='training')

nrun3 = tfkeras_job.run(task, handler='plot_validation', out_path=TARGET_PATH)

In [None]:
task.with_input('test_data', nrun2.outputs['test_data'])

task.with_params(labels=[0, 1], fmt='png', target_path=TARGET_PATH, key='confusion_matrix')

nrun4 = tfkeras_job.run(task, handler='plot_confusion_matrix', out_path=TARGET_PATH)

<a id="compose"></a>
# **Create a KubeFlow Pipeline from our functions**

Our pipeline will consist of two steps, ```data_generator``` and ```train```.

For complete details on KubeFlow Pipelines please refer to the following docs:
1. **[KubeFlow pipelines](https://www.kubeflow.org/docs/pipelines/)**.
2. **[kfp.dsl Python package](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#module-kfp.dsl)**.

Please note, the model server file name in the ```new_model_server``` function call below should identical in every respect to the name of the model server notebook (here, **[model_server.ipynb](#model-server.ipynb)**).

In [None]:
import kfp
from kfp import dsl

In [None]:
srvfn = new_model_server("tfkeras",  
                         model_class="MyKerasClassifier",   
                         filename="model_server.ipynb")
srvfn.apply(mount_v3io())

In [None]:
@dsl.pipeline(
    name="Sklearn and KubeFlow",
    description="Shows how to use mlrun/kfp."
)
def tfkeras_pipeline(
    neg_weight = [0.5, 0.1],
):

    datagen = tfkeras_job.as_step(
        name='data generator',
        handler='data_generator',
        out_path=TARGET_PATH, 
        params={        
            'samples':         N_SAMPLES,
            'features':        M_FEATURES,
            'neg_weight':      CLASS_BALANCE,
            'target_path':     TARGET_PATH,
            'key':            'simdata'},
        outputs=['simdata']).apply(mount_v3io())
    
    train = tfkeras_job.as_step(
        name='sklearn pipe train',
        handler='train',
        out_path=TARGET_PATH, 
        inputs={'dataset': datagen.outputs['simdata']},
        outputs=['model', 'test_data', 'metrics'],
        params={
            'scaler_cls':     'sklearn.preprocessing.data.StandardScaler',
            'engineer_cls':   'functions.models.FeaturesEngineer',
            'classifier_cls': 'functions.models.classifier',
            'target_path':     TARGET_PATH,
            'model_key':      'model',
            'test_data_key':  'test_data',
            'metrics_key':    'metrics',
            'batch_size':      BATCH_SIZE,
            'epochs':          10}).apply(mount_v3io())

    plot_valid = tfkeras_job.as_step(
        name='plot training validation accuracy',
        handler='plot_validation',
        out_path=TARGET_PATH, 
        inputs={'metrics': train.outputs['metrics']},
        params={
            'target_path':     TARGET_PATH,
            'key':             'history',
            'fmt':             'png'}).apply(mount_v3io())

    plot_confusion = tfkeras_job.as_step(
        name='plot confusion matrix',
        handler='plot_confusion_matrix',
        out_path=TARGET_PATH, 
        inputs={'test_data': train.outputs['test_data']},
        outputs=['model'],
        params={
            'labels':          [0, 1], 
            'target_path':     TARGET_PATH,
            'key':             'confusion',
            'fmt':             'png'}).apply(mount_v3io())

    # define a nuclio-serving function, generated from a notebook file
    srvfn.deploy_step(project="github-demos", 
                      models={"tfkeras_pickle": train.outputs["model"]})

<a id="compile the pipeline"></a>
### _compile the pipeline_

We can compile our KubeFlow pipeline and produce a yaml description of the pipeline worflow:

In [None]:
# os.makedirs(TARGET_PATH, exist_ok=True)
kfp.compiler.Compiler().compile(tfkeras_pipeline, TARGET_PATH+"/mlrunpipe.yaml")

In [None]:
client = kfp.Client(namespace="default-tenant")

Finally, the following line will run the pipeline as a job::

In [None]:
arguments = {
    'neg_weight' : [0.5, 0.1]    
}

run_result = client.create_run_from_pipeline_func(
    tfkeras_pipeline, 
    arguments, 
    run_name="tfkeras",
    experiment_name="tfkeras")

In [None]:
# !mlrun clean