# **Tensorflow-Keras and Scikit-Learn With MLRun**

_______________________________________________________________________________

MLRun is an open-source Python package that provides a framework for running machine learning tasks transparently in multiple, scalable, runtime environments.  MLRun provides tracking of code, metadata, inputs, outputs and the results of machine learning pipelines. 

In this notebook we"ll compose a pipeline that deploys a classifier model, and uses it as the input in a training and validation step. We'll be working with a synthetic features matrix of dimension 10 million rows by 20 features and a binary label.  The model will be a 2-layer neural net classifier using **[tensorflow-keras](https://www.tensorflow.org/)** (v2.0.0b1), without gpu support.

The dataset we create is balanced, however there is a `weight` parameter in the data generator function specifying the fraction of observations that are labeled 0/False. The number of samples and features are also parameters.  The demonstration could be modified easily to allow for a more fine-grained control over the simulated dataset either by adding more parameters or replacing the underlying function altogether.

The training and validation step employs a scikit learn `Pipeline` to perform feature engineering. Some of the feature engineering needs to be done _**after**_ the train-valid-test set split. In some preprocessing scenarios we might estimate a data transformation on the training set before model training, and then apply the estimate to the validation and test sets before prediction. Since we need to perform the same transformation pre-inference, all pipeline model steps are stored.

Serializing models can be challenging for number of reasons:  a pipeline with multiple steps may require just as many encoding and decoding routines--applying joblib to a Keras model that has been wrapped in a scikit-learn api fails.  Since we have the model architecture in a class definition, all we need to do is save the weights.  Some steps in a pipeline may have no internal state to store, while others can be stored and loaded using `joblib`.  Most of it all boils down to storing dicts/json with numpy objects.

One of the upsides of the present architecture is that we can mix many simulations of data with a given model estimator, or many models with a given data sample and track everything in **MLRun**.  Research, development, and deployment, all on one page, running under multiple configurations, limited only by the compute resources at our disposal.


#### **notebook take-aways**
* write and test reusable and replaceable **[MLRun](https://github.com/mlrun)** components in a notebook
* store and load models
* run the components as a **[KubeFlow](https://www.kubeflow.org/)** pipeline

<a id='top'></a>
#### **steps**
**[install the python mlrun package](#install)**<br>
**[nuclio code section](#nuclio-code-section)**<br>
    - [nuclio's ignore](#ignore)<br>
    - [function dependencies](#function-dependencies)<br>

**[components](#components)**<br>
    - [supporting functions](#utilties)<br>
    - [data simulation](#datasim)<br>
    - [feature engineering](#feateng)<br>
    - [a classifier](#classifier)<br>
    - [training and validation](#train)<br>
**[local tests](#local-testing)**<br>
**[compose pipeline](#image)**<br>
**[run](#run)**<br>

<a id="install" ></a>
______________________________________________

# **notebook installs**

The following will reinstall the latest development version of ```mlrun```:

    !pip uninstall mlrun -y
        
    !pip install -U git+https://github.com/mlrun/mlrun.git@development

Install the KubeFlow pipelines package ```kfp```. For more information see the **[KubeFlow documentation on nuclio](https://www.kubeflow.org/docs/components/misc/nuclio/)** and  **[Kubeflow pipelines and nuclio](https://github.com/kubeflow/pipelines/tree/master/components/nuclio)**. For logging the estimated machine learning models we'll use ```joblib```'s ```dump``` and ```load```. For more details see **[Joblib: running Python functions as pipeline jobs](https://joblib.readthedocs.io/en/latest/index.html)**.

    !pip install -U kfp joblib seaborn tensorflow==2.0.0b1 keras

<a id="nuclio-code-section"><a>
______________________________________________

# **nuclio code section**

<a id='ignore'></a>
### _nuclio's **ignore** notation_

You"ll write all the code that gets packaged for execution between the tags ```# nuclio: ignore```, meaning ignore all the code here and above, and ```# nuclio: end-code```, meaning ignore everything after this annotation.  The **[docs](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** also suggest another approach: we can use ```# nuclio: start``` at the first relevant code cell instead of marking all the cells above with ```# nuclio: ignore```.

See the **[nuclio-jupyter](https://github.com/nuclio/nuclio-jupyter)** repo for further information on these and many other **[nuclio magic commands](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** that make it easy to transform a Jupyter notebook environment into a platform for developing production-quality, machine learning systems.

The ```nuclio-jupyter``` package provides methods for automatically generating and deploying nuclio serverless functions from code, repositories or Jupyter notebooks. **_If you have never run nuclio functions in your notebooks, please uncomment and run the following_**:

        !pip install nuclio-jupyter

The following two lines _**should be in the same cell**_ and mark the start of your mchine learning coding section:

In [1]:
# nuclio: ignore
import nuclio 

<a id="function-dependencies"></a>
### _function dependencies_

The installs made in the section **[Setup](#Setup)** covered the Jupyter environment within which this notebook runs.  However, we need to ensure that all the dependencies our nuclio function relies upon (such as ```matplotlib```, ```sklearn```, ```lightgbm```), will be available when that code is wrapped up into a nuclio function _**on some presently unknown runtime**_.   Within the nuclio code section we can ensure these dependencies get built into the function with the ```%nuclio cmd``` magic command.

In [2]:
%nuclio cmd -c pip install -U matplotlib tensorflow==2.0.0b1 keras sklearn pandas numpy joblib
%nuclio cmd -c pip install -U git+https://github.com/mlrun/mlrun.git@development

We"ll use a standard base image here, however the build step can be shortened by preparing images with pre-installed packages.

In [3]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"

%nuclio: setting spec.build.baseImage to 'python:3.6-jessie'


### _imports_

Some of the functionality is provided in supporting components within the ```functions``` folder.<br>

Links to the code:
- **[datasets](functions/datasets.py)**:&emsp;generate simulation data
- **[files](functions/file_fs.py)**:&emsp;&emsp;&emsp;save and load files
- **[models](function/model_fs.py)**:&nbsp; &emsp;save, load, and instantiate models
- **[plots](functions/plot_fs.py)**:&emsp;  &emsp; sundry plotting functions
- **[tables](functions/tables.py)**:&emsp; &nbsp; &nbsp;logging and retrieving table artifacts

In [4]:
from functions.datasets import create_binary_classification
from functions.tables import log_context_table, get_context_table
from functions.models import FeaturesEngineer, Classifier, class_instance

  % (item.__module__, item.__name__)


In [5]:
import os
from typing import Any, Union, Optional, List

In [6]:
import joblib
import json
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

<a id='paths'></a>
### _paths and parameters_

In [10]:
TARGET_PATH = '/User/mlrun/simdata'
IMAGE_NAME = "mlrun/mlruntfkeras:latest"
MODEL_NAME = 'my-binclass-tfkeras-model'

# data simulation and ml training parameter
BATCH_SIZE = 1024
LEARNING_RATE = 0.1
EPOCHS= 3
N_SAMPLES = 1_000_000
M_FEATURES = 20
CLASS_BALANCE = 0.5
DROPOUT = 0.5

<a id="components" ></a>
______________________________________________

# **components**

<a id='delete'></a>
## **data generator**

In [11]:
def data_generator(
    context: MLClientCtx,
    samples: int,
    features: int,
    features_hdr: Optional[List[str]],
    neg_weight: float,
    target_path: str,
    key: str
) -> None:
    """Generate raw data for this pipeline
    
    This component will be the entry point of the pipeline.
    
    In this demonstration our component is a simple wrapper for scikit learn's 
    `make_classification`, a convenient utility enabling us to build
    and test a pipeline from start to finish with a clean and 
    predictable dataset. By fiddling with neg_weight, we can also take a 
    quick look at the effect of class balance on our model before exposing it
    to the kind of data we find in the real world.
    
    :param context:       function context
    :param samples:       number of samples (rows) to generate
    :param features:      number of features (cols)
    :param features_hdr:  (optional) header for the features array
    :param neg_weights:   fraction of negative samples
    :param target_path:   destination for data including file name
    :param key:           context key of data
    """
    if features_hdr:
        assert len(features_hdr)==m_features, f"features header dimension mismatch for {name}"
    data = create_binary_classification(
                context, n_samples=samples, m_features=features,
                features_hdr=features_hdr,  weight=neg_weight, 
                target_path=target_path, key=key)

<a id='feateng'></a>
## **feature engineering**

For code please see the custom sklearn transformer `FeaturesEngineer` in **[models.py](functions/models.py)**.

<a id="classifier"></a>
## **classifier**

For code please see `KerasClassifier` and `classifier_gen` in **[models.py](functions/models.py)**.  This method generates a small keras Sequential model with 2 layers which gets wrapped in a `KerasClassifier` class. The latter provides it with a convenient sklearn interface for use in **[sklearn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline)**. The list of metrics collected during training can also be found in the same module as `METRICS` and includes accuracy, precision, recall, auc and a confusion matrix.

<a id='model-save'></a>
## **save model**

The model presented here has three stages, wrapped into an sklearn pipeline.  In order to save this model pipeline, each of its components may have to be saved independently.  There may be other advantages to serializing the pipeline components separately.

In our pipeline, the `FeaturesEngineer` has no state so we just create a new one during load. 

The `StandardScaler`'s estimates need to be re-used when transforming new data, so it is
pickled using `joblib` (it contains arrays).  Its filename is `{path.join(target_path,name)}-scaler.pickle`. **It is important to remember that feature stores need more than feature arrays, they need code and often data in the form of parameter estimates, correlation matrices and so on so that these may be used again on new data, and for forensics.**

Keras models have an architecture saved as json, and corresponding weights are saved in hdf5 format. The architecture can be recreated by instantiating the class or converting the json representation into a model, with the weights loaded into that structure. Filenames for the model are `{path.join(target_path,name)}-weights.h5` and `{path.join(target_path,name)}-model.json`.

In [12]:
def my_pipeline_save(
    context: MLClientCtx,
    pipe: Pipeline,
    target_path: str,
    key: str
):
    """Serialize a specific pipeline.
    
    :param context:         function context
    :param pipe:            estimated model pipeline
    :param target_path:     destination for saved model
    :param name:            model name
    """
    if target_path:
        os.makedirs(target_path, exist_ok=True)
    
    # StandardScaler
    joblib.dump(
        pipe.steps[1][1].__dict__, 
        target_path + '/scaler.pickle')
    
    # Keras model--if we have the class code, we don't need this
    json.dump(
        pipe.steps[2][1].model.to_json(), 
        open(f'{target_path}/model.json', 'w'))
    pipe.steps[2][1].model.save_weights(f'{target_path}/weights.h5')
    
    context.log_artifact('model', 
                         target_path=target_path, 
                         labels={"engineer": "functions.models.FeaturesEngineer",
                                 "scaler": "sklearn.preprocessing.StandardScaler",
                                 'model': 'tensorflow.keras.sklearn.KerasClassifier',
                                 "type": "classifier"})

In [13]:
def my_pipeline_load(
    target_path: str,
    key: str
) -> Pipeline:
    """Deserialize a specific pipeline.
    
    See 'pipeline_save' for details.
    
    :param target_path: location of saved model + prefix of file names
                        For example '/User/projects/simdata/model-scaler.pickle'
    :param key:         model name
    """

    # this particular feature generator has no state
    # or parameters
    ffg = FeaturesGenerator()
    
    # scaler
    ss = StandardScaler()
    ss.__dict__ = joblib.load(f'{target_path}/scaler.pickle')
    
    # keras model
    ksm = classifier_gen()
    ksm.load_weights(f'{target_path}/weights.h5')
    
    pipe = make_pipeline(ffg, ss, ksm)
    
    return pipe

<a id='train'></a>
## **training and validation**

In this notebook demonstration we wrap the training and validation steps into the same method.  The data is split into train, validation, and test sets, with the latter being saved for further tests.

**exercise / todos**

To complete the demonstration, instead of hard-coding the `train_test_split` method, add a splitter class into the pipeline, like a cross-validator. 

The model encoder/decoder should also be input as a parameter.

In [14]:
def train(
    context: MLClientCtx,
    dataset: DataItem,
    engineer_cls: str,
    scaler_cls: str,
    classifier_cls: str,
    target_path: str,
    key: str = '',
    test_size: float = 0.1,
    valid_size: float = 0.3,
    batch_size: int = 1024,
    epochs: int = 5,
    verbose: bool = True,
    random_state: int = 1,
    ) -> None:
    """Train, validate, test and save a classifier model pipeline.
    
    Here we split the data, instantiate our pipeline and its models, and proceed
    to training and validation.
    
    :param context:             function context
    :param dataset:             cleaned input dataset
    :param engineer_cls:        feature engineering class
    :param scaler_cls:          scaler class
    :param classifier_cls:      classifier class    
    :param target_path:         destination for artifacts
    :param key:                 model key in context
    :param test_size:           (0.1) test set size as fraction
    :param valid_size:          (0.3) validation set size as fraction
    :param batch_size:          (1024) network feed batch size
    :param epochs:              (5) training epochs
    :param verbose:             (default True) Show metrics for 
                                training/validation steps
        
    Also included for demonstration are a randomly selected sample
    of training parameters:
    :param learning_rate: Step size at each iteration, constant.
    """
    raw = get_context_table(dataset)

    train, test = train_test_split(raw, test_size=test_size)
    train, valid = train_test_split(train, test_size=valid_size)
    
    y_train = train.pop('labels')
    y_valid = valid.pop('labels')
    y_test = test.pop('labels')

    # instantiate features engineer, scaler and classifier
    Engineer = class_instance(engineer_cls)
    Scaler = class_instance(scaler_cls)
    classifier = class_instance(classifier_cls)

    # we want to save the features, and perhaps use them elsewhere
    # so so the Engineer gets the header (deprecate?), path and key:
    engineer = Engineer(raw.columns, target_path, 'features')
    # the other components have no parameters in this demonstration:

    pipe = make_pipeline(engineer,
                         Scaler(),
                         classifier)
    pipe.fit(train, y_train)

    y_pred = pipe.predict(test)                          
    acc = accuracy_score(y_test, y_pred)
    context.log_result("accuracy", float(acc))

    my_pipeline_save(context, pipe, target_path, 'tf-keras-pipe-key')

#### _end of nuclio function definition_

In [15]:
# nuclio: end-code

<a id="local-testing" ></a>
______________________________________________

# **testing your code locally**

The function can be run locally and debugged/tested before deployment:

In [16]:
from mlrun import (mlconf,
                   code_to_function,
                   new_function,  
                   new_model_server,
                   mount_v3io)  

Set MLRun's DB path.  MLRun wil generate and store all of its tracking and metadata to the `MLRUN_DBATH` environment variable.  We have set a `TARGET_PATH` earlier in this notebook in the above section **[paths and parameters](#paths)**.

In [17]:
mlconf.dbpath = "http://mlrun-api:8080"

In [18]:
workflow = new_function()

In [19]:
datagen_run = workflow.run(
    name='data generator',
    handler=data_generator,
    params={
        'samples': N_SAMPLES,
        'features': M_FEATURES,
        'neg_weight': CLASS_BALANCE, # this is a balanced dataset
        'target_path': TARGET_PATH,
        'key': 'simdata'})

writing /User/mlrun/simdata/simdata-1e06X20.parquet



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...1e4642,0,Dec 26 00:45:16,completed,data generator,host=jupyter-qlqrqnzi25-vogv2-79db4f79d-qt2wf,,samples=1000000features=20neg_weight=0.5target_path=/User/mlrun/simdatakey=simdata,,features


to track results use .show() or .logs() or in CLI: 
!mlrun get run d5e893bf80ad4cd681a87178c21e4642  , !mlrun logs d5e893bf80ad4cd681a87178c21e4642 
[mlrun] 2019-12-26 00:45:20,753 run executed, status=completed


In [20]:
train_run = workflow.run(
    name='train, validate and store model',
    handler=train,
    inputs={
        'dataset': datagen_run.output('features')},
    params={
        'scaler_cls':     'sklearn.preprocessing.data.StandardScaler',
        'engineer_cls':   'functions.models.FeaturesEngineer',
        'classifier_cls': 'functions.models.Classifier',
        'target_path':     TARGET_PATH,
        'key':             'model',
        'batch_size':      BATCH_SIZE,
        'epochs':          10})

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
630000 20
Train on 630000 samples
100000 20



uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
...fa5cb5,0,Dec 26 00:45:20,completed,"train, validate and store model",host=jupyter-qlqrqnzi25-vogv2-79db4f79d-qt2wf,dataset,scaler_cls=sklearn.preprocessing.data.StandardScalerengineer_cls=functions.models.FeaturesEngineerclassifier_cls=functions.models.Classifiertarget_path=/User/mlrun/simdatakey=modelbatch_size=1024epochs=10,accuracy=0.98014,model


to track results use .show() or .logs() or in CLI: 
!mlrun get run 3c0d63660cfc4b6481bb26ce28fa5cb5  , !mlrun logs 3c0d63660cfc4b6481bb26ce28fa5cb5 
[mlrun] 2019-12-26 00:45:48,653 run executed, status=completed


<a id="image"></a>
#### _Create a deployment image_

Once debugged you can create a reusable image, and then deploy it for testing. In the following line we are converting the code block between the ```#nuclio: ignore``` and ```#nuclio: end-code``` to be run as a KubeJob.  Next we build an image named ```mlrun/mlrunlkeras:latest```.  _**It is important to ensure that this image has been built at least once, and that you have access to it.**_

In [27]:
tfkeras_job = code_to_function(name='tfkeras-named-pipe', runtime="job").apply(mount_v3io())

<a id="pipeline"></a>
______________________________________________

# **Create a KubeFlow Pipeline from our functions**

Our pipeline will consist of two instead of three steps, ```load``` and ```train```.  We"ll drop the ```test```
here since at the end of this deployment we can test the system with API requests.

For complete details on KubeFlow Pipelines please refer to the following docs:
1. **[KubeFlow pipelines](https://www.kubeflow.org/docs/pipelines/)**.
2. **[kfp.dsl Python package](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#module-kfp.dsl)**.

Please note, the model server file name in the ```new_model_server``` function call below should identical in every respect to the name of the model server notebook.

In [26]:
import kfp
from kfp import dsl

In [23]:
@dsl.pipeline(
    name="Sklearn and KubeFlow",
    description="Shows how to use mlrun/kfp."
)
def tfkeras_pipeline(
):

    datagen = tfkeras_job.as_step(
        handler='data_generator',
        out_path=TARGET_PATH, 
        params={        
            'samples':         N_SAMPLES,
            'features':        M_FEATURES,
            'neg_weight':      CLASS_BALANCE,
            'target_path':     TARGET_PATH,
            'key':            'simdata'},
        outputs=['features'])
    
    train = tfkeras_job.as_step(
        handler='train',
        out_path=TARGET_PATH, 
        inputs={'dataset': datagen.outputs['features']},
        outputs=['model'],
        params={
            'scaler_cls':     'sklearn.preprocessing.data.StandardScaler',
            'engineer_cls':   'functions.models.FeaturesEngineer',
            'classifier_cls': 'functions.models.Classifier',
            'target_path':     TARGET_PATH,
            'key':             'model',
            'batch_size':      BATCH_SIZE,
            'epochs':          10})        

    # define a nuclio-serving function, generated from a notebook file
    srvfn = new_model_server(
        "tfkeras-serving", 
        model_class="TFKerasClassifier", 
        filename="model-server.ipynb")
    
    # deploy the model serving function with inputs from the training stage
    deploy = srvfn.with_v3io("User", "~/")
    deploy = deploy.deploy_step(project="refactor-demos", 
                                models={"tfkeras_joblib": train.outputs["model"]})

<a id="compile the pipeline"></a>
### _compile the pipeline_

We can compile our KubeFlow pipeline and produce a yaml description of the pipeline worflow:

In [24]:
os.makedirs(TARGET_PATH, exist_ok=True)
kfp.compiler.Compiler().compile(tfkeras_pipeline, TARGET_PATH+"/mlrunpipe.yaml")

In [25]:
client = kfp.Client(namespace="default-tenant")

Finally, the following line will run the pipeline as a job::

In [None]:
arguments = {
}

run_result = client.create_run_from_pipeline_func(
    tfkeras_pipeline, 
    arguments, 
    run_name="tfkeras_latest",
    experiment_name="tfkeras")

In [28]:
import mlrun
!mlrun clean

  % (item.__module__, item.__name__)
[mlrun] 2019-12-26 00:52:02,087 using in-cluster config.
state      started          type     name
Failed     Dec 26 00:31:16  job      kubeflow-pipeline-67222
Succeeded  Dec 26 00:14:40  build    mlrun-build-kubeflow-pipeline-gft9m
