# **Keras and Scikit-Learn With MLRun: Kaggle's Credit Fraud Data**

_______________________________________________________________________________

MLRun is an open-source Python package that provides a framework for running machine learning tasks transparently in multiple, scalable, runtime environments.  MLRun provides tracking of code, metadata, inputs, outputs and the results of machine learning pipelines. 

In this notebook we"ll compose a pipeline that deploys a classifier model, and uses it as the input in either evaluation, inference, or retrain steps. We'll be working with **[Kaggle's Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/ntnu-testimon/paysim1)**, a synthetic mobile transactions dataset with 6362620 rows, 10 features and a binary label.  The model will be a dense neural net classifier written with **[keras](https://keras.io/)** (v2.3.1), using a **[tensorflow](https://www.tensorflow.org/)** backend (v1.14.0), and without gpu support.

One of the challeges with the paysim dataset is it's imbalance. For more details on some of the issues, take a look at **[Classification on imbalanced data](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data)** covering the same dataset. We'll borrow some of the feature transformations presented there as part of a pipeline within a pipeline.

Another transformation we'll apply is a standard normalization of our feature input matrix.  This ```fit``` is performed on the training data only, and this estimated transformer is saved and reused on any data going into the model--during validation, testing, and inference the data must be transformed using this pre-estimate. One potential source of model drift would be the case where this pre-estimated normalizer fails.  

All of the fitted transformations and latest estimated model are run as an sklearn pipeline, then saved and deployed for later testing and inference. 

#### **notebook take-aways**
* write and test reusable **[MLRun](https://github.com/mlrun)** components in a notebook
* run the components as a **[KubeFlow](https://www.kubeflow.org/)** pipeline

<a id='top'></a>
#### **steps**
**[install the python mlrun package](#install)**<br>
**[nuclio code section](#nuclio-code-section)**<br>
    - [nuclio's ignore](#ignore)<br>
    - [function dependencies](#function-dependencies)<br>

**[components](#components)**<br>
    - [supporting functions](#utilties)<br>
    - [feature engineering](#feateng)<br>
    - [a classifier](#classifier)<br>
    - [training and validation](#train)<br>
**[local tests](#local-testing)**<br>
**[compose pipeline](#image)**<br>
**[run](#run)**<br>

<a id="install" ></a>
______________________________________________

# **notebook installs**

The following will reinstall the latest development version of ```mlrun```:

    !pip install -U git+https://github.com/mlrun/mlrun.git@development

Install the KubeFlow pipelines package ```kfp```. For more information see the **[KubeFlow documentation on nuclio](https://www.kubeflow.org/docs/components/misc/nuclio/)** and  **[Kubeflow pipelines and nuclio](https://github.com/kubeflow/pipelines/tree/master/components/nuclio)**. For logging the estimated machine learning models we"ll use ```joblib```"s [```dump``` and ```load```](https://joblib.readthedocs.io/en/latest/persistence.html#persistence).

    !pip install -U kfp joblib seaborn tensorflow==1.14 keras

<a id="nuclio-code-section"><a>
______________________________________________

# **nuclio code section**

<a id='ignore'></a>
### _nuclio"s **ignore** notation_

You"ll write all the code that gets packaged for execution between the tags ```# nuclio: ignore```, meaning ignore all the code here and above, and ```# nuclio: end-code```, meaning ignore everything after this annotation.  Methods in this code section can be called separately if designed as such (```acquire```, ```split```, ```train```, ```test```), or as you"ll discover below, they are most often "chained" together to form a pipeline where the output of one stage serves as the input to the next. The **[docs](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** also suggest another approach: we can use ```# nuclio: start``` at the first relevant code cell instead of marking all the cells above with ```# nuclio: ignore```.

See the **[nuclio-jupyter](https://github.com/nuclio/nuclio-jupyter)** repo for further information on these and many other **[nuclio magic commands](https://github.com/nuclio/nuclio-jupyter#creating-and-debugging-functions-using-nuclio-magic)** that make it easy to transform a Jupyter notebook environment into a platform for developing production-quality, machine learning systems.

The ```nuclio-jupyter``` package provides methods for automatically generating and deploying nuclio serverless functions from code, repositories or Jupyter notebooks. **_If you have never run nuclio functions in your notebooks, please uncomment and run the following_**: ```!pip install nuclio-jupyter```

The following two lines _**should be in the same cell**_ and mark the start of your mchine learning coding section:

In [None]:
# nuclio: ignore
import nuclio 

<a id="function-dependencies"></a>
### _function dependencies_

The installs made in the section **[Setup](#Setup)** covered the Jupyter environment within which this notebook runs.  However, we need to ensure that all the dependencies our nuclio function relies upon (such as ```matplotlib```, ```sklearn```, ```lightgbm```), will be available when that code is wrapped up into a nuclio function _**on some presently unknown runtime**_.   Within the nuclio code section we can ensure these dependencies get built into the function with the ```%nuclio cmd``` magic command.

In [None]:
%nuclio cmd -c pip install -U matplotlib tensorflow==1.14.0 keras sklearn pandas numpy joblib
%nuclio cmd -c pip install -U git+https://github.com/mlrun/mlrun.git@7604c4d35e076897b49815f8f1cb8ee13cbd0286

We"ll use a standard base image here, however the build step can be shortened by preparing images with pre-installed packages.

In [None]:
%nuclio config spec.build.baseImage = "python:3.6-jessie"

### _imports_

Some of the functionality is provided in supporting components within the functions folder:
- **[files](functions/file_functions.py)**
- **[models](function/model_functions.py)**
- **[plots](functions/plot_functions.py)**
- **[tables](functions/table_functions.py)**

In [None]:
import functions

In [None]:
import os
from typing import Any

In [None]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow as pa

from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

<a id='paths'></a>
### _source and destination paths_

In [None]:
TARGET_PATH = "/User/projects/paysim"
SRC_PATH = TARGET_PATH
SRC_URL = ''
SRCNAME = "data/PS_20174392719_1491204439457_log.csv.zip"

<a id="components" ></a>
______________________________________________

# **components**

<a id='feateng'></a>
## **feature engineering**

In [None]:
class FraudFeaturesGenerator(TransformerMixin, BaseEstimator):
    """Generate features from raw input.

    A standard transformer mixin that can be inserted into a scikit learn Pipeline.
    
    These modifications to the features matrix follows the approach taken by 
    https://www.tensorflow.org/tutorials/structured_data/imbalanced_data
    where this data set is looked at in some detail.

    To use, 
    >>> ffg = FraudFeaturesGenerator()
    >>> ffg.fit(X)
    >>> x_transformed = ffg.transform(X)
    or
    >>> ffg = FraudFeaturesGenerator()
    >>> x_transformed = ffg.fit_transform(X)
    
    In a pipeline:
    >>> from sklearn.pipeline import Pipeline
    >>> from sklearn.preprocessing import StandardScaler
    >>> transformers = [('feature_gen', FraudFeaturesGeneratorFeature()), 
                        ('scaler', StandardScaler())]
    >>> transformer_pipe = Pipeline(transformers)
    """
    def fit(self, x, y=None):
        """fit is unused here
        """
        return self

    def transform(
        self,
        x: pd.DataFrame,
        y=None
    ): -> pd.DataFrame:
        """Transform raw input features a preprocessing step.
        
        :param x: Raw input features, as a pandas Dataframe 
        
        Returns a cleaned DataFrame of features.
        """
        x.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1, inplace=True)
        x = x.loc[(X.type == 'TRANSFER') | (x.type == 'CASH_OUT')]
        
        # Binary-encoding of labelled data in 'type'
        x.loc[x.type == 'TRANSFER', 'type'] = 0
        x.loc[x.type == 'CASH_OUT', 'type'] = 1
        x.type = x.type.astype(int) # convert dtype('O') to dtype(int)
        
        # cannot happen so mark
        x.loc[(x.oldbalanceDest == 0) & (x.newbalanceDest == 0) & (x.amount != 0), \
              ['oldbalanceDest', 'newbalanceDest']] = - 1
        x.loc[(x.oldbalanceOrg == 0) & (x.newbalanceOrig == 0) & (x.amount != 0), \
              ['oldbalanceOrg', 'newbalanceOrig']] = np.nan
        
        # create 2 new features (columns) recording errors in the originating and 
        # destination accounts for each transaction. 
        x['errorBalanceOrig'] = x.newbalanceOrig + x.amount - x.oldbalanceOrg
        x['errorBalanceDest'] = x.oldbalanceDest + x.amount - x.newbalanceDest
        
        # log transform amount
        eps=0.001 # 0 => 0.1¢
        x['amount_log'] = np.log(x['amount'].values+eps)
        
        # reset the index and save.
        x.reset_index(inplace=True)
        
        return x
    

<a id="classifier"></a>
## **a classifier**

In [None]:
class KerasSequentialModel(Sequential):
    """Generate a Simple Keras Model.
        
    This could be a very involved model builder, or it could
    download a model and make it available for retraining,
    `ala transfer learning`.
    
    So `model_args` might be number of layers, initializers,
    or whole custom layers built elsewhere.
    
    :param model_args: Unused.
    
    Returns a keras sequential model.
    """
    def __init__(self, model_args: dict={}):
        clf = Sequential()
        clf.add(Dense(units =15 , 
                kernel_initializer = 'uniform', 
                activation = 'relu', 
                input_dim = 11))
        clf.add(Dense(units = 15, 
                kernel_initializer = 'uniform', 
                activation = 'relu'))
        clf.add(Dense(units = 1, 
                kernel_initializer = 'uniform', 
                activation = 'sigmoid'))
        clf.compile(optimizer = 'adam', 
                    loss = 'binary_crossentropy', 
                    metrics = ['accuracy'])
        return clf

<a id='train'></a>
## **training and validation**

In [None]:

def train(
    context: MLClientCtx,
    feature_generator: TransformerMixin,
    normalizer: TransformerMixin,
    any_model: Any,
    data: DataItem,
    target_path: str = '',
    name: str = 'model-default.pickle'
    batch_size: int = 1024,
    epochs: int = 5,
    learning_rate: float = 0.1,
    test_size: float = 0.1,
    valid_size: float = 0.3,
    verbose: bool = True,
    random_state: int = 1,
    ) -> None:
    """Train, validate, test and save a TF-Keras classifier model.
    
    :param context:             function context
    :param feature_generator:   scikit learn transformer
    :param normalizer:          scikit learn transformer
    :param classidier_model:    keras Sequential model, could be any model
    :param data:                complete raw input dataset
    :param batch_size:          (1024) network feed batch size
    :param epochs:              (5) training epochs
    :param test_size:           (0.1) test set size as fraction
    :param valid_size:          (0.3) validation set size as fraction
    :param verbose:             (default True) Show metrics for 
                                training/validation steps
    :param target_path:         destination for artifacts
    :para name:                 model name
        
    Also included for demonstration are a randomly selected sample
    of training parameters:
    :param learning_rate: Step size at each iteration, constant.
    """
    raw = functions.get_context_table(data)

    train, test = train_test_split(raw, test_size=test_size)
    train, valid = train_test_split(train, test_size=valid_size)
    
    y_train = train.pop('isFraud')
    y_valid = valid.pop('isFraud')
    y_test = test.pop('isFraud')

    training_pipeline = [('features', feature_generator()),
                          'scaler', normalizer()),
                          'classifier', any_model()]

    training_pipeline.fit(train, 
                          y_train, 
                          batch_size = batch_size, 
                          epochs = epochs)

    y_pred = training_pipeline.predict(test)
    acc = accuracy_score(y_test, y_pred)
    
    context.log_result("accuracy", float(acc))

    functions.log_context_table(context, target_path, 'xtext.parquet', test)
    functions.log_context_table(context, target_path, 'ytext.parquet', y_test)
    functions.log_context_model(context, target_path, name, training_pipeline)

#### _end of nuclio function definition_

In [None]:
# nuclio: end-code

<a id="local-testing" ></a>
______________________________________________

# **testing your code locally**

The function can be run locally and debugged/tested before deployment:

In [None]:
from mlrun import (mlconf,           # you can use mlconf for mlconf settings
                  code_to_function,  # convert your code to a function
                  new_function,      # create a new function, give it some code
                  new_model_server,  # in this notebook create an inference server 
                  mount_v3io)        # use the v3io data fabric

Set MLRun's DB path.  MLRun wil generate and store all of its tracking and metadata to the `MLRUN_DBATH` environment variable. **Please note that you should not be storing other data related to your experiments in this folder, let MLRun manage it.** We have set a `target_path` earlier in this notebook in the section [setting some source and destination paths](#paths).

In [None]:
%env MLRUN_DBPATH=/User/mlrun
mlconf.dbpath = "/User/mlrun"

In [None]:
workflow = new_function()

In [None]:
arc2parq_run = workflow.run(
    name='acquire and store',
    handler=functions.arc_to_parquet,
    params={
        'archive_url': '/User/projects/paysim/data/PS_20174392719_1491204439457_log.csv.zip',
        'header': None,
        'target_path': TARGET_PATH,
        'name': 'paysim.parquet',
        'chunksize': 10_000
    })

In [None]:
train_run = workflow.run(
    name='train and validate then store model',
    handler=train,
    inputs={
        'data': arc2parq_run.outputs['data']
    },
    params={
        'target_path': TARGET_PATH,
        'srcname': 'paysim.parquet',
        'feature_generator': FraudFeaturesGenerator,
        'normalizer': StandardScaler,
        'classifier_model': KerasSequentialModel,
        'batch_size': 1024,
        'epcohs': 5
        'name': 'tfkeras-seq.pickle'
    })

<a id="image"></a>
#### _Create a deployment image_

Once debugged you can create a reusable image, and then deploy it for testing. In the following line we are converting the code block between the ```#nuclio: ignore``` and ```#nuclio: end-code``` to be run as a KubeJob.  Next we build an image named ```mlrun/mlrunlgb:latest```.  _**It is important to ensure that this image has been built at least once, and that you have access to it.**_

In [None]:
tfkeras_job = code_to_function(runtime="job").apply(mount_v3io())

tfkeras_job.build(image="mlrun/mlruntfkeras:latest")

While debugging, and _**after you have run**_ ```build``` **_at least once**_, you can comment out the last cell so that the build process isn"t started needlessly.  The code can be injected into the job using the following line:

In [None]:
# lgbm_job.with_code()

<a id="pipeline"></a>
______________________________________________

# **Create a KubeFlow Pipeline from our functions**

Our pipeline will consist of two instead of three steps, ```load``` and ```train```.  We"ll drop the ```test```
here since at the end of this deployment we can test the system with API requests.

For complete details on KubeFlow Pipelines please refer to the following docs:
1. **[KubeFlow pipelines](https://www.kubeflow.org/docs/pipelines/)**.
2. **[kfp.dsl Python package](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#module-kfp.dsl)**.

Please note, the model server file name in the ```new_model_server``` function call below should identical in every respect to the name of the model server notebook.

In [None]:
import kfp
from kfp import dsl

In [None]:
@dsl.pipeline(
    name="TF-Keras Classifier Training Pipeline - Paysim",
    description="Shows how to use mlrun/kfp."
)
def tfkeras_pipeline(
   learning_rate = [0.1, 0.3]
):

    acquire = tfkeras_job.as_step(
        handler='my_funcs.file_functions.arc_to_parquet',
        out_path=target_path, 
        params={
        },
        outputs=['data'])
    
    train = tfkeras_job.as_step(
        handler='train ',
        out_path=artifacts_path, 
        inputs={'model': train.outputs['model']},
        outputs=['validation'])

    # define a nuclio-serving function, generated from a notebook file
    srvfn = new_model_server(
        "paysim-serving", 
        model_class="TFKerasClassifier", 
        filename="model-server.ipynb")
    
    # deploy the model serving function with inputs from the training stage
    deploy = srvfn.with_v3io("User", "~/")
    deploy = deploy.deploy_step(project="refactor-demos", 
                                models={"tfkeras_joblib": train_step.outputs["model"]})

<a id="compile the pipeline"></a>
### _compile the pipeline_

We can compile our KubeFlow pipeline and produce a yaml description of the pipeline worflow:

In [None]:
os.makedirs(TARGET_PATH, exist_ok=True)
kfp.compiler.Compiler().compile(lgbm_pipeline, os.path.join(TARGET_PATH, "mlrunpipe.yaml"))

In [None]:
client = kfp.Client(namespace="default-tenant")

Finally, the following line will run the pipeline as a job::

In [None]:
arguments = {
    "learning_rate": [ 0.1, 0.3]
}

run_result = client.create_run_from_pipeline_func(
    tfkeras_pipeline, 
    arguments, 
    run_name="tfkeras_latest",
    experiment_name="tfkeras")