# Train and Deploy the Credit Risk Model

This pipeline trains and deploys the components of the credit risk application that are related to the model.

The components that are trained and deployed are:
* Transformer (data preprocessing)
* Predictor (ONNX runtimed)
* Explainer (Alibi)

The pipeline also includes training data quality reports and model quality evaluation.

In [1]:
import kfp
from kfp.components import InputPath, OutputPath
from kfp import dsl
from typing import List, Tuple
from kfp.dsl import ContainerOp
from kubernetes.client.models import V1EnvVar,V1EnvVarSource, V1SecretKeySelector,V1ConfigMapKeySelector
from typing import NamedTuple
import os

## Images
The source for these images is included in the container image subdirectory.

This example uses many custom images, so that we can use open-source packages not available in our notebook container images. Using multiple images allows us to layer our builds, so individual components can be built faster. It also allows us to scope packages to only those parts of the application that need them.

In [2]:
BASE_IMAGE = "quay.io/ntlawrence/demo-workflow:3.0.0"
KSERVE_IMAGE = "quay.io/ntlawrence/demo-kserve:3.0.0"
TRANSFORMER_IMAGE = "quay.io/ntlawrence/demo-transformer:3.0.0"
PREDICTOR_IMAGE = "quay.io/ntlawrence/demo-predictor:3.0.0"
EXPLAINER_IMAGE = "quay.io/ntlawrence/demo-explainer:3.0.0"

## Define the tensorborad component

Define a component to start a tensorboard service to monitor training.

In [3]:
CONFIGURE_TENSORBOARD_COMPONENT = f"{os.getenv('HOME')}/kubeflow-ppc64le-examples/configure_tensorboard_component/configure_tensorboard_component.yaml"

configure_tensorboard_comp = kfp.components.load_component_from_file(
    CONFIGURE_TENSORBOARD_COMPONENT
)

## Define components to load training data

The demo is designed to be able to pull training data from either DB2 or PostgreSQL. The following components are for each source. The actual pipeline will only use one of them.

We could imagine defining both components in different files, and pulling in only the one that is needed here. But to keep things simple, they are both defined here in the pipeline creation script.

In [4]:
def load_df_from_db2(table_name: str,
                     data_frame_pkl: OutputPath(str)):
    import warnings
    import ibm_db
    import ibm_db_dbi
    import os
    import json
    import pandas as pd
    import pickle
    from typing import Dict, Any
    
    def assign_categories_to_df(df: pd.DataFrame, column_info: Dict[str, any]) -> None:
        for col_name, levels in column_info["label_columns"].items():
            if col_name in df.columns:
                ctype = pd.CategoricalDtype(categories=levels, ordered=False)
                df[col_name] = df[col_name].astype(ctype)

    def df_from_sql(
        name: str,
        conn: ibm_db.IBM_DBConnection,
        column_info: Dict[str, Any],
    ) -> pd.DataFrame:
        sql_safe_name = name.replace('"', "")

        rStmtColsSql = ",".join([f'"{col}"' for col in column_info["columns"]])
        rSql = f'SELECT {rStmtColsSql} FROM "{sql_safe_name}" ORDER BY "ACCOUNT_ID"'

        read_conn = ibm_db_dbi.Connection(conn)
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", message="pandas only support SQLAlchemy")
            df = pd.read_sql(rSql, read_conn)

        assign_categories_to_df(df, column_info)
        return df
    
    conn_str = (
    "DRIVER={IBM DB2 ODBC DRIVER};"
    f"DATABASE=BLUDB;HOSTNAME={os.environ['db2_host']};PORT={os.environ['db2_port']};PROTOCOL=TCPIP;UID={os.environ['db2_user']};Pwd={os.environ['db2_pwd']};SECURITY=SSL;"
    )
        
    conn = ibm_db.connect(conn_str, "", "")

    column_info = json.loads(os.environ["COLUMNS"])
    df = df_from_sql(table_name, conn, column_info)
    df.to_pickle(data_frame_pkl)


load_df_from_db2_comp = kfp.components.create_component_from_func(
    func=load_df_from_db2, base_image=BASE_IMAGE
)

In [5]:
def load_df_from_postgresql(table_name: str,
                           data_frame_pkl: OutputPath(str)):
    import os
    import json
    import pandas as pd
    import pickle
    from typing import Dict, Any
    import psycopg
    from psycopg import sql
    import yaml
    
    def get_pg_conn() -> psycopg.Connection:
        host, dbname, username, password, port = (
            os.environ.get('PG_HOST'),
            os.environ.get('PG_DB_NAME'),
            os.environ.get('PG_USER'),
            os.environ.get('PG_PWD'),
            int(os.environ.get('PG_PORT'))
        )

        conn_str = f"postgresql://{username}:{password}@{host}:{port}/{dbname}?connect_timeout=10&application_name=mlpipeline"
        print(conn_str)
        conn = psycopg.connect(conn_str)

        return conn
    
    def assign_categories_to_df(df: pd.DataFrame, column_info: Dict[str, any]) -> None:
        for col_name, levels in column_info["label_columns"].items():
            if col_name in df.columns:
                ctype = pd.CategoricalDtype(categories=levels, ordered=False)
                df[col_name] = df[col_name].astype(ctype)

    def df_from_sql(
        name: str,
        db: psycopg.Connection,
        column_info: Dict[str, Any],
    ) -> pd.DataFrame:
        with db.cursor() as cur:
            cur.execute(sql.SQL('SELECT * FROM {} ORDER BY "ACCOUNT_ID"').format(sql.Identifier(table_name)))
            df = pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])
        assign_categories_to_df(df, column_info)

        return df
    
    conn = get_pg_conn()
    column_info = json.loads(os.environ["COLUMNS"])
    df = df_from_sql(table_name, conn, column_info)
    df.to_pickle(data_frame_pkl)


load_df_from_postgresql_comp = kfp.components.create_component_from_func(
    func=load_df_from_postgresql, base_image=BASE_IMAGE, packages_to_install=["psycopg[binary,pool]"]
)

## Generate a data quality report

The report appears in the task's visualization's tab. It helps to understand what bias may exist in the training data.

In [6]:
def data_quality_report(df: InputPath(str),
                        features: List[str],
                        mlpipeline_ui_metadata_path: OutputPath(str),
                        output_report: OutputPath(str),
                        target: str = 'Risk'):
    from evidently.metric_preset import DataQualityPreset
    from evidently.report import Report
    from evidently import ColumnMapping
    import pandas as pd
    import os
    from pathlib import Path
    import json

    
    dataset = pd.read_pickle(df)
    column_info = json.loads(os.environ["COLUMNS"])

    column_mapping = ColumnMapping()
    column_mapping.target = target
    column_mapping.task = "classification"
    feature_set = set(features)
    column_mapping.numerical_features = [
        c
        for c in column_info["int_columns"]
        if c in feature_set
    ]
    column_mapping.categorical_features = [
        c
        for c in column_info["label_columns"]
        if c in feature_set
    ]

    report = Report(
        metrics=[
            DataQualityPreset(),
        ]
    )

    report.run(
        reference_data=None,
        current_data=dataset,
        column_mapping=column_mapping,
    )

    Path(output_report).parent.mkdir(parents=True, exist_ok=True)
    report.save_html(output_report)
    html_content = open(output_report, "r").read()
    metadata = {
        "outputs": [
            {
                "type": "web-app",
                "storage": "inline",
                "source": html_content,
            }
        ]
    }

    with open(mlpipeline_ui_metadata_path, "w") as f:
        json.dump(metadata, f)
        
data_quality_report_comp = kfp.components.create_component_from_func(
    func=data_quality_report, base_image=BASE_IMAGE
)

## Define a component to fit the preprocessor

Because the input features contain categorical data, we'll need a preprocessor to one-hot encode the cateogrical features. Non-categorical features are passed through, making the result features that are ready to be trained.

In [7]:
def fit_preprocessor(
    training_df: InputPath(str),
    preprocessor_pkl: OutputPath(str),
    features: List[str],
):
    import pandas as pd
    import json
    import joblib
    import os

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.pipeline import Pipeline

    feature_set = set(features)
    column_info = json.loads(os.environ["COLUMNS"])

    ohe_labels = [
        (
            "ohe_" + label,
            OneHotEncoder(
                handle_unknown="ignore", sparse_output=False, categories=[levels]
            ),
            [label],
        )
        for label, levels in column_info["label_columns"].items()
        if label in feature_set
    ]

    int_cols = [
        (
            "passthrough",
            "passthrough",
            [col for col in column_info["int_columns"] if col in feature_set],
        )
    ]

    pipe = Pipeline(
        steps=[
            ("Preprocess", ColumnTransformer(ohe_labels + int_cols, remainder="drop")),
        ]
    )

    print(pipe)
    train = pd.read_pickle(training_df)
    print(train.dtypes)
    pipe.fit(train)
    joblib.dump(pipe, preprocessor_pkl)


fit_preprocessor_comp = kfp.components.create_component_from_func(
    func=fit_preprocessor, base_image=BASE_IMAGE
)

## Define a component to train the model

The model is a shallow neural network of fully connected layers, implemented in TensorFlow. In practice, we might be better model performance using a random forest with Scikit-learn or [SnapML](https://snapml.readthedocs.io/en/latest/installation.html), however a neural network was selected to show off the features of GPUs on IBM Power 9 AC922's and IBM Power 10 with MMA technology.

Since a neural network has difficulty providing explainations for its predictions, using one allows us to showcase open-source solutions for explainable AI.

In practice, the German Credit Risk dataset is usuallay paired Random Forests in AI demos. SnapML provides that capability and has been reported to have excellent performance on Power 10 systems; this would be a worthwhile alternative to investigate in the future.

In [8]:
def train(
    training_df: InputPath(str),
    preprocessor: InputPath(str),
    model: OutputPath(str),
    target_processing_config: OutputPath(str),
    mlpipeline_ui_metadata_path: OutputPath(str),
    target: str = "Risk",
    tensorboard_dir: str = None,
):
    import pandas as pd
    import json
    import joblib
    import tensorflow as tf
    from keras import Sequential
    from keras.layers import Dense, Dropout, BatchNormalization, Input
    from keras.callbacks import EarlyStopping, ReduceLROnPlateau, TensorBoard
    from sklearn.metrics import precision_recall_curve
    import numpy as np
    import os
    from sklearn.metrics import PrecisionRecallDisplay
    import base64

    tf.keras.utils.set_random_seed(42)
    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
    target_processing_config_dict = {
        "threshold" : 0.5,
        "target_names" : {0: "No Risk", 1: "Risk"}
    }

    def get_tf_model(num_features: int) -> Tuple[tf.keras.Model, List[tf.keras.callbacks.Callback]]:

        tf_model = Sequential(
            [
                Input(shape=(num_features,)),
                BatchNormalization(),
                Dense(64, activation="sigmoid", name="layer1"),
                BatchNormalization(),
                Dropout(0.3, name="dropout1"),
                Dense(32, activation="sigmoid", name="layer2"),
                BatchNormalization(),
                Dropout(0.3, name="dropout2"),
                Dense(16, activation="sigmoid", name="layer3"),
                Dropout(0.3, name="dropout3"),
                Dense(
                    1,
                    activation="sigmoid",
                    name="output",
                ),
            ]
        )

        tf_model.compile(optimizer="adam", loss="binary_crossentropy")

        callbacks = [
            EarlyStopping(
                monitor="val_loss",
                patience=20,
                verbose=0,
                mode="min",
                restore_best_weights=True,
            ),
            ReduceLROnPlateau(
                monitor="val_loss",
                factor=0.1,
                patience=10,
                verbose=1,
                min_delta=0.0001,
                mode="min",
            )
        ]
        print(f"constructing tensorborad with log dir {tensorboard_dir}...")
        if tensorboard_dir:
            callbacks.append(TensorBoard(
                log_dir=tensorboard_dir
            ))

        return tf_model, callbacks

    print("loading training data...")
    train = pd.read_pickle(training_df)
    print("loading preprocessor...")
    preprocessor = joblib.load(preprocessor)

    X = tf.convert_to_tensor(preprocessor.transform(train))
    y = tf.convert_to_tensor(
        train.loc[:, target].apply(lambda v: 1 if v == target_processing_config_dict["target_names"][1] else 0)
    )
    print("obtaining model....")
    tf_model, callbacks = get_tf_model(num_features=X.shape[1])
    print("Training...")
    tf_model.fit(
        X,
        y,
        validation_split=0.2,
        epochs=500,
        callbacks=callbacks,
        class_weight={0: 1, 1: 2},
    )
    
    # calculate best threshold of highest f1 score
    predictions = tf_model.predict(X)
    precision, recall, thresholds = precision_recall_curve(
        y_true=y.numpy(), probas_pred=predictions.flatten()
    )
    f1s = 2 * (precision * recall) / (precision + recall)
    threshold_index = np.argmax(f1s)
    target_processing_config_dict["threshold"] = float(thresholds[threshold_index])
    
    # Save model and threshold config
    tf_model.save(model, save_format="h5")
    with open(target_processing_config, "w") as f:
        json.dump(target_processing_config_dict, f)
        
    # Plot precision recall curve
    plt = PrecisionRecallDisplay.from_predictions(
        y_true=y.numpy(), y_pred=predictions.flatten(),
    )
    plt.ax_.plot(recall[threshold_index], precision[threshold_index], 
                     marker="o", markersize=10, markeredgecolor="black", markerfacecolor="red")
    plt.ax_.text(recall[threshold_index] + .03, precision[threshold_index] + .03,
                  (f"({recall[threshold_index]:.3f},{precision[threshold_index]:.3f})\n" +
                   f"threshold={thresholds[threshold_index]:.3f}\n" +
                   f"f1={f1s[threshold_index]:.3f}")
                )
    plt.ax_.set_title("Precision Recall - Risk (Training)")

    plt.figure_.savefig("pr.jpg")
    with open("pr.jpg", "rb") as f:
        jpg = base64.b64encode(f.read())
    html = f'<img src="data:image/jpg;base64,{jpg.decode("utf-8")}"/>'
    metadata = {"outputs": [{"type": "markdown", "storage": "inline", "source": html}]}
    with open(mlpipeline_ui_metadata_path, "w") as metadata_file:
        json.dump(metadata, metadata_file)

train_comp = kfp.components.create_component_from_func(
    func=train, base_image=BASE_IMAGE, packages_to_install=["minio"]
)

## Define a component to evalute the model

This component evaluates the model and produces a classification report in the visualizations tab. The output parameter mlpipeline_ui_metadata_path is what defines the visualization.

In [9]:
def evaluate(
    df: InputPath(str),
    preprocessor: InputPath(str),
    onnx_model: InputPath(str),
    target_processing_config: InputPath(str),
    output_report: OutputPath(str),
    mlpipeline_ui_metadata_path: OutputPath(str),
    target="Risk",
) -> NamedTuple("EvaluationOutput", [("mlpipeline_metrics", "Metrics")]):
    import pandas as pd
    import joblib
    import json
    from evidently.metric_preset import ClassificationPreset
    from evidently.report import Report
    from evidently import ColumnMapping
    import os
    from pathlib import Path
    import onnxruntime as ort
    import numpy as np
    from collections import namedtuple

    dataset = pd.read_pickle(df)
    preprocessor = joblib.load(preprocessor)
    
    inference_session = ort.InferenceSession(
            onnx_model, providers=["CPUExecutionProvider"]
    )
    
    with open(target_processing_config, "r") as f:
        target_processing_config_dict = json.load(f)
        target_processing_config_dict["target_names"] = {int(k):v for k,v in target_processing_config_dict["target_names"].items()}

    column_info = json.loads(os.environ["COLUMNS"])

    X = preprocessor.transform(dataset).astype(np.float32)
    y_prob = np.array(
        inference_session.run(
            [], {"input_1": X}
        )).flatten()

    dataset["Prediction"] = pd.Series(y_prob).apply(
        lambda p: 1 if p > target_processing_config_dict["threshold"] else 0
    )
    dataset["Actual"] = dataset.loc[:, target].apply(
        lambda v: 1 if v == target_processing_config_dict["target_names"][1] else 0
    )

    column_mapping = ColumnMapping()
    column_mapping.target_names = target_processing_config_dict["target_names"]
    column_mapping.target = "Actual"
    column_mapping.prediction = "Prediction"
    column_mapping.task = "classification"
    column_mapping.numerical_features = [
        c
        for c in column_info["int_columns"]
        if c in set(preprocessor.feature_names_in_)
    ]
    column_mapping.categorical_features = [
        c
        for c in column_info["label_columns"]
        if c in set(preprocessor.feature_names_in_)
    ]

    report = Report(
        metrics=[
            ClassificationPreset(),
        ]
    )

    report.run(
        reference_data=None,
        current_data=dataset,
        column_mapping=column_mapping,
    )

    Path(output_report).parent.mkdir(parents=True, exist_ok=True)
    report.save_html(output_report)
    html_content = open(output_report, "r").read()
    metadata = {
        "outputs": [
            {
                "type": "web-app",
                "storage": "inline",
                "source": html_content,
            }
        ]
    }

    with open(mlpipeline_ui_metadata_path, "w") as f:
        json.dump(metadata, f)

    report_dict = report.as_dict()
    print(report_dict)
    classification_metrics = next(
        filter(lambda m: m["metric"]=="ClassificationQualityMetric",
               report_dict["metrics"])
    )
    
    
    metrics = {
        "metrics": [
            {"name": "acc", 
             "numberValue": classification_metrics["result"]["current"]["accuracy"],
             "format": "RAW"},
            {"name": "F1", 
             "numberValue": classification_metrics["result"]["current"]["f1"],
             "format": "RAW"},
        ]
    }
    
    out_tuple = namedtuple("EvaluationOutput", ["mlpipeline_metrics"])
    return out_tuple(json.dumps(metrics))
    
eval_comp = kfp.components.create_component_from_func(
    func=evaluate, base_image=BASE_IMAGE
)

## Define a component to convert the model to ONNX

We use an ONNX model for inference. ONNX is an open standard, and offers both a common way to represent TensorFlow and PyTorch models, and also obtain some performance improvements.

In [10]:
def convert_model_to_onnx(tf_model: InputPath(str),
                          onnx_model: OutputPath(str)):
    import tf2onnx
    import tensorflow as tf
    import onnx
    
    keras_model = tf.keras.models.load_model(tf_model)    
    converted_model, _ = tf2onnx.convert.from_keras(keras_model)
    onnx.save_model(converted_model, onnx_model)

convert_model_to_onnx_comp = kfp.components.create_component_from_func(
    func=convert_model_to_onnx, base_image=BASE_IMAGE
)

## Define a component to train the explainer

This example uses [Alibi Anchor explainers](https://docs.seldon.io/projects/alibi/en/latest/overview/high_level.html). The explainer is a blackbox explainer; it has no visibility to the model's interworkings. When asked to explain an example, it performs a search over similar examples while invoking the predictor. Using the predicted results, it learns which feature values are most important for the example's prediction. It then can return an answer that resembles a set of rules for why the prediction was made.

In order to preform the search, the explainer needs to be fit with the distributions of the input variables. It also needs to be informed of the one-hot encoding scheme.

This component fits the explainer model. The ONNX model and onnx-runtime is used directly as the predictor.

In [11]:
def train_explainer(train_df: InputPath(str),
                    preprocessor: InputPath(str),
                    onnx_model: InputPath(str),
                    target_processing_config: InputPath(str),
                    explainer_dll: OutputPath(str)):
    from alibi.explainers.anchors import anchor_tabular
    from alibi.utils import gen_category_map
    import joblib
    from sklearn.pipeline import Pipeline
    import re
    import numpy as np
    import dill
    from functools import partial
    from typing import Tuple, List, Dict
    import json
    import os
    import pandas as pd
    import onnxruntime as ort
    import logging
    
    logging.basicConfig()
    
    def generate_category_map(preprocessor_pipeline: Pipeline) -> Tuple[List[str], Dict[int, List[str]]]:
        features = []
        seen_features = set()
        category_map = dict()

        for out_col_name in preprocessor_pipeline.get_feature_names_out():
            parts = re.search("(.*)__([^_]+)(_([A-Za-z0-9_-]+))?", out_col_name)
            if parts:
                if parts.group(2) not in seen_features:
                    features.append(parts.group(2))
                    seen_features.add(parts.group(2))

                if parts.group(1) == "categorical" or parts.group(1).startswith("ohe_"):
                    levels = category_map.get(len(features) - 1, [])
                    levels.append(parts.group(4))
                    category_map[len(features) - 1] = levels
            else:
                raise ValueError("Could not parse column " + out_col_name)

        return features, category_map


    with open(target_processing_config, "r") as f:
        target_processing_config_dict = json.load(f)
    column_info = json.loads(os.environ["COLUMNS"])

    inference_session = ort.InferenceSession(
            onnx_model, providers=["CPUExecutionProvider"]
        )
    threshold:float = float(target_processing_config_dict["threshold"])
    def predict(X: np.ndarray) -> np.ndarray:
        scores= np.array(inference_session.run(
            [], {"input_1": X}
        )).flatten()
        predictions = pd.Series(scores).apply(lambda p: 1 if p > threshold else 0).to_numpy()
        return predictions

    preprocessor_pipeline = joblib.load(preprocessor)
    features, category_map = generate_category_map(preprocessor_pipeline)

    dataset = pd.read_pickle(train_df)
    X = preprocessor_pipeline.transform(dataset).astype(np.float32)

    logging.info(f"Training explainer: features={features} categories={category_map} X.shape = {X.shape}")
    explainer = anchor_tabular.AnchorTabular(
        predictor=predict, feature_names=features, categorical_names=category_map, ohe=True, seed=42
    )
    explainer.fit(X, disc_prec=[10,25,33,50,66,75,90])

    explainer.reset_predictor(None)   # Clear explainer predict_fn as its a lambda and will be reset when loaded
    with open(explainer_dll, "wb") as f:
        dill.dump(explainer, f)

train_explainer_comp = kfp.components.create_component_from_func(
    func=train_explainer, base_image=BASE_IMAGE
)

## Define a component to upload artifacts

There are 3 components that are trained/fitted that need to be uploaded to S3 storage.
* Data preprocessor
* ONNX model
* Explainer

This will store the components in a tar archive and upload to S3 storage (MinIO).

In [12]:
def upload_artifacts(
    onnx_model: InputPath(str),
    preprocessor: InputPath(str),
    explainer: InputPath(str),
    archive_name: str,
    minio_url: str = "minio-service.kubeflow:9000",
    version: str = "1"
) -> NamedTuple("UploadOutput", [("s3_address", str)]):
    """Uploads a model file to MinIO artifact store."""

    from collections import namedtuple
    import logging
    from minio import Minio
    import sys
    import tarfile
    import os

    logging.basicConfig(
        stream=sys.stdout,
        level=logging.INFO,
        format="%(levelname)s %(asctime)s: %(message)s",
    )
    logger = logging.getLogger()

    ARCHIVE_FILE = f"/tmp/{archive_name}"
    with tarfile.open(ARCHIVE_FILE, "w") as f:
        f.add(onnx_model, arcname="model.onnx")
        f.add(preprocessor, arcname="preprocessor.joblib")
        f.add(explainer, arcname="explainer.dll")

    minio_client = Minio(
            minio_url, 
            access_key=os.environ["MINIO_ID"], 
            secret_key=os.environ["MINIO_PWD"], secure=False
        )

    # Create export bucket if it does not yet exist
    export_bucket="{{workflow.namespace}}"
    existing_bucket = next(filter(lambda bucket: bucket.name == export_bucket, minio_client.list_buckets()), None)

    if not existing_bucket:
        logger.info(f"Creating bucket '{export_bucket}'...")
        minio_client.make_bucket(bucket_name=export_bucket)

    path = f"tar/{version}/{archive_name}"
    s3_address = f"s3://{export_bucket}/tar"

    logger.info(f"Saving onnx file to MinIO (s3 address: {s3_address})...")
    minio_client.fput_object(
        bucket_name=export_bucket,  # bucket name in Minio
        object_name=path,  # file name in bucket of Minio 
        file_path=ARCHIVE_FILE,  # file path / name in local system
    )

    logger.info("Finished.")
    out_tuple = namedtuple("UploadOutput", ["s3_address"])
    return out_tuple(s3_address)


upload_artifacts_comp = kfp.components.create_component_from_func(
    func=upload_artifacts, base_image=BASE_IMAGE, packages_to_install=["minio==7.1.13"]
)

## Define a component to deploy the inference service.

The inference service has three custom built components:
* Transformer
* Predictor
* Explainer

The transformer and predictor use GRPC to communicate with the predictor, which improves the data transfer performance.

The annotation `sidecar.istio.io/inject": "false"` is added to the service because we don't want istio to enforce the same access restrictions that are used for Kubeflow services.

The deployed service will pull and expand the model file from S3 in each pod. (The transformer, predictor, and explainer run as separate pods).

In [13]:
def deploy_inference_service(name:str,
                             target_processing_config: InputPath(str),
                             model_archive_s3: str,
                             transformer_image: str,
                             predictor_image: str,
                             explainer_image: str,
                             predictor_max_replicas: int = 1,
                             predictor_min_replicas: int = 1,
                             predictor_concurrency_target: int = None,
                             transformer_max_replicas: int = 1,
                             transformer_min_replicas: int = 1,
                             transformer_concurrency_target: int = None,
                             explainer_max_replicas: int = 1,
                             explainer_min_replicas: int = 1,
                             explainer_concurrency_target: int = None
                            ):
    import kserve
    from kubernetes import client, config
    from kubernetes.client import (V1ServiceAccount, 
                                   V1Container, 
                                   V1EnvVar, 
                                   V1ObjectMeta, 
                                   V1ContainerPort, 
                                   V1ObjectReference,
                                   V1ResourceRequirements
                                  )
    from kserve import KServeClient
    from kserve import constants
    from kserve import V1beta1PredictorSpec
    from kserve import V1beta1ExplainerSpec
    from kserve import V1beta1TransformerSpec
    from kserve import V1beta1InferenceServiceSpec
    from kserve import V1beta1InferenceService
    import json
    from http import HTTPStatus
    import logging
    import yaml
    from time import sleep

    with open(target_processing_config, "r") as f:
        target_processing = json.load(f)

    prediction_threshold = target_processing["threshold"]
    target_names = json.dumps(
        [target_processing["target_names"].get(str(idx),"?")
         for idx in range(len(target_processing["target_names"]))]
    )

    config.load_incluster_config()
    
    # Setup the service account for the inference service to run under
    # The service account must have access to the secret with the
    # MinIO credentials
    SERVICE_ACCOUNT = "credit-risk-inference-sa"

    sa = V1ServiceAccount(
        api_version="v1",
        kind="ServiceAccount",
        metadata=V1ObjectMeta(name=SERVICE_ACCOUNT, 
                              namespace="{{workflow.namespace}}"),
        secrets=[V1ObjectReference(name="minio-credentials")]
    )
    corev1 = client.CoreV1Api()
    try:
        corev1.create_namespaced_service_account(namespace="{{workflow.namespace}}",
                                                 body=sa)
    except client.exceptions.ApiException as e:
        if e.status==HTTPStatus.CONFLICT:
            corev1.patch_namespaced_service_account(name=SERVICE_ACCOUNT,
                                                    namespace="{{workflow.namespace}}",
                                                    body=sa)
        else:
            raise
    
    ### Build the InferenceService spec components
    
    # Predictor component
    # We use a custom predictor in this example, and so it is
    # necessary to be more complete with the spec than we would
    # need for 'built-in' predictors such as Triton.
    predictor_spec = V1beta1PredictorSpec(
        max_replicas=predictor_max_replicas,
        min_replicas=predictor_min_replicas,
        scale_target=predictor_concurrency_target,
        scale_metric="concurrency",
        containers=[
            V1Container(
                name="kserve-container",
                image=predictor_image,
                args=["--grpc_port=8081", f"--model_name={name}"],
                ports=[V1ContainerPort(
                    container_port=8081,
                    name="h2c",
                    protocol="TCP"
                )],
                resources=V1ResourceRequirements(
                    limits={"memory": "10Gi"},
                    requests={"memory": "2Gi"},
                ),
                env=[
                 V1EnvVar(
                     name="STORAGE_URI", value=model_archive_s3
                 ),
                 V1EnvVar(
                     name="THRESHOLD",
                     value=str(prediction_threshold)
                 )
                ],
            )
        ],
        service_account_name=SERVICE_ACCOUNT
    )

    # Transformer spec
    # Uses a custom image, explicitly specifies that GRPC should be used
    # to communicate with the predictor
    transformer_spec = V1beta1TransformerSpec(
        max_replicas=transformer_max_replicas,
        min_replicas=transformer_min_replicas,
        scale_target=transformer_concurrency_target,
        scale_metric="concurrency",
        containers=[
            V1Container(
                name="kserve-container",
                image=transformer_image,
                args=["--protocol=grpc-v2", f"--model_name={name}"],
                resources=V1ResourceRequirements(
                    limits={"memory": "10Gi"},
                    requests={"memory": "2Gi"},
                ),
                env=[
                 V1EnvVar(
                     name="STORAGE_URI", value=model_archive_s3
                 ),
                V1EnvVar(
                    name="TARGET_NAMES", value=target_names
                )
                ],
            )
        ],
        service_account_name=SERVICE_ACCOUNT
    )

    # Explainer spec
    # A custom image is used
    # GRPC is explicitly specified as the protocol to communicate with the
    # predictor.
    explainer_spec=V1beta1ExplainerSpec(
        max_replicas=explainer_max_replicas,
        min_replicas=explainer_min_replicas,
        scale_target=explainer_concurrency_target,
        scale_metric="concurrency",
        containers=[
            V1Container(
                name="kserve-container",
                image=explainer_image,
                args=["--protocol=grpc-v2", f"--model_name={name}"],
                resources=V1ResourceRequirements(
                    limits={"memory": "10Gi"},
                    requests={"memory": "4Gi"},
                ),
                env=[
                 V1EnvVar(
                     name="STORAGE_URI", value=model_archive_s3
                 ),
                 V1EnvVar(
                     name="EXPLAIN_MIN_SAMPLES_START", value="15000"
                 )
                ],
            )
        ],
        service_account_name=SERVICE_ACCOUNT
    )


    # Build the inference service
    # enable-prometheus-scraping causes the service to export metrics.
    inference_service=V1beta1InferenceService(
        api_version=constants.KSERVE_V1BETA1,
        kind=constants.KSERVE_KIND,
        metadata=V1ObjectMeta(name=name, 
                              namespace="{{workflow.namespace}}",
                              annotations={"sidecar.istio.io/inject": "false",
                                           "serving.kserve.io/enable-prometheus-scraping" : "true"}),
        spec=V1beta1InferenceServiceSpec(predictor=predictor_spec,
                                         transformer=transformer_spec,
                                         explainer=explainer_spec)
    )
    
    # Dump the created inference service spec to the log for easier debug
    logging.info(
        yaml.dump(
            client.ApiClient().sanitize_for_serialization(inference_service)
        )
    )

    # Create the service, handling a conflict error for existing service
    api_instance = client.CustomObjectsApi()
    while True:
        try:
            api_instance.create_namespaced_custom_object(
                    group=constants.KSERVE_GROUP,
                    version=inference_service.api_version.split("/")[1],
                    namespace="{{workflow.namespace}}",
                    plural=constants.KSERVE_PLURAL,
                    body=inference_service)
            break
        except client.exceptions.ApiException as api_exception:
            if api_exception.status==HTTPStatus.CONFLICT:
                try:
                    api_instance.delete_namespaced_custom_object(
                        group=constants.KSERVE_GROUP,
                        version=inference_service.api_version.split("/")[1],
                        namespace="{{workflow.namespace}}",
                        plural=constants.KSERVE_PLURAL,
                        name=name)
                    sleep(15)
                except client.exceptions.ApiException as api_exception2:
                    if api_exception2.status in {HTTPStatus.NOT_FOUND, HTTPStatus.GONE}:
                        pass
                    else:
                        raise

            else:
                raise
            
    # Wait for inference service to become ready
    kclient = KServeClient()
    kclient.wait_isvc_ready(name=name, namespace="{{workflow.namespace}}")
    
    if not kclient.is_isvc_ready(name=name, namespace="{{workflow.namespace}}"):
        raise RuntimeError(f"The inference service {name} is not ready!")

deploy_inference_service_comp = kfp.components.create_component_from_func(
    func=deploy_inference_service, base_image=KSERVE_IMAGE
)


## Define the pipeline

The load_training_data_task and load_test_data_task needs to use the correct function (DB2 or PostgreSQL) to load the training data. Connection credentials are passed via the environment by referencing secrets.


In [14]:
PIPELINE_NAME = "Build-Credit-Risk-Model"

In [15]:
from kubernetes.client import ( V1PersistentVolumeClaimVolumeSource, V1Volume, V1VolumeMount)
@dsl.pipeline(
    name=PIPELINE_NAME,
    description="An example pipeline that builds and deploys a credit risk model",
)
def credit_model_pipeline():
    def env_var_from_secret(env_var_name: str, secret_name: str, secret_key: str) -> V1EnvVar:
        return V1EnvVar(name=env_var_name,
                                     value_from=V1EnvVarSource(
                                         secret_key_ref=V1SecretKeySelector(
                                             name=secret_name,
                                             key=secret_key
                                         )
                                     )
                                    )
    
    def add_db2_connection_secrets(pipeline_task) -> None:
        pipeline_task.container.add_env_variable(env_var_from_secret("db2_host", "db2-credentials", "host"))
        pipeline_task.container.add_env_variable(env_var_from_secret("db2_user", "db2-credentials", "username"))
        pipeline_task.container.add_env_variable(env_var_from_secret("db2_pwd", "db2-credentials", "password"))
        pipeline_task.container.add_env_variable(env_var_from_secret("db2_port", "db2-credentials", "port"))

    
    def add_pg_connection_secrets(pipeline_task) -> None:
        pipeline_task.container.add_env_variable(V1EnvVar(name="PG_HOST", value="postgresql.{{workflow.namespace}}.svc"))
        pipeline_task.container.add_env_variable(env_var_from_secret("PG_DB_NAME", "postgresql", "database-name"))
        pipeline_task.container.add_env_variable(env_var_from_secret("PG_USER", "postgresql", "database-user"))
        pipeline_task.container.add_env_variable(env_var_from_secret("PG_PWD", "postgresql", "database-password"))
        pipeline_task.container.add_env_variable(V1EnvVar(name="PG_PORT", value="5432"))

    feature_columns = [
            "CheckingStatus",
            "LoanDuration",
            "CreditHistory",
            "LoanPurpose",
            "LoanAmount",
            "ExistingSavings",
            "EmploymentDuration",
            "InstallmentPercent",
            "Sex",
            "OthersOnLoan",
            "CurrentResidenceDuration",
            "OwnsProperty",
            "Age",
            "InstallmentPlans",
            "Housing",
            "ExistingCreditsCount",
            "Job",
            "Dependents",
            "Telephone",
            "ForeignWorker",
        ]
    
    load_training_data_task = load_df_from_postgresql_comp(table_name="TRAIN")
    load_training_data_task.set_display_name("Load_Training_Data")
    load_training_data_task.execution_options.caching_strategy.max_cache_staleness = "P0D"
    add_pg_connection_secrets(load_training_data_task)

    data_quality_report_comp(df=load_training_data_task.outputs["data_frame_pkl"],
                             features=feature_columns)
    
    load_test_data_task = load_df_from_postgresql_comp(table_name="TEST")
    load_test_data_task.set_display_name("Load_Test_Data")
    load_test_data_task.execution_options.caching_strategy.max_cache_staleness = "P0D"
    add_pg_connection_secrets(load_test_data_task)

    fit_preprocessor_task = fit_preprocessor_comp(
        training_df=load_training_data_task.outputs["data_frame_pkl"],
        features=feature_columns
    )

    create_tensorboard_volume = dsl.VolumeOp(
        name=f"Create PVC for tensorboard",
        resource_name="tensorboard",
        modes=dsl.VOLUME_MODE_RWM,
        size="4G",
        set_owner_reference=True,
    )
    create_tensorboard_volume.add_pod_annotation(
            name="pipelines.kubeflow.org/max_cache_staleness", value="P0D"
        )
        
    configure_tensorboard_task = configure_tensorboard_comp(
        pipeline_name=PIPELINE_NAME,
        pvc_name=create_tensorboard_volume.volume.persistent_volume_claim.claim_name
    )
    
    train_model_task = train_comp(
        training_df=load_training_data_task.outputs["data_frame_pkl"],
        preprocessor=fit_preprocessor_task.outputs["preprocessor_pkl"],
        target="Risk",
        tensorboard_dir="/tensorboard",
    )
    train_model_task.after(configure_tensorboard_task)
    train_model_task.add_pvolumes({"/tensorboard": create_tensorboard_volume.volume})

    
    convert_model_to_onnx_task = convert_model_to_onnx_comp(tf_model=train_model_task.outputs["model"])
    
    train_explainer_task = train_explainer_comp(train_df=load_training_data_task.outputs["data_frame_pkl"],
                                                preprocessor=fit_preprocessor_task.outputs["preprocessor_pkl"],
                                                onnx_model=convert_model_to_onnx_task.outputs["onnx_model"],
                                                target_processing_config=train_model_task.outputs["target_processing_config"])
    
    evaluate_model_task = eval_comp(
        load_test_data_task.outputs["data_frame_pkl"],
        preprocessor=fit_preprocessor_task.outputs["preprocessor_pkl"],
        onnx_model=convert_model_to_onnx_task.outputs["onnx_model"],
        target_processing_config=train_model_task.outputs["target_processing_config"],
        target="Risk"
    )
    
    upload_artifacts_task = upload_artifacts_comp(
        onnx_model=convert_model_to_onnx_task.outputs["onnx_model"],
        preprocessor=fit_preprocessor_task.outputs["preprocessor_pkl"],
        explainer=train_explainer_task.outputs["explainer_dll"],
        archive_name="credit-risk.tar"
    )
    upload_artifacts_task.container.add_env_variable(env_var_from_secret("MINIO_ID", "mlpipeline-minio-artifact", "accesskey"))
    upload_artifacts_task.container.add_env_variable(env_var_from_secret("MINIO_PWD", "mlpipeline-minio-artifact", "secretkey"))
    upload_artifacts_task.after(evaluate_model_task)
    
    deploy_inference_service_task=deploy_inference_service_comp(name="credit-risk",
                                                                target_processing_config=train_model_task.outputs["target_processing_config"],
                                                                model_archive_s3=upload_artifacts_task.output,
                                                                transformer_image=TRANSFORMER_IMAGE,
                                                                predictor_image=PREDICTOR_IMAGE,
                                                                explainer_image=EXPLAINER_IMAGE,
                                                                predictor_max_replicas=4,
                                                                predictor_concurrency_target=1,
                                                                transformer_max_replicas=4,
                                                                transformer_concurrency_target=1
                                                               )


## Compile the pipeline

Compiling the pipeline creates a yaml file for the pipeline. We can then create a pipeline using the YAML. A run for the pipeline can be started using the UI, without the need to run this notebook.

In [16]:
pipeline_conf = kfp.dsl.PipelineConf()

def provide_column_info_transformer(op: dsl.ContainerOp):
    
    if isinstance(op, dsl.ContainerOp):
        op.container.add_env_variable(
            V1EnvVar(name="COLUMNS",
                    value_from=V1EnvVarSource(
                                         config_map_key_ref=V1ConfigMapKeySelector(
                                             name="credit-risk-columns",
                                             key="columns"
                                         )
                                     )
                    )
        )

pipeline_conf.add_op_transformer(provide_column_info_transformer)



In [17]:
       
kfp.compiler.Compiler().compile(
    pipeline_func=credit_model_pipeline,
    package_path=f"{PIPELINE_NAME}.yaml",
    pipeline_conf=pipeline_conf,
)


## Upload the pipeline

The compiled pipeline is uloaded here. We could also do this manually using the UI.

We can only have one pipeline with a specific name; in case we ran this script multiple times we will delete the pipeline first if it exists.

In [18]:
def delete_pipeline(pipeline_name: str):
    """Delete's a pipeline with the specified name"""

    client = kfp.Client()
    existing_pipelines = client.list_pipelines(page_size=999).pipelines
    matches = (
        [ep.id for ep in existing_pipelines if ep.name == pipeline_name]
        if existing_pipelines
        else []
    )
    for id in matches:
        client.delete_pipeline(id)


In [19]:
# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

client = kfp.Client()
uploaded_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)

## Create a pipeline run

This creates a run of the pipeline. As with uploading, we could do this manually using the UI.

Runnning a pipeline requires a KFP experiment. This code will create the experiment if it does not already exist.

Grouping runs into an experiment is advantageous, since it allows us to compare metrics between pipeline runs.

In [20]:
def get_experiment_id(experiment_name: str) -> str:
    """Returns the id for the experiment, creating the experiment if needed"""
    client = kfp.Client()
    existing_experiments = client.list_experiments(page_size=999).experiments
    matches = (
        [ex.id for ex in existing_experiments if ex.name == experiment_name]
        if existing_experiments
        else []
    )

    if matches:
        return matches[0]

    exp = client.create_experiment(experiment_name)
    return exp.id


In [21]:
run = client.run_pipeline(
    experiment_id=get_experiment_id("credit-risk"),
    job_name="credit-risk",
    pipeline_id=uploaded_pipeline.id,
)

## Wait for pipeline completion

This shows how to wait for the pipeline to complete and check for success. The metrics from the evaluate component are also included in the response.

In [22]:
TWENTY_MIN = 20 * 60
result = client.wait_for_run_completion(run.id, timeout=TWENTY_MIN)
{
    "status": result.run.status,
    "error": result.run.error,
    "time": str(result.run.finished_at - result.run.created_at),
    "metrics": result.run.metrics,
}


{'status': 'Succeeded',
 'error': None,
 'time': '0:06:52',
 'metrics': [{'format': 'RAW',
   'name': 'acc',
   'node_id': 'build-credit-risk-model-v5d2g-3574078477',
   'number_value': 0.786},
  {'format': 'RAW',
   'name': 'F1',
   'node_id': 'build-credit-risk-model-v5d2g-3574078477',
   'number_value': 0.6786786786786787}]}