# End to End Machine Learning Pipeline for Income Prediction

We use [demographic features from the 1996 US census](https://archive.ics.uci.edu/ml/datasets/census+income) to build an end to end machine learning pipeline. The pipeline is also annotated so it can be run as a [Kubeflow Pipeline](https://www.kubeflow.org/docs/pipelines/overview/pipelines-overview/) using the [Kale](https://github.com/kubeflow-kale/kale) pipeline generator.

The notebook/pipeline stages are:

 1. Setup 
   * Imports
   * pipeline-parameters
   * minio client test
 1. Train a simple sklearn model and push to minio
 1. Prepare an Anchors explainer for model and push to minio
 1. Test Explainer
 1. Train an isolation forest outlier detector for model and push to minio
 1. Deploy a Seldon model and test
 1. Deploy a KfServing model and test
 1. Deploy an outlier detector 



In [58]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from alibi.explainers import AnchorTabular
from alibi.datasets import fetch_adult
from minio import Minio
from minio.error import ResponseError
from joblib import dump, load
import dill
import time
import json
from subprocess import run, Popen, PIPE
from alibi_detect.utils.data import create_outlier_batch

In [2]:
MINIO_HOST="minio-service.kubeflow:9000"
MINIO_ACCESS_KEY="minio"
MINIO_SECRET_KEY="minio123"
MINIO_MODEL_BUCKET="seldon"
INCOME_MODEL_PATH="sklearn/income/model"
EXPLAINER_MODEL_PATH="sklearn/income/explainer"
OUTLIER_MODEL_PATH="sklearn/income/outlier"
DEPLOY_NAMESPACE="admin"

In [3]:
def get_minio():
    return Minio(MINIO_HOST,
                    access_key=MINIO_ACCESS_KEY,
                    secret_key=MINIO_SECRET_KEY,
                    secure=False)

In [4]:
minioClient = get_minio()
buckets = minioClient.list_buckets()
for bucket in buckets:
    print(bucket.name, bucket.creation_date)

mlpipeline 2020-07-04 08:55:55.417000+00:00
mybucket 2020-07-17 09:54:33.136000+00:00
seldon 2020-07-24 19:24:13.525000+00:00


In [5]:
if not minioClient.bucket_exists(MINIO_MODEL_BUCKET):
    minioClient.make_bucket(MINIO_MODEL_BUCKET)

## Train Model

In [6]:
adult = fetch_adult()
adult.keys()

dict_keys(['data', 'target', 'feature_names', 'target_names', 'category_map'])

In [7]:
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map = adult.category_map

Note that for your own datasets you can use our utility function [gen_category_map](../api/alibi.utils.data.rst) to create the category map:

In [8]:
from alibi.utils.data import gen_category_map

Define shuffled training and test set

In [9]:
np.random.seed(0)
data_perm = np.random.permutation(np.c_[data, target])
data = data_perm[:,:-1]
target = data_perm[:,-1]

In [10]:
idx = 30000
X_train,Y_train = data[:idx,:], target[:idx]
X_test, Y_test = data[idx+1:,:], target[idx+1:]

### Create feature transformation pipeline
Create feature pre-processor. Needs to have 'fit' and 'transform' methods. Different types of pre-processing can be applied to all or part of the features. In the example below we will standardize ordinal features and apply one-hot-encoding to categorical features.

Ordinal features:

In [11]:
ordinal_features = [x for x in range(len(feature_names)) if x not in list(category_map.keys())]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

Categorical features:

In [12]:
categorical_features = list(category_map.keys())
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

Combine and fit:

In [13]:
preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
                                               ('cat', categorical_transformer, categorical_features)])

### Train Random Forest model

Fit on pre-processed (imputing, OHE, standardizing) data.

In [14]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)

In [15]:
model=Pipeline(steps=[("preprocess",preprocessor),("model",clf)])
model.fit(X_train,Y_train)

Pipeline(memory=None,
     steps=[('preprocess', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

Define predict function

In [16]:
def predict_fn(x):
    return model.predict(x)

In [17]:
#predict_fn = lambda x: clf.predict(preprocessor.transform(x))
print('Train accuracy: ', accuracy_score(Y_train, predict_fn(X_train)))
print('Test accuracy: ', accuracy_score(Y_test, predict_fn(X_test)))

Train accuracy:  0.9655333333333334
Test accuracy:  0.855859375


In [18]:
dump(model, 'model.joblib') 

['model.joblib']

In [19]:
print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{INCOME_MODEL_PATH}/model.joblib", 'model.joblib'))

694bfd00872125a7906e3a12413a3c71-7


## Train Explainer

In [20]:
model.predict(X_train)
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)

Discretize the ordinal features into quartiles

In [21]:
explainer.fit(X_train, disc_perc=[25, 50, 75])

In [22]:
with open("explainer.dill", "wb") as dill_file:
    dill.dump(explainer, dill_file)    
    dill_file.close()
print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{EXPLAINER_MODEL_PATH}/explainer.dill", 'explainer.dill'))

9b0f4efeb7c68d482cac140a5aceb7d8-2


## Get Explanation

Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition - that is, when the anchor holds, the prediction should be the same as the prediction for this instance.

In [23]:
model.predict(X_train)
idx = 0
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predict_fn(X_test[idx].reshape(1, -1))[0]])

Prediction:  <=50K


We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.

In [24]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation['names'])))
print('Precision: %.2f' % explanation['precision'])
print('Coverage: %.2f' % explanation['coverage'])

Anchor: Marital Status = Separated AND Sex = Female
Precision: 0.96
Coverage: 0.11


## Train Outlier Detector

In [25]:
from alibi_detect.od import IForest

od = IForest(
    threshold=0.,
    n_estimators=200,
)


In [26]:
od.fit(X_train)



In [27]:
np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(X_train, Y_train, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target
#X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))

5.0% outliers


In [28]:
od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))
threshold = od.threshold

New threshold: 0.029017499251428627


In [29]:
X_outlier = [[300,  4,  4,  2,  1,  4,  4,  0,  0,  0, 600,  9]]

In [30]:
od.predict(
    X_outlier
)

{'data': {'instance_score': array([0.04198649]),
  'feature_score': None,
  'is_outlier': array([1])},
 'meta': {'name': 'IForest',
  'detector_type': 'offline',
  'data_type': 'tabular'}}

In [31]:
from alibi_detect.utils.saving import save_detector, load_detector
from os import listdir
from os.path import isfile, join

filepath="ifoutlier"
save_detector(od, filepath) 
onlyfiles = [f for f in listdir(filepath) if isfile(join(filepath, f))]
for filename in onlyfiles:
    print(filename)
    print(get_minio().fput_object(MINIO_MODEL_BUCKET, f"{OUTLIER_MODEL_PATH}/{filename}", join(filepath, filename)))

W0727 16:22:05.788773 140193113450304 saving.py:68] Directory ifoutlier does not exist and is now created.


meta.pickle
e57a0ae93b75c8a169b2d003231243f8
IForest.pickle
c9c344f140bd36fc2520ed5abcf67d0b


## Deploy Seldon Core Model

In [32]:
secret = f"""apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
  namespace: {DEPLOY_NAMESPACE}
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: {MINIO_ACCESS_KEY}
  AWS_SECRET_ACCESS_KEY: {MINIO_SECRET_KEY}
  AWS_ENDPOINT_URL: http://{MINIO_HOST}
  USE_SSL: "false"
"""
with open("secret.yaml","w") as f:
    f.write(secret)
run("cat secret.yaml | kubectl apply -f -", shell=True)

CompletedProcess(args='cat secret.yaml | kubectl apply -f -', returncode=0)

In [33]:
sa = f"""apiVersion: v1
kind: ServiceAccount
metadata:
  name: minio-sa
  namespace: {DEPLOY_NAMESPACE}
secrets:
  - name: seldon-init-container-secret
"""
with open("sa.yaml","w") as f:
    f.write(sa)
run("kubectl apply -f sa.yaml", shell=True)

CompletedProcess(args='kubectl apply -f sa.yaml', returncode=0)

In [34]:
model_yaml=f"""apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: income-classifier
  namespace: {DEPLOY_NAMESPACE}
spec:
  predictors:
  - componentSpecs:
    graph:
      implementation: SKLEARN_SERVER
      modelUri: s3://{MINIO_MODEL_BUCKET}/{INCOME_MODEL_PATH}
      envSecretRefName: seldon-init-container-secret
      name: classifier
      logger:
         mode: all
    explainer:
      type: AnchorTabular
      modelUri: s3://{MINIO_MODEL_BUCKET}/{EXPLAINER_MODEL_PATH}
      envSecretRefName: seldon-init-container-secret
    name: default
    replicas: 1
"""
with open("model.yaml","w") as f:
    f.write(model_yaml)
run("kubectl apply -f model.yaml", shell=True)

CompletedProcess(args='kubectl apply -f model.yaml', returncode=0)

In [35]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/$(kubectl get deploy -l seldon-deployment-id=income-classifier -o jsonpath='{{.items[0].metadata.name}}' -n {DEPLOY_NAMESPACE})", shell=True)

CompletedProcess(args="kubectl rollout status -n admin deploy/$(kubectl get deploy -l seldon-deployment-id=income-classifier -o jsonpath='{.items[0].metadata.name}' -n admin)", returncode=0)

In [36]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/$(kubectl get deploy -l seldon-deployment-id=income-classifier -o jsonpath='{{.items[1].metadata.name}}' -n {DEPLOY_NAMESPACE})", shell=True)

CompletedProcess(args="kubectl rollout status -n admin deploy/$(kubectl get deploy -l seldon-deployment-id=income-classifier -o jsonpath='{.items[1].metadata.name}' -n admin)", returncode=0)

Make a prediction request

In [37]:
payload='{"data": {"ndarray": [[53,4,0,2,8,4,4,0,0,0,60,9]]}}'
cmd=f"""curl -d '{payload}' \
   http://income-classifier-default.{DEPLOY_NAMESPACE}:8000/api/v1.0/predictions \
   -H "Content-Type: application/json"
"""
ret = Popen(cmd, shell=True,stdout=PIPE)
raw = ret.stdout.read().decode("utf-8")
print(raw)

{"data":{"names":["t:0","t:1"],"ndarray":[[0.88,0.12]]},"meta":{}}



Make an explanation request

In [38]:
payload='{"data": {"ndarray": [[53,4,0,2,8,4,4,0,0,0,60,9]]}}'
cmd=f"""curl -d '{payload}' \
   http://income-classifier-default-explainer.{DEPLOY_NAMESPACE}:9000/api/v1.0/explain \
   -H "Content-Type: application/json"
"""
ret = Popen(cmd, shell=True,stdout=PIPE)
raw = ret.stdout.read().decode("utf-8")
print(raw)

{"names": ["Marital Status = Separated", "Sex = Female"], "precision": 0.9635627530364372, "coverage": 0.109, "raw": {"feature": [3, 7], "mean": [0.898186889818689, 0.9635627530364372], "precision": [0.898186889818689, 0.9635627530364372], "coverage": [0.1797, 0.109], "examples": [{"covered": [[71, "Private", "Dropout", "Separated", "Service", "Not-in-family", "Asian-Pac-Islander", "Female", "Capital Gain <= 0.00", "Capital Loss <= 0.00", 75, "United-States"], [34, "Self-emp-not-inc", "High School grad", "Separated", "Blue-Collar", "Husband", "White", "Male", "Capital Gain <= 0.00", "Capital Loss <= 0.00", 40, "United-States"], [71, "Private", "High School grad", "Separated", "Sales", "Husband", "White", "Male", "Capital Gain <= 0.00", 2467, 52, "United-States"], [30, "Private", "Bachelors", "Separated", "Other", "Not-in-family", "White", "Female", 4787, "Capital Loss <= 0.00", 45, "United-States"], [73, "Private", "High School grad", "Separated", "Service", "Not-in-family", "White", "

## Deploy Outier Detector

In [55]:
outlier_yaml=f"""apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: income-outlier
  namespace: {DEPLOY_NAMESPACE}
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"
    spec:
      containers:
      - image: seldonio/alibi-detect-server:1.2.2-dev_alibidetect
        imagePullPolicy: IfNotPresent
        args:
        - --model_name
        - adultod
        - --http_port
        - '8080'
        - --protocol
        - seldon.http
        - --storage_uri
        - s3://{MINIO_MODEL_BUCKET}/{OUTLIER_MODEL_PATH}
        - --reply_url
        - http://default-broker       
        - --event_type
        - io.seldon.serving.inference.outlier
        - --event_source
        - io.seldon.serving.incomeod
        - OutlierDetector
        envFrom:
        - secretRef:
            name: seldon-init-container-secret
"""
with open("outlier.yaml","w") as f:
    f.write(outlier_yaml)
run("kubectl apply -f outlier.yaml", shell=True)

CompletedProcess(args='kubectl apply -f outlier.yaml', returncode=0)

In [40]:
trigger_outlier_yaml=f"""apiVersion: eventing.knative.dev/v1alpha1
kind: Trigger
metadata:
  name: income-outlier-trigger
  namespace: {DEPLOY_NAMESPACE}
spec:
  filter:
    sourceAndType:
      type: io.seldon.serving.inference.request
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1alpha1
      kind: Service
      name: income-outlier
"""
with open("outlier_trigger.yaml","w") as f:
    f.write(trigger_outlier_yaml)
run("kubectl apply -f outlier_trigger.yaml", shell=True)

CompletedProcess(args='kubectl apply -f outlier_trigger.yaml', returncode=0)

In [41]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/$(kubectl get deploy -l serving.knative.dev/service=income-outlier -o jsonpath='{{.items[0].metadata.name}}' -n {DEPLOY_NAMESPACE})", shell=True)

CompletedProcess(args="kubectl rollout status -n admin deploy/$(kubectl get deploy -l serving.knative.dev/service=income-outlier -o jsonpath='{.items[0].metadata.name}' -n admin)", returncode=0)

## Deploy KNative Eventing Event Display

In [46]:
event_display=f"""apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-display
  namespace: {DEPLOY_NAMESPACE}          
spec:
  replicas: 1
  selector:
    matchLabels: &labels
      app: event-display
  template:
    metadata:
      labels: *labels
    spec:
      containers:
        - name: helloworld-go
          # Source code: https://github.com/knative/eventing-contrib/tree/master/cmd/event_display
          image: gcr.io/knative-releases/knative.dev/eventing-contrib/cmd/event_display@sha256:f4628e97a836c77ed38bd3b6fd3d0b06de4d5e7db6704772fe674d48b20bd477
---
kind: Service
apiVersion: v1
metadata:
  name: event-display
  namespace: {DEPLOY_NAMESPACE}
spec:
  selector:
    app: event-display
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: eventing.knative.dev/v1alpha1
kind: Trigger
metadata:
  name: income-outlier-display
  namespace: {DEPLOY_NAMESPACE}
spec:
  broker: default
  filter:
    attributes:
      type: io.seldon.serving.inference.outlier
  subscriber:
    ref:
      apiVersion: v1
      kind: Service
      name: event-display
"""
with open("event_display.yaml","w") as f:
    f.write(event_display)
run("kubectl apply -f event_display.yaml", shell=True)

CompletedProcess(args='kubectl apply -f event_display.yaml', returncode=0)

In [47]:
run(f"kubectl rollout status -n {DEPLOY_NAMESPACE} deploy/event-display -n {DEPLOY_NAMESPACE}", shell=True)

CompletedProcess(args='kubectl rollout status -n admin deploy/event-display -n admin', returncode=0)

## Test Outlier Detection

In [62]:
def predict():
    payload='{"data": {"ndarray": [[300,  4,  4,  2,  1,  4,  4,  0,  0,  0, 600,  9]]}}'
    cmd=f"""curl -d '{payload}' \
       http://income-classifier-default.{DEPLOY_NAMESPACE}:8000/api/v1.0/predictions \
       -H "Content-Type: application/json"
    """
    ret = Popen(cmd, shell=True,stdout=PIPE)
    raw = ret.stdout.read().decode("utf-8")
    print(raw)

In [63]:
def get_outlier_event_display_logs():
    cmd=f"kubectl logs $(kubectl get pod -l app=event-display -o jsonpath='{{.items[0].metadata.name}}' -n {DEPLOY_NAMESPACE}) -n {DEPLOY_NAMESPACE}"
    ret = Popen(cmd, shell=True,stdout=PIPE)
    res = ret.stdout.read().decode("utf-8").split("\n")
    data= []
    for i in range(0,len(res)):
        if res[i] == 'Data,':
            j = json.loads(json.loads(res[i+1]))
            if "is_outlier"in j["data"].keys():
                data.append(j)
    if len(data) > 0:
        return data[-1]
    else:
        return None
j = None
while j is None:
    predict()
    print("Waiting for outlier logs, sleeping")
    time.sleep(2)
    j = get_outlier_event_display_logs()
    
print(j)
print("Outlier",j["data"]["is_outlier"]==[1])

{"data":{"names":["t:0","t:1"],"ndarray":[[0.92,0.08]]},"meta":{}}

Waiting for outlier logs, sleeping
{'data': {'instance_score': None, 'feature_score': None, 'is_outlier': [1]}, 'meta': {'name': 'IForest', 'detector_type': 'offline', 'data_type': 'tabular'}}
Outlier True


## Clean Up Resources

In [60]:
run(f"kubectl delete sdep income-classifier -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete ksvc income-outlier -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete sa  minio-sa -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete secret seldon-init-container-secret -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete deployment event-display -n {DEPLOY_NAMESPACE}", shell=True)
run(f"kubectl delete svc event-display -n {DEPLOY_NAMESPACE}", shell=True)

CompletedProcess(args='kubectl delete svc event-display -n admin', returncode=0)