# Basic Examples for SKlearn Prepackaged Server trained with Pachyderm and deployed to MinIO


## Prerequisites

 * A kubernetes cluster with kubectl configured
 * curl
 * pygmentize
 * Python 3.7 locally (3.8 does not work, use pyenv if necessary)

TODO: test with local minikube, ensure example works end to end with a totally fresh cluster (rather than working on pachub cluster and skipping some bits)

## Setup Seldon Core

Use the setup notebook to [Setup Cluster](seldon_core_setup.ipynb) to setup Seldon Core with an ingress.


## Setup MinIO (TODO: remove this section)

Use the provided [notebook](../../../notebooks/minio_setup.ipynb) to install Minio in your cluster and configure `mc` CLI tool. 
Instructions [also online](./minio_setup.html).

## Python dependencies

This tutorial will require you to install pandas and scikit-learn in followint versions

In [2]:
!cat iris-trainer/requirements.txt

scikit-learn == 0.20.3
numpy >= 1.8.2
joblib >= 0.13.0
pandas >= 1.0.1
PyYAML >= 5.3


You can do it by issuing following command

In [4]:
!pip install -r iris-trainer/requirements.txt

Collecting scikit-learn==0.20.3
  Using cached scikit_learn-0.20.3-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB)
Collecting numpy>=1.8.2
  Using cached numpy-1.19.1-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
Collecting joblib>=0.13.0
  Using cached joblib-0.16.0-py3-none-any.whl (300 kB)
Collecting pandas>=1.0.1
  Using cached pandas-1.1.0-cp37-cp37m-manylinux1_x86_64.whl (10.5 MB)
Collecting PyYAML>=5.3
  Using cached PyYAML-5.3.1.tar.gz (269 kB)
Collecting scipy>=0.13.3
  Downloading scipy-1.5.2-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 2.9 MB/s eta 0:00:01    |▊                               | 573 kB 1.8 MB/s eta 0:00:15
[?25hCollecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz>=2017.2
  Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
[K     |████████████████████████████████| 510 kB 1.7 MB/s eta 0:00:01
[?25hCollecting six>=1.5
  Downloading six-1.15.0-py2.py3-

## Get Pachyderm CLI (pachctl) client tool

Follow steps relevant to your platform from official [documentation](https://docs.pachyderm.com/latest/getting_started/local_installation/#install-pachctl) in order to get the `pachctl` command line tool.

Verify correct client installation:

In [5]:
!pachctl version --client-only

1.12.0-3ad6aa7344f90eeebedb6235eeb561bdded45879


## Install Pachyderm in cluster

Use pachctl deploy Pachyderm:

In [4]:
%%bash
kubectl create ns pachyderm
pachctl deploy local --no-expose-docker-socket --namespace pachyderm

namespace/pachyderm created
serviceaccount/pachyderm created
serviceaccount/pachyderm-worker created
clusterrole.rbac.authorization.k8s.io/pachyderm created
clusterrolebinding.rbac.authorization.k8s.io/pachyderm created
role.rbac.authorization.k8s.io/pachyderm-worker created
rolebinding.rbac.authorization.k8s.io/pachyderm-worker created
deployment.apps/etcd created
service/etcd created
service/pachd created
service/pachd-peer created
deployment.apps/pachd created
service/dash created
deployment.apps/dash created
secret/pachyderm-storage-secret created

Pachyderm is launching. Check its status with "kubectl get all"
Once launched, access the dashboard by running "pachctl port-forward"



In [5]:
!kubectl rollout status deployment pachd

Error from server (NotFound): deployments.apps "pachd" not found


### port-forward pachyderm to localhost

in separate terminal:

```bash
pachctl port-forward
```

## Train model using Pachyderm

### And training data to Pachyderm "iris-input" repository

We will now use the helper python script to pull iris training data from sklearn

In [6]:
!pygmentize get-data.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m [34mimport[39;49;00m datasets
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m


[34mdef[39;49;00m [32mmain[39;49;00m():
    [36mprint[39;49;00m([33m"[39;49;00m[33mGetting Iris Dataset[39;49;00m[33m"[39;49;00m)
    iris = datasets.load_iris()
    X, y = iris.data, iris.target

    data = pd.DataFrame(
        data=np.c_[iris[[33m"[39;49;00m[33mdata[39;49;00m[33m"[39;49;00m], iris[[33m"[39;49;00m[33mtarget[39;49;00m[33m"[39;49;00m]],
        columns=iris[[33m"[39;49;00m[33mfeature_names[39;49;00m[33m"[39;49;00m] + [[33m"[39;49;00m[33mtarget[39;49;00m[33m"[39;49;00m],
    )

    data.to_csv([33m"[39;49;00m[33mdata.csv[39;49;00m[33m"[39;49;00m, index=[34mFalse[39;49;00m)
    [36mprint[39;49;00m([33m"[39;49;00m[33mIris dataset saved to [39;49;00m[33m'[

In [7]:
!pwd

/home/luke/Projects/Pachyderm/seldon-core/examples/pachyderm


In [8]:
!python get-data.py

Getting Iris Dataset
Iris dataset saved to 'data.csv' file


And put produced `data.csv` file into Pachyderm's  `iris-input` repository

In [9]:
%%bash
pachctl create repo iris-data
pachctl list repo

NAME      CREATED      SIZE (MASTER) ACCESS LEVEL 
iris-data 1 second ago 0B            OWNER         


And then we use following python script to pull training dataset from sklearn

In [10]:
%%bash
pachctl put file iris-data@master -f data.csv
pachctl list commit iris-data

REPO      BRANCH COMMIT                           FINISHED     SIZE     PROGRESS DESCRIPTION
iris-data master b6f21966b1ca434281bc23fb751ab5d9 1 second ago 3.005KiB -         




In [11]:
!pachctl list file iris-data@master

NAME      TYPE SIZE     
/data.csv file 3.005KiB 


### Create Pachyderm pipeline

Pachyderm Pipeline is defined by the following file

In [12]:
%%writefile train.json

{
  "pipeline": {
    "name": "iris"
  },
  "description": "A pipeline that trains simple Iris classifier.",
  "transform": {
    "cmd": [ "python3", "/train_iris.py" ],
    "image": "seldonio/pachyderm-iris-trainer:0.1"
  },
  "input": {
    "pfs": {
      "repo": "iris-data",
      "glob": "/*"
    }
  }
}


Overwriting train.json


In [13]:
!pachctl create pipeline -f train.json

### Verify pipeline success

Give pachyderm a moment to process the pipeline first!

In [1]:
!pachctl list job

ID                               PIPELINE STARTED      DURATION  RESTART PROGRESS  DL       UL      STATE   
145ae50e43e24019b923b823e2813eeb iris     18 hours ago 4 seconds 0       1 + 0 / 1 3.005KiB 1.01KiB [32msuccess[0m 


In [2]:
!pachctl list commit iris

REPO BRANCH COMMIT                           FINISHED     SIZE    PROGRESS DESCRIPTION
iris master 74976e9cc5e540cbb4d61d370b350518 18 hours ago 1.01KiB -         


In [3]:
!pachctl list file iris@master

NAME          TYPE SIZE    
/model.joblib file 1.01KiB 


In [35]:
%%writefile metadata.json

{
  "pipeline": {
    "name": "iris-meta"
  },
  "description": "Copy model over and create seldon metadata based on model commit in pachyderm",
  "transform": {
    "cmd": [ "python3", "-c", "import os; import pprint; pprint.pprint(os.environ)" ],
    "image": "python:3"
  },
  "input": {
    "pfs": {
      "repo": "iris",
      "glob": "/*"
    }
  }
}


Overwriting metadata.json


In [61]:
%%writefile metadata.json

{
  "pipeline": {
    "name": "iris-meta"
  },
  "description": "Copy model over and create seldon metadata based on model commit in pachyderm",
  "transform": {
    "cmd": [ "python3", "-c", "import os; print(os.environ)" ],
    "image": "python:3"
  },
  "input": {
    "pfs": {
      "repo": "iris",
      "glob": "/*"
    }
  }
}


Overwriting metadata.json


In [62]:
!pachctl update pipeline -f metadata.json

In [64]:
!pachctl list job

ID                               PIPELINE  STARTED       DURATION  RESTART PROGRESS  DL       UL      STATE   
1a2ffcd611b54fcbb7207267bc8add2c iris-meta 9 seconds ago 1 second  0       0 + 1 / 1 0B       0B      [32msuccess[0m 
145ae50e43e24019b923b823e2813eeb iris      3 days ago    4 seconds 0       1 + 0 / 1 3.005KiB 1.01KiB [32msuccess[0m 


In [65]:
!pachctl logs --job 1a2ffcd611b54fcbb7207267bc8add2c

In [48]:
!pachctl logs --help

Return logs from a job.

Usage:
  pachctl logs [--pipeline=<pipeline>|--job=<job>] [--datum=<datum>] [flags]

Examples:

# Return logs emitted by recent jobs in the "filter" pipeline
$ pachctl logs --pipeline=filter

# Return logs emitted by the job aedfa12aedf
$ pachctl logs --job=aedfa12aedf

# Return logs emitted by the pipeline \"filter\" while processing /apple.txt and a file with the hash 123aef
$ pachctl logs --pipeline=filter --inputs=/apple.txt,123aef

Flags:
      --datum string      Filter for log lines for this datum (accepts datum ID)
  -f, --follow            Follow logs as more are created.
  -h, --help              help for logs
      --inputs string     Filter for log lines generated while processing these files (accepts PFS paths or file hashes)
  -j, --job string        Filter for log lines from this job (accepts job ID)
      --master            Return log messages from the master process (pipeline must be set).
  -p, --pipeline string   Filter the log for lines fro

## Add trained model to remote S3 storage

### Create metadata.yaml 

In metadata we can use Pachyderm's hash to version deployed models

In [17]:
commitId = !pachctl list commit iris --raw |jq -r .commit.id

In [18]:
commitId = commitId[0]

In [19]:
f = open("metadata.yaml", "w")

f.write(f"""name: iris
versions: [iris/pachyderm:{commitId}]
platform: sklearn
inputs:
- datatype: BYTES
  name: input
  shape: [ 1, 4 ]
outputs:
- datatype: BYTES
  name: output
  shape: [ 3 ]""")
f.close()

### TODO
Extend the above pachyderm pipeline to output to a repo which contains the model from the previous pipeline stage along with the metadata.yml with 

Then we can automatically generate the Seldon metadata for a versioned model whenever the data changes, which is cool.

Also: how to access the Pach S3 gateway? Look at the docs...

### Add metadata to Pachyderm

This is so that Seldon can fetch it via Pachyderm's S3 gateway, which allows Seldon to access files in Pachyderm using the S3 protocol.

In [25]:
!pachctl put file iris@master -f metadata.yaml

cannot start a commit on an output branch


### Create bucket for our trained model and push it

In [20]:
%%bash
mc mb minio-seldon/pachyderm-iris -p

mc cp model.joblib minio-seldon/pachyderm-iris/
mc cp metadata.yaml minio-seldon/pachyderm-iris/

Bucket created successfully `minio-seldon/pachyderm-iris`.
`model.joblib` -> `minio-seldon/pachyderm-iris/model.joblib`
Total: 0 B, Transferred: 1.01 KiB, Speed: 146.70 KiB/s
`metadata.yaml` -> `minio-seldon/pachyderm-iris/metadata.yaml`
Total: 0 B, Transferred: 205 B, Speed: 24.81 KiB/s


In [21]:
!mc ls minio-seldon/pachyderm-iris

[m[32m[2020-05-24 18:53:00 BST] [0m[33m   205B [0m[1mmetadata.yaml[0m
[0m[m[32m[2020-05-24 18:53:00 BST] [0m[33m 1.0KiB [0m[1mmodel.joblib[0m
[0m

## Deploy sklearn server

In [22]:
%%writefile secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: seldon-init-container-secret
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: minioadmin
  AWS_SECRET_ACCESS_KEY: minioadmin
  AWS_ENDPOINT_URL: http://minio.minio-system.svc.cluster.local:9000
  USE_SSL: "false"

Overwriting secret.yaml


In [23]:
!kubectl apply -f secret.yaml

secret/seldon-init-container-secret configured


In [24]:
%%writefile deploy.yaml

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: pachyderm-sklearn
spec:
  annotations:
    seldon.io/executor: "true"
  name: iris
  predictors:
  - componentSpecs:
    graph:
      children: []
      implementation: SKLEARN_SERVER
      modelUri: s3://pachyderm-iris
      envSecretRefName: seldon-init-container-secret
      name: classifier
    name: default
    replicas: 1

Overwriting deploy.yaml


In [25]:
!kubectl apply -f deploy.yaml

seldondeployment.machinelearning.seldon.io/pachyderm-sklearn created


In [26]:
!kubectl rollout status deploy/$(kubectl get deploy -l seldon-deployment-id=pachyderm-sklearn -o jsonpath='{.items[0].metadata.name}')

Waiting for deployment "pachyderm-sklearn-default-0-classifier" rollout to finish: 0 of 1 updated replicas are available...
deployment "pachyderm-sklearn-default-0-classifier" successfully rolled out


## Test deployment

### Test prediction

In [29]:
%%bash
curl -s -X POST -H 'Content-Type: application/json' \
    -d '{"data":{"ndarray":[[5.964, 4.006, 2.081, 1.031]]}}' \
    http://localhost:8003/seldon/seldon/pachyderm-sklearn/api/v1.0/predictions  | jq .

{
  "data": {
    "names": [
      "t:0",
      "t:1",
      "t:2"
    ],
    "ndarray": [
      [
        0.9548873249364185,
        0.04505474761561256,
        5.792744796895459e-05
      ]
    ]
  },
  "meta": {}
}


### Test model metadata (optional)

In [30]:
%%bash
curl -s http://localhost:8003/seldon/seldon/pachyderm-sklearn/api/v1.0/metadata/classifier | jq .

{
  "inputs": [
    {
      "datatype": "BYTES",
      "name": "input",
      "shape": [
        1,
        4
      ]
    }
  ],
  "name": "iris",
  "outputs": [
    {
      "datatype": "BYTES",
      "name": "output",
      "shape": [
        3
      ]
    }
  ],
  "platform": "sklearn",
  "versions": [
    "iris/pachyderm:f8849a38b3f64c4b8998abf1f732f486"
  ]
}


## Cleanup

In [56]:
!kubectl delete -f deploy.yaml

seldondeployment.machinelearning.seldon.io "pachyderm-sklearn" deleted
