Update torchserve docs (#1271)

* Update torchserve doc * Fix autoscaling/canary example * Reorgnize torchserve examples * Add bert example
kserve · Dec 29, 2020 · 1ca8977 · 1ca8977
1 parent 7df81e5
commit 1ca8977
Show file tree

Hide file tree

Showing 21 changed files with 421 additions and 259 deletions.
diff --git a/docs/samples/v1beta1/torchserve/README.md b/docs/samples/v1beta1/torchserve/README.md
@@ -1,39 +1,47 @@
-# Predict on a InferenceService using Torchserve
+# Predict on a InferenceService using TorchServe
 
-In this example, we use a trained pytorch mnist model to predict handwritten digits by running an inference service with pytorch torchserve predictor.
+In this example, we use a trained pytorch mnist model to predict handwritten digits by running an inference service with [TorchServe](https://github.com/pytorch/serve) predictor.
 
 ## Setup
 
 1. Your ~/.kube/config should point to a cluster with [KFServing installed](https://github.com/kubeflow/kfserving/#install-kfserving).
 2. Your cluster's Istio Ingress gateway must be [network accessible](https://istio.io/latest/docs/tasks/traffic-management/ingress/ingress-control/).
 
-**__Note__** For prebuilt mnist marfile and config properties use this remote storage:
-
-```storageUri: gs://kfserving-examples/models/torchserve/image_classifier```
-
 ## Creating model storage with model archive file
 
-[Torchserve Model Archive Files (MAR)](https://github.com/pytorch/serve/blob/master/model-archiver/README.md)
-
-We obtain the model and dependent files from [here](https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist)
+TorchServe provides a utility to package all the model artifacts into a single [Torchserve Model Archive Files (MAR)](https://github.com/pytorch/serve/blob/master/model-archiver/README.md).
 
-Refer [model archive file generation](./model-archiver/README.md) for auto generation of marfiles from model and dependent files.
-
-## Create the InferenceService
+You can store your model and dependent files on remote storage or local persistent volume, the mnist model and dependent files can be obtained
+from [here](https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist).
 
-Apply the CRD
+The KFServing/TorchServe integration expects following model store layout.
 
 ```bash
-kubectl apply -f torchserve.yaml
+├── config
+│   ├── config.properties
+├── model-store
+│   ├── densenet_161.mar
+│   ├── mnist.mar
 ```
 
-Expected Output
+- For remote storage you can choose to start the example using the prebuilt mnist MAR file stored on KFServing example GCS bucket
+`gs://kfserving-examples/models/torchserve/image_classifier`,
+you can also generate the MAR file with `torch-model-archiver` and create the model store on remote storage according to the above layout.
 
 ```bash
-$inferenceservice.serving.kubeflow.org/torchserve created
+torch-model-archiver --model-name mnist --version 1.0 \
+--model-file model-archiver/model-store/mnist/mnist.py \
+--serialized-file model-archiver/model-store/mnist/mnist_cnn.pt \
+--handler model-archiver/model-store/mnist/mnist_handler.py \
 ```
 
-## Torchserve with KFS envelope inference endpoints
+
+- For PVC user please refer to [model archive file generation](./model-archiver/README.md) for auto generation of MAR files from
+the model and dependent files.
+
+
+## TorchServe with KFS envelope inference endpoints
+The KFServing/TorchServe integration supports KFServing v1 protocol and we are working on to support v2 protocol.
 
 | API  | Verb | Path | Payload |
 | ------------- | ------------- | ------------- | ------------- |
@@ -42,21 +50,38 @@ $inferenceservice.serving.kubeflow.org/torchserve created
 
 [Sample requests for text and image classification](https://github.com/pytorch/serve/tree/master/kubernetes/kfserving/kf_request_json)
 
-## Run a prediction
+## Create the InferenceService
+
+For deploying the `InferenceService` on CPU
+```bash
+kubectl apply -f torchserve.yaml
+```
+
+For deploying the `InferenceService` on GPU
+```bash
+kubectl apply -f gpu.yaml
+```
+
+Expected Output
+
+```bash
+$inferenceservice.serving.kubeflow.org/torchserve created
+```
+
+## Inference
 
 The first step is to [determine the ingress IP and ports](../../../README.md#determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
 
 ```bash
-MODEL_NAME=torchserve
+MODEL_NAME=mnist
 SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
 ```
 
-Use [image converter](../imgconv/README.md) to create input request for mnist. For other models refer [input request](https://github.com/pytorch/serve/tree/master/kubernetes/kfserving/kf_request_json)
-
-### Prediction Request
+Use [image converter](../imgconv/README.md) to create input request for mnist. 
+For other models please refer to [input request](https://github.com/pytorch/serve/tree/master/kubernetes/kfserving/kf_request_json)
 
 ```bash
-curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist:predict -d @./mnist.json
+curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @./mnist.json
 ```
 
 Expected Output
@@ -87,11 +112,13 @@ Expected Output
 {"predictions": ["2"]}
 ```
 
-### Explanation
+## Explanation
 
-Model interpretability is an important aspect which help  to understand , which of the input features were important for a particular classification. Captum is a model interpretability libarary. The explain function uses Captum's -integrated graident feature to help us understand, which input features were important for a particular model prediction.
+Model interpretability is an important aspect which help to understand which of the input features were important for a particular classification. 
+[Captum](https://captum.ai) is a model interpretability library, the `KFServing Explain Endpoint` uses Captum's state-of-the-art algorithm, including integrated
+gradients to provide user with an easy way to understand which features are contributing to the model output.
 
-Refer [Captum](https://captum.ai/tutorials/) for more info.
+Your can refer to [Captum Tutorial](https://captum.ai/tutorials/) for more examples.
 
 ### Explain Request
 
@@ -128,23 +155,17 @@ Expected Output
 317543458, 0.0060051362999805355, -0.0008195376963202741, 0.0041728603512658224, -0.0017597169567888774, -0.0010577007775543158, 0.00046033327178068433, -0.0007674196306044449, -0.0], [-0.0, -0.0, 0.0013386963856532302, 0.00035183178922260837, 0.0030610334903526204, 8.951834979315781e-05, 0.0023676793550483524, -0.0002900551076915047, -0.00207019445286608, -7.61697478482574e-05, 0.0012150086715244216, 0.009831239281792168, 0.003479667642621962, 0.0070584324334114525, 0.004161851261339585, 0.0026146296354490665, -9.194746959222099e-05, 0.0013583866966571571, 0.0016821551239318913, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]]]]}
 ```
 
-Get Pods
-
-```bash
-kubectl get pods -n <namespace>
-
-NAME                                                                  READY   STATUS    RESTARTS   AGE
-pod/torchserve-predictor-default-8mw55-deployment-57f979c88-f2dkn     2/2     Running   0          4m25s
-```
-
-## For Autoscaling
+## Autoscaling
+One of the main serverless inference features is to automatically scale the replicas of an `InferenceService` matching the incoming workload.
+KFServing by default enables [Knative Pod Autoscaler](https://knative.dev/docs/serving/autoscaling/) which watches traffic flow and scales up and down
+based on the configured metrics.
 
-Configurations for autoscaling pods [Auto scaling](docs/autoscaling.md)
+[Autoscaling Example](autoscaling/README.md)
 
 ## Canary Rollout
+Canary rollout is a deployment strategy when you release a new version of model to a small percent of the production traffic.
 
-Configurations for canary [Canary Deployment](docs/canary.md)
-
-## For Metrics
+[Canary Deployment](canary/README.md)
 
-Configurations for Metrics [Metrics](docs/metrics.md)
+## Monitoring
+[Expose metrics and setup grafana dashboards](metrics/README.md)
diff --git a/docs/samples/v1beta1/torchserve/autoscaling/README.md b/docs/samples/v1beta1/torchserve/autoscaling/README.md
@@ -0,0 +1,99 @@
+# Autoscaling
+KFServing supports the implementation of Knative Pod Autoscaler (KPA) and Kubernetes’ Horizontal Pod Autoscaler (HPA).
+The features and limitations of each of these Autoscalers are listed below.
+
+IMPORTANT: If you want to use Kubernetes Horizontal Pod Autoscaler (HPA), you must install [HPA extension](https://knative.dev/docs/install/any-kubernetes-cluster/#optional-serving-extensions)
+ after you install Knative Serving.
+
+Knative Pod Autoscaler (KPA)
+- Part of the Knative Serving core and enabled by default once Knative Serving is installed.
+- Supports scale to zero functionality.
+- Does not support CPU-based autoscaling.
+
+Horizontal Pod Autoscaler (HPA)
+- Not part of the Knative Serving core, and must be enabled after Knative Serving installation.
+- Does not support scale to zero functionality.
+- Supports CPU-based autoscaling.
+
+## Create InferenceService with concurrency target
+
+
+### Soft limit
+You can configure InferenceService with annotation `autoscaling.knative.dev/target` for a soft limit. The soft limit is a targeted limit rather than
+a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.
+
+```yaml
+apiVersion: "serving.kubeflow.org/v1beta1"
+kind: "InferenceService"
+metadata:
+  name: "torchserve"
+  annotations:
+    autoscaling.knative.dev/target: "10"
+spec:
+  predictor:
+    pytorch:
+      protocolVersion: v2
+      storageUri: "gs://kfserving-examples/models/torchserve/image_classifier"
+```
+
+### Hard limit
+
+You can also configure InferenceService with field `containerConcurrency` for a hard limit. The hard limit is an enforced upper bound. 
+If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.
+
+```yaml
+apiVersion: "serving.kubeflow.org/v1beta1"
+kind: "InferenceService"
+metadata:
+  name: "torchserve"
+spec:
+  predictor:
+    containerConcurrency: 10
+    pytorch:
+      protocolVersion: v2
+      storageUri: "gs://kfserving-examples/models/torchserve/image_classifier"
+```
+
+### Create the InferenceService
+
+```bash
+kubectl apply -f torchserve.yaml
+```
+
+Expected Output
+
+```bash
+$inferenceservice.serving.kubeflow.org/torchserve created
+```
+
+## Run inference with concurrent requests
+
+The first step is to [determine the ingress IP and ports](../../../README.md#determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
+
+Install hey load generator 
+```bash
+go get -u github.com/rakyll/hey
+```
+
+Send concurrent inference requests
+```bash
+MODEL_NAME=mnist
+SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+
+./hey -m POST -z 30s -D ./mnist.json -host ${SERVICE_HOSTNAME} http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict
+```
+
+### Check the pods that are scaled up
+`hey` by default generates 50 requests concurrently, so you can see that the InferenceService scales to 5 pods as the container concurrency target is 10.
+
+```bash
+kubectl get pods -n kfserving-test 
+
+NAME                                                             READY   STATUS        RESTARTS   AGE
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-67qwb   2/2     Terminating   0          103s
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-nnxk8   2/2     Terminating   0          95s
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-rq8jq   2/2     Running       0          50m
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-tsrwr   2/2     Running       0          113s
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-vvpjl   2/2     Running       0          109s
+torchserve-predictor-default-cj2d8-deployment-69444c9c74-xvn7t   2/2     Terminating   0          103s
+```
diff --git a/...mples/v1beta1/torchserve/autoscaling.yaml → ...1/torchserve/autoscaling/autoscaling.yaml b/...mples/v1beta1/torchserve/autoscaling.yaml → ...1/torchserve/autoscaling/autoscaling.yaml
@@ -1,10 +1,11 @@
-#For example, specify a “concurrency target” of “10”, the autoscaler will try to make sure that every replica receives on average 10 requests at a time. A target is always evaluated against a specified metric.
+# For example, specify a “concurrency target” of “10”, the autoscaler will try to make sure that every replica receives on average 10 requests at a time.
+# A target is always evaluated against a specified metric.
 apiVersion: "serving.kubeflow.org/v1beta1"
 kind: "InferenceService"
 metadata:
   name: "torchserve"
   annotations:
-    autoscaling.knative.dev/target: "5"
+    autoscaling.knative.dev/target: "10"
 spec:
   predictor:
     pytorch:

diff --git a/docs/samples/v1beta1/torchserve/bert/README.md b/docs/samples/v1beta1/torchserve/bert/README.md
@@ -0,0 +1,105 @@
+# TorchServe example with Huggingface bert model
+In this example we will show how to serve [Huggingface Transformers with TorchServe](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers)
+on KFServing.
+
+## Model archive file creation
+
+Clone [pytorch/serve](https://github.com/pytorch/serve) repository,
+navigate to `examples/Huggingface_Transformers` and follow the steps for creating the MAR file including serialized model and other dependent files.
+TorchServe supports both eager model and torchscript and here we save as the pretrained model. 
+
+```bash
+torch-model-archiver --model-name BERTSeqClassification --version 1.0 \
+--serialized-file Transformer_model/pytorch_model.bin \
+--handler ./Transformer_handler_generalized.py \
+--extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json"
+```
+
+## Create the InferenceService
+
+Apply the CRD
+
+```bash
+kubectl apply -f bert.yaml
+```
+
+Expected Output
+
+```bash
+$inferenceservice.serving.kubeflow.org/torchserve-bert created
+```
+
+## Run a prediction
+
+The first step is to [determine the ingress IP and ports](../../../../README.md#determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`
+
+```bash
+MODEL_NAME=torchserve-bert
+SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -n <namespace> -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+
+curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/BERTSeqClassification:predict -d ./sample_text.txt
+```
+
+Expected Output
+
+```bash
+*   Trying 44.239.20.204...
+* Connected to a881f5a8c676a41edbccdb0a394a80d6-2069247558.us-west-2.elb.amazonaws.com (44.239.20.204) port 80 (#0)
+> PUT /v1/models/BERTSeqClassification:predict HTTP/1.1
+> Host: torchserve-bert.kfserving-test.example.com
+> User-Agent: curl/7.47.0
+> Accept: */*
+> Content-Length: 79
+> Expect: 100-continue
+>
+< HTTP/1.1 100 Continue
+* We are completely uploaded and fine
+< HTTP/1.1 200 OK
+< cache-control: no-cache; no-store, must-revalidate, private
+< content-length: 8
+< date: Wed, 04 Nov 2020 10:54:49 GMT
+< expires: Thu, 01 Jan 1970 00:00:00 UTC
+< pragma: no-cache
+< x-request-id: 4b54d3ac-185f-444c-b344-b8a785fdeb50
+< x-envoy-upstream-service-time: 2085
+< server: istio-envoy
+<
+* Connection #0 to host torchserve-bert.kfserving-test.example.com left intact
+Accepted
+```
+
+## Captum Explanations
+In order to understand the word importances and attributions when we make an explanation Request, we use Captum Insights for the Hugginface Transformers pre-trained model.
+```bash
+MODEL_NAME=torchserve-bert
+SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -n <namespace> -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+
+curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/BERTSeqClassification:explaine -d ./sample_text.txt
+```
+Expected output
+```bash
+*   Trying ::1:8080...
+* Connected to localhost (::1) port 8080 (#0)
+> POST /v1/models/BERTSeqClassification:explain HTTP/1.1
+> Host: torchserve-bert.default.example.com
+> User-Agent: curl/7.73.0
+> Accept: */*
+> Content-Length: 84
+> Content-Type: application/x-www-form-urlencoded
+>Handling connection for 8080
+
+* upload completely sent off: 84 out of 84 bytes
+* Mark bundle as not supporting multiuse
+< HTTP/1.1 200 OK
+< content-length: 292
+< content-type: application/json; charset=UTF-8
+< date: Sun, 27 Dec 2020 05:53:52 GMT
+< server: istio-envoy
+< x-envoy-upstream-service-time: 5769
+< 
+* Connection #0 to host localhost left intact
+{"explanations": [{"importances": [0.0, -0.6324463574494716, -0.033115653530477414, 0.2681695752722339, -0.29124745608778546, 0.5422589681903883, -0.3848768219546909, 0.0], 
+"words": ["[CLS]", "bloomberg", "has", "reported", "on", "the", "economy", "[SEP]"], "delta": -0.0007350619859377225}]}
+```
+
+
diff --git a/docs/samples/v1beta1/torchserve/bert/bert.yaml b/docs/samples/v1beta1/torchserve/bert/bert.yaml
@@ -0,0 +1,10 @@
+apiVersion: serving.kubeflow.org/v1beta1
+kind: InferenceService
+metadata:
+  name: "torchserve-bert"
+spec:
+  predictor:
+    pytorch:
+      protocolVersion: v2
+      storageUri: gs://kfserving-examples/models/torchserve/huggingface
+      # storageUri: pvc://model-pv-claim
diff --git a/docs/samples/v1beta1/torchserve/bert/config.properties b/docs/samples/v1beta1/torchserve/bert/config.properties
@@ -0,0 +1,6 @@
+inference_address=http://0.0.0.0:8085
+management_address=http://0.0.0.0:8081
+number_of_netty_threads=4
+job_queue_size=10
+model_store=/mnt/models/model-store
+model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bert":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}