Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add amdserver #179

Merged
merged 7 commits into from
Nov 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/community/adopters.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ This page contains a list of organizations who are using KServe either in produc

| Organization | Contact |
| ------------ | ------- |
| [Advanced Micro Devices](https://www.amd.com) | [Varun Sharma](https://github.com/varunsh-xilinx) |
| [Amazon Web Services](https://aws.amazon.com/) | [Ellis Tarn](https://github.com/ellistarn) |
| [Bloomberg](https://www.bloomberg.com/) | [Dan Sun](https://github.com/yuzisun) |
| [Cars24](https://www.cars24.com/) | [Swapnesh Khare](https://github.com/swapkh91) |
Expand Down
4 changes: 3 additions & 1 deletion docs/modelserving/servingruntimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ Several out-of-the-box _ClusterServingRuntimes_ are provided with KServe so that
| kserve-tritonserver | TensorFlow, ONNX, PyTorch, TensorRT |
| kserve-xgbserver | XGBoost |

In addition to these included runtimes, you can extend your KServe installation by adding custom runtimes.
This is demonstrated in the example for the [AMD Inference Server](./v1beta1/amd/).

## Spec Attributes

Available attributes in the `ServingRuntime` spec:
Expand Down Expand Up @@ -201,4 +204,3 @@ The previous schema would mutate into the new schema where the `kserve-sklearnse
In previous versions of KServe, supported predictor formats and container images were defined in a
[ConfigMap](https://github.com/kserve/kserve/blob/release-0.7/config/configmap/inferenceservice.yaml#L7) in the control plane namespace.
Existing _InferenceServices_ upgraded from v0.7 will continue to make use of the configuration listed in this config map, but this will eventually be phased out.

138 changes: 138 additions & 0 deletions docs/modelserving/v1beta1/amd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# AMD Inference Server
yuzisun marked this conversation as resolved.
Show resolved Hide resolved

The [AMD Inference Server](https://xilinx.github.io/inference-server/main/index.html) is an easy-to-use inferencing solution specially designed for AMD CPUs, GPUs, and FPGAs.
It can be deployed as a standalone executable or on a Kubernetes cluster with KServe or used to create custom applications by linking to its C++ API.
This example demonstrates how to deploy a Tensorflow GraphDef model on KServe with the AMD Inference Server to run inference on [AMD EPYC CPUs](https://www.amd.com/en/processors/epyc-server-cpu-family).

## Prerequisites

This example was tested on an Ubuntu 18.04 host machine using the Bash shell.

These instructions assume:

- You have a machine with a modern version of Docker (>=18.09) and sufficient disk space to build the image

- You have a Kubernetes cluster set up

- KServe has been installed on the Kubernetes cluster

- Some familiarity with Kubernetes / KServe

Refer to the installation instructions for these tools to install them if needed.

## Set up the image

This example uses the [AMD ZenDNN](https://developer.amd.com/zendnn/) backend to run inference on TensorFlow models on AMD EPYC CPUs.

### Build the image

To build a Docker image for the AMD Inference Server that uses this backend, download the `TF_v2.9_ZenDNN_v3.3_C++_API.zip` package from ZenDNN.
You must agree to the EULA to download this package.
You need a modern version of Docker (at least 18.09) to build this image.

```bash
# clone the inference server repository
git clone https://github.com/Xilinx/inference-server.git

# place the downloaded ZenDNN zip in the repository
mv TF_v2.9_ZenDNN_v3.3_C++_API.zip ./inference-server/

# build the image
cd inference-server
./proteus dockerize --production --tfzendnn=./TF_v2.9_ZenDNN_v3.3_C++_API.zip
```

This builds an image on your host: `<username>/proteus:latest`.
To use with KServe, you need to upload this image to a Docker registry server such as on a [local server](https://docs.docker.com/registry/deploying/).
You will also need to update the YAML files in this example to use this image.

More documentation for building a ZenDNN image for KServe is available: [ZenDNN + AMD Inference Server](https://xilinx.github.io/inference-server/main/zendnn.html) and [KServe + AMD Inference Server](https://xilinx.github.io/inference-server/main/kserve.html).

## Set up the model

In this example, you will use an [MNIST Tensorflow model](https://github.com/Xilinx/inference-server/blob/main/tests/assets/mnist.zip).
The AMD Inference Server also supports PyTorch, ONNX and [Vitis AI models](https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo) models with the appropriate Docker images.
To prepare new models, look at the [KServe + AMD Inference Server documentation](https://xilinx.github.io/inference-server/main/kserve.html) for more information about the expected model format.

## Make an inference

The AMD Inference Server can be used in single model serving mode in KServe.
The code snippets below use the environment variables `INGRESS_HOST` and `INGRESS_PORT` to make requests to the cluster.
[Find the ingress host and port](https://kserve.github.io/website/master/get_started/first_isvc/#4-determine-the-ingress-ip-and-ports) for making requests to your cluster and set these values appropriately.

### Add the ClusterServingRuntime

To use the AMD Inference Server with KServe, add it as a [serving runtime](https://kserve.github.io/website/master/modelserving/servingruntimes/).
A `ClusterServingRuntime` configuration file is included in this example.
To apply it:

```bash
# update the kserve-amdserver.yaml to use the right image
# if you have a different image name, you'll need to edit it manually
sed -i "s/<image>/$(whoami)\/proteus:latest/" kserve-amdserver.yaml

kubectl apply -f kserve-amdserver.yaml
```

### Single model serving

Once the AMD Inference Server has been added as a serving runtime, you can start a service that uses it.

```bash
# download the inference service file and input data
curl -O https://raw.githubusercontent.com/kserve/website/master/docs/modelserving/v1beta1/amd/single_model.yaml
curl -O https://raw.githubusercontent.com/kserve/website/master/docs/modelserving/v1beta1/amd/input.json

# create the inference service
kubectl apply -f single_model.yaml

# wait for service to be ready
kubectl wait --for=condition=ready isvc -l app=example-amdserver-runtime-isvc

export SERVICE_HOSTNAME=$(kubectl get inferenceservice example-amdserver-runtime-isvc -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

### Make a request with REST

Once the service is ready, you can make requests to it.
Assuming that `INGRESS_HOST`, `INGRESS_PORT`, and `SERVICE_HOSTNAME` have been defined as above, the following command runs an inference over REST to the example MNIST model.

```bash
export MODEL_NAME=mnist
export INPUT_DATA=@./input.json
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer -d ${INPUT_DATA}
```

This shows the response from the server in KServe's v2 API format.
For this example, it will be similar to:

```{ .bash .no-copy }
{
"id":"",
"model_name":"TFModel",
"outputs":
[
{
"data": [
0.11987821012735367,
0.18648317456245422,
-0.83796119689941406,
-0.088459312915802002,
0.030454874038696289,
0.074872657656669617,
-1.1334009170532227,
-0.046301722526550293,
-0.31683838367462158,
0.32014602422714233
],
"datatype":"FP32",
"name":"input-0",
"parameters":{},
"shape":[10]
}
]
}
```

For MNIST, the data indicates the likely classification for the input image, which is the number 9.
In this response, the index with the highest value is the last one, indicating that the image was correctly classified as nine.