This repository provides a demo for deploying a custom predictor to serve ONNX models using KServe, designed to run seamlessly on macOS and a local Kind cluster. It addresses challenges like NVIDIA Triton’s incompatibility with Macs by using a lightweight, CPU-based custom predictor, making it a practical starting point for ONNX model serving in Kubernetes.
For a production environment, https://kserve.github.io/website/master/modelserving/v1beta1/onnx/ can be used.
Serving ONNX models with KServe can be complex on environments not supported by NVIDIA Triton, especially on macOS with ARM64 architecture. This demo:
- Offers a custom predictor built with Python and ONNX Runtime, ensuring compatibility with macOS and CPU environments.
- Demonstrates deployment on Kind (Kubernetes in a container engine) with KServe, simplifying local testing and scaling.
- Provides a reproducible setup for developers to experiment with ONNX model serving on their Macs.
- Model: An example ONNX model (in this case, for fraud detection) with 5 input features, served via a custom predictor.
- Predictor: A Python script (
predictor.py) leveraging KServe’s API and ONNX Runtime to load and serve ONNX models. - Deployment: Instructions for running locally on macOS or deploying to Kind with KServe, including Kubernetes manifests.
- Container Engine: A container engine, like Podman, for building the container image and running Kind.
- Kind: To create a local Kubernetes cluster.
- kubectl: To interact with the cluster.
- Python 3.11+: For local execution.
- uv: For dependency management (optional but recommended).
python predictor.pyThis starts a server on http://0.0.0.0:8080.
In a new terminal, send a sample request:
curl -i -X POST http://0.0.0.0:8080/v1/models/onnx-model:predict -H "Content-Type: application/json" -d '{"
instances": [[50.0, 5.0, 0.0, 0.0, 1.0]]}'Expected output:
HTTP/1.1 200 OK
date: Sat, 05 Apr 2025 14:42:34 GMT
server: uvicorn
content-length: 38
content-type: application/json
{"predictions":[[0.9998821020126343]]}Set up KServe in your Kind cluster by following the official KServe instructions: https://kserve.github.io/website/master/get_started/#install-the-kserve-quickstart-environment
Build the container image using a container engine, like Podman, and push it to a registry (e.g., Quay.io):
podman build -t quay.io/<your-username>/kserve-onnx-predictor-demo:latest -f Containerfile .
podman push quay.io/<your-username>/kserve-onnx-predictor-demo:latestReplace <your-username> with your Quay.io username.
Update inferenceservice.yaml setting the image you've created in the previous step in
spec.predictor.containers[0].image.
Apply the Kubernetes manifests from the manifests directory:
kubectl apply -k manifestsThis deploys the InferenceService in the onnx-model namespace.
KServe services lack selectors, so port-forward to the pod:
kubectl get pods -n onnx-model -l serving.kserve.io/inferenceservice=onnx-model -o jsonpath="{.items[0].metadata.name}" | xargs -I {} kubectl port-forward -n onnx-model pod/{} 8080:8080With the port-forward active, test the deployed model:
curl -i -X POST http://localhost:8080/v1/models/onnx-model:predict -H "Content-Type: application/json" -d '{"instances": [[50.0, 5.0, 0.0, 0.0, 1.0]]}'Expected output:
HTTP/1.1 200 OK
date: Mon, 07 Apr 2025 21:34:03 GMT
server: uvicorn
content-length: 38
content-type: application/json
{"predictions":[[0.9998821020126343]]}
Run the following command to undeploy this demo:
kubectl delete -k manifests