diff --git a/docs/assets/images/guides/mlops/serving/deployment_endpoints.png b/docs/assets/images/guides/mlops/serving/deployment_endpoints.png new file mode 100644 index 000000000..c97a83598 Binary files /dev/null and b/docs/assets/images/guides/mlops/serving/deployment_endpoints.png differ diff --git a/docs/user_guides/mlops/serving/index.md b/docs/user_guides/mlops/serving/index.md index dc0915bd9..3f1ce0e92 100644 --- a/docs/user_guides/mlops/serving/index.md +++ b/docs/user_guides/mlops/serving/index.md @@ -24,6 +24,10 @@ Configure the predictor to batch inference requests, see the [Inference Batcher Configure the predictor to log inference requests and predictions, see the [Inference Logger Guide](inference-logger.md). +### REST API + +Send inference requests to deployed models using REST API, see the [Rest API Guide](rest-api.md). + ### Troubleshooting Inspect the model server logs to troubleshoot your model deployments, see the [Troubleshooting Guide](troubleshooting.md). diff --git a/docs/user_guides/mlops/serving/rest-api.md b/docs/user_guides/mlops/serving/rest-api.md new file mode 100644 index 000000000..93461c9c3 --- /dev/null +++ b/docs/user_guides/mlops/serving/rest-api.md @@ -0,0 +1,98 @@ +# Hopsworks Model Serving REST API + +## Introduction + +Hopsworks provides model serving capabilities by leveraging [KServe](https://kserve.github.io/website/) as the model serving platform and [Istio](https://istio.io/) as the ingress gateway to the model deployments. + +This document explains how to interact with a model deployment via REST API. + +## Base URL + +Deployed models are accessible through the Istio ingress gateway. The URL to interact with a model deployment is provided on the model deployment page in the Hopsworks UI. + +The URL follows the format `http:///`, where `RESOURCE_PATH` depends on the [model server](https://docs.hopsworks.ai/latest/user_guides/mlops/serving/predictor/#model-server) (e.g. vLLM, TensorFlow Serving, SKLearn ModelServer). + +

+

+ Endpoints +
Deployment Endpoints
+
+

+ + +## Authentication + +All requests must include an API Key for authentication. You can create an API by following this [guide](../../projects/api_key/create_api_key.md). + +Include the key in the Authorization header: +```text +Authorization: ApiKey +``` + +## Headers + +| Header | Description | Example Value | +| --------------- | ------------------------------------------- | ------------------------------------ | +| `Host` | Model’s hostname, provided in Hopsworks UI. | `fraud.test.hopsworks.ai` | +| `Authorization` | API key for authentication. | `ApiKey ` | +| `Content-Type` | Request payload type (always JSON). | `application/json` | + +## Request Format + +The request format depends on the model sever being used. + +For predictive inference (i.e. for Tensorflow or SkLearn or Python Serving). The request must be sent as a JSON object containing an `inputs` or `instances` field. You can find more information on the request format [here](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v1-protocol#request-format). An example for this is given below. + +=== "Python" + + !!! example "REST API example for Predictive Inference (Tensorflow or SkLearn or Python Serving)" + ```python + import requests + + data = { + "inputs": [ + [ + 4641025220953719, + 4920355418495856 + ] + ] + } + + headers = { + "Host": "fraud.test.hopsworks.ai", + "Authorization": "ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp", + "Content-Type": "application/json" + } + + response = requests.post( + "http://10.87.42.108/v1/models/fraud:predict", + headers=headers, + json=data + ) + print(response.json()) + ``` + +=== "Curl" + + !!! example "REST API example for Predictive Inference (Tensorflow or SkLearn or Python Serving)" + ```bash + curl -X POST "http://10.87.42.108/v1/models/fraud:predict" \ + -H "Host: fraud.test.hopsworks.ai" \ + -H "Authorization: ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp" \ + -H "Content-Type: application/json" \ + -d '{ + "inputs": [ + [ + 4641025220953719, + 4920355418495856 + ] + ] + }' + ``` + +For generative inference (i.e vLLM) the response follows the [OpenAI specification](https://platform.openai.com/docs/api-reference/chat/create). + + +## Response + +The model returns predictions in a JSON object. The response depends on the model server implementation. You can find more information regarding specific model servers in the [Kserve documentation](https://kserve.github.io/website/docs/intro). diff --git a/mkdocs.yml b/mkdocs.yml index 984a41326..98a3875b3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -203,6 +203,7 @@ nav: - Inference Logger: user_guides/mlops/serving/inference-logger.md - Inference Batcher: user_guides/mlops/serving/inference-batcher.md - API Protocol: user_guides/mlops/serving/api-protocol.md + - REST API: user_guides/mlops/serving/rest-api.md - Troubleshooting: user_guides/mlops/serving/troubleshooting.md - External Access: user_guides/mlops/serving/external-access.md - Vector Database: user_guides/mlops/vector_database/index.md