Point users to vLLM production server (#362)

The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead. So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers. Signed-off-by: Pierre Dulac <pierre@dotprod.ai>
kserve · May 11, 2024 · 5ed7232 · 5ed7232
1 parent d85a1f8
commit 5ed7232
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/docs/modelserving/v1beta1/llm/vllm/README.md b/docs/modelserving/v1beta1/llm/vllm/README.md
@@ -28,7 +28,7 @@ The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta
           command:
             - python3
             - -m
-            - vllm.entrypoints.api_server
+            - vllm.entrypoints.openai.api_server
           env:
             - name: STORAGE_URI
               value: gs://kfserving-examples/llm/huggingface/llama
@@ -69,7 +69,7 @@ to find out your ingress IP and port.
 You can run the [benchmarking script](./benchmark.py) and send the inference request to the exposed URL.
 
 ```bash
-python benchmark.py --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
+python benchmark_serving.py --backend openai --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
 ```
 
 !!! success "Expected Output"