Skip to content

Commit

Permalink
Point users to vLLM production server (#362)
Browse files Browse the repository at this point in the history
The vLLM teams states that the [`vllm.entrypoints.api_server`](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py#L2-L6) is just to demonstrates usage of their AsyncEngine, for production use they point users to `vllm.entrypoints.openai.api_server` instead.

So, I think this should be the entrypoint used in the kServe documentation too, to avoid confusing new comers.

Signed-off-by: Pierre Dulac <pierre@dotprod.ai>
  • Loading branch information
dulacp committed May 11, 2024
1 parent d85a1f8 commit 5ed7232
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/modelserving/v1beta1/llm/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta
command:
- python3
- -m
- vllm.entrypoints.api_server
- vllm.entrypoints.openai.api_server
env:
- name: STORAGE_URI
value: gs://kfserving-examples/llm/huggingface/llama
Expand Down Expand Up @@ -69,7 +69,7 @@ to find out your ingress IP and port.
You can run the [benchmarking script](./benchmark.py) and send the inference request to the exposed URL.

```bash
python benchmark.py --backend vllm --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
python benchmark_serving.py --backend openai --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
```

!!! success "Expected Output"
Expand Down

0 comments on commit 5ed7232

Please sign in to comment.