Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update huggingface readme #3678

Merged
merged 8 commits into from
May 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions python/huggingfaceserver/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Huggingface Serving Runtime
# HuggingFace Serving Runtime

The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box.
The Huggingface serving runtime implements a runtime that can serve HuggingFace transformer based model out of the box.
The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform
the inference on a more optimized inference engine like triton inference server and vLLM for text generation.


## Run Huggingface Server Locally
## Run HuggingFace Server Locally

```bash
python -m huggingfaceserver --model_id=bert-base-uncased --model_name=bert
Expand Down Expand Up @@ -45,7 +45,7 @@ curl -H "content-type:application/json" -v localhost:8080/v1/models/bert:predict
> 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
> 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.

1. Serve the huggingface model using KServe python runtime for both preprocess(tokenization)/postprocess and inference.
1. Serve the BERT model using KServe python HuggingFace runtime for both preprocess(tokenization)/postprocess and inference.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -70,7 +70,7 @@ spec:
memory: 2Gi
```

2. Serve the huggingface model using triton inference runtime and KServe transformer for the preprocess(tokenization) and postprocess.
2. Serve the BERT model using Triton inference runtime and KServe transformer with HuggingFace runtime for the preprocess(tokenization) and postprocess steps.
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand Down Expand Up @@ -111,8 +111,11 @@ spec:
cpu: 100m
memory: 2Gi
```
3. Serve the huggingface model using vllm runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html

3. Serve the llama2 model using KServe HuggingFace vLLM runtime. For the llama2 model, vLLM is supported and used as the default backend.
If available for a model, vLLM is set as the default backend, otherwise KServe HuggingFace runtime is used as a failsafe.
You can find vLLM support models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -135,10 +138,10 @@ spec:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"

```

If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the KServe python runtime will be used.

If vllm needs to be disabled include the flag `--backend=huggingface` in the container args. In this case the HuggingFace backend is used.

```yaml
apiVersion: serving.kserve.io/v1beta1
Expand All @@ -165,18 +168,20 @@ spec:
nvidia.com/gpu: "1"
```

Perform the inference for vllm specific runtime
Perform the inference:

KServe Huggingface runtime deployments supports OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.

vllm runtime deployments only support OpenAI `v1/completions` and `v1/chat/completions` endpoints for inference.
Sample OpenAI Completions request:

Sample OpenAI Completions request
```bash
curl -H "content-type:application/json" -v localhost:8080/openai/v1/completions -d '{"model": "gpt2", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'

{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"gpt2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```

Sample OpenAI Chat request
Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -v localhost:8080/openai/v1/chat/completions -d '{"model": "gpt2", "messages": [{"role": "user","content": "<message>"}], "stream":false }'

Expand Down
Loading