Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update huggingface readme #3678

Merged
merged 8 commits into from
May 11, 2024
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions python/huggingfaceserver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ curl -H "content-type:application/json" -v localhost:8080/v1/models/bert:predict
> 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
> 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.

1. Serve the huggingface model using KServe python runtime for both preprocess(tokenization)/postprocess and inference.
1. Serve the BERT model using KServe python huggingface runtime for both preprocess(tokenization)/postprocess and inference.
yuzisun marked this conversation as resolved.
Show resolved Hide resolved
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -70,7 +70,7 @@ spec:
memory: 2Gi
```

2. Serve the huggingface model using triton inference runtime and KServe transformer for the preprocess(tokenization) and postprocess.
2. Serve the BERT model using triton inference runtime and KServe transformer with huggingface for the preprocess(tokenization) and postprocess.
yuzisun marked this conversation as resolved.
Show resolved Hide resolved
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand Down Expand Up @@ -111,8 +111,8 @@ spec:
cpu: 100m
memory: 2Gi
```
3. Serve the huggingface model using vllm runtime. vllm is the default runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html
3. Serve the llama2 model using huggingface vLLM runtime. For the llama2 model, vLLM is supported and used as the default model. If available for a model, vLLM is set as the default runtime. Note - Model needs to be backed by vLLM otherwise KServe python runtime will be used as a failsafe.
yuzisun marked this conversation as resolved.
Show resolved Hide resolved
vLLM supported models - https://docs.vllm.ai/en/latest/models/supported_models.html
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
Expand All @@ -138,7 +138,7 @@ spec:

```

If vllm needs to be disabled include the flag `--disable_vllm` in the container args. In this case the KServe python runtime will be used.
If vLLM needs to be disabled include the flag `--disable_vllm` in the container args. In this case the KServe python runtime will be used.
yuzisun marked this conversation as resolved.
Show resolved Hide resolved

```yaml
apiVersion: serving.kserve.io/v1beta1
Expand All @@ -165,9 +165,9 @@ spec:
nvidia.com/gpu: "1"
```

Perform the inference for vllm specific runtime
Perform the inference for vLLM specific runtime

vllm runtime deployments only support `/generate` endpoint for inference. Please refer to [text generation API schema](https://github.com/kserve/open-inference-protocol/blob/main/specification/protocol/generate_rest.yaml) for more details.
vLLM runtime deployments only support `/generate` endpoint for inference. Please refer to [text generation API schema](https://github.com/kserve/open-inference-protocol/blob/main/specification/protocol/generate_rest.yaml) for more details.
```bash
curl -H "content-type:application/json" -v localhost:8080/v2/models/gpt2/generate -d '{"text_input": "The capital of france is [MASK]." }'

Expand Down
Loading