kserve · yuzisun · May 11, 2024 · May 10, 2024 · May 10, 2024 · May 10, 2024
diff --git a/python/huggingfaceserver/README.md b/python/huggingfaceserver/README.md
@@ -45,7 +45,7 @@ curl -H "content-type:application/json" -v localhost:8080/v1/models/bert:predict
 > 1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
 > 2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.
 
-1. Serve the huggingface model using KServe python runtime for both preprocess(tokenization)/postprocess and inference.
+1. Serve the BERT model using KServe python huggingface runtime for both preprocess(tokenization)/postprocess and inference.
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -70,7 +70,7 @@ spec:
           memory: 2Gi
 ```
 
-2. Serve the huggingface model using triton inference runtime and KServe transformer for the preprocess(tokenization) and postprocess.
+2. Serve the BERT model using triton inference runtime and KServe transformer with huggingface for the preprocess(tokenization) and postprocess.
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -111,8 +111,8 @@ spec:
           cpu: 100m
           memory: 2Gi
 ```
-3. Serve the huggingface model using vllm runtime. vllm is the default runtime. Note - Model need to be supported by vllm otherwise KServe python runtime will be used as a failsafe.
-vllm supported models - https://docs.vllm.ai/en/latest/models/supported_models.html 
+3. Serve the llama2 model using huggingface vLLM runtime. For the llama2 model, vLLM is supported and used as the default model. If available for a model, vLLM is set as the default runtime. Note - Model needs to be backed by vLLM otherwise KServe python runtime will be used as a failsafe.
+vLLM supported models - https://docs.vllm.ai/en/latest/models/supported_models.html 
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
 kind: InferenceService
@@ -138,7 +138,7 @@ spec:
 
 ```
 
-If vllm needs to be disabled include the flag `--disable_vllm` in the container args. In this case the KServe python runtime will be used.
+If vLLM needs to be disabled include the flag `--disable_vllm` in the container args. In this case the KServe python runtime will be used.
 
 ```yaml
 apiVersion: serving.kserve.io/v1beta1
@@ -165,9 +165,9 @@ spec:
           nvidia.com/gpu: "1"
 ```
 
-Perform the inference for vllm specific runtime
+Perform the inference for vLLM specific runtime
 
-vllm runtime deployments only support `/generate` endpoint for inference. Please refer to [text generation API schema](https://github.com/kserve/open-inference-protocol/blob/main/specification/protocol/generate_rest.yaml) for more details.
+vLLM runtime deployments only support `/generate` endpoint for inference. Please refer to [text generation API schema](https://github.com/kserve/open-inference-protocol/blob/main/specification/protocol/generate_rest.yaml) for more details.
 ```bash
 curl -H "content-type:application/json" -v localhost:8080/v2/models/gpt2/generate -d '{"text_input": "The capital of france is [MASK]." }'