# Quickstart

This guide shows how to use vLLM to:

-   run offline batched inference on a dataset;
-   build an API server for a large language model;
-   start an OpenAI-compatible API server.

Be sure to complete the `installation instructions <installation>`
before continuing with this guide.

Note

By default, vLLM downloads model from
[HuggingFace](https://huggingface.co/). If you would like to use models
from [ModelScope](https://www.modelscope.cn) in the following examples,
please set the environment variable:

``` shell
export VLLM_USE_MODELSCOPE=True
```

## Offline Batched Inference

We first show an example of using vLLM for offline batched inference on
a dataset. In other words, we use vLLM to generate texts for a list of
input prompts.

Import `LLM` and `SamplingParams` from vLLM. The `LLM` class is the main
class for running offline inference with vLLM engine. The
`SamplingParams` class specifies the parameters for the sampling
process.

In [1]:
from vllm import LLM, SamplingParams



Define the list of input prompts and the sampling parameters for
generation. The sampling temperature is set to 0.8 and the nucleus
sampling probability is set to 0.95. For more information about the
sampling parameters, refer to the [class
definition](https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py).

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Initialize vLLM's engine for offline inference with the `LLM` class and
the [OPT-125M model](https://arxiv.org/abs/2205.01068). The list of
supported models can be found at `supported models <supported_models>`.

In [3]:
llm = LLM(model="facebook/opt-125m")



config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

INFO 05-28 13:20:58 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=facebook/opt-125m)


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

INFO 05-28 13:21:00 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:21:00 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:21:00 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:21:00 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:21:00 weight_utils.py:207] Using model weights format ['*.bin']


pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

INFO 05-28 13:21:06 cpu_executor.py:72] # CPU blocks: 7281


Call `llm.generate` to generate the outputs. It adds the input prompts
to vLLM engine's waiting queue and executes the vLLM engine to generate
the outputs with high throughput. The outputs are returned as a list of
`RequestOutput` objects, which include all the output tokens.

In [4]:
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts:   0%|   | 0/4 [00:00<?, ?it/s, Generation Speed: 0.00 toks/s]

NameError: name 'vllm_cache_ops' is not defined


The code example can also be found in
[examples/offline_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py).


## OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API
protocol. This allows vLLM to be used as a drop-in replacement for
applications using OpenAI API. By default, it starts the server at
`http://localhost:8000`. You can specify the address with `--host` and
`--port` arguments. The server currently hosts one model at a time
(OPT-125M in the command below) and implements [list
models](https://platform.openai.com/docs/api-reference/models/list),
[create chat
completion](https://platform.openai.com/docs/api-reference/chat/completions/create),
and [create
completion](https://platform.openai.com/docs/api-reference/completions/create)
endpoints. We are actively adding support for more endpoints.

Start the server:

In [None]:
%%bash
python -m vllm.entrypoints.openai.api_server \
    --model facebook/opt-125m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)






INFO 05-28 13:26:42 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=facebook/opt-125m)
INFO 05-28 13:26:42 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:26:42 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:26:43 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:26:43 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:26:43 weight_utils.py:20

By default, the server uses a predefined chat template stored in the
tokenizer. You can override this template by using the `--chat-template`
argument:

In [1]:
%%bash
python -m vllm.entrypoints.openai.api_server \
    --model facebook/opt-125m \
    --chat-template ./examples/template_chatml.jinja





INFO 05-28 13:27:01 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=facebook/opt-125m)
INFO 05-28 13:27:02 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:27:02 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:27:02 selector.py:101] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 05-28 13:27:02 selector.py:61] Using Torch SDPA backend.
INFO 05-28 13:27:02 weight_utils.py:20

INFO:     Started server process [3052]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO 05-28 13:27:17 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-28 13:27:27 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-28 13:27:37 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-28 13:27:47 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 05-28 13:27:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, S

INFO:     Shutting down


Error while terminating subprocess (pid=3050): 


This server can be queried in the same format as OpenAI API. For
example, list the models:

In [2]:
%%bash
curl http://localhost:8000/v1/models

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to localhost port 8000 after 0 ms: Connection refused


CalledProcessError: Command 'b'curl http://localhost:8000/v1/models\n'' returned non-zero exit status 7.

You can pass in the argument `--api-key` or environment variable
`VLLM_API_KEY` to enable the server to check for API key in the header.

### Using OpenAI Completions API with vLLM

Query the model with input prompts:


In [3]:
%%bash
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "facebook/opt-125m",
      "prompt": "San Francisco is a",
      "max_tokens": 7,
      "temperature": 0
}'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to localhost port 8000 after 0 ms: Connection refused


CalledProcessError: Command 'b'curl http://localhost:8000/v1/completions \\\n    -H "Content-Type: application/json" \\\n    -d \'{\n      "model": "facebook/opt-125m",\n      "prompt": "San Francisco is a",\n      "max_tokens": 7,\n      "temperature": 0\n}\'\n'' returned non-zero exit status 7.

Since this server is compatible with OpenAI API, you can use it as a
drop-in replacement for any applications using OpenAI API. For example,
another way to query the server is via the `openai` python package:

In [None]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="facebook/opt-125m",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

For a more detailed client example, refer to
[examples/openai_completion_client.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py).

### Using OpenAI Chat API with vLLM

The vLLM server is designed to support the OpenAI Chat API, allowing you
to engage in dynamic conversations with the model. The chat interface is
a more interactive way to communicate with the model, allowing
back-and-forth exchanges that can be stored in the chat history. This is
useful for tasks that require context or more detailed explanations.

Querying the model using OpenAI Chat API:

You can use the [create chat
completion](https://platform.openai.com/docs/api-reference/chat/completions/create)
endpoint to communicate with the model in a chat-like interface:

In [None]:
%%bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "facebook/opt-125m",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
}'

Python Client Example:

Using the <span class="title-ref">openai</span> python package, you can
also communicate with the model in a chat-like manner:

In [None]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="facebook/opt-125m",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

For more in-depth examples and advanced features of the chat API, you
can refer to the official OpenAI documentation.