This tutorial demonstrates how to use Pixeltable's built-in `vLLM` integration to run local LLMs with high-throughput inference.

<div class="alert alert-block alert-info"><!-- mdx:none -->
<b>If you are running this tutorial in Colab:</b>
vLLM requires a GPU for efficient operation. Click on the <code>Runtime -> Change runtime type</code> menu item at the top, then select the <code>GPU</code> radio button and click on <code>Save</code>.
</div>

### Important notes

- vLLM provides high-throughput inference with techniques like PagedAttention and continuous batching
- Models are loaded from HuggingFace and cached in memory for reuse
- vLLM currently requires a Linux environment with GPU support for best performance
- Consider GPU memory when choosing model sizes

## Set up environment

First, let's install Pixeltable with vLLM support:

In [None]:
%pip install -qU pixeltable vllm

[0m

## Create a table for chat completions

Now let's create a table that will contain our inputs and responses.

In [None]:
import pixeltable as pxt
from pixeltable.functions import vllm

pxt.drop_dir('vllm_demo', force=True)
pxt.create_dir('vllm_demo')

t = pxt.create_table('vllm_demo/chat', {'input': pxt.String})

Next, we add a computed column that calls the Pixeltable `chat_completions` UDF, which uses vLLM's high-throughput inference engine under the hood. We specify a HuggingFace model identifier, and vLLM will download and cache the model automatically.

(If this is your first time using Pixeltable, the <a href="https://docs.pixeltable.com/tutorials/tables-and-data-operations">Pixeltable Fundamentals</a> tutorial contains more details about table creation, computed columns, and UDFs.)

For this demo we'll use `Qwen2.5-0.5B-Instruct`, a very small (0.5-billion parameter) model that still produces decent results.

In [None]:
# Add a computed column that uses vLLM for chat completion
# against the input.

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': t.input},
]

t.add_computed_column(
    result=vllm.chat_completions(
        messages,
        model='Qwen/Qwen2.5-0.5B-Instruct',
    )
)

# Extract the output content from the JSON structure returned
# by vLLM.

t.add_computed_column(output=t.result.choices[0].message.content)

## Test chat completion

Let's try a few queries:

In [None]:
# Test with a few questions
t.insert(
    [
        {'input': 'What is the capital of France?'},
        {'input': 'What are some edible species of fish?'},
        {'input': 'Who are the most prominent classical composers?'},
    ]
)

In [None]:
t.select(t.input, t.output).collect()

## Comparing models

vLLM makes it easy to compare the output of different models. Let's try comparing the output from `Qwen2.5-0.5B` against a somewhat larger model, `Llama-3.2-1B-Instruct`. As always, when we add a new computed column to our table, it's automatically evaluated against the existing table rows.

In [None]:
t.add_computed_column(
    result_l3=vllm.chat_completions(
        messages,
        model='meta-llama/Llama-3.2-1B-Instruct',
    )
)

t.add_computed_column(output_l3=t.result_l3.choices[0].message.content)

t.select(t.input, t.output, t.output_l3).collect()

## Using sampling parameters

vLLM supports fine-grained control over generation through sampling parameters. Let's try running with a different system prompt and custom sampling settings.

In [None]:
messages_teacher = [
    {
        'role': 'system',
        'content': 'You are a patient school teacher. '
        'Explain concepts simply and clearly.',
    },
    {'role': 'user', 'content': t.input},
]

t.add_computed_column(
    result_teacher=vllm.chat_completions(
        messages_teacher,
        model='Qwen/Qwen2.5-0.5B-Instruct',
        sampling_kwargs={'max_tokens': 256, 'temperature': 0.7, 'top_p': 0.9},
    )
)

t.add_computed_column(
    output_teacher=t.result_teacher.choices[0].message.content
)

t.select(t.input, t.output_teacher).collect()

## Text generation

In addition to chat completions, vLLM also supports direct text generation with the `generate` UDF.

In [None]:
gen_t = pxt.create_table('vllm_demo/generation', {'prompt': pxt.String})

gen_t.add_computed_column(
    result=vllm.generate(
        gen_t.prompt,
        model='Qwen/Qwen2.5-0.5B-Instruct',
        sampling_kwargs={'max_tokens': 100},
    )
)

gen_t.add_computed_column(output=gen_t.result.choices[0].text)

gen_t.insert(
    [
        {'prompt': 'The capital of France is'},
        {'prompt': 'Once upon a time, there was a'},
    ]
)

gen_t.select(gen_t.prompt, gen_t.output).collect()

## Additional Resources

- [Pixeltable Documentation](https://docs.pixeltable.com/)
- [vLLM Documentation](https://docs.vllm.ai/)
- [vLLM GitHub](https://github.com/vllm-project/vllm)